A Two-Stage Modeling Approach for Building and Evaluating Risk Prediction Models Using High-Dimensional Two-Phase Data
When using electronic health records (EHRs) for clinical and translational studies, additional data is often collected from external sources to enrich the information extractable from EHRs. For example, it is common to combine biobank or patient survey data with data extracted from the EHR. The external data, however, is generally only available for a small subset of patients, and the combined EHR and external data set follows a monotone missingness pattern. We view the data in the two-phase design framework where the set of EHR covariates which are available for all patients is considered the phase I data and the set of external covariates which are available for only a subset of patients is considered the phase II data. Leveraging ideas from two-phase sampling design, we present a novel two-stage approach for building and evaluating risk prediction models for binary outcomes while accounting for the incoplete availability of the phase II data. Our method accomodates the high dimensionality of the phase I EHR data and allows for the complete utilization of the phase I data when incorporating incomplete phase II data. We propose new estimators for log-odds ratio parameters and the area under the ROC curve (AUC). Finally, we apply our method to an EHR data set and incorporate additional patient survey data to develop a model for predicting the risk of short-term mortality for oncology patients in the University of Pennsylvania hospital system.
KeywordsTwo-Stage Modeling, Two-Phase Design, Electronic Health Records, Predictive Accuracy
To understand health and disease today, we need new thinking and novel science —the kind we create when multiple disciplines work together from the ground up. That is why this department has put forward a bold vision in population-health science: a single academic home for biostatistics, epidemiology and informatics.