Deep Prognosis: Predicting Mortality in the ICU

Toby Manders
Insight
Published in
7 min readNov 5, 2019

--

Originally posted on Towards Data Science.

In medical school, we learn to identify lab abnormalities, their causes and treatments. When a patient is observed to have a blood potassium of 5.5 mmol/L, doctors can quickly identify the culprit (e.g. renal failure, acute kidney injury, ACE inhibitor-induced, etc.) in light of the patient’s history and presentation.

The more difficult questions for providers are: how exactly does this patient’s hyperkalemia affect her future health outcomes? What if the patient’s potassium were 5.2 instead of 5.5 mmol/L? What if her potassium rose from baseline in 3 days rather than 5? What if she has a history of chronic kidney disease? Furthermore, how do all the other items on a patient’s ‘problem list’ and in her past medical history affect her distribution of outcomes? What if their interplay is not merely additive? Can we ever hope to appreciate the interactions between hundreds or thousands of factors that shape outcomes for a single patient?

If we hope to succeed, we ought to look at complicated patients, who potentially offer the greatest insights and opportunities. Patients in the intensive care unit (ICU) are some of the most complicated in the hospital. In the dataset I used for this project, the average patient was ascribed 13 unique ICD-9 diagnosis codes while in the ICU alone. These patients have the highest mortality rates in the hospital: Each year there are 4 million ICU admissions in the U.S., approximately 500,000 of which result in death. And their care is some of the most expensive: With intensive therapies including mechanical ventilation, dialysis and infusions, a stay in the ICU could cost more than $10,000 per day. Most importantly for this project, these patients generate a tremendous amount of data — vital signs, lab results, interventions, progress notes, etc.

Is it possible to use that abundant data to predict survival in these patients?

Such a model would be able to answer the difficult questions above. Additionally, It would provide a single index of a patient’s acuity that could be easily interpreted by large and varied hospital teams. It would provide objective information that would be useful in end-of-life discussions between providers, patients and families. Importantly, the tool could remain flexible, so that it could predict events other than death.

The Data: Preprocessing & Feature Engineering

Overview of the Data

I used the eICU Collaborative Research Database for this project. Managed by MIT and distributed by Physionet, it’s publicly available, requiring only HIPAA before access. Collected over two years (2014 and 2015) from ICUs across the US, it contains an enormous amount of data about more than 200,000 patient-stays in the ICU. Of those patient-stays, a little more than 10,000 resulted in expiration.

Data Preprocessing

First, patients under 16 years old and those who stayed in the ICU less than 7 hours are removed from the dataset.

Different cleaning steps were needed for each type of dataset used: diagnoses, treatments, past medical history, periodic vital signs, aperiodic vitals, and lab results. Each was loaded and pre-processed independently according to its needs.

In order to train the model, positive and negative patients need to be identified, and timepoints for those events need to be saved for later extraction of features. Because I chose to keep the model flexible, different labeling algorithms were needed for identifying mortality and other events in the dataset. In the case of mortality prediction, all patients who expired are identified, and the last event (treatment or diagnosis) for that patient is used as the timestamp. In the case of a diagnosis, only patients who received that diagnosis more than three hours after admission are selected, and the first instance of that diagnosis is used for the timestamp.

Feature Engineering

After selecting positive and negative patients, feature vectors are generated for all of those examples. I used three categories of data — categorical, numerical and sequential.

Categorical features included ethnicity, gender, the admitting unit, and more. These are one-hot encoded.

Numerical features represented the bulk of the feature engineering work, and included vital signs, lab results, and static values like admission weight. I created 66 features here, employing my medical education. For example, one feature is the ratio of blood urea nitrogen (BUN) to serum creatinine, which is used to indicate the location of an acute kidney injury. For each feature, the maximum, minimum or mean value is calculated over one of three windows, a 4-hour window just prior to the ‘present moment’ (i.e. the prediction timepoint), a window extending from admission to the present moment, or a baseline window lasting the first six hours of admission. Other features, like the BUN-to-creatinine ratio, the creatinine change from baseline, or the arterial-to-inspired oxygen ratio (the PF ratio) are calculated from these first features. Missing values are tolerated for these features, although the 10% of examples missing the greatest number of features are removed. The remaining missing values are replaced with the feature mean and are scaled.

Finally, the sequential features include the past medical history diagnoses (given in English), diagnoses given in the ICU up until the present moment (given in English and ICD-9 codes) and therapeutic and diagnostic interventions up until the present moment. After translating all of the past medical history diagnoses to the common language of ICD-9 codes, all of these events are arranged by timestamp — in the order in which they actually happened.

A Multimodal Deep Learning Model

A flexible, multimodal architecture for predicting diagnoses or survival from ICU data.

I chose to keep the diagnoses and treatments of each patient in sequential order so as to preserve the clinical patterns that I know to be important. For example, a patient with a history of malignancy who now has unilateral swelling of the leg is relatively likely to be diagnosed with a pulmonary embolism.

I needed a model architecture that would respect that sequential information, so I used a long short-term memory (LSTM) network with a bidirectional wrapper and 32 units. I also wanted the model to learn representations for each diagnosis and treatment, so I used an embedding layer with 32-element vectors, which the model learns as it trains.

With three 16-unit hidden layers, as determined by hyperparameter tuning, the RNN achieved an area under the ROC curve (AUROC) of 0.85.

Next, I trained a DNN on just the categorical and numerical data. Three hidden layers of 32 units each with 50% dropout between each layer worked best here, and after feature selection and engineering, the DNN achieved an AUROC of 0.87.

In order to combine the two models, I popped the last layer off of each model and concatenated the outputs. That layer serves as input for a 3-layer, 64-unit DNN with dropout. An additional auxiliary output from the categorical/numerical DNN ensures that part of the network learns a representation useful to the final task. So how did the composite model perform?

Results

At the final timepoint, the composite model achieves a state-of-the-art AUROC of 0.91. But even days out from the last observation, the model performs well — reliably identifying the patients with the worst prognoses.

At every timepoint, the composite model performs best. Even days out, the model performs well.

If we look at examples of individual patients’ predictions over time, we can see how the model responds to new information:

The model recognizes a patient’s worsening clinical picture.

As the patient’s clinical picture worsens with a sequence of inauspicious events — stroke, respiratory failure, pneumonia, shock — the model correctly escalates its output, demonstrating that the score can be viewed as a severity index.

Looking at a representation of the first principal components of the embeddings the model has learned, we see an interesting pattern:

The pressors are far to the right — norepinephrine, vasopressin, phenylephrine. And mechanical ventilation is there, too. This stands to reason, as hemodynamic instability and ventilation ought to be markers of poor prognosis.

But as noted above, I wanted the model to remain flexible — so new models can be trained just by changing the diagnosis.

Training a new model is as easy as changing one input.

I included a wide variety of features — renal, pulmonary, hepatic, and cariovascular — so that the model performs well with other diagnoses.

The model performs well across a variety of diagnoses.

Conclusion

In summary, I’ve built a multimodal, flexible deep learning model and tool for providers, hospitals and insurers, that not only provides a single metric for a patient’s severity and mortality likelihood, but can also predict other events.

I had a lot of fun on this project. Thanks for reading!

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.

--

--