Given the first 48 hours of ICU patients' data, predict per patient their:
So as to evaluate the severity of patients and priortise them for optimal resources allocation to maximise monitoring and support to those who need them more.
Static Data that are collected at the time the patient is admitted to the ICU:
# OS folder management
import os
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
# Basic Python dataframe packages
import pandas as pd
import numpy as np
# Visualization packages
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
sns.set(font_scale=2)
import math
from statistics import mean
Below the data is extracted from Project_Data
folder. Within this folder, the data is separated into 4 folds.
Fold1
Folder Fold2
Folder Fold3
Folder Fold4
Folder Fold1_Outcomes.csv
Fold2_Outcomes.csv
Fold3_Outcomes.csv
Fold4_Outcomes.csv
data_dir = os.path.join(os.getcwd(), "Project_Data")
os.listdir(data_dir)
Below, the data from each file is extracted and collated into these variables:
dirpaths
- to collect all directory pathnamesoutcomes
- to collect all csv filesfolds
- to collect all txt.files in each Folds' folder (i.e. Fold1
, Fold2
, Fold3
, Fold4
), which contains the patients' dataisFirst = True
dirpaths = []
outcomes = [] # collect csv files
folds = {} # collect lists of txt.files
for (dirpath, dirnames, filenames) in os.walk(data_dir):
dirpaths.append(dirpath)
if isFirst:
outcomes = filenames
outcomes.sort()
dirnames.sort()
folds = {key:[] for key in dirnames}
isFirst = False
else:
folds[dirpath[-5:]] = filenames
# print(dirpaths)
# print("Number of outcome files: ", outcomes)
for key, value in folds.items():
print("Number of patients for", key, 'is', len(value))
Below, the data from each file is extracted and collated into these variables:
static_variables
=> List of all static variablestemporal_variables
=> List of all temporal variables# define static and temporal variables
static_variables = ['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'Weight']
print("Number of static variables: " + str(len(static_variables)))
temporal_variables = {'Albumin': 'Serum Albumin (g/dL)',
'ALP': 'Alkaline phosphatase (IU/L)',
'ALT': 'Alanine transaminase (IU/L)',
'AST': 'Aspartate transaminase (IU/L)',
'Bilirubin': 'Bilirubin (mg/dL)',
'BUN': 'Blood urea nitrogen (mg/dL)',
'Cholesterol': 'Cholesterol (mg/dL)',
'Creatinine': 'Serum creatinine (mg/dL)',
'DiasABP': 'Invasive diastolic arterial blood pressure (mmHg)',
'FiO2': 'Fractional inspired O2 (0-1)',
'GCS': 'Glasgow Coma Score (3-15)',
'Glucose': 'Serum glucose (mg/dL)',
'HCO3': 'Serum bicarbonate (mmol/L)',
'HCT': 'Hematocrit (%)',
'HR': 'Heart rate (bpm)',
'K': 'Serum potassium (mEq/L)',
'Lactate': 'Lactate (mmol/L)',
'Mg': 'Serum magnesium (mmol/L)',
'MAP': 'Invasive mean arterial blood pressure (mmHg)',
'MechVent': 'Mechanical ventilation respiration (0:false or 1:true)',
'Na': 'Serum sodium (mEq/L)',
'NIDiasABP': 'Non-invasive diastolic arterial blood pressure (mmHg)',
'NIMAP': 'Non-invasive mean arterial blood pressure (mmHg)',
'NISysABP': 'Non-invasive systolic arterial blood pressure (mmHg)',
'PaCO2': 'partial pressure of arterial CO2 (mmHg)',
'PaO2': 'Partial pressure of arterial O2 (mmHg)',
'pH': 'Arterial pH (0-14)',
'Platelets': 'Platelets (cells/nL)',
'RespRate': 'Respiration rate (bpm)',
'SaO2': 'O2 saturation in hemoglobin (%)',
'SysABP': 'Invasive systolic arterial blood pressure (mmHg)',
'Temp': 'Temperature (°C)',
'TroponinI': 'Troponin-I (μg/L)',
'TroponinT': 'Troponin-T (μg/L)',
'Urine': 'Urine output (mL)',
'WBC': 'White blood cell count (cells/nL)',
'Weight': 'Weight (kg)'}
print("Number of temporal variables: " + str(len(temporal_variables.keys())))
# function to extract a patient's data from txt file and store as dataframe in long form
def extractPatientRecords(temp_dir):
data = pd.read_csv(temp_dir, sep=",")
patient_id = temp_dir[-10:].split('.')[0] #assume recordID is 6 digits
return (patient_id, data)
# assume patient is unique:
# Extract all patients records from txt files
all_patients = {} # {patient id: raw dataframe from txt.file}
cv_fold = {} # {fold# : list of all patients'id}
t1 = datetime.now()
print("Started Data Extraction at", t1.strftime('%Y-%m-%d %H:%M:%S'))
for path in dirpaths[1:]:
# get last 4 chracters from path
key = path[-5:]
print("\nData Extraction on", key, "has begun")
cv_fold[key] = []
# get list from folds
patients_files = folds[key]
# iterate one file at a time
for filename in patients_files:
patient_id, data = extractPatientRecords(os.path.join(path,filename))
if data.empty:
print("Empty Dataframe found at " + filename)
continue
else:
cv_fold[key].append(patient_id)
all_patients[patient_id] = data
t2 = datetime.now()
diff = t2-t1
print(key, "is done.\nTook about", diff.seconds, "seconds from start")
t2 = datetime.now()
diff = t2-t1
print("\nEntire Data Extraction completed after", diff.seconds, "seconds from start")
print("\nTotal number of folds is", len(cv_fold))
print("Fold1:", len(cv_fold['Fold1']))
print("Fold2:", len(cv_fold['Fold2']))
print("Fold3:", len(cv_fold['Fold3']))
print("Fold4:", len(cv_fold['Fold4']))
print("Total number of patients is", len(all_patients))
# extract static data frame of a patient
def getPatientStaticData(data):
static_df = data.loc[data['Time'] == '00:00', :]
static_df = static_df.loc[data['Parameter'].isin(static_variables)]
del static_df['Time']
static_df = static_df.transpose().reset_index(drop=True)
header = static_df.iloc[0]
static_df = static_df[1:]
static_df.columns = header
return static_df
# get patients' static dataframe(s) in 4 folds - for training and testing
all_static_dfs_folds = {}
for key, ids_list in cv_fold.items():
all_static_dfs_folds[key] = pd.DataFrame()
print(key, "has started extracting static data")
for patient_id in ids_list:
all_static_dfs_folds[key] = all_static_dfs_folds[key].append(getPatientStaticData(all_patients[patient_id]), ignore_index=True)
print(key, "has completed\n")
print(len(all_static_dfs_folds), "folds of patients' static data has been extracted")
# get patients' static dataframe in one dataframe - for data exploration
all_static_dfs = pd.DataFrame()
for key, ids_list in cv_fold.items():
print(key, "has started extracting static data")
for patient_id in ids_list:
all_static_dfs = all_static_dfs.append(getPatientStaticData(all_patients[patient_id]), ignore_index=True)
print(key, "has completed\n")
print(len(all_static_dfs), "folds of patients' static data has been extracted")
# last non-nan value of each column.
def findMostRecent(data):
if data.last_valid_index() is None:
return np.nan
else:
return data[data.last_valid_index()]
def findEarliest(data):
if data.first_valid_index() is None:
return np.nan
else:
return data[data.first_valid_index()]
# return a patient's temporal data in a single row given a patient's dataframe of table
# for data exploration
# frequency, most recent, earliest, overall_changes of temporal data
def extractPatientTemporalData(data, aggregate_by="freq"):
# get Record ID and Overwrite it
record_id = data.loc[data['Time'] == '00:00', :]['Value'][0]
age = data.loc[data['Time'] == '00:00', :]['Value'][1]
gender = data.loc[data['Time'] == '00:00', :]['Value'][2]
height = data.loc[data['Time'] == '00:00', :]['Value'][3]
icu_type = data.loc[data['Time'] == '00:00', :]['Value'][4]
weight = data.loc[data['Time'] == '00:00', :]['Value'][5]
if aggregate_by == "freq":
data_1 = data[['Parameter', 'Value']].groupby(['Parameter']).count()
data_1 = data_1.T
data_1['RecordID'] = record_id
data_1['Age'] = age
data_1['Gender'] = gender
data_1['Height'] = height
data_1['ICUType'] = icu_type
data_1['Weight'] = weight
return data_1
else:
data_1 = data[['Time', 'Parameter', 'Value']].groupby(['Time', 'Parameter']).median().reset_index()
data_pivot = data_1.pivot(index='Time', columns='Parameter', values='Value')
if aggregate_by == "most_recent":
return data_pivot.apply(findMostRecent, axis=0)
if aggregate_by == "earliest":
return data_pivot.apply(findEarliest, axis=0)
# difference between the most recent and the earliest
if aggregate_by == "overall_changes":
data_pivot = data_pivot.apply(findMostRecent, axis=0) - data_pivot.apply(findEarliest, axis=0)
data_pivot['RecordID'] = record_id
data_pivot['Age'] = age
data_pivot['Gender'] = gender
data_pivot['Height'] = height
data_pivot['ICUType'] = icu_type
data_pivot['Weight'] = weight
return data_pivot
# Collate all observations into 4 dataframe of patients data of 4 folds - for training and testing
def getAllTemporalDataFrameByAggregationTypeInFolds(cv_fold, all_patients, aggregator="freq"):
all_temporal_dfs_folds = {}
for key, ids_list in cv_fold.items():
all_temporal_dfs_folds[key] = pd.DataFrame()
print(key, "has started extracting temporal data")
for patient_id in ids_list:
all_temporal_dfs_folds[key] = all_temporal_dfs_folds[key].append(extractPatientTemporalData(all_patients[patient_id], aggregator), ignore_index=True)
print(key, "has completed\n")
print(len(all_temporal_dfs_folds), "folds of patients' Temporals data has been extracted with aggregator", aggregator)
return all_temporal_dfs_folds
# Collate all observations into 1 dataframe of patients data - for data exploration
def getAllTemporalDataFrameByAggregationType(cv_fold, all_patients, aggregator="freq"):
dataframe = pd.DataFrame()
for key, ids_list in cv_fold.items():
print(key, "has started extracting temporal data")
for patient_id in ids_list:
dataframe = dataframe.append(extractPatientTemporalData(all_patients[patient_id], aggregator), ignore_index=True)
print(key, "has completed\n")
print(len(dataframe), "patients' Temporals data has been extracted with aggregator", aggregator)
return dataframe
Storage of temporal variables for 2 different aspects of the project:
Assumption:
# in 4 folds - frequency of monitoring temporal variables
all_temporal_dfs_folds__freq = getAllTemporalDataFrameByAggregationTypeInFolds(cv_fold, all_patients, "freq")
all_temporal_dfs_folds__freq
# get patients' 4000 temporal dataframe - by frequency of measurement of temporal data
all_temporal_dfs__freq = getAllTemporalDataFrameByAggregationType(cv_fold, all_patients, "freq")
all_temporal_dfs__freq.head()
Assumption:
# in 4 folds - last value of temporal variables within the 48 hours period
all_temporal_dfs_folds__most_recent = getAllTemporalDataFrameByAggregationTypeInFolds(cv_fold, all_patients, "most_recent")
all_temporal_dfs_folds__most_recent
# get patients' 4000 temporal dataframe - by the last value within the 48 hours period
all_temporal_dfs__most_recent = getAllTemporalDataFrameByAggregationType(cv_fold, all_patients, "most_recent")
all_temporal_dfs__most_recent.head()
Assumption:
# in 4 folds - by the first value of each of the temporal data
all_temporal_dfs_folds__earliest = getAllTemporalDataFrameByAggregationType(cv_fold, all_patients, "earliest")
all_temporal_dfs_folds__earliest
# get patients' 4000 temporal dataframe - by the first value of each of the temporal data
all_temporal_dfs__earliest = getAllTemporalDataFrameByAggregationType(cv_fold, all_patients, "earliest")
all_temporal_dfs__earliest.head()
def extractPatientOutcome(temp_dir):
data = pd.read_csv(temp_dir, sep=",", encoding = "ISO-8859-1")
return data
# get patients' outcome dataframe
all_outcome_dfs_folds = {}
for csv_file in outcomes:
fold_num = csv_file[0:5]
all_outcome_dfs_folds[fold_num] = pd.DataFrame()
all_outcome_dfs_folds[fold_num] = all_outcome_dfs_folds[fold_num].append(extractPatientOutcome(os.path.join(dirpaths[:1][0], csv_file)), ignore_index=True)
print(len(all_outcome_dfs_folds), "folds of patients' outcome data has been extracted")
# get patients' outcome dataframe
all_outcome_dfs = pd.DataFrame()
for csv_file in outcomes:
all_outcome_dfs = all_outcome_dfs.append(extractPatientOutcome(os.path.join(dirpaths[:1][0], csv_file)), ignore_index=True)
print(len(all_outcome_dfs), "folds of patients' outcome data has been extracted")
all_temporal_dfs__most_recent.head()
Summary of Data Structures:
Folds Dictionary Objects for training and testing:
Objectives:
# function to visualize a patients' temporal data given an id
def visualizePatientTemporalData(data):
temporal = data[5:] # include weight
patient_id = data.loc[data['Time'] == '00:00', :][:-5]
plt.figure(figsize=(24, 9))
tick_spacing = 2
chart = sns.scatterplot(temporal['Time'], temporal['Parameter'], s=150)
chart.set_title("Patient ID: " + str(int(patient_id['Value'])))
chart.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.setp(chart.get_xticklabels(), rotation=45)
plt.show()
Plot of each patient's temporal data.
# test visualization of Temporal data
visualizePatientTemporalData(all_patients['132539'])
all_features_df_XXX
- Merge outcome variables with static and temporal variables
# dataframes of all variables
print(all_outcome_dfs.columns)
print(all_static_dfs.columns)
print(all_temporal_dfs__most_recent.columns)
# exploration by frequency of measurement temporal data
all_features_df = pd.merge(all_outcome_dfs, all_static_dfs)
all_features_df__freq = pd.merge(all_features_df, all_temporal_dfs__freq, how='left', on=["RecordID", "Age", "Gender", "Height", "ICUType", "Weight"])
print("Number of Features is", len(all_features_df__freq.columns))
print(all_features_df__freq.columns)
all_features_df__freq.head()
# exploration by patient's condition in which he/she came into ICU by (earliest temporal data)
all_features_df = pd.merge(all_outcome_dfs, all_static_dfs)
all_features_df__earliest = pd.merge(all_features_df, all_temporal_dfs__earliest, how='left', on=["RecordID", "Age", "Gender", "Height", "ICUType", "Weight"])
print("Number of Features is", len(all_features_df__earliest.columns))
print(all_features_df__earliest.columns)
all_features_df__earliest.head()
# exploration by patient's condition by end of 48 hours in ICU (latest temporal data)
all_features_df = pd.merge(all_outcome_dfs, all_static_dfs)
all_features_df__most_recent = pd.merge(all_features_df, all_temporal_dfs__most_recent, how='left', on=["RecordID", "Age", "Gender", "Height", "ICUType", "Weight"])
print("Number of Features is", len(all_features_df__most_recent.columns))
print(all_features_df__most_recent.columns)
all_features_df__most_recent.head()
Below is an exploration of the data types of variables and number of observations. This is to check for missing values for each parameter and for data cleaning. This process is executed for each aggregation type (i.e. freq, most recent, earliest).
all_features_df__most_recent.info()
# Static Variables are in non-null objects
# Temporal data are in float
def cleanData(df):
# Convert Age, Gender, ICUType to convert to categorical or binary variables
df[["Age", "Gender", "ICUType"]] = df[["Age", "Gender", "ICUType"]].astype(int)
# Convert Height, Weight from objects into float
df[["Height","Weight"]] = df[["Height","Weight"]].apply(pd.to_numeric)
# Missing Data -1: To change to NaN. To ensure that there are no negative values for the visualisation.
df = df.replace(to_replace=-1, value = np.nan)
# Missing Data NaN: To change to 0. To ensure that there are no null values for the visualisation.
df[['MechVent']] = df[['MechVent']].fillna(value = 0)
return df
all_features_df__most_recent = cleanData(all_features_df__most_recent)
all_features_df__most_recent.info()
all_features_df__freq = cleanData(all_features_df__freq)
all_features_df__earliest = cleanData(all_features_df__earliest)
This is to explore the distribution of the features across all ICUs.
# visulize all distirbution of each parameter (Boxplot, histogram, kde)
cat_feature = ["Gender", "ICUType", "MechVent"]
num_feature = ["Age","Height","Weight","BUN","Creatinine","GCS","Glucose","HCO3","HCT","HR",
"K","Mg","NIDiasABP","NIMAP","NISysABP","Na","Platelets","RespRate","Temp",
"TroponinT","Urine","WBC","Lactate","DiasABP","FiO2","MAP","PaCO2","PaO2",
"SaO2","SysABP","pH","ALP","ALT","AST","Albumin","Bilirubin","TroponinI",
"Cholesterol"]
print(len(cat_feature))
print(len(num_feature))
# This is to explore the distribution of the features across all ICUs, gender, MechVent or In-hospital death.
from collections import OrderedDict
def visualizeKDEDistribution(df, catType="IcuType"):
plt.figure(figsize = (30, 90))
colors = OrderedDict({1: 'blue', 2: 'orange', 3: 'green', 4: 'red'})
bin_colors = OrderedDict({0: 'blue', 1: 'red'})
icuTypes = OrderedDict({1: 'Coronary Care Unit', 2: "Cardiac Surgery Recovery Unit", 3: 'Medical ICU', 4: 'Surgical ICU'})
genderTypes = OrderedDict({0: 'Female', 1: 'Male'})
mechVentTypes = OrderedDict({0: 'Not Mech Vent', 1: 'Mech Vent'})
inHospitalDeathTypes = OrderedDict({0: 'Survive', 1: 'In-Hospital Death'})
for i, feature in enumerate(num_feature):
ax = plt.subplot(19, 2, i + 1)
if catType == "IcuType":
for icuTypes_level, color in colors.items():
sns.kdeplot(df.loc[df['ICUType'] == icuTypes_level, feature].dropna(),
ax = ax, color = color, label = icuTypes[icuTypes_level])
plt.title(f'{feature.capitalize()} Distribution'); plt.xlabel(f'{feature}'); plt.ylabel('Density')
if catType == "Gender":
for genderTypes_level, color in bin_colors.items():
sns.kdeplot(df.loc[df['Gender'] == genderTypes_level, feature].dropna(),
ax = ax, color = color, label = genderTypes[genderTypes_level])
plt.title(f'{feature.capitalize()} Distribution'); plt.xlabel(f'{feature}'); plt.ylabel('Density')
if catType == "MechVent":
for mechVentTypes_level, color in bin_colors.items():
sns.kdeplot(df.loc[df['MechVent'] == mechVentTypes_level, feature].dropna(),
ax = ax, color = color, label = mechVentTypes[mechVentTypes_level])
plt.title(f'{feature.capitalize()} Distribution'); plt.xlabel(f'{feature}'); plt.ylabel('Density')
if catType == "In-hospital_death":
for inHospitalDeathTypes_level, color in bin_colors.items():
sns.kdeplot(df.loc[df['In-hospital_death'] == inHospitalDeathTypes_level, feature].dropna(),
ax = ax, color = color, label = inHospitalDeathTypes[inHospitalDeathTypes_level])
plt.title(f'{feature.capitalize()} Distribution'); plt.xlabel(f'{feature}'); plt.ylabel('Density')
plt.subplots_adjust(top = 2)
# find by ICU Type, how many in-hospital deaths
# distribution of the length of stay
all_features_df = pd.merge(all_outcome_dfs, all_static_dfs)
all_features_df.head()
# length of stay by ICUType and in-hospital death distribution
sns.catplot(x="ICUType", y="Length_of_stay", col="In-hospital_death", kind="box", data=all_features_df, aspect=1);
Above are some observations of the length of stay by ICUType and in-hospital death distribution:
# Get the list of record ids of all patients
all_record_ids = []
# Get the list of survival patients and dead patients
outcome_dead_list = []
outcome_survival_list =[]
outcome_df_folds = pd.DataFrame(columns = ['Dead' , 'Survived'], index=['Fold1', 'Fold2', 'Fold3' , 'Fold4'])
outcome_fold1 = all_outcome_dfs_folds['Fold1']
for n in range(len(outcome_fold1)):
all_record_ids.append(outcome_fold1['RecordID'][n])
if (outcome_fold1['In-hospital_death'][n] == 1):
outcome_dead_list.append(outcome_fold1['RecordID'][n])
all_record_ids.append
else :
outcome_survival_list.append(outcome_fold1['RecordID'][n])
fold1_outcome_dead = len(outcome_dead_list)
fold1_outcome_survive = len(outcome_survival_list)
outcome_df_folds.iloc[0] = [fold1_outcome_dead, fold1_outcome_survive]
outcome_fold2 = all_outcome_dfs_folds['Fold2']
for n in range(len(outcome_fold2)):
all_record_ids.append(outcome_fold2['RecordID'][n])
if (outcome_fold2['In-hospital_death'][n] == 1):
outcome_dead_list.append(outcome_fold2['RecordID'][n])
all_record_ids.append
else :
outcome_survival_list.append(outcome_fold2['RecordID'][n])
fold2_outcome_dead = len(outcome_dead_list)-fold1_outcome_dead
fold2_outcome_survive = len(outcome_survival_list)-fold1_outcome_survive
outcome_df_folds.iloc[1] = [fold2_outcome_dead, fold2_outcome_survive]
outcome_fold3 = all_outcome_dfs_folds['Fold3']
for n in range(len(outcome_fold3)):
all_record_ids.append(outcome_fold2['RecordID'][n])
if (outcome_fold3['In-hospital_death'][n] == 1):
outcome_dead_list.append(outcome_fold3['RecordID'][n])
else :
outcome_survival_list.append(outcome_fold3['RecordID'][n])
fold3_outcome_dead = len(outcome_dead_list)-fold1_outcome_dead-fold2_outcome_dead
fold3_outcome_survive = len(outcome_survival_list)-fold1_outcome_survive-fold2_outcome_survive
outcome_df_folds.iloc[2] = [fold3_outcome_dead, fold3_outcome_survive]
outcome_fold4 = all_outcome_dfs_folds['Fold4']
for n in range(len(outcome_fold4)):
all_record_ids.append(outcome_fold2['RecordID'][n])
if (outcome_fold4['In-hospital_death'][n] == 1):
outcome_dead_list.append(outcome_fold4['RecordID'][n])
else :
outcome_survival_list.append(outcome_fold4['RecordID'][n])
fold4_outcome_dead = len(outcome_dead_list)-fold1_outcome_dead-fold2_outcome_dead-fold3_outcome_dead
fold4_outcome_survive = len(outcome_survival_list)-fold1_outcome_survive-fold2_outcome_survive-fold3_outcome_survive
outcome_df_folds.iloc[3] = [fold4_outcome_dead, fold4_outcome_survive]
print("The total number of patients are", len(all_record_ids))
print("The number of dead patients are", len(outcome_dead_list))
print("The number of patients who survived are", len(outcome_survival_list), "\n")
print("Number of Patients based on Dead or Survival Outcome:")
print(outcome_df_folds)
outcome_df_folds.plot(kind='bar', stacked=False, figsize=[12,6])
plt.title('Number of Patients based on Dead or Survival Outcome')
plt.xticks(rotation=0)
plt.show()
from fractions import Fraction
print("The ratio of death over survival in each fold:")
for index, row in outcome_df_folds.iterrows():
print("Ratio of", index, ":", Fraction(row['Dead']/row['Survived']).limit_denominator(), "=", row['Dead']/row['Survived'])
Observation: The ratio between death and survival outcomes in each fold is almost similar.
sns.catplot(x="ICUType", kind="count", col="In-hospital_death", data=all_features_df, aspect=1)
Above are the observations of in-hospital death against ICU types:
# frequency of each temporal variable for death and ICU types
all_features_df__freq[all_features_df__freq['ICUType'] == 1].loc[:, 'ALP':'pH'].sum().sort_values(ascending=False).plot.bar(figsize = (14, 5))
plt.xlabel('Variables'); plt.ylabel('Sum of Frequency');
plt.title('Sum of Frequency vs Variables in Coronary Care Unit');
# Top 12 frequently monitored variables are HR, MAP, SysABP, DiasABP, Urine, NISysABP, NIDiasABP, NIMAP, RespRate, Temp, GCS, FiO2
all_features_df__freq[all_features_df__freq['ICUType'] == 2].loc[:, 'ALP':'pH'].sum().sort_values(ascending=False).plot.bar(figsize = (14, 5), color='orange')
plt.xlabel('Variables'); plt.ylabel('Sum of Frequency');
plt.title('Sum of Frequency vs Variables in Cardiac Surgery Recovery Unit');
# Top 12 frequently monitored variables are HR, MAP, SysABP, DiasABP, Urine, Temp, GCS, NISysABP, NIDiasABP, NIMAP, pH, PaCO2
all_features_df__freq[all_features_df__freq['ICUType'] == 3].loc[:, 'ALP':'pH'].sum().sort_values(ascending=False).plot.bar(figsize = (14, 5), color='green')
plt.xlabel('Variables'); plt.ylabel('Sum of Frequency');
plt.title('Sum of Frequency vs Variables in Medical ICU');
# Top 12 frequently monitored variables are HR, NISysABP, NIDiasABP, NIMAP, Urine, SysABP, DiasABP, MAP, RespRate, Temp, GCS, FiO2
all_features_df__freq[all_features_df__freq['ICUType'] == 4].loc[:, 'ALP':'pH'].sum().sort_values(ascending=False).plot.bar(figsize = (14, 5), color='red')
plt.xlabel('Variables'); plt.ylabel('Sum of Frequency');
plt.title('Sum of Frequency vs Variables in Surgical ICU');
# Top 12 frequently monitored variables are HR, SysABP, DiasABP, MAP, Urine, GCS, NISysABP, NIDiasABP, NIMAP, Temp, RespRate, FiO2
print('ICUType 1 and 2 has difference in variable', set(['HR', 'MAP', 'SysABP', 'DiasABP', 'Urine', 'NISysABP', 'NIDiasABP', 'NIMAP', 'RespRate', 'Temp', 'GCS', 'FiO2']).symmetric_difference(set(['HR', 'MAP', 'SysABP', 'DiasABP', 'Urine', 'Temp', 'GCS', 'NISysABP', 'NIDiasABP', 'NIMAP', 'pH', 'PaCO2'])))
print('ICUType 1 and 3 has no difference in variable', set(['HR', 'MAP', 'SysABP', 'DiasABP', 'Urine', 'NISysABP', 'NIDiasABP', 'NIMAP', 'RespRate', 'Temp', 'GCS', 'FiO2']).symmetric_difference(set(['HR', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Urine', 'SysABP', 'DiasABP', 'MAP', 'RespRate', 'Temp', 'GCS', 'FiO2'])))
print('ICUType 1 and 4 has no difference in variable', set(['HR', 'MAP', 'SysABP', 'DiasABP', 'Urine', 'NISysABP', 'NIDiasABP', 'NIMAP', 'RespRate', 'Temp', 'GCS', 'FiO2']).symmetric_difference((['HR', 'SysABP', 'DiasABP', 'MAP', 'Urine', 'GCS', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Temp', 'RespRate', 'FiO2'])))
print('ICUType 2 and 3 has difference in variable', set(['HR', 'MAP', 'SysABP', 'DiasABP', 'Urine', 'Temp', 'GCS', 'NISysABP', 'NIDiasABP', 'NIMAP', 'pH', 'PaCO2']).symmetric_difference((['HR', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Urine', 'SysABP', 'DiasABP', 'MAP', 'RespRate', 'Temp', 'GCS', 'FiO2'])))
print('ICUType 2 and 4 has difference in variable', set(['HR', 'MAP', 'SysABP', 'DiasABP', 'Urine', 'Temp', 'GCS', 'NISysABP', 'NIDiasABP', 'NIMAP', 'pH', 'PaCO2']).symmetric_difference((['HR', 'SysABP', 'DiasABP', 'MAP', 'Urine', 'GCS', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Temp', 'RespRate', 'FiO2'])))
print('ICUType 3 and 4 has no difference in variable', set(['HR', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Urine', 'SysABP', 'DiasABP', 'MAP', 'RespRate', 'Temp', 'GCS', 'FiO2']).symmetric_difference((['HR', 'SysABP', 'DiasABP', 'MAP', 'Urine', 'GCS', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Temp', 'RespRate', 'FiO2'])))
# ICUType 1 and 2, 2 and 3, 2 and 4 have some different variables
# ICUType 1 and 3, 1 and 4, 3 and 4 has same variables
# the different variables {'pH', 'PaCO2'} is in ICUType 2 and {'RespRate','FiO2'} in ICUType 1,3,4
feat = ['Age', 'HR', 'MAP', 'SysABP','DiasABP', 'NISysABP', 'NIDiasABP', 'NIMAP', 'Temp', 'GCS', 'pH', 'RespRate', 'PaCO2', 'FiO2']
colors = OrderedDict({1: 'blue', 2: 'orange', 3: 'green', 4: 'red'})
icuTypes = OrderedDict({1: 'Coronary Care Unit', 2: "Cardiac Surgery Recovery Unit", 3: 'Medical ICU', 4: 'Surgical ICU'})
for num in range(1, 5):
all_features_df__most_recent[(all_features_df__most_recent['ICUType'] == num) & (all_features_df__most_recent['In-hospital_death'] == 0) ].loc[:, feat].plot.box(figsize = (20, 5), color=colors[num])
plt.xlabel('Variables'); plt.ylabel('Unit');
plt.xticks(rotation='vertical')
plt.title('BoxPlot of variables for ' + icuTypes[num] + ' for those who survive');
all_features_df__most_recent[(all_features_df__most_recent['ICUType'] == num) & (all_features_df__most_recent['In-hospital_death'] == 1) ].loc[:, feat].plot.box(figsize = (20, 5), color=colors[num])
plt.xlabel('Variables'); plt.ylabel('Unit');
plt.xticks(rotation='vertical')
plt.title('BoxPlot of variables for ' + icuTypes[num] + ' for those who died');
plt.figure(figsize = (10, 2))
bin_colors = OrderedDict({0: 'red', 1: 'blue'})
inHospitalDeathTypes = OrderedDict({0: 'Survive', 1: 'In-Hospital Death'})
feature = "Length of Stay"
all_features_df__earliest_stay = all_features_df__earliest.loc[all_features_df__earliest['Length_of_stay'] <= 60]
for inHospitalDeathTypes_level, color in bin_colors.items():
sns.distplot(all_features_df__earliest_stay.loc[all_features_df__earliest_stay['In-hospital_death'] == inHospitalDeathTypes_level, "Length_of_stay"].dropna(),
color = color, label = inHospitalDeathTypes[inHospitalDeathTypes_level], kde=False)
plt.legend()
plt.title(f'{feature.capitalize()} Distribution');
plt.xlabel(f'{feature}');
plt.ylabel('Frequency')
plt.subplots_adjust(top = 2)
# The in-hospital death seem to be right skewed. Majority of those who died falls in the range of 0-10 and 10-20 days in hospital
visualizeKDEDistribution(all_features_df__earliest, "In-hospital_death")
Key Observations on the condition of patients upon arrival in ICU associating to in-hospital death:
# it may make more sense to find our in what condition the patient comes in
visualizeKDEDistribution(all_features_df__earliest, "IcuType")
Key observations comparing with the previous graph
# it may make more sense to find our in what condition the patient comes in
visualizeKDEDistribution(all_features_df__earliest, "Gender")
# Both Genders have quite comparable distribution and density measures for each of the features,
def plotFeaturesCorrelationMatrix(df, num_feature=None):
if num_feature == None:
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
sns.set(rc={'figure.figsize':(30,20)})
ax = sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, square=True, cmap='seismic',annot=True)
else:
corr = df[num_feature].corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
sns.set(rc={'figure.figsize':(30,20)})
ax = sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, square=True, cmap='seismic',annot=True)
return corr
correlation_matrix_earliest = plotFeaturesCorrelationMatrix(all_features_df__earliest)
# to extract and find the top correlated features pair in list format
levels = {'high': 0.8, 'mid': 0.4}
def extractCorrelatedFeatures(correlation_matrix, levels):
correlation_dict = {}
highly_correlation = correlation_matrix[(correlation_matrix.ix[:,:] >= levels['high']) & (correlation_matrix.ix[:,:] < 1)]
correlation_dict['high_positive'] = findFeaturePairs(highly_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
highly_correlation = correlation_matrix[(correlation_matrix.ix[:,:] <= (-1 * levels['high']))]
correlation_dict['high_negative'] = findFeaturePairs(highly_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
mid_correlation = correlation_matrix[(correlation_matrix.ix[:,:] >= levels['mid']) & (correlation_matrix.ix[:,:] < levels['high'])]
correlation_dict['mid_positive'] = findFeaturePairs(mid_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
mid_correlation = correlation_matrix[((correlation_matrix.ix[:,:] <= (-1 * levels['mid'])) &
(correlation_matrix.ix[:,:] > (-1 * levels['high'])))]
correlation_dict['mid_negative'] = findFeaturePairs(mid_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
low_correlation = correlation_matrix[(correlation_matrix.ix[:,:] < levels['mid'])]
correlation_dict['low_positive'] = findFeaturePairs(low_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
low_correlation = correlation_matrix[((correlation_matrix.ix[:,:] >= (-1 * levels['mid'])) & (correlation_matrix.ix[:,:] < 0))]
correlation_dict['low_negative'] = findFeaturePairs(low_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
no_correlation = correlation_matrix[(correlation_matrix.ix[:,:] == 0)]
correlation_dict['no'] = findFeaturePairs(no_correlation.dropna(axis=1, how='all').dropna(axis=0, how='all'))
return correlation_dict
# given dataframe iterate to find the correlation pair:
def findFeaturePairs(correlation_matrix):
arr_pairs = {}
for row in correlation_matrix:
if row != "RecordID":
new_df = correlation_matrix[row][correlation_matrix[row].notna()].drop_duplicates(keep='first')
if "RecordID" not in new_df:
arr_pairs[frozenset({row, new_df.index[0]})] = new_df[0]
return arr_pairs
import operator
def printFeaturePairs(correlation_dict):
# loop each dictionary
for idx, dictionary in correlation_dict.items():
print("====================",idx, "====================")
if '_negative' in idx:
sorted_d = sorted(dictionary.items(), key=operator.itemgetter(1))
else:
sorted_d = sorted(dictionary.items(), key=operator.itemgetter(1), reverse=True)
for _set in sorted_d:
print("Feature pair {} with corr. coeff. {}".format([feature for feature in _set[0]] , round(_set[1], 3)))
print("====================",idx, "====================\n")
correlation_dict = extractCorrelatedFeatures(correlation_matrix_earliest, levels)
printFeaturePairs(correlation_dict)
Observation:
However, the correlated features can be identified to process them together to prevent multicollinearity
from statsmodels.formula.api import ols
def printBestFeaturesForRegression(data, features):
betas = []
beta_stds = []
pvalues = []
R2s = []
best_R2_model = None
for f in features:
if f == 'Length_of_stay': continue
res = ols(formula='Length_of_stay ~ {}'.format(f), data=data).fit()
betas.append(res.params[f])
beta_stds.append(res.bse[f])
pvalues.append(res.pvalues[f])
R2s.append(res.rsquared)
# print(res.rsquared)
if res.rsquared >= np.max(R2s):
best_R2_model = res
slr_res = pd.DataFrame(data={'β':np.round(betas,4), 'β_std': np.round(beta_stds,4),
'p-value':np.round(pvalues,4), 'R-squared':np.round(R2s,4)},
index=features)
print('Result of Simple linear Regression')
print(slr_res[['β', 'β_std', 'p-value','R-squared']])
print('\nFeatures with significant correlation')
print(slr_res.index[slr_res['p-value']<0.05])
print("\npredictor with maximum R-squared is", slr_res['R-squared'].idxmax())
features = ['Age', 'Gender','Height', 'ICUType', 'Weight', 'BUN', 'Creatinine', 'GCS', 'Glucose',
'HCO3', 'HCT', 'HR', 'K', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na',
'Platelets', 'RespRate', 'Temp', 'TroponinT', 'Urine', 'WBC', 'Lactate',
'DiasABP', 'FiO2', 'MAP', 'MechVent', 'PaCO2', 'PaO2', 'SaO2', 'SysABP',
'pH', 'ALP', 'ALT', 'AST', 'Albumin', 'Bilirubin', 'TroponinI',
'Cholesterol']
printBestFeaturesForRegression(all_features_df__earliest, features)
In the presence and level of the variables when patient is admitted into the ICU, Albumin level is a predictor with the maximum R-Squared with Length of Stay.
The variables that are measured in the earlier part of the ICU admission:
'Age', 'ICUType', 'Weight', 'BUN', 'Creatinine', 'GCS', 'HCO3', 'HCT', 'HR', 'Mg', 'NISysABP', 'Urine', 'MechVent', 'PaO2', 'Albumin','Bilirubin'
has significant correlation with Length of Stay in terms of patients' condition when admitted into the ICUType
features = ['Age', 'Gender','Height', 'ICUType', 'Weight', 'BUN', 'Creatinine', 'GCS', 'Glucose',
'HCO3', 'HCT', 'HR', 'K', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na',
'Platelets', 'RespRate', 'Temp', 'TroponinT', 'Urine', 'WBC', 'Lactate',
'DiasABP', 'FiO2', 'MAP', 'MechVent', 'PaCO2', 'PaO2', 'SaO2', 'SysABP',
'pH', 'ALP', 'ALT', 'AST', 'Albumin', 'Bilirubin', 'TroponinI',
'Cholesterol']
printBestFeaturesForRegression(all_features_df__most_recent, features)
In monitoring patients' health by the end of 48 hours, Albumin level is a predictor with the maximum R-Squared with Length of Stay.
The Variables of patients' health by the end of 48 hours:
'Age', 'ICUType', 'Weight', 'BUN', 'Creatinine', 'GCS', 'Glucose','HCO3', 'HCT', 'HR', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Platelets','Urine', 'MechVent', 'Albumin', 'Bilirubin'
has significant correlation with Length of Stay in terms of patients' health by the end of 48 hours
features = ['Age', 'Gender','Height', 'ICUType', 'Weight', 'BUN', 'Creatinine', 'GCS', 'Glucose',
'HCO3', 'HCT', 'HR', 'K', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na',
'Platelets', 'RespRate', 'Temp', 'TroponinT', 'Urine', 'WBC', 'Lactate',
'DiasABP', 'FiO2', 'MAP', 'MechVent', 'PaCO2', 'PaO2', 'SaO2', 'SysABP',
'pH', 'ALP', 'ALT', 'AST', 'Albumin', 'Bilirubin', 'TroponinI',
'Cholesterol']
printBestFeaturesForRegression(all_features_df__freq, features)
With the frequency distribution of monitoring patients' health signs, MechVent is a predictor with maximum R-squared for length of stay.
The Frequency of monitoring the variables:
Objective: To convert the temporal data to a single feature vector per patient to obtain the Design Matrix for modelling
Physiology is the scientific study of the functions and mechanisms which work within a living system. There are 12 physiological features that are tracked annd recorded within the first 24 hours of ICU admission as stated in a journal30024-X.pdf).
These features are:
Below lists the occurence of each paramater within the respective timeframe for a patient:
physiological_feats = ["Bilirubin", "BUN", "GCS", "HCO3", "HR", "K", "Na", "PaO2", "SysABP", "Temp", "Urine", "WBC"]
X = all_static_dfs
df_tmp = pd.DataFrame()
data = all_patients['132540']
idx = data['Parameter'].isin(physiological_feats)
df_phy_temp = data.loc[idx, :]
Below list a dataframe of a consolidated count for each physiological parameter for each patient.
df_phy_par = df_phy_temp.groupby(['Parameter'])['Value']
df_phy_par.describe()
physiological_feats.append("In-hospital_death") # to add target variable for data exploration
plt.rcParams.update({'font.size': 14})
df_phy_freq = all_features_df__freq[physiological_feats]
#df_phy_freq
correlation_matrix_phy_freq = df_phy_freq.corr()
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data=correlation_matrix_phy_freq, cmap='seismic', annot=True, vmin=-1, vmax=1)
plt.show()
df_phy_most_recent = all_features_df__most_recent[physiological_feats]
#df_phy_most_recent
correlation_matrix_phy_most_recent = df_phy_most_recent.corr()
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data=correlation_matrix_phy_most_recent, cmap='seismic', annot=True, vmin=-1, vmax=1)
plt.show()
Based on the findings on the correlation matrix between physiological features, the aggregators to be considered for the design matrix between freq
and most_recent
. Most of the features have a high correlation when using the freq
aggregator, thus, the most of the features used for the design matrix are most_recent
features.
freq
- Frequency:
most_recent
- Most Recent:
#Fix values with wrong inputs in most recent dataframes for physiological variables.
def replace_value(df, c, value=np.nan, below=None, above=None):
idx = c
if below is not None:
idx = df[c] < below
if above is not None:
idx = df[c] > above
if 'function' in str(type(value)):
# value replacement is a function of the input
df.loc[idx, c] = df.loc[idx, c].apply(value)
else:
df.loc[idx, c] = value
return df
df_phy_most_recent = replace_value(df_phy_most_recent, 'SysABP', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'HR', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'HR', value=np.nan, above=299)
df_phy_most_recent = replace_value(df_phy_most_recent, 'PaO2', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'PaO2', value=lambda x: x*10, below=20)
df_phy_most_recent = replace_value(df_phy_most_recent, 'Temp', value=np.nan, below=25)
df_phy_most_recent = replace_value(df_phy_most_recent, 'Temp', value=np.nan, above=45)
df_phy_most_recent = replace_value(df_phy_most_recent, 'WBC', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'HCO3', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'Urine', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'BUN', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'Na', value=np.nan, below=1)
df_phy_most_recent = replace_value(df_phy_most_recent, 'Bilirubin', value=np.nan, below=0)
#df_phy_most_recent
phy_feats_freq = ["HCO3", "Urine"]
phy_feats_most_recent = ["HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
df_phy_freq_v2 = df_phy_freq[phy_feats_freq]
df_phy_most_recent_v2 = df_phy_most_recent[phy_feats_most_recent]
df_phy_v2 = df_phy_freq_v2.join(df_phy_most_recent_v2)
#df_phy_v2
correlation_matrix_phy_v2 = df_phy_v2.corr()
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data=correlation_matrix_phy_v2, cmap='seismic', annot=True, vmin=-1, vmax=1)
plt.show()
#This is to retrieve the median values for each feature.
phy_median = pd.DataFrame(data=df_phy_v2.median(), columns=['Median'])
phy_feats_freq = ['RecordID', "HCO3", "Urine"]
phy_feats_most_recent = ["HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
phy_feats = ["HCO3", "Urine", "HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
#Merge required frequency and most recent values from respective dictionaries to form design matrix 1.
design_matrix_1 = {}
design_matrix_1_freq = {}
design_matrix_1_most_recent = {}
for key, ids_list in all_temporal_dfs_folds__most_recent.items():
design_matrix_1[key] = pd.DataFrame()
design_matrix_1_freq[key] = all_temporal_dfs_folds__freq[key][phy_feats_freq]
design_matrix_1_most_recent[key] = all_temporal_dfs_folds__most_recent[key][phy_feats_most_recent]
design_matrix_1[key] = design_matrix_1_freq[key].join(design_matrix_1_most_recent[key])
#design_matrix_1
# Replace NaN with median values for each parameter
# Replace impractical values that are out of range
for key, ids_list in design_matrix_1.items():
for col in phy_feats:
median_value = phy_median.loc[col]['Median']
design_matrix_1[key][col].fillna(median_value, inplace=True)
design_matrix_1
The most recent value of the respective vital signs variables are taken as it has a high significance on a patients' condition at the end of the 48 hours.
Nan values for respective vital feature variables will be replaced with the normal range and assume that patient in the particular variable is good, thus, measurements are not taken. The NaN values are not replaced with mean values because the condition of a patient in a particular variable (ie MAP) does not depend on the condition on the rest of the patients.
The data are carefully evaulated based on each fold, where missing data are replaced and the data are categories on wehther they are in the normal range.
temp
is created to make a copy on the most recent of the temporal data.
temp = all_temporal_dfs_folds__most_recent.copy()
Reason for choosing MechVent: Patients are unable to breathe normally on their own. They require the mechanical means to assist or replace their spontaneous breathing. The usuage of MechVent is related to the mortality rate of a patient.
NaN values are replace with 0.
Patients who were on MechVent are categorized as 1 and those without are categories as 0.
#Fold 1
mechvent_list_fold1=[]
for i in temp['Fold1']['MechVent']:
if i == 1.0:
mechvent_list_fold1.append(1)
else:
mechvent_list_fold1.append(0)
#Fold 2
mechvent_list_fold2=[]
for i in temp['Fold2']['MechVent']:
if i == 1.0:
mechvent_list_fold2.append(1)
else:
mechvent_list_fold2.append(0)
#Fold 3
mechvent_list_fold3=[]
for i in temp['Fold3']['MechVent']:
if i == 1.0:
mechvent_list_fold3.append(1)
else:
mechvent_list_fold3.append(0)
#Fold 4
mechvent_list_fold4=[]
for i in temp['Fold4']['MechVent']:
if i == 1.0:
mechvent_list_fold4.append(1)
else:
mechvent_list_fold4.append(0)
Reason for choosing Temp: A body temperature that is too high can cause malfunction and ultimately failure of most organs, which eventually cause death.
Hypothermia: <35.0 °C, Normal: 36.5–37.5 °C, Fever: 37.5 or 38.3 °C, Hyperthermia: >37.5 or 38.3 °C, Hyperpyrexia: >40.0 or 41.0 °C
NaN values are replace with 37.0.
Patients who are Hypothermia are categorised as 1, Normal as 2, Fever as 3 and Hyperpyrexia as 4.
#Fold 1
temp['Fold1']['Temp'].fillna('37.0', inplace=True)
temp_list1=[]
for i in temp['Fold1']['Temp']:
if float(i)<35:
temp_list1.append(1)
elif float(i)<=37.5:
temp_list1.append(2)
elif float(i)<=38.3:
temp_list1.append(3)
elif float(i)>38.3:
temp_list1.append(4)
#Fold 2
temp['Fold2']['Temp'].fillna('37.0', inplace=True)
temp_list2=[]
for i in temp['Fold2']['Temp']:
if float(i)<35:
temp_list2.append(1)
elif float(i)<=37.5:
temp_list2.append(2)
elif float(i)<=38.3:
temp_list2.append(3)
elif float(i)>38.3:
temp_list2.append(4)
#Fold 3
temp['Fold3']['Temp'].fillna('37.0', inplace=True)
temp_list3=[]
for i in temp['Fold3']['Temp']:
if float(i)<35:
temp_list3.append(1)
elif float(i)<=37.5:
temp_list3.append(2)
elif float(i)<=38.3:
temp_list3.append(3)
elif float(i)>38.3:
temp_list3.append(4)
#Fold 4
temp['Fold4']['Temp'].fillna('37.0', inplace=True)
temp_list4=[]
for i in temp['Fold3']['Temp']:
if float(i)<35:
temp_list4.append(1)
elif float(i)<=37.5:
temp_list4.append(2)
elif float(i)<=38.3:
temp_list4.append(3)
elif float(i)>38.3:
temp_list4.append(4)
Reason for choosing GCS: A GCS score of 3 is the lowest possible score and is associated with an extremely high mortality rate, with low chance of survival.
Range of GCS is 3-15. Level 3-8: Coma, 8>good chance of recovery
NaN values are replace with 13.
Patients who are coma are categorized as 1 and those who are safe are categories as 0.
#Fold 1
GCS_list1=[]
temp['Fold1']['GCS'].fillna('13', inplace=True)
for i in temp['Fold1']['GCS']:
if float(i)<=8:
GCS_list1.append(1)
elif float(i)>8:
GCS_list1.append(0)
#Fold 2
GCS_list2=[]
temp['Fold2']['GCS'].fillna('13', inplace=True)
for i in temp['Fold2']['GCS']:
if float(i)<=8:
GCS_list2.append(1)
elif float(i)>8:
GCS_list2.append(0)
#Fold 3
GCS_list3=[]
temp['Fold3']['GCS'].fillna('13', inplace=True)
for i in temp['Fold3']['GCS']:
if float(i)<=8:
GCS_list3.append(1)
elif float(i)>8:
GCS_list3.append(0)
#Fold 4
GCS_list4=[]
temp['Fold4']['GCS'].fillna('13', inplace=True)
for i in temp['Fold4']['GCS']:
if float(i)<=8:
GCS_list4.append(1)
elif float(i)>8:
GCS_list4.append(0)
Reason for choosing PaO2/ FiO2 Level: The level of PaO2/ FiO2 level is used to compare between the oxygen level in the blood and the oxygen concentration that is breathed It is associated with the chance of mortalty.
There are a total of 3 classfications.
Mild: 200-300. Mortality rate of Mild is 27%.
Moderate: 100-200. Mortality rate of Moderate is 32%.
Severe: <100. Mortality rate of Severe is 45%
NaN values for PaO2 are replace with 80 and NaN values for FiO2 are replaced with 0.5
Patients who are on mild level are categorized as 1, on moderate level as 2 and on severe level at 3.
#Replace all NaN in PaO2 with 80
temp['Fold1']['PaO2'].fillna('80', inplace=True)
temp['Fold2']['PaO2'].fillna('80', inplace=True)
temp['Fold3']['PaO2'].fillna('80', inplace=True)
temp['Fold4']['PaO2'].fillna('80', inplace=True)
#Replace all NaN in FiO2 with 0.5
temp['Fold1']['FiO2'].fillna('0.5', inplace=True)
temp['Fold2']['FiO2'].fillna('0.5', inplace=True)
temp['Fold3']['FiO2'].fillna('0.5', inplace=True)
temp['Fold4']['FiO2'].fillna('0.5', inplace=True)
#To find the PaO2/FiO2 level
PaO2_FiO2_ratio1 = (temp['Fold1']['PaO2'].astype(float))/ (temp['Fold1']['FiO2'].astype(float))
PaO2_FiO2_ratio2 = (temp['Fold2']['PaO2'].astype(float))/ (temp['Fold2']['FiO2'].astype(float))
PaO2_FiO2_ratio3 = (temp['Fold3']['PaO2'].astype(float))/ (temp['Fold3']['FiO2'].astype(float))
PaO2_FiO2_ratio4 = (temp['Fold4']['PaO2'].astype(float))/ (temp['Fold4']['FiO2'].astype(float))
#1: mild, 2: Moderate, 3: Severe
PaO2_FiO2_ratio_list1=[]
for i in PaO2_FiO2_ratio1:
if float(i)<100:
PaO2_FiO2_ratio_list1.append(3)
elif float(i)<=200:
PaO2_FiO2_ratio_list1.append(2)
elif float(i)<=300 or float(i)>300:
PaO2_FiO2_ratio_list1.append(1)
PaO2_FiO2_ratio_list2=[]
for i in PaO2_FiO2_ratio2:
if float(i)<100:
PaO2_FiO2_ratio_list2.append(3)
elif float(i)<=200:
PaO2_FiO2_ratio_list2.append(2)
elif float(i)<=300 or float(i)>300:
PaO2_FiO2_ratio_list2.append(1)
PaO2_FiO2_ratio_list3=[]
for i in PaO2_FiO2_ratio3:
if float(i)<100:
PaO2_FiO2_ratio_list3.append(3)
elif float(i)<=200:
PaO2_FiO2_ratio_list3.append(2)
elif float(i)<=300 or float(i)>300:
PaO2_FiO2_ratio_list3.append(1)
PaO2_FiO2_ratio_list4=[]
for i in PaO2_FiO2_ratio4:
if float(i)<100:
PaO2_FiO2_ratio_list4.append(3)
elif float(i)<=200:
PaO2_FiO2_ratio_list4.append(2)
elif float(i)<=300 or float(i)>300:
PaO2_FiO2_ratio_list4.append(1)
Reason for choosing HR: There is an association of heart rate with mortality.
The normal range is: 50 -100 bpm
NaN values are replace with 80.
Patients who are on normal HR are categorized as 0 and those who are not are categories as 1.
#Fold 1
temp['Fold1']['HR'].fillna('80', inplace=True)
HR_list1=[]
for i in temp['Fold1']['Temp']:
if float(i)>=50 and float(i)<=100:
HR_list1.append(0)
elif float(i) <50 or float(i)>100:
HR_list1.append(1)
#Fold 2
temp['Fold2']['HR'].fillna('80', inplace=True)
HR_list2=[]
for i in temp['Fold2']['Temp']:
if float(i)>=50 and float(i)<=100:
HR_list2.append(0)
elif float(i) <50 or float(i)>100:
HR_list2.append(1)
#Fold 3
temp['Fold3']['HR'].fillna('80', inplace=True)
HR_list3=[]
for i in temp['Fold3']['Temp']:
if float(i)>=50 and float(i)<=100:
HR_list3.append(0)
elif float(i) <50 or float(i)>100:
HR_list3.append(1)
#Fold 4
temp['Fold4']['HR'].fillna('80', inplace=True)
HR_list4=[]
for i in temp['Fold4']['Temp']:
if float(i)>=50 and float(i)<=100:
HR_list4.append(0)
elif float(i) <50 or float(i)>100:
HR_list4.append(1)
Reason for choosing MAP: Normally range is between 65 and 110. Below normal range is bad, where vital organs will not get enough oxygen perfusion and will become hypoxic
MAP is a better factor in determining a patient's mortality as compared to systolic blood pressure
On top of that, MAP range is associated with mortality
NaN values are replace with 80.
Patients who are on normal MAP range are categorized as 0 and those who are not are categories as 1.
MAP_list1=[]
temp['Fold1']['MAP'].fillna('80', inplace=True)
for i in temp['Fold1']['MAP']:
if float(i)>=65 and float(i)<=110:
MAP_list1.append(0)
else:
MAP_list1.append(1)
MAP_list2=[]
temp['Fold2']['MAP'].fillna('80', inplace=True)
for i in temp['Fold2']['MAP']:
if float(i)>=65 and float(i)<=110:
MAP_list2.append(0)
else:
MAP_list2.append(1)
MAP_list3=[]
temp['Fold3']['MAP'].fillna('80', inplace=True)
for i in temp['Fold3']['MAP']:
if float(i)>=65 and float(i)<=110:
MAP_list3.append(0)
else:
MAP_list3.append(1)
MAP_list4=[]
temp['Fold4']['MAP'].fillna('80', inplace=True)
for i in temp['Fold4']['MAP']:
if float(i)>=65 and float(i)<=110:
MAP_list4.append(0)
else:
MAP_list4.append(1)
Now, collated a dataframe for each of the folds which contains the 6 variables above. Then, combine the dataframes and form design matrix 2.
vital_sign_list1 = list(zip(list(all_static_dfs_folds['Fold1']['RecordID']), mechvent_list_fold1, temp_list1, GCS_list1, PaO2_FiO2_ratio_list1, HR_list1, MAP_list1))
vital_sign_list2 = list(zip(list(all_static_dfs_folds['Fold2']['RecordID']), mechvent_list_fold2, temp_list2, GCS_list2, PaO2_FiO2_ratio_list2, HR_list2, MAP_list2))
vital_sign_list3 = list(zip(list(all_static_dfs_folds['Fold3']['RecordID']), mechvent_list_fold3, temp_list3, GCS_list3, PaO2_FiO2_ratio_list3, HR_list3, MAP_list3))
vital_sign_list4 = list(zip(list(all_static_dfs_folds['Fold4']['RecordID']), mechvent_list_fold4, temp_list4, GCS_list4, PaO2_FiO2_ratio_list4, HR_list4, MAP_list4))
vital_sign_list1_df = pd.DataFrame(vital_sign_list1, columns = ['RecordID', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP'])
vital_sign_list2_df = pd.DataFrame(vital_sign_list2, columns = ['RecordID', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio','HR', 'MAP'])
vital_sign_list3_df = pd.DataFrame(vital_sign_list3, columns = ['RecordID', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP'])
vital_sign_list4_df = pd.DataFrame(vital_sign_list4, columns = ['RecordID', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP'])
design_matrix_2 = {}
design_matrix_2['Fold1'] = vital_sign_list1_df
design_matrix_2['Fold2'] = vital_sign_list2_df
design_matrix_2['Fold3'] = vital_sign_list3_df
design_matrix_2['Fold4'] = vital_sign_list4_df
design_matrix_2
Objective: To use the most recent data as a whole to see how feature performs in the model and iterate to improve performance
all_temporal_dfs_folds__most_recent
is used for design matrix. Those are the features that are within the 48 hours and the features can be further selected by forward stepwise regression model in refine after that.
all_temporal_dfs_folds__most_recent = getAllTemporalDataFrameByAggregationTypeInFolds(cv_fold, all_patients, "most_recent")
all_temporal_dfs_folds__most_recent
design_matrix_3 = all_temporal_dfs_folds__most_recent.copy()
Objectives: Based on "ICU Mortality Prediction: A Classification
Algorithm for Imbalanced Datasets" by Bhattacharya, S., Rajan, V, Shrivastava, H. (2017), the features have the least number of missing data.
def getVariableStats(row, characteristics='mean'):
# Mean - If average value exceed healthy threshold, more likelihood of mortality?
if characteristics == 'mean':
return np.nanmean(row)
# Median - If median exceed healthy threshold, more likelihood of mortality?
if characteristics == 'median':
arr = np.asarray(row.sort(), dtype=np.float64)
return np.nanmedian(arr)
# Mode - If high occurrences exceed healthy threshold, more likelihood of mortality?
if characteristics == 'mode':
mode = max(set(row), key=row.count)
return mode
# Standard deviation - If large spread, is there more likelihood of mortality?
if characteristics == 'sd':
if len(row) >= 2:
mean = np.nanmean(row)
return np.nanstd(row, xbar=mean)
else:
return 0
# given temporal data of a patient and selected temporal features, return a dictionary of a row
def createDesignMatrix4(data, features, characteristics):
record_id = data.loc[data['Time'] == '00:00', :]['Value'][0]
age = data.loc[data['Time'] == '00:00', :]['Value'][1]
gender = data.loc[data['Time'] == '00:00', :]['Value'][2]
icu_type = data.loc[data['Time'] == '00:00', :]['Value'][4]
data = data[['Time', 'Parameter', 'Value']].groupby(['Time','Parameter']).median().reset_index()
data_pivot = data.pivot(index='Time', columns='Parameter', values='Value')
row_dict = {}
for idx, row in data_pivot.T.iterrows():
if idx in features:
row_dict[idx] = getVariableStats(list(row.dropna()), characteristics="mean")
row_dict['RecordID'] = record_id
row_dict['Age'] = age
row_dict['Gender'] = gender
row_dict['ICUType'] = icu_type
return row_dict
# return mostly numerical features
# features are selected from a research paper by Bhattacharya,S., Rajan,V. and Shrivastava, H. (2017). ICU Mortality Prediction: A Classification Algorithm for Imbalanced Datasets.
# They have the least amount of missing values of below 2%
features = ['RecordID', 'Age', 'Gender', 'ICUType', 'HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
design_matrix_4 = {}
for key, ids_list in cv_fold.items():
design_matrix_4[key] = pd.DataFrame()
print(key, "has started extracting temporal data")
for patient_id in ids_list:
design_matrix_4[key] = design_matrix_4[key].append(createDesignMatrix4(all_patients[patient_id], features, "mean"), ignore_index=True)
print(key, "has completed\n")
print(len(design_matrix_4), "patients' Temporals data has been extracted")
# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# Model Building
from sklearn.pipeline import Pipeline
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
# Model Evaluation
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix
from statsmodels.tools.eval_measures import rmse
Objective: Store design matrix train and test sets into their respective folds and iterations for CV in model building.
design_matrices
- Store all design matrices across 4 folds.initializeDesignMatrixTrainTestSet
function - To group the 4 folds into training and test sets accordingly into 4 iterations for each design matrix.design_matrices = {}
# For Design Matrix into training and test
def initializeDesignMatrixTrainTestSet(design_matrix):
design_matrices_iter = {}
# Iter1
design_matrix_X_train = pd.DataFrame()
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold1'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold2'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold3'])
print('X_train', design_matrix_X_train.shape)
design_matrix_X_test = design_matrix['Fold4']
print('X_test', design_matrix_X_test.shape)
design_matrix_Y_train = pd.DataFrame()
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold1'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold2'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold3'])
print('Y_train', design_matrix_Y_train.shape)
design_matrix_Y_test = all_outcome_dfs_folds['Fold4']
print('Y_test',design_matrix_Y_test.shape)
design_matrices_iter['Iter1'] = {'X_train': design_matrix_X_train,
'X_test': design_matrix_X_test,
'Y_train': design_matrix_Y_train,
'Y_test': design_matrix_Y_test}
# Iter2
design_matrix_X_train = pd.DataFrame()
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold1'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold2'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold4'])
print('X_train', design_matrix_X_train.shape)
design_matrix_X_test = design_matrix['Fold3']
print('X_test', design_matrix_X_test.shape)
design_matrix_Y_train = pd.DataFrame()
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold1'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold2'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold4'])
print('Y_train', design_matrix_Y_train.shape)
design_matrix_Y_test = all_outcome_dfs_folds['Fold3']
print('Y_test',design_matrix_Y_test.shape)
design_matrices_iter['Iter2'] = {'X_train': design_matrix_X_train,
'X_test': design_matrix_X_test,
'Y_train': design_matrix_Y_train,
'Y_test': design_matrix_Y_test}
# Iter3
design_matrix_X_train = pd.DataFrame()
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold1'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold3'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold4'])
print('X_train', design_matrix_X_train.shape)
design_matrix_X_test = design_matrix['Fold2']
print('X_test', design_matrix_X_test.shape)
design_matrix_Y_train = pd.DataFrame()
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold1'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold3'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold4'])
print('Y_train', design_matrix_Y_train.shape)
design_matrix_Y_test = all_outcome_dfs_folds['Fold2']
print('Y_test',design_matrix_Y_test.shape)
design_matrices_iter['Iter3'] = {'X_train': design_matrix_X_train,
'X_test': design_matrix_X_test,
'Y_train': design_matrix_Y_train,
'Y_test': design_matrix_Y_test}
# Iter4
design_matrix_X_train = pd.DataFrame()
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold2'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold3'])
design_matrix_X_train = design_matrix_X_train.append(design_matrix['Fold4'])
print('X_train', design_matrix_X_train.shape)
design_matrix_X_test = design_matrix['Fold1']
print('X_test', design_matrix_X_test.shape)
design_matrix_Y_train = pd.DataFrame()
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold2'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold3'])
design_matrix_Y_train = design_matrix_Y_train.append(all_outcome_dfs_folds['Fold4'])
print('Y_train', design_matrix_Y_train.shape)
design_matrix_Y_test = all_outcome_dfs_folds['Fold1']
print('Y_test',design_matrix_Y_test.shape)
design_matrices_iter['Iter4'] = {'X_train': design_matrix_X_train,
'X_test': design_matrix_X_test,
'Y_train': design_matrix_Y_train,
'Y_test': design_matrix_Y_test}
return design_matrices_iter
design_matrices['1'] = initializeDesignMatrixTrainTestSet(design_matrix_1)
design_matrices['2'] = initializeDesignMatrixTrainTestSet(design_matrix_2)
design_matrices['3'] = initializeDesignMatrixTrainTestSet(design_matrix_3)
features_dm3 = ['ALP', 'ALT', 'AST', 'Age', 'Albumin', 'BUN', 'Bilirubin',
'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Gender',
'Glucose', 'HCO3', 'HCT', 'HR', 'Height', 'ICUType', 'K', 'Lactate',
'MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na',
'PaCO2', 'PaO2', 'Platelets', 'RecordID', 'RespRate', 'SaO2', 'SysABP',
'Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'Weight', 'pH']
# for design matrix 3 to impute the missing rows for all features
imp_dm3 = SimpleImputer(missing_values=-1, strategy='most_frequent')
simple_imp_dm3 = Pipeline(steps=[('imputer', imp_dm3)])
for key, design_matrix in design_matrices['3'].items():
design_matrix_3_X_train = design_matrix['X_train'].fillna(value=-1)
print('Before', design_matrix_3_X_train.shape)
simple_imp_dm3.fit(design_matrix_3_X_train)
design_matrix_3_X_train_preprocessed = simple_imp_dm3.transform(design_matrix_3_X_train)
print("Total number of output features:", design_matrix_3_X_train_preprocessed.shape)
design_matrix_3_X_train_preprocessed = pd.DataFrame.from_records(design_matrix_3_X_train_preprocessed)
design_matrix_3_X_train_preprocessed.columns = design_matrix_3_X_train.columns
design_matrix['X_train'] = design_matrix_3_X_train_preprocessed
design_matrix_3_X_test = design_matrix['X_test'].fillna(value=-1)
print('Before',design_matrix_3_X_test.shape)
design_matrix_3_X_test_preprocessed = simple_imp_dm3.fit_transform(design_matrix_3_X_test)
print("Total number of output features:", design_matrix_3_X_test_preprocessed.shape)
design_matrix_3_X_test_preprocessed = pd.DataFrame.from_records(design_matrix_3_X_test_preprocessed)
design_matrix_3_X_test_preprocessed.columns = design_matrix_3_X_test.columns
design_matrix['X_test'] = design_matrix_3_X_test_preprocessed
design_matrices['4'] = initializeDesignMatrixTrainTestSet(design_matrix_4)
features_dm4 = ['RecordID', 'Age', 'Gender', 'ICUType', 'HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
# for design matrix 4 to impute the missing rows for all features
imp_dm4 = SimpleImputer(missing_values=-1, strategy='mean')
simple_imp_dm4 = Pipeline(steps=[('imputer', imp_dm4)])
for key, design_matrix in design_matrices['4'].items():
design_matrix_4_X_train = design_matrix['X_train'].fillna(value=-1)
print('Before', design_matrix_4_X_train.shape)
design_matrix_4_X_train_preprocessed = simple_imp_dm4.fit_transform(design_matrix_4_X_train)
print("Total number of output features:", design_matrix_4_X_train_preprocessed.shape)
design_matrix_4_X_train_preprocessed = pd.DataFrame.from_records(design_matrix_4_X_train_preprocessed)
design_matrix_4_X_train_preprocessed.columns = design_matrix_4_X_train.columns
design_matrix['X_train'] = design_matrix_4_X_train_preprocessed
design_matrix_4_X_test = design_matrix['X_test'].fillna(value=-1)
print('Before',design_matrix_4_X_test.shape)
design_matrix_4_X_test_preprocessed = simple_imp_dm4.fit_transform(design_matrix_4_X_test)
print("Total number of output features:", design_matrix_4_X_test_preprocessed.shape)
design_matrix_4_X_test_preprocessed = pd.DataFrame.from_records(design_matrix_4_X_test_preprocessed)
design_matrix_4_X_test_preprocessed.columns = design_matrix_4_X_test.columns
design_matrix['X_test'] = design_matrix_4_X_test_preprocessed
A foward stepwise regression model to help us identify which sets of features would give us the most explainatory power in predicting Length of Stay reflected by the metrics "Adjusted R-Squared" while having low RSME.
# Model 1a (regression)
# Forward StepWise Multi linear Regression
def forward_selected(data, response):
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
t1 = datetime.now()
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {}".format(response,
' + '.join(selected + [candidate]))
score = ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
t2 = datetime.now()
diff = t2-t1
#to break out of an infinity loop
if diff.total_seconds() > 30:
break
formula = "{} ~ {} ".format(response,
' + '.join(selected))
model = ols(formula, data).fit()
t3 = datetime.now()
diff = t3-t1
print("Took about", diff.seconds, "seconds from start")
return model
# Extract features from formula for usage in the stepwise regression for each iteration
def processFormulaToFeatures(formula):
return formula[17:len(formula)].replace(" ", "").split("+")
### from statsmodels.formula.api import ols
regression_rsme_2a = {}
for key, val in design_matrices.items():
print("Design Matrix", key)
regression_rsme_2a[key] = {'train': [], 'test':[], 'regression':[], 'formula': [], 'adjusted_rsquared':[], 'subfeatures': []}
for iternum, value in val.items():
df_train_ols = pd.merge(value['X_train'], value['Y_train'], how='left', on='RecordID')
if key == '1':
features = ['Length_of_stay', "HCO3", "Urine", "HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
if key == '2':
features = ['Length_of_stay', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP']
if key == '3':
features = ['Length_of_stay', 'ALP', 'ALT', 'AST', 'Age', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Gender','Glucose', 'HCO3', 'HCT', 'HR', 'Height', 'ICUType', 'K', 'Lactate','MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na','PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP','Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'Weight', 'pH']
if key == '4':
features = ['Length_of_stay','HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
regression_3 = forward_selected(df_train_ols[features], 'Length_of_stay')
Y_pred = regression_3.predict(value['X_test'])
RMSE = rmse(value['Y_test']['Length_of_stay'], Y_pred)
regression_rsme_2a[key]['train'].append(math.sqrt((regression_3.resid**2).mean()))
regression_rsme_2a[key]['test'].append(RMSE)
regression_rsme_2a[key]['regression'].append(regression_3)
regression_rsme_2a[key]['formula'].append(regression_3.model.formula)
regression_rsme_2a[key]['adjusted_rsquared'].append(regression_3.rsquared_adj)
regression_rsme_2a[key]['subfeatures'].append(processFormulaToFeatures(regression_3.model.formula))
print("Mean Training RSME:", round(mean(regression_rsme_2a[key]['train']),3))
print("Mean Test RSME:", round(mean(regression_rsme_2a[key]['test']),3))
print("Mean Adjusted R-Squared:", round(mean(regression_rsme_2a[key]['adjusted_rsquared']),3))
print("")
for key, iteration_dict in regression_rsme_2a.items():
#find the highest r-square
for idx, val in enumerate(iteration_dict['regression']):
print('************************************************************\n')
print("Design Matrix", key)
print("--For Iteration ", idx+1)
print(regression_rsme_2a[key]['regression'][idx].summary())
print('************************************************************\n')
The key features are identified and run in an elastic net regression with regularization in the training in attempt to reduce further the RSME and the adjusted R-Squared
# For Elastic_net Regression
regression_rsme_2b = {}
for key, val in design_matrices.items():
print("Design Matrix", key)
regression_rsme_2b[key] = {'train': [], 'test':[], 'regression':[], 'adjusted_rsquared':[]}
for iternum, value in val.items():
df_train_ols = pd.merge(value['X_train'], value['Y_train'], how='left', on='RecordID')
idx = -1
if iternum == 'Iter1':
idx = 0
if iternum == 'Iter2':
idx = 1
if iternum == 'Iter3':
idx = 2
if iternum == 'Iter4':
idx = 3
if key == '1':
features = ['Length_of_stay', "HCO3", "Urine", "HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
new_reduced_features = regression_rsme_2a[key]['subfeatures'][idx]
if key == '2':
features = ['Length_of_stay', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP']
new_reduced_features = regression_rsme_2a[key]['subfeatures'][idx]
if key == '3':
features = ['Length_of_stay', 'ALP', 'ALT', 'AST', 'Age', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Gender','Glucose', 'HCO3', 'HCT', 'HR', 'Height', 'ICUType', 'K', 'Lactate','MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na','PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP','Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'Weight', 'pH']
new_reduced_features = regression_rsme_2a[key]['subfeatures'][idx]
if key == '4':
features = ['Length_of_stay','HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
new_reduced_features = regression_rsme_2a[key]['subfeatures'][idx]
regression_3b = ols(formula='Length_of_stay ~'+ ' + '.join(new_reduced_features), data=df_train_ols[features]).fit_regularized(method='elastic_net', refit=True)
Y_pred = regression_3b.predict(value['X_test'])
RMSE = rmse(value['Y_test']['Length_of_stay'], Y_pred)
regression_rsme_2b[key]['train'].append(math.sqrt((regression_3b.resid**2).mean()))
regression_rsme_2b[key]['test'].append(RMSE)
regression_rsme_2b[key]['regression'].append(regression_3b)
regression_rsme_2b[key]['adjusted_rsquared'].append(regression_3b.rsquared_adj)
print("Mean Training RSME:", round(mean(regression_rsme_2b[key]['train']),3))
print("Mean Test RSME:", round(mean(regression_rsme_2b[key]['test']),3))
print("Mean Adjusted R-Squared:", round(mean(regression_rsme_2b[key]['adjusted_rsquared']),3))
print("")
for key, iteration_dict in regression_rsme_2b.items():
#find the highest r-square
for idx, val in enumerate(iteration_dict['regression']):
print('************************************************************\n')
print("Design Matrix", key)
print("--For Iteration ", idx+1)
print(regression_rsme_2b[key]['regression'][idx].summary())
print('************************************************************\n')
Unfortunately, the model 1b does not give us any improvements in performance. Instead, it becomes worst than the model 1a.
As such, we retried will a multi linear regression model as follows:
# Model 2 (regression)
# Multiple Linear Regression
regression_rsme = {}
for key, val in design_matrices.items():
print("Design Matrix", key)
regression_rsme[key] = {'train': [], 'test':[], 'regression': [], 'adjusted_rsquared':[]}
for iternum, value in val.items():
df_train_ols = pd.merge(value['X_train'], value['Y_train'], how='left', on='RecordID')
if key == '1':
features = ['Length_of_stay', "HCO3", "Urine", "HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
sub_features = ["HCO3", "Urine", "HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
if key == '2':
features = ['Length_of_stay', 'MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP']
sub_features = ['MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP']
if key == '3':
features = ['Length_of_stay', 'ALP', 'ALT', 'AST', 'Age', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Gender','Glucose', 'HCO3', 'HCT', 'HR', 'Height', 'ICUType', 'K', 'Lactate','MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na','PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP','Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'Weight', 'pH']
sub_features = ['ALP', 'ALT', 'AST', 'Age', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Gender','Glucose', 'HCO3', 'HCT', 'HR', 'Height', 'ICUType', 'K', 'Lactate','MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na','PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP','Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'Weight', 'pH']
if key == '4':
features = ['Length_of_stay','HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
sub_features = ['HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
multlin = ols(formula='Length_of_stay ~'+ ' + '.join(sub_features), data=df_train_ols[features]).fit()
Y_pred = multlin.predict(value['X_test'])
RMSE = rmse(value['Y_test']['Length_of_stay'], Y_pred)
regression_rsme[key]['train'].append(math.sqrt((multlin.resid**2).mean()))
regression_rsme[key]['test'].append(RMSE)
regression_rsme[key]['regression'].append(multlin)
regression_rsme[key]['adjusted_rsquared'].append(multlin.rsquared_adj)
print("Mean Training RSME:", round(mean(regression_rsme[key]['train']),3))
print("Mean Test RSME:", round(mean(regression_rsme[key]['test']),3))
print("Mean Adjusted R-Squared:", round(mean(regression_rsme[key]['adjusted_rsquared']),3))
print("")
Model 2's multilinear regression model performed better than the combination of Model 1a and 1b (Forward Stepwise Regression + Elastic Net Regression)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score
estimators = [] #Store the list of estimators
hyperparams_list = [] #Store the list of hyperparams
auc_scores = {} #Store a dictionary of auc scores over the 4 design matrices
Try decision tree model for classfication.
model_1_preprocessor = Pipeline(steps=[('f_scaler', StandardScaler(with_mean=True))])
# Model 1 (classification) - building
# Decision Tree Classifier
classifer_1_dtc = Pipeline([('preprocessor', model_1_preprocessor),
('dtc', DecisionTreeClassifier())])
hparam_grid_classifer_1_dtc = {'dtc__random_state':[0,1,2,3,4,5]}
estimators.append(classifer_1_dtc)
hyperparams_list.append(hparam_grid_classifer_1_dtc)
A basic logistic regresison pipeline preprocess will standard scalar to normalise the data, followed by select K best to retain features with the highest scores and PCA to optimise the features that can explain the most of the data.
# Model 2 (classification) - building
# Logistic Regression
classifer_1_lr_1 = Pipeline([('preprocessor', model_1_preprocessor),
('select_best', SelectKBest(k='all')),
('pca', PCA()),
('classifier_LR', LogisticRegression(solver='lbfgs', random_state=0))])
hparam_grid_classifer_1_lr_1 = {'classifier_LR__C': [0.5, 1.0, 1.5, 2.0, 2.5, 3.0],'pca__tol': [0.5, 1.0, 1.5, 2.0], 'pca__random_state': [0,1,2,3,4,5]}
estimators.append(classifer_1_lr_1)
hyperparams_list.append(hparam_grid_classifer_1_lr_1)
Other models we tried to outperform the models above:
# K-Neighbors Classifier
classifer_1_knc_1 = Pipeline([('preprocessor', model_1_preprocessor),
('select_best', SelectKBest(k='all')),
('pca', PCA()),
('knc', KNeighborsClassifier(n_neighbors=3))])
hparam_grid_classifer_1_knc_1 = {'knc__n_neighbors': [25,50], 'select_best__k': [1, 2, 3, 4, 5],'pca__tol': [0.5, 1.0, 1.5, 2.0], 'pca__random_state': [0,1,2,3,4,5]}
estimators.append(classifer_1_knc_1)
hyperparams_list.append(hparam_grid_classifer_1_knc_1)
# Random Forest Classifier
classifer_4 = Pipeline(steps=[('f_selecter_kbest', SelectKBest(k='all')),
('pca', PCA()),
('rf', RandomForestClassifier())])
hparam_grid_classifer_4 = {'rf__max_depth': [2,3,4,5], 'rf__n_estimators': [10, 20, 30, 40, 50], 'rf__random_state': [0,1,2,3,4,5]}
estimators.append(classifer_4)
hyperparams_list.append(hparam_grid_classifer_4)
t1 = datetime.now()
print("Started Data Extraction at", t1.strftime('%Y-%m-%d %H:%M:%S'))
for key, val in design_matrices.items():
print('For Design Matrix', key)
auc_scores[key] = {}
if key == '1':
features = ["HCO3", "Urine", "HR", "Bilirubin", "BUN", "GCS", "K", "Na", "PaO2", "SysABP", "Temp", "WBC"]
if key == '2':
features = ['MechVent', 'Temp', 'GCS', 'PaO2_FiO2_ratio', 'HR', 'MAP']
if key == '3':
features = ['ALP', 'ALT', 'AST', 'Age', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Gender','Glucose', 'HCO3', 'HCT', 'HR', 'Height', 'ICUType', 'K', 'Lactate','MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na','PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP','Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'Weight', 'pH']
if key == '4':
features = ['HR', 'MAP', 'Temp', 'Na', 'K', 'Mg']
for index, est in enumerate(estimators):
auc_scores[key][index] = {'train': [], 'test': [], 'accuracy': [], 'precision': [], 'recall': [], 'classifier':[]}
for iternum, value in val.items():
estimator = GridSearchCV(est, hyperparams_list[index], cv=3, scoring='roc_auc')
estimator.fit(value['X_train'][features], value['Y_train']['In-hospital_death'])
#print("\nFor Classifier #", index+1)
#print('Best hyperparameter:', estimator.best_params_)
#print('Best Training AUC score:', round(estimator.best_score_,3))
best_est = estimator.best_estimator_
scores = cross_val_score(best_est, value['X_train'][features], value['Y_train']['In-hospital_death'], cv=3, scoring='roc_auc')
Y_pred = best_est.predict(value['X_test'][features])
#Y_pred = estimator.predict(value['X_test'][features])
#print("Test AUC Score ", round(roc_auc_score(value['Y_test']['In-hospital_death'], Y_pred),3))
#print("Test accuracy = {:.2%}".format(accuracy_score(value['Y_test']['In-hospital_death'], Y_pred)))
#print("Precision: {:.2%}; Recall: {:.2%}".format(precision_score(value['Y_test']['In-hospital_death'], Y_pred),
# recall_score(value['Y_test']['In-hospital_death'], Y_pred)))
auc_scores[key][index]['train'].append(np.mean(scores))
#auc_scores[key][index]['train'].append(round(estimator.best_score_,3))
auc_scores[key][index]['test'].append(roc_auc_score(value['Y_test']['In-hospital_death'], Y_pred))
auc_scores[key][index]['accuracy'].append(accuracy_score(value['Y_test']['In-hospital_death'], Y_pred))
auc_scores[key][index]['precision'].append(precision_score(value['Y_test']['In-hospital_death'], Y_pred))
auc_scores[key][index]['recall'].append(recall_score(value['Y_test']['In-hospital_death'], Y_pred))
auc_scores[key][index]['classifier'].append(estimator)
t2 = datetime.now()
diff = t2-t1
print("Classifier", index+1, "- Mean Training AUC: ", round(mean(auc_scores[key][index]['train']), 3))
print("Classifier", index+1, "- Mean Test AUC: ", round(mean(auc_scores[key][index]['test']),3))
print("Classifier", index+1, "- Mean Accuracy Score: {:.2%}".format(mean(auc_scores[key][index]['accuracy'])))
print("Classifier", index+1, "- Mean Precision Score: {:.2%}".format(mean(auc_scores[key][index]['precision'])))
print("Classifier", index+1, "- Mean Recall Score: {:.2%}".format(mean(auc_scores[key][index]['recall'])))
t2 = datetime.now()
diff = t2-t1
#print("Design Matrix", key, "is done.\nTook about", diff.seconds, "seconds from start")
print('************************************************************\n')
t2 = datetime.now()
diff = t2-t1
#print("\nEntire Data Extraction completed after", diff.seconds, "seconds from start")
The workflow starts with the input and ends with two predictions.
Intermediate components will be from the pipeline of your best performing model.
Differences if any between training and prediction should be explained (no implementation is required).
Input should be 48 hours of patient data (assume the same file format) for a single patient
Output should be prediction of both mortality and LoS
The patients' data will be first extracted from their patients' file.
The temporal variables would be the latest data from the patients' file within the 48 hours of patients' stay in the hospital assuming that they survive the 48 hours period.
The Design Matrix 3's features are all the static and temporal data.
The data is preprocessed with a simple imputer, replacing missing data represented by -1, with the most frequent values across the respective folds
The data is then divided into 4 folds into X and Y whereby X are the features while Y are the target variables.
These X and Y are further divided into Training and Test.
These datas are prepared for 4-fold Cross Validation (CV) whereby there are 4 iterations of training and testing the model performance with 3 folds being the training set and 1 remaining fold being the test set.
The models are then evaluated by their respective performance metrics - mainly Root Mean Squared Error (RSME) for regression model and Area Under the Curve (AUC) score.
They are taken as the average of the 4 iterations for each combination of Design Matrix with the model.
Additional Metrics being evaluated are as follows:
the accuracy scores - how accuracy the model is in making the predictions
the precision - the proportion of actual true in-hospital death cases out of all the predicted in-hospital death cases.
the recall - the proportion of actual true in-hospital death cases out of all the predictions made.
The best model after the cross validation for each tasks are then deployed into the system.
Each time there is a new patient's data in every 48 hours interval, it will be extracted and preprocessesd as an observation and the respective models will be run for a prediction - The regression model will predict the length of stays while the classifier will predict the patients' mortality - either "Will Die", "Will not die"
This predictions will allow the hospital managers to better allocate resources to achieve their objective of saving lives.
For length of stay prediction - allow hospital managers to better schedule the ICU units in advanced based on possible future demands.
For the mortality prediction - more allocation would be likely to those who are predicted to be "Will Die" so as to increase the chance of their survival.
The model with the given features of design matrix 3 has the explainatory power of 0.108 to predict the length of stay of each patient.
The model has about RSME = 19.795 of the discrepancy between the observed values and the values expected under the model in question.
Logistic Regression Pipeline with the StandardScalar, SelectKBest and PCA has the best predictive power of ROC_AUC score of 0.817 in training and 0.613 in the test
It also has the highest accuracy score of 87.45% along with a precision of 61.67% and 25.16%
Accuracy score: the model is able to predict correctly classify the labels 4 out of every 5 patients.
each prediction of either "Will Die" or "Will not die" would be realised truely about 61.67% of the time.
each prediction of "Will Die" has about 1/4 of a chance of being correct.