EHR based studies are typically conducted on a group of patients who meet specific inclusion/exclusion criteria. Some examples are
Diagnoses based studies: Patients diagnosed with a particular disease or condition.
Treatment based studies: Patients who have undergone a particular procedure or who were prescribed a particular medication.
Device based studies: Patients implanted with specific devices.
In all the cases above, the first task is to identify the patient cohort based on the defined criteria. In this case we will identify patients diagnoses with Asthma in MIMIC.
Creating an Asthma Cohort
A common strategy for identifying the patient cohort is to use the ICD codes corresponding to asthma. However, ICD codes can be very granular, and different studies on asthma usually ends up using different sets of ICD codes leading to inconsistencies. An alternative approach is to use PheCodes corresponding to asthma and filter patients based on these codes. The PheWAS Catalog provides a mapping between PheCodes and ICD codes.
Querying the PheWAS Catalog shows that Asthma corresponds to PheCode 495. Given this PheCode, we retrieve the full set of ICD codes mapped to this PheCode and then identify patients with at least one of those codes. It is common practice to further require at least two occurrences of these codes, preferably on separate dates to improve phenotyping specificity. For this tutorial, we will just look for one
import pandas as pdimport osbase_directory = os.path.dirname(os.getcwd())diagnoses_icd = ( pd.read_csv( os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp", "diagnoses_icd.csv"), dtype=str ).rename(columns={'icd_code': 'code'}))diagnoses_icd['coding_system'] ="ICD"+ diagnoses_icd['icd_version']print("----Raw MIMIC IV 3.1 Diagnoses ICD----")display(diagnoses_icd)icd_to_phecode = pd.read_csv( os.path.join(base_directory, 'scripts', 'rollup_mappings', "icd_to_phecode.csv"), dtype=str)print("----ICD to PheCode Rollup file----")display(icd_to_phecode.head())print("----Comprehensive Diagnoses including ICD and PheCode----")comprehensive_diagnoses = pd.merge(diagnoses_icd, icd_to_phecode, on=['coding_system','code'])display(comprehensive_diagnoses.head())print("----Identifying the cohort of interest----")comprehensive_asthma_cohort = comprehensive_diagnoses[comprehensive_diagnoses['PheCode']=='495']display(comprehensive_asthma_cohort)print(comprehensive_asthma_cohort.describe())print("----Final Asthma Cohort----")asthma_cohort = comprehensive_asthma_cohort[['subject_id']].drop_duplicates()display(asthma_cohort)
----Raw MIMIC IV 3.1 Diagnoses ICD----
subject_id
hadm_id
seq_num
code
icd_version
coding_system
0
10000032
22595853
1
5723
9
ICD9
1
10000032
22595853
2
78959
9
ICD9
2
10000032
22595853
3
5715
9
ICD9
3
10000032
22595853
4
07070
9
ICD9
4
10000032
22595853
5
496
9
ICD9
...
...
...
...
...
...
...
6364483
19999987
23865745
7
41401
9
ICD9
6364484
19999987
23865745
8
78039
9
ICD9
6364485
19999987
23865745
9
0413
9
ICD9
6364486
19999987
23865745
10
36846
9
ICD9
6364487
19999987
23865745
11
7810
9
ICD9
6364488 rows × 6 columns
----ICD to PheCode Rollup file----
code
PheCode
coding_system
0
001
008
ICD9
1
0010
008
ICD9
2
0011
008
ICD9
3
0019
008
ICD9
4
002
008
ICD9
----Comprehensive Diagnoses including ICD and PheCode----
Now that we have identified our cohort of 20,316 patients with asthma, we proceed to extract the variables of interest. The extracted data will be aggregated at the patient level later on. First, we create the directories for saving the final aggregated outputs.
# We will extract data of interest, aggregate them and save them herecohort_aggregateddata_directory =os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata')os.makedirs(cohort_aggregateddata_directory, exist_ok=True)# We will seperate codified aggregated data and nlp aggregated data cohort_aggregateddata_codified_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'codified')os.makedirs(cohort_aggregateddata_codified_directory, exist_ok=True)
# Processing data in batches to extract patient data of interestrolledup_diagnoses_directory = os.path.join(base_directory, 'processed_data', 'step4_rolledup_finaldata', 'Diagnoses')rolledup_diagnoses_batch_files = os.listdir(rolledup_diagnoses_directory)sample_rolled_diagnoses = pd.read_csv(os.path.join(rolledup_diagnoses_directory, rolledup_diagnoses_batch_files[0]), dtype=str)extracted_diagnoses_dfs = []for diagnoses_batch_file in rolledup_diagnoses_batch_files: diagnoses_batch = pd.read_csv(os.path.join(rolledup_diagnoses_directory, diagnoses_batch_file), dtype=str) diagnoses_batch_extracted = pd.merge(diagnoses_batch,asthma_cohort, on=['subject_id'], how='inner') display(diagnoses_batch_extracted) extracted_diagnoses_dfs.append(diagnoses_batch_extracted)extracted_diagnoses = pd.concat(extracted_diagnoses_dfs)display(extracted_diagnoses)
Important Note: The diagnoses data in the MIMIC-IV Hosp dataset does not include a specific time element. For the purpose of analysis, we have inferred the timing as the admission date. Please be cautious when using this assumption in downstream analyses as it may not accurately reflect the actual timing of diagnoses.
Please proceed to generate the patient-level data for the other data types: medications, procedures, and labs.
Unstructured Data: MIMIC-IV Discharge Summaries
In the last step, we provide a concise guide to transforming free text notes into structured features suitable for downstream analysis. In this section, we will focus on processing the Asthma cohort patients and the Discharge Summaries from MIMIC-IV notes module
To identify and extract clinically relevant concepts from notes, we translate text into codes based on mappings provided by the Unified Medical Language System (UMLS). To do this, we can leverage tools such as cTAKES, NILE, and MetaMap.
For this tutorial, we use Petehr—a Python toolkit custom built for this tutorial. However, for better performance and in real world applications, we highly recommend using NILE or other similar software
MIMIC Notes and Hosp module version difference
Here, we will focus on processing discharge summaries from MIMIC-IV deidentified notes. It’s important to note that the most recent version of the MIMIC-IV note data is version 2.2, whereas the MIMIC-IV Hosp data we are using comes from a newer version, version 3.1. Because the modules differ in release versions, some patients present in Hosp may be missing from Note.
Note.
This kind of mismatch is common in real world EHR systems. To resolve this, we can
1) Restrict analyses to the intersection of patients with both Note and Hosp data during the study window. 2) Align timelines so that structured events do not extend beyond the last available note for each patient.
# Installing Python Toolkit for processing EHR data. For NLP, we recommend you use NILE linked below# https://celehs.hms.harvard.edu/software/NILE.html!pip install petehr import osimport pandas as pdfrom petehr import Text2Codebase_directory = os.path.dirname(os.getcwd())cohort_aggregateddata_nlp_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'nlp')os.makedirs(cohort_aggregateddata_nlp_directory, exist_ok=True)
Requirement already satisfied: petehr in /n/data1/hsph/biostat/celehs/lab/va67/anaconda/anaconda3/lib/python3.11/site-packages (0.0.1)
# The Asthma CUI dictionary was downloaded from here. https://shiny.parse-health.org/ONCE/# Please see the bottom of this notebook for further resources asthma_dictionary_file = os.path.join(base_directory, 'scripts', 'meta_files','Asthma_NLP_Dict.csv')asthma_dictionary = pd.read_csv(asthma_dictionary_file, dtype=str)display(asthma_dictionary)print(asthma_dictionary)print(asthma_dictionary.describe())
STR
CUI
0
pulmonary hypertension nos
C0020542
1
hypertensive pulmonary vascular disease
C0020542
2
hypertension pulmonary
C0020542
3
pulmonary hypertension disorder
C0020542
4
pulmonary hypertensions
C0020542
...
...
...
2975
drug screen qualitative digoxin
C0337449
2976
electrocardiogram myocardial infarction
C0428953
2977
roche brand of bumetanide
C0701009
2978
glaxosmithkline brand of carvedilol
C0719509
2979
pfizer brand of eplerenone
C1144054
2980 rows × 2 columns
STR CUI
0 pulmonary hypertension nos C0020542
1 hypertensive pulmonary vascular disease C0020542
2 hypertension pulmonary C0020542
3 pulmonary hypertension disorder C0020542
4 pulmonary hypertensions C0020542
... ... ...
2975 drug screen qualitative digoxin C0337449
2976 electrocardiogram myocardial infarction C0428953
2977 roche brand of bumetanide C0701009
2978 glaxosmithkline brand of carvedilol C0719509
2979 pfizer brand of eplerenone C1144054
[2980 rows x 2 columns]
STR CUI
count 2976 2980
unique 2833 276
top ventricular tachycardia C0038454
freq 4 52
In the summary we can see we have identified 276 unique CUIs (Concept Unique Identifiers) using ONCE app that are related to Asthma. When we include the synonymous terms for these 276 CUIs, we have around 3000 CUI-term mappings.
text2cui = Text2Code(asthma_dictionary_file)
Dictionary loaded sucessfully from /n/scratch/users/v/va67/EHR_TUTORIAL_WORKSPACE/scripts/meta_files/Asthma_NLP_Dict.csv
# Check how many patients from the Asthma cohort can be identified in the notes cohort.print(len(asthma_cohort))discharge_asthma_cohort = discharge[discharge['subject_id'].isin(asthma_cohort['subject_id'])]display(discharge_asthma_cohort.describe())
20316
note_id
subject_id
hadm_id
note_type
note_seq
charttime
storetime
text
count
50286
50286
50286
50286
50286
50286
50284
50286
unique
50286
14958
50286
1
135
25983
50232
50285
top
10001725-DS-12
12468016
25563031
DS
21
2160-06-09 00:00:00
2116-12-03 14:09:00
\nName: ___ Unit No: ___...
freq
1
85
1
50286
2391
9
2
2
Out of 20,316 patients in the asthma cohort, only 14,858 patients have discharge notes. Approximately 73% of the patients have discharge notes.
# Selecting only the required columnsdischarge = discharge[['subject_id','charttime','text']]display(discharge.head())
subject_id
charttime
text
0
10000032
2180-05-07 00:00:00
\nName: ___ Unit No: _...
1
10000032
2180-06-27 00:00:00
\nName: ___ Unit No: _...
2
10000032
2180-07-25 00:00:00
\nName: ___ Unit No: _...
3
10000032
2180-08-07 00:00:00
\nName: ___ Unit No: _...
4
10000084
2160-11-25 00:00:00
\nName: ___ Unit No: __...
# Filtering to retain only the notes associated with the patients of interestasthma_cohort_notes = pd.merge(asthma_cohort,discharge, on=['subject_id'],how='inner')display(asthma_cohort_notes)asthma_cohort_notes.dropna(inplace=True)display(asthma_cohort_notes)
subject_id
charttime
text
0
10001725
2110-04-14 00:00:00
\nName: ___ Unit No: ___\n \nA...
1
10001884
2125-10-20 00:00:00
\nName: ___ Unit No: ___\n \nA...
2
10001884
2125-10-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
3
10001884
2125-12-03 00:00:00
\nName: ___ Unit No: ___\n \nA...
4
10001884
2125-12-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
\nName: ___ Unit No: __...
50282
19757198
2191-05-26 00:00:00
\nName: ___ Unit No: __...
50283
19757198
2193-06-23 00:00:00
\nName: ___ Unit No: __...
50284
19757198
2194-02-19 00:00:00
\nName: ___ Unit No: __...
50285
19757198
2194-10-02 00:00:00
\nName: ___ Unit No: __...
50286 rows × 3 columns
subject_id
charttime
text
0
10001725
2110-04-14 00:00:00
\nName: ___ Unit No: ___\n \nA...
1
10001884
2125-10-20 00:00:00
\nName: ___ Unit No: ___\n \nA...
2
10001884
2125-10-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
3
10001884
2125-12-03 00:00:00
\nName: ___ Unit No: ___\n \nA...
4
10001884
2125-12-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
\nName: ___ Unit No: __...
50282
19757198
2191-05-26 00:00:00
\nName: ___ Unit No: __...
50283
19757198
2193-06-23 00:00:00
\nName: ___ Unit No: __...
50284
19757198
2194-02-19 00:00:00
\nName: ___ Unit No: __...
50285
19757198
2194-10-02 00:00:00
\nName: ___ Unit No: __...
50286 rows × 3 columns
# Converting Text to CUI (CONCEPT UNIQUE IDENTIFIER)asthma_cohort_notes['note_cui'] = asthma_cohort_notes['text'].map(lambda x: text2cui.convert(x))asthma_cohort_notes
subject_id
charttime
text
note_cui
0
10001725
2110-04-14 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 4 columns
Now that we have the narrative text translated to CUI codes, we can go ahead with the reformatting and cleaning process.
asthma_cohort_missing_notes = pd.DataFrame({'Column': asthma_cohort_notes.columns,'Missing_Values': asthma_cohort_notes.isna().sum()})display(asthma_cohort_missing_notes)# Selecting only columns of interestasthma_cohort_notes=asthma_cohort_notes[['subject_id','charttime','note_cui']]display(asthma_cohort_notes)# Rename the time columns to be consisten with other datasetsasthma_cohort_notes = asthma_cohort_notes.rename(columns = {"charttime":"date"})display(asthma_cohort_notes)# Cleaning the datesasthma_cohort_notes["date"] = asthma_cohort_notes["date"].str[:10]display(asthma_cohort_notes)
Column
Missing_Values
subject_id
subject_id
0
charttime
charttime
0
text
text
0
note_cui
note_cui
0
subject_id
charttime
note_cui
0
10001725
2110-04-14 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 3 columns
subject_id
date
note_cui
0
10001725
2110-04-14 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 3 columns
subject_id
date
note_cui
0
10001725
2110-04-14
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
50281
19757198
2189-09-12
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 3 columns
# Convert the note_cui column to cui list so we can expand them in next stepasthma_cohort_notes['note_cui_list'] = asthma_cohort_notes['note_cui'].apply(lambda x: x.split(',') if x elseNone)display(asthma_cohort_notes)
#Expand the note_cui_list column to have one CUI per rowasthma_cohort_notes = asthma_cohort_notes[['subject_id','date','note_cui_list']]asthma_cohort_notes = asthma_cohort_notes.explode('note_cui_list')display(asthma_cohort_notes)
subject_id
date
note_cui_list
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
...
...
...
...
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
42090077 rows × 3 columns
# Drop Duplicatesasthma_cohort_notes.drop_duplicates(inplace=True)asthma_cohort_notes=asthma_cohort_notes.rename(columns={"note_cui_list":"cui"})display(asthma_cohort_notes)