Step 4: Cohort Creation

EHR based studies are typically conducted on a group of patients who meet specific inclusion/exclusion criteria. Some examples are

Diagnoses based studies: Patients diagnosed with a particular disease or condition.
Treatment based studies: Patients who have undergone a particular procedure or who were prescribed a particular medication.
Device based studies: Patients implanted with specific devices.

In all the cases above, the first task is to identify the patient cohort based on the defined criteria. In this case we will identify patients diagnoses with Asthma in MIMIC.

Creating an Asthma Cohort

A common strategy for identifying the patient cohort is to use the ICD codes corresponding to asthma. However, ICD codes can be very granular, and different studies on asthma usually ends up using different sets of ICD codes leading to inconsistencies. An alternative approach is to use PheCodes corresponding to asthma and filter patients based on these codes. The PheWAS Catalog provides a mapping between PheCodes and ICD codes.

Querying the PheWAS Catalog shows that Asthma corresponds to PheCode 495. Given this PheCode, we retrieve the full set of ICD codes mapped to this PheCode and then identify patients with at least one of those codes. It is common practice to further require at least two occurrences of these codes, preferably on separate dates to improve phenotyping specificity. For this tutorial, we will just look for one

import pandas as pd
import os
base_directory = os.path.dirname(os.getcwd())


diagnoses_icd = ( 
     pd.read_csv(
        os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp", "diagnoses_icd.csv"),
        dtype=str
    ).rename(columns={'icd_code': 'code'})
)
diagnoses_icd['coding_system'] = "ICD" + diagnoses_icd['icd_version']

print("----Raw MIMIC IV 3.1 Diagnoses ICD----")
display(diagnoses_icd)


icd_to_phecode = pd.read_csv(
    os.path.join(base_directory, 'scripts', 'rollup_mappings', "icd_to_phecode.csv"),
    dtype=str
)
print("----ICD to PheCode Rollup file----")
display(icd_to_phecode.head())


print("----Comprehensive Diagnoses including ICD and PheCode----")
comprehensive_diagnoses = pd.merge(diagnoses_icd, icd_to_phecode, on=['coding_system','code'])
display(comprehensive_diagnoses.head())


print("----Identifying the cohort of interest----")
comprehensive_asthma_cohort = comprehensive_diagnoses[comprehensive_diagnoses['PheCode']=='495']
display(comprehensive_asthma_cohort)
print(comprehensive_asthma_cohort.describe())


print("----Final Asthma Cohort----")
asthma_cohort = comprehensive_asthma_cohort[['subject_id']].drop_duplicates()
display(asthma_cohort)

----Raw MIMIC IV 3.1 Diagnoses ICD----

	subject_id	hadm_id	seq_num	code	icd_version	coding_system
0	10000032	22595853	1	5723	9	ICD9
1	10000032	22595853	2	78959	9	ICD9
2	10000032	22595853	3	5715	9	ICD9
3	10000032	22595853	4	07070	9	ICD9
4	10000032	22595853	5	496	9	ICD9
...	...	...	...	...	...	...
6364483	19999987	23865745	7	41401	9	ICD9
6364484	19999987	23865745	8	78039	9	ICD9
6364485	19999987	23865745	9	0413	9	ICD9
6364486	19999987	23865745	10	36846	9	ICD9
6364487	19999987	23865745	11	7810	9	ICD9

6364488 rows × 6 columns

----ICD to PheCode Rollup file----

	code	PheCode	coding_system
0	001	008	ICD9
1	0010	008	ICD9
2	0011	008	ICD9
3	0019	008	ICD9
4	002	008	ICD9

----Comprehensive Diagnoses including ICD and PheCode----

	subject_id	hadm_id	seq_num	code	icd_version	coding_system	PheCode
0	10000032	22595853	1	5723	9	ICD9	571.81
1	10000826	20032235	4	5723	9	ICD9	571.81
2	10000826	28289260	1	5723	9	ICD9	571.81
3	10005866	26158160	4	5723	9	ICD9	571.81
4	10008924	23676183	7	5723	9	ICD9	571.81

----Identifying the cohort of interest----

	subject_id	hadm_id	seq_num	code	icd_version	coding_system	PheCode
2478863	10001725	25563031	4	49390	9	ICD9	495
2478864	10001884	26679629	7	49390	9	ICD9	495
2478865	10003019	20030125	5	49390	9	ICD9	495
2478866	10003019	20277210	10	49390	9	ICD9	495
2478867	10003019	20962108	15	49390	9	ICD9	495
...	...	...	...	...	...	...	...
5908161	17892612	24109018	1	49382	9	ICD9	495
5908162	17997063	25519468	11	49382	9	ICD9	495
5908163	18269165	28966193	6	49382	9	ICD9	495
5908164	18958101	23643092	8	49382	9	ICD9	495
5908165	19757198	28680884	9	49382	9	ICD9	495

42057 rows × 7 columns

       subject_id   hadm_id seq_num    code icd_version coding_system PheCode
count       42057     42057   42057   42057       42057         42057   42057
unique      20316     42035      39      11           2             2       1
top      18676703  24773199       5  J45909          10         ICD10     495
freq           60         3    4105   20679       21954         21954   42057
----Final Asthma Cohort----

	subject_id
2478863	10001725
2478864	10001884
2478865	10003019
2478872	10004457
2478875	10004749
...	...
5908156	16550589
5908158	17562616
5908161	17892612
5908162	17997063
5908165	19757198

20316 rows × 1 columns

Structured Data

Now that we have identified our cohort of 20,316 patients with asthma, we proceed to extract the variables of interest. The extracted data will be aggregated at the patient level later on. First, we create the directories for saving the final aggregated outputs.

# We will extract data of interest, aggregate them and save them here
cohort_aggregateddata_directory =os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata')
os.makedirs(cohort_aggregateddata_directory, exist_ok=True)

# We will seperate codified aggregated data and nlp aggregated data 
cohort_aggregateddata_codified_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'codified')
os.makedirs(cohort_aggregateddata_codified_directory, exist_ok=True)

# Processing data in batches to extract patient data of interest

rolledup_diagnoses_directory =  os.path.join(base_directory, 'processed_data', 'step4_rolledup_finaldata', 'Diagnoses')
rolledup_diagnoses_batch_files = os.listdir(rolledup_diagnoses_directory)
sample_rolled_diagnoses = pd.read_csv(os.path.join(rolledup_diagnoses_directory, rolledup_diagnoses_batch_files[0]), dtype=str)

extracted_diagnoses_dfs = []

for diagnoses_batch_file in rolledup_diagnoses_batch_files:
    diagnoses_batch = pd.read_csv(os.path.join(rolledup_diagnoses_directory, diagnoses_batch_file), dtype=str)
    diagnoses_batch_extracted = pd.merge(diagnoses_batch,asthma_cohort, on=['subject_id'], how='inner')
    display(diagnoses_batch_extracted)
    extracted_diagnoses_dfs.append(diagnoses_batch_extracted)

extracted_diagnoses = pd.concat(extracted_diagnoses_dfs)
display(extracted_diagnoses)

	subject_id	date	PheCode
0	10004457	2140-09-17	411.4
1	10004457	2140-09-17	272.1
2	10004457	2140-09-17	401.1
3	10004457	2140-09-17	495
4	10004457	2140-09-17	185
...	...	...	...
122569	19990563	2180-11-30	783
122570	19990563	2180-11-30	250.2
122571	19990563	2180-11-30	401.1
122572	19990563	2180-11-30	457.3
122573	19990563	2180-11-30	495

122574 rows × 3 columns

	subject_id	date	PheCode
0	10017393	2179-07-20	694.3
1	10017393	2179-07-20	960.2
2	10017393	2179-07-20	528.7
3	10017393	2179-07-20	361
4	10017393	2179-07-20	495
...	...	...	...
118535	19990581	2141-07-23	272.11
118536	19990581	2141-07-23	250.24
118537	19990581	2141-07-23	536.3
118538	19990581	2141-07-23	495
118539	19990581	2141-07-23	274.1

118540 rows × 3 columns

	subject_id	date	PheCode
0	10011912	2176-10-21	800.3
1	10011912	2176-10-21	070.3
2	10011912	2176-10-21	174.11
3	10011912	2176-10-21	495
4	10011912	2176-10-21	317.11
...	...	...	...
115431	19999442	2148-11-19	856
115432	19999442	2148-11-19	348
115433	19999442	2148-11-19	342
115434	19999442	2148-11-19	495
115435	19999442	2148-11-19	296.2

115436 rows × 3 columns

	subject_id	date	PheCode
0	10001884	2130-10-05	496.21
1	10001884	2130-10-05	1013
2	10001884	2130-10-05	276.14
3	10001884	2130-10-05	401.1
4	10001884	2130-10-05	272.11
...	...	...	...
115775	19997760	2187-07-09	286.2
115776	19997760	2187-07-09	594
115777	19997760	2187-07-09	972.6
115778	19997760	2187-07-09	972.1
115779	19997760	2187-07-09	974

115780 rows × 3 columns

	subject_id	date	PheCode
0	10003019	2174-12-25	743.21
1	10003019	2174-12-25	334
2	10003019	2174-12-25	697
3	10003019	2174-12-25	510
4	10003019	2174-12-25	495
...	...	...	...
124334	19996016	2159-12-10	415
124335	19996016	2159-12-10	256.4
124336	19996016	2159-12-10	1010.6
124337	19996016	2159-12-10	652
124338	19996016	2159-12-10	663

124339 rows × 3 columns

	subject_id	date	PheCode
0	10002800	2164-07-12	649
1	10002800	2164-07-12	521.1
2	10002800	2164-07-12	1010.6
3	10002800	2164-07-12	495
4	10002800	2164-07-12	646
...	...	...	...
117695	19996832	2179-02-21	296.22
117696	19996832	2179-02-21	297.2
117697	19996832	2179-02-21	495
117698	19996832	2179-02-21	318
117699	19996832	2179-02-21	301.2

117700 rows × 3 columns

	subject_id	date	PheCode
0	10011607	2184-04-26	495.2
1	10011607	2184-04-26	509.1
2	10011607	2184-04-26	411.3
3	10011607	2184-04-26	401.1
4	10011607	2184-04-26	290.1
...	...	...	...
115303	19998350	2128-02-21	327.32
115304	19998350	2128-02-21	495
115305	19998350	2128-02-21	278.1
115306	19998350	2128-02-21	300.1
115307	19998350	2128-02-21	208

115308 rows × 3 columns

	subject_id	date	PheCode
0	10001725	2110-04-11	599.2
1	10001725	2110-04-11	946
2	10001725	2110-04-11	618.5
3	10001725	2110-04-11	495
4	10001725	2110-04-11	530.11
...	...	...	...
113325	19997887	2117-04-07	318
113326	19997887	2117-04-07	288.2
113327	19997887	2117-04-07	338.1
113328	19997887	2117-04-07	789
113329	19997887	2117-04-07	458.2

113330 rows × 3 columns

	subject_id	date	PheCode
0	10004457	2140-09-17	411.4
1	10004457	2140-09-17	272.1
2	10004457	2140-09-17	401.1
3	10004457	2140-09-17	495
4	10004457	2140-09-17	185
...	...	...	...
113325	19997887	2117-04-07	318
113326	19997887	2117-04-07	288.2
113327	19997887	2117-04-07	338.1
113328	19997887	2117-04-07	789
113329	19997887	2117-04-07	458.2

943007 rows × 3 columns

Aggregate Data at Patient Level

phecode_counts_per_patient = extracted_diagnoses.groupby(['subject_id', 'PheCode']).size().reset_index(name='counts')
display(phecode_counts_per_patient)

phecode_counts_per_patient_matrixformat = phecode_counts_per_patient.pivot_table(index='subject_id', columns='PheCode', values='counts', fill_value=0)
display(phecode_counts_per_patient_matrixformat)

phecode_counts_per_patient_matrixformat.to_csv(os.path.join(cohort_aggregateddata_codified_directory,"Diagnoses.csv"), index=None)

	subject_id	PheCode	counts
0	10001725	180.1	1
1	10001725	296.2	1
2	10001725	300.1	1
3	10001725	313.1	1
4	10001725	318	1
...	...	...	...
480814	19999442	433.21	1
480815	19999442	495	2
480816	19999442	591	1
480817	19999442	594	1
480818	19999442	856	1

480819 rows × 3 columns

PheCode	008	008.5	008.51	008.52	008.6	008.7	010	031	038	038.1	...	983	985	986	987	988	989	990	994.1	994.2	994.21
subject_id
10001725	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
10001884	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
10002800	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
10003019	0	0	0	1	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	1	1
10004296	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19997760	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19997887	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19998350	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19999112	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19999442	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

20316 rows × 1725 columns

Important Note: The diagnoses data in the MIMIC-IV Hosp dataset does not include a specific time element. For the purpose of analysis, we have inferred the timing as the admission date. Please be cautious when using this assumption in downstream analyses as it may not accurately reflect the actual timing of diagnoses.

Please proceed to generate the patient-level data for the other data types: medications, procedures, and labs.

Unstructured Data: MIMIC-IV Discharge Summaries

In the last step, we provide a concise guide to transforming free text notes into structured features suitable for downstream analysis. In this section, we will focus on processing the Asthma cohort patients and the Discharge Summaries from MIMIC-IV notes module

To identify and extract clinically relevant concepts from notes, we translate text into codes based on mappings provided by the Unified Medical Language System (UMLS). To do this, we can leverage tools such as cTAKES, NILE, and MetaMap.

For this tutorial, we use Petehr—a Python toolkit custom built for this tutorial. However, for better performance and in real world applications, we highly recommend using NILE or other similar software

MIMIC Notes and Hosp module version difference

Here, we will focus on processing discharge summaries from MIMIC-IV deidentified notes. It’s important to note that the most recent version of the MIMIC-IV note data is version 2.2, whereas the MIMIC-IV Hosp data we are using comes from a newer version, version 3.1. Because the modules differ in release versions, some patients present in Hosp may be missing from Note.

Note.
This kind of mismatch is common in real world EHR systems. To resolve this, we can
1) Restrict analyses to the intersection of patients with both Note and Hosp data during the study window. 2) Align timelines so that structured events do not extend beyond the last available note for each patient.

# Installing Python Toolkit for processing EHR data. For NLP, we recommend you use NILE linked below
# https://celehs.hms.harvard.edu/software/NILE.html

!pip install petehr 

import os
import pandas as pd
from petehr import Text2Code

base_directory = os.path.dirname(os.getcwd())
cohort_aggregateddata_nlp_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'nlp')
os.makedirs(cohort_aggregateddata_nlp_directory, exist_ok=True)

Requirement already satisfied: petehr in /n/data1/hsph/biostat/celehs/lab/va67/anaconda/anaconda3/lib/python3.11/site-packages (0.0.1)

note_directory = os.path.join(base_directory, 'raw_data', 'nlp', 'physionet.org', 'files', 'mimic-iv-note', '2.2', 'note')
os.listdir(note_directory)

['discharge.csv',
 'discharge_detail.csv',
 'index.html',
 'radiology.csv',
 'radiology_detail.csv']

# The Asthma CUI dictionary was downloaded from here. https://shiny.parse-health.org/ONCE/
# Please see the bottom of this notebook for further resources 

asthma_dictionary_file = os.path.join(base_directory, 'scripts', 'meta_files','Asthma_NLP_Dict.csv')
asthma_dictionary = pd.read_csv(asthma_dictionary_file, dtype=str)
display(asthma_dictionary)

print(asthma_dictionary)
print(asthma_dictionary.describe())

	STR	CUI
0	pulmonary hypertension nos	C0020542
1	hypertensive pulmonary vascular disease	C0020542
2	hypertension pulmonary	C0020542
3	pulmonary hypertension disorder	C0020542
4	pulmonary hypertensions	C0020542
...	...	...
2975	drug screen qualitative digoxin	C0337449
2976	electrocardiogram myocardial infarction	C0428953
2977	roche brand of bumetanide	C0701009
2978	glaxosmithkline brand of carvedilol	C0719509
2979	pfizer brand of eplerenone	C1144054

2980 rows × 2 columns

                                          STR       CUI
0                  pulmonary hypertension nos  C0020542
1     hypertensive pulmonary vascular disease  C0020542
2                      hypertension pulmonary  C0020542
3             pulmonary hypertension disorder  C0020542
4                     pulmonary hypertensions  C0020542
...                                       ...       ...
2975          drug screen qualitative digoxin  C0337449
2976  electrocardiogram myocardial infarction  C0428953
2977                roche brand of bumetanide  C0701009
2978      glaxosmithkline brand of carvedilol  C0719509
2979               pfizer brand of eplerenone  C1144054

[2980 rows x 2 columns]
                            STR       CUI
count                      2976      2980
unique                     2833       276
top     ventricular tachycardia  C0038454
freq                          4        52

In the summary we can see we have identified 276 unique CUIs (Concept Unique Identifiers) using ONCE app that are related to Asthma. When we include the synonymous terms for these 276 CUIs, we have around 3000 CUI-term mappings.

text2cui = Text2Code(asthma_dictionary_file)

Dictionary loaded sucessfully from /n/scratch/users/v/va67/EHR_TUTORIAL_WORKSPACE/scripts/meta_files/Asthma_NLP_Dict.csv

# Loading discharge notes data

discharge = pd.read_csv(os.path.join(note_directory,"discharge.csv"), dtype=str)
display(discharge)

	note_id	subject_id	hadm_id	note_type	note_seq	charttime	storetime	text
0	10000032-DS-21	10000032	22595853	DS	21	2180-05-07 00:00:00	2180-05-09 15:26:00	\nName: ___ Unit No: _...
1	10000032-DS-22	10000032	22841357	DS	22	2180-06-27 00:00:00	2180-07-01 10:15:00	\nName: ___ Unit No: _...
2	10000032-DS-23	10000032	29079034	DS	23	2180-07-25 00:00:00	2180-07-25 21:42:00	\nName: ___ Unit No: _...
3	10000032-DS-24	10000032	25742920	DS	24	2180-08-07 00:00:00	2180-08-10 05:43:00	\nName: ___ Unit No: _...
4	10000084-DS-17	10000084	23052089	DS	17	2160-11-25 00:00:00	2160-11-25 15:09:00	\nName: ___ Unit No: __...
...	...	...	...	...	...	...	...	...
331788	19999828-DS-6	19999828	29734428	DS	6	2147-08-04 00:00:00	2147-08-12 15:36:00	\nName: ___ Unit No: ___...
331789	19999828-DS-7	19999828	25744818	DS	7	2149-01-18 00:00:00	2149-01-19 07:03:00	\nName: ___ Unit No: ___...
331790	19999840-DS-20	19999840	26071774	DS	20	2164-07-28 00:00:00	2164-07-29 14:52:00	\nName: ___ Unit No: ___\...
331791	19999840-DS-21	19999840	21033226	DS	21	2164-09-17 00:00:00	2164-09-18 01:36:00	\nName: ___ Unit No: ___\...
331792	19999987-DS-2	19999987	23865745	DS	2	2145-11-11 00:00:00	2145-11-11 13:13:00	\nName: ___ Unit No: __...

331793 rows × 8 columns

# Check how many patients from the Asthma cohort can be identified in the notes cohort.

print(len(asthma_cohort))

discharge_asthma_cohort = discharge[discharge['subject_id'].isin(asthma_cohort['subject_id'])]

display(discharge_asthma_cohort.describe())

	note_id	subject_id	hadm_id	note_type	note_seq	charttime	storetime	text
count	50286	50286	50286	50286	50286	50286	50284	50286
unique	50286	14958	50286	1	135	25983	50232	50285
top	10001725-DS-12	12468016	25563031	DS	21	2160-06-09 00:00:00	2116-12-03 14:09:00	\nName: ___ Unit No: ___...
freq	1	85	1	50286	2391	9	2	2

Out of 20,316 patients in the asthma cohort, only 14,858 patients have discharge notes. Approximately 73% of the patients have discharge notes.

# Selecting only the required columns

discharge = discharge[['subject_id','charttime','text']]
display(discharge.head())

	subject_id	charttime	text
0	10000032	2180-05-07 00:00:00	\nName: ___ Unit No: _...
1	10000032	2180-06-27 00:00:00	\nName: ___ Unit No: _...
2	10000032	2180-07-25 00:00:00	\nName: ___ Unit No: _...
3	10000032	2180-08-07 00:00:00	\nName: ___ Unit No: _...
4	10000084	2160-11-25 00:00:00	\nName: ___ Unit No: __...

# Filtering to retain only the notes associated with the patients of interest

asthma_cohort_notes = pd.merge(asthma_cohort,discharge, on=['subject_id'],how='inner')
display(asthma_cohort_notes)

asthma_cohort_notes.dropna(inplace=True)
display(asthma_cohort_notes)

	subject_id	charttime	text
0	10001725	2110-04-14 00:00:00	\nName: ___ Unit No: ___\n \nA...
1	10001884	2125-10-20 00:00:00	\nName: ___ Unit No: ___\n \nA...
2	10001884	2125-10-27 00:00:00	\nName: ___ Unit No: ___\n \nA...
3	10001884	2125-12-03 00:00:00	\nName: ___ Unit No: ___\n \nA...
4	10001884	2125-12-27 00:00:00	\nName: ___ Unit No: ___\n \nA...
...	...	...	...
50281	19757198	2189-09-12 00:00:00	\nName: ___ Unit No: __...
50282	19757198	2191-05-26 00:00:00	\nName: ___ Unit No: __...
50283	19757198	2193-06-23 00:00:00	\nName: ___ Unit No: __...
50284	19757198	2194-02-19 00:00:00	\nName: ___ Unit No: __...
50285	19757198	2194-10-02 00:00:00	\nName: ___ Unit No: __...

50286 rows × 3 columns

	subject_id	charttime	text
0	10001725	2110-04-14 00:00:00	\nName: ___ Unit No: ___\n \nA...
1	10001884	2125-10-20 00:00:00	\nName: ___ Unit No: ___\n \nA...
2	10001884	2125-10-27 00:00:00	\nName: ___ Unit No: ___\n \nA...
3	10001884	2125-12-03 00:00:00	\nName: ___ Unit No: ___\n \nA...
4	10001884	2125-12-27 00:00:00	\nName: ___ Unit No: ___\n \nA...
...	...	...	...
50281	19757198	2189-09-12 00:00:00	\nName: ___ Unit No: __...
50282	19757198	2191-05-26 00:00:00	\nName: ___ Unit No: __...
50283	19757198	2193-06-23 00:00:00	\nName: ___ Unit No: __...
50284	19757198	2194-02-19 00:00:00	\nName: ___ Unit No: __...
50285	19757198	2194-10-02 00:00:00	\nName: ___ Unit No: __...

50286 rows × 3 columns

# Converting Text to CUI (CONCEPT UNIQUE IDENTIFIER)

asthma_cohort_notes['note_cui'] = asthma_cohort_notes['text'].map(lambda x: text2cui.convert(x))
asthma_cohort_notes

	subject_id	charttime	text	note_cui
0	10001725	2110-04-14 00:00:00	\nName: ___ Unit No: ___\n \nA...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
1	10001884	2125-10-20 00:00:00	\nName: ___ Unit No: ___\n \nA...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
2	10001884	2125-10-27 00:00:00	\nName: ___ Unit No: ___\n \nA...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
3	10001884	2125-12-03 00:00:00	\nName: ___ Unit No: ___\n \nA...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
4	10001884	2125-12-27 00:00:00	\nName: ___ Unit No: ___\n \nA...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
...	...	...	...	...
50281	19757198	2189-09-12 00:00:00	\nName: ___ Unit No: __...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282	19757198	2191-05-26 00:00:00	\nName: ___ Unit No: __...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283	19757198	2193-06-23 00:00:00	\nName: ___ Unit No: __...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284	19757198	2194-02-19 00:00:00	\nName: ___ Unit No: __...	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285	19757198	2194-10-02 00:00:00	\nName: ___ Unit No: __...	C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 4 columns

Now that we have the narrative text translated to CUI codes, we can go ahead with the reformatting and cleaning process.

asthma_cohort_missing_notes = pd.DataFrame({'Column': asthma_cohort_notes.columns,'Missing_Values': asthma_cohort_notes.isna().sum()})
display(asthma_cohort_missing_notes)

# Selecting only columns of interest

asthma_cohort_notes=asthma_cohort_notes[['subject_id','charttime','note_cui']]
display(asthma_cohort_notes)

# Rename the time columns to be consisten with other datasets
asthma_cohort_notes = asthma_cohort_notes.rename(columns = {"charttime":"date"})
display(asthma_cohort_notes)

# Cleaning the dates
asthma_cohort_notes["date"] = asthma_cohort_notes["date"].str[:10]
display(asthma_cohort_notes)

	Column	Missing_Values
subject_id	subject_id	0
charttime	charttime	0
text	text	0
note_cui	note_cui	0

	subject_id	charttime	note_cui
0	10001725	2110-04-14 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
1	10001884	2125-10-20 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
2	10001884	2125-10-27 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
3	10001884	2125-12-03 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
4	10001884	2125-12-27 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
...	...	...	...
50281	19757198	2189-09-12 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282	19757198	2191-05-26 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283	19757198	2193-06-23 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284	19757198	2194-02-19 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285	19757198	2194-10-02 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

	subject_id	date	note_cui
0	10001725	2110-04-14 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
1	10001884	2125-10-20 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
2	10001884	2125-10-27 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
3	10001884	2125-12-03 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
4	10001884	2125-12-27 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
...	...	...	...
50281	19757198	2189-09-12 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282	19757198	2191-05-26 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283	19757198	2193-06-23 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284	19757198	2194-02-19 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285	19757198	2194-10-02 00:00:00	C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

	subject_id	date	note_cui
0	10001725	2110-04-14	C5441729,C5441729,C5441729,C5441729,C5441729,C...
1	10001884	2125-10-20	C5441729,C5441729,C5441729,C5441729,C5441729,C...
2	10001884	2125-10-27	C5441729,C5441729,C5441729,C5441729,C5441729,C...
3	10001884	2125-12-03	C5441729,C5441729,C5441729,C5441729,C5441729,C...
4	10001884	2125-12-27	C5441729,C5441729,C5441729,C5441729,C5441729,C...
...	...	...	...
50281	19757198	2189-09-12	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282	19757198	2191-05-26	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283	19757198	2193-06-23	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284	19757198	2194-02-19	C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285	19757198	2194-10-02	C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

# Convert the note_cui column to cui list so we can expand them in next step

asthma_cohort_notes['note_cui_list'] = asthma_cohort_notes['note_cui'].apply(lambda x: x.split(',') if x else None)
display(asthma_cohort_notes)

	subject_id	date	note_cui	note_cui_list
0	10001725	2110-04-14	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
1	10001884	2125-10-20	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
2	10001884	2125-10-27	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
3	10001884	2125-12-03	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
4	10001884	2125-12-27	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
...	...	...	...	...
50281	19757198	2189-09-12	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50282	19757198	2191-05-26	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50283	19757198	2193-06-23	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50284	19757198	2194-02-19	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50285	19757198	2194-10-02	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...

50286 rows × 4 columns

asthma_cohort_notes.dropna(inplace=True)
display(asthma_cohort_notes)

	subject_id	date	note_cui	note_cui_list
0	10001725	2110-04-14	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
1	10001884	2125-10-20	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
2	10001884	2125-10-27	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
3	10001884	2125-12-03	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
4	10001884	2125-12-27	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
...	...	...	...	...
50281	19757198	2189-09-12	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50282	19757198	2191-05-26	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50283	19757198	2193-06-23	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50284	19757198	2194-02-19	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...
50285	19757198	2194-10-02	C5441729,C5441729,C5441729,C5441729,C5441729,C...	[C5441729, C5441729, C5441729, C5441729, C5441...

50286 rows × 4 columns

#Expand the note_cui_list column to have one CUI per row

asthma_cohort_notes = asthma_cohort_notes[['subject_id','date','note_cui_list']]
asthma_cohort_notes = asthma_cohort_notes.explode('note_cui_list')
display(asthma_cohort_notes)

	subject_id	date	note_cui_list
0	10001725	2110-04-14	C5441729
0	10001725	2110-04-14	C5441729
0	10001725	2110-04-14	C5441729
0	10001725	2110-04-14	C5441729
0	10001725	2110-04-14	C5441729
...	...	...	...
50285	19757198	2194-10-02	C5441729
50285	19757198	2194-10-02	C5441729
50285	19757198	2194-10-02	C5441729
50285	19757198	2194-10-02	C5441729
50285	19757198	2194-10-02	C5441729

42090077 rows × 3 columns

# Drop Duplicates

asthma_cohort_notes.drop_duplicates(inplace=True)
asthma_cohort_notes=asthma_cohort_notes.rename(columns={"note_cui_list":"cui"})
display(asthma_cohort_notes)

	subject_id	date	cui
0	10001725	2110-04-14	C5441729
0	10001725	2110-04-14	C0016860
0	10001725	2110-04-14	C0205082
0	10001725	2110-04-14	C0344315
0	10001725	2110-04-14	C0013604
...	...	...	...
50285	19757198	2194-10-02	C5201148
50285	19757198	2194-10-02	C4084203
50285	19757198	2194-10-02	C0033095
50285	19757198	2194-10-02	C0010957
50285	19757198	2194-10-02	C0011847

729164 rows × 3 columns

cui_counts_per_patient = asthma_cohort_notes.groupby(['subject_id', 'cui']).size().reset_index(name='counts')
display(cui_counts_per_patient)

	subject_id	cui	counts
0	10001725	C0001645	1
1	10001725	C0004238	1
2	10001725	C0013404	1
3	10001725	C0013604	1
4	10001725	C0016860	1
...	...	...	...
297382	19999442	C4554100	1
297383	19999442	C4554645	1
297384	19999442	C5201148	1
297385	19999442	C5203119	1
297386	19999442	C5441729	2

297387 rows × 3 columns

cui_counts_per_patient_matrixformat = cui_counts_per_patient.pivot_table(index='subject_id', columns='cui', values='counts', fill_value=0)
display(cui_counts_per_patient_matrixformat)


cui_counts_per_patient_matrixformat.to_csv(os.path.join(cohort_aggregateddata_nlp_directory,"CUI_counts.csv"),index=None)

cui	C0001645	C0002598	C0002962	C0003015	C0003811	C0004093	C0004238	C0006376	C0007166	C0007193	...	C4554158	C4554645	C4699158	C4723804	C4759845	C4759928	C5201148	C5203119	C5424789	C5441729
subject_id
10001725	1	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	1	0	1
10001884	1	4	0	0	1	0	7	0	0	0	...	0	3	1	0	0	0	3	8	0	15
10002800	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	3	2	0	3
10003019	4	0	0	0	2	0	1	0	0	0	...	0	3	0	0	0	0	7	6	0	11
10004296	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19992803	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	1
19997072	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	1	1	0	2
19997887	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	2	1	2	0	2
19998350	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	2	1	2	0	2
19999442	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	1	1	0	2

14958 rows × 229 columns

The complete tutorial notebook describing the cohort selection process is available here

To access the full collection of tutorial notebooks, tools, visualizations, and relevant metadata files, please register here