Step 4: Cohort Creation

EHR based studies are typically conducted on a group of patients who meet specific inclusion/exclusion criteria. Some examples are

In all the cases above, the first task is to identify the patient cohort based on the defined criteria. In this case we will identify patients diagnoses with Asthma in MIMIC.

Creating an Asthma Cohort

A common strategy for identifying the patient cohort is to use the ICD codes corresponding to asthma. However, ICD codes can be very granular, and different studies on asthma usually ends up using different sets of ICD codes leading to inconsistencies. An alternative approach is to use PheCodes corresponding to asthma and filter patients based on these codes. The PheWAS Catalog provides a mapping between PheCodes and ICD codes.

We see that PheCode string Asthma corresponds to PheCode 495. Once we have the PheCode we can identify all ICD codes that falls under that phecode and then grab patients having at least one of those phecodes. Let’s go ahead and implement that.

Extract Cohort Data of Interest

Now for each dataset of rolled up data, we can extract the data of interest and aggregate them at patient level.

import pandas as pd
import os
base_directory = os.path.dirname(os.getcwd())


diagnoses_icd = ( 
     pd.read_csv(
        os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp", "diagnoses_icd.csv"),
        dtype=str
    ).rename(columns={'icd_code': 'code'})
)
diagnoses_icd['coding_system'] = "ICD" + diagnoses_icd['icd_version']

print("----Raw MIMIC IV 3.1 Diagnoses ICD----")
display(diagnoses_icd)


icd_to_phecode = pd.read_csv(
    os.path.join(base_directory, 'scripts', 'rollup_mappings', "icd_to_phecode.csv"),
    dtype=str
)
print("----ICD to PheCode Rollup file----")
display(icd_to_phecode.head())


print("----Comprehensive Diagnoses including ICD and PheCode----")
comprehensive_diagnoses = pd.merge(diagnoses_icd, icd_to_phecode, on=['coding_system','code'])
display(comprehensive_diagnoses.head())


print("----Identifying the cohort of interest----")
comprehensive_asthma_cohort = comprehensive_diagnoses[comprehensive_diagnoses['PheCode']=='495']
display(comprehensive_asthma_cohort)
print(comprehensive_asthma_cohort.describe())


print("----Final Asthma Cohort----")
asthma_cohort = comprehensive_asthma_cohort[['subject_id']].drop_duplicates()
display(asthma_cohort)
----Raw MIMIC IV 3.1 Diagnoses ICD----
subject_id hadm_id seq_num code icd_version coding_system
0 10000032 22595853 1 5723 9 ICD9
1 10000032 22595853 2 78959 9 ICD9
2 10000032 22595853 3 5715 9 ICD9
3 10000032 22595853 4 07070 9 ICD9
4 10000032 22595853 5 496 9 ICD9
... ... ... ... ... ... ...
6364483 19999987 23865745 7 41401 9 ICD9
6364484 19999987 23865745 8 78039 9 ICD9
6364485 19999987 23865745 9 0413 9 ICD9
6364486 19999987 23865745 10 36846 9 ICD9
6364487 19999987 23865745 11 7810 9 ICD9

6364488 rows × 6 columns

----ICD to PheCode Rollup file----
code PheCode coding_system
0 001 008 ICD9
1 0010 008 ICD9
2 0011 008 ICD9
3 0019 008 ICD9
4 002 008 ICD9
----Comprehensive Diagnoses including ICD and PheCode----
subject_id hadm_id seq_num code icd_version coding_system PheCode
0 10000032 22595853 1 5723 9 ICD9 571.81
1 10000826 20032235 4 5723 9 ICD9 571.81
2 10000826 28289260 1 5723 9 ICD9 571.81
3 10005866 26158160 4 5723 9 ICD9 571.81
4 10008924 23676183 7 5723 9 ICD9 571.81
----Identifying the cohort of interest----
subject_id hadm_id seq_num code icd_version coding_system PheCode
2478863 10001725 25563031 4 49390 9 ICD9 495
2478864 10001884 26679629 7 49390 9 ICD9 495
2478865 10003019 20030125 5 49390 9 ICD9 495
2478866 10003019 20277210 10 49390 9 ICD9 495
2478867 10003019 20962108 15 49390 9 ICD9 495
... ... ... ... ... ... ... ...
5908161 17892612 24109018 1 49382 9 ICD9 495
5908162 17997063 25519468 11 49382 9 ICD9 495
5908163 18269165 28966193 6 49382 9 ICD9 495
5908164 18958101 23643092 8 49382 9 ICD9 495
5908165 19757198 28680884 9 49382 9 ICD9 495

42057 rows × 7 columns

       subject_id   hadm_id seq_num    code icd_version coding_system PheCode
count       42057     42057   42057   42057       42057         42057   42057
unique      20316     42035      39      11           2             2       1
top      18676703  24773199       5  J45909          10         ICD10     495
freq           60         3    4105   20679       21954         21954   42057
----Final Asthma Cohort----
subject_id
2478863 10001725
2478864 10001884
2478865 10003019
2478872 10004457
2478875 10004749
... ...
5908156 16550589
5908158 17562616
5908161 17892612
5908162 17997063
5908165 19757198

20316 rows × 1 columns

# We will extract data of interest, aggregate them and save them here
cohort_aggregateddata_directory =os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata')
os.makedirs(cohort_aggregateddata_directory, exist_ok=True)

# We will seperate codified aggregated data and nlp aggregated data 
cohort_aggregateddata_codified_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'codified')
os.makedirs(cohort_aggregateddata_codified_directory, exist_ok=True)
# Processing data in batches to extract patient data of interest

rolledup_diagnoses_directory =  os.path.join(base_directory, 'processed_data', 'step4_rolledup_finaldata', 'Diagnoses')
rolledup_diagnoses_batch_files = os.listdir(rolledup_diagnoses_directory)
sample_rolled_diagnoses = pd.read_csv(os.path.join(rolledup_diagnoses_directory, rolledup_diagnoses_batch_files[0]), dtype=str)

extracted_diagnoses_dfs = []

for diagnoses_batch_file in rolledup_diagnoses_batch_files:
    diagnoses_batch = pd.read_csv(os.path.join(rolledup_diagnoses_directory, diagnoses_batch_file), dtype=str)
    diagnoses_batch_extracted = pd.merge(diagnoses_batch,asthma_cohort, on=['subject_id'], how='inner')
    display(diagnoses_batch_extracted)
    extracted_diagnoses_dfs.append(diagnoses_batch_extracted)

extracted_diagnoses = pd.concat(extracted_diagnoses_dfs)
display(extracted_diagnoses)
subject_id date PheCode
0 10004457 2140-09-17 411.4
1 10004457 2140-09-17 272.1
2 10004457 2140-09-17 401.1
3 10004457 2140-09-17 495
4 10004457 2140-09-17 185
... ... ... ...
122569 19990563 2180-11-30 783
122570 19990563 2180-11-30 250.2
122571 19990563 2180-11-30 401.1
122572 19990563 2180-11-30 457.3
122573 19990563 2180-11-30 495

122574 rows × 3 columns

subject_id date PheCode
0 10017393 2179-07-20 694.3
1 10017393 2179-07-20 960.2
2 10017393 2179-07-20 528.7
3 10017393 2179-07-20 361
4 10017393 2179-07-20 495
... ... ... ...
118535 19990581 2141-07-23 272.11
118536 19990581 2141-07-23 250.24
118537 19990581 2141-07-23 536.3
118538 19990581 2141-07-23 495
118539 19990581 2141-07-23 274.1

118540 rows × 3 columns

subject_id date PheCode
0 10011912 2176-10-21 800.3
1 10011912 2176-10-21 070.3
2 10011912 2176-10-21 174.11
3 10011912 2176-10-21 495
4 10011912 2176-10-21 317.11
... ... ... ...
115431 19999442 2148-11-19 856
115432 19999442 2148-11-19 348
115433 19999442 2148-11-19 342
115434 19999442 2148-11-19 495
115435 19999442 2148-11-19 296.2

115436 rows × 3 columns

subject_id date PheCode
0 10001884 2130-10-05 496.21
1 10001884 2130-10-05 1013
2 10001884 2130-10-05 276.14
3 10001884 2130-10-05 401.1
4 10001884 2130-10-05 272.11
... ... ... ...
115775 19997760 2187-07-09 286.2
115776 19997760 2187-07-09 594
115777 19997760 2187-07-09 972.6
115778 19997760 2187-07-09 972.1
115779 19997760 2187-07-09 974

115780 rows × 3 columns

subject_id date PheCode
0 10003019 2174-12-25 743.21
1 10003019 2174-12-25 334
2 10003019 2174-12-25 697
3 10003019 2174-12-25 510
4 10003019 2174-12-25 495
... ... ... ...
124334 19996016 2159-12-10 415
124335 19996016 2159-12-10 256.4
124336 19996016 2159-12-10 1010.6
124337 19996016 2159-12-10 652
124338 19996016 2159-12-10 663

124339 rows × 3 columns

subject_id date PheCode
0 10002800 2164-07-12 649
1 10002800 2164-07-12 521.1
2 10002800 2164-07-12 1010.6
3 10002800 2164-07-12 495
4 10002800 2164-07-12 646
... ... ... ...
117695 19996832 2179-02-21 296.22
117696 19996832 2179-02-21 297.2
117697 19996832 2179-02-21 495
117698 19996832 2179-02-21 318
117699 19996832 2179-02-21 301.2

117700 rows × 3 columns

subject_id date PheCode
0 10011607 2184-04-26 495.2
1 10011607 2184-04-26 509.1
2 10011607 2184-04-26 411.3
3 10011607 2184-04-26 401.1
4 10011607 2184-04-26 290.1
... ... ... ...
115303 19998350 2128-02-21 327.32
115304 19998350 2128-02-21 495
115305 19998350 2128-02-21 278.1
115306 19998350 2128-02-21 300.1
115307 19998350 2128-02-21 208

115308 rows × 3 columns

subject_id date PheCode
0 10001725 2110-04-11 599.2
1 10001725 2110-04-11 946
2 10001725 2110-04-11 618.5
3 10001725 2110-04-11 495
4 10001725 2110-04-11 530.11
... ... ... ...
113325 19997887 2117-04-07 318
113326 19997887 2117-04-07 288.2
113327 19997887 2117-04-07 338.1
113328 19997887 2117-04-07 789
113329 19997887 2117-04-07 458.2

113330 rows × 3 columns

subject_id date PheCode
0 10004457 2140-09-17 411.4
1 10004457 2140-09-17 272.1
2 10004457 2140-09-17 401.1
3 10004457 2140-09-17 495
4 10004457 2140-09-17 185
... ... ... ...
113325 19997887 2117-04-07 318
113326 19997887 2117-04-07 288.2
113327 19997887 2117-04-07 338.1
113328 19997887 2117-04-07 789
113329 19997887 2117-04-07 458.2

943007 rows × 3 columns

Aggregate Data at Patient Level

phecode_counts_per_patient = extracted_diagnoses.groupby(['subject_id', 'PheCode']).size().reset_index(name='counts')
display(phecode_counts_per_patient)

phecode_counts_per_patient_matrixformat = phecode_counts_per_patient.pivot_table(index='subject_id', columns='PheCode', values='counts', fill_value=0)
display(phecode_counts_per_patient_matrixformat)

phecode_counts_per_patient_matrixformat.to_csv(os.path.join(cohort_aggregateddata_codified_directory,"Diagnoses.csv"), index=None)
subject_id PheCode counts
0 10001725 180.1 1
1 10001725 296.2 1
2 10001725 300.1 1
3 10001725 313.1 1
4 10001725 318 1
... ... ... ...
480814 19999442 433.21 1
480815 19999442 495 2
480816 19999442 591 1
480817 19999442 594 1
480818 19999442 856 1

480819 rows × 3 columns

PheCode 008 008.5 008.51 008.52 008.6 008.7 010 031 038 038.1 ... 983 985 986 987 988 989 990 994.1 994.2 994.21
subject_id
10001725 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10001884 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10002800 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10003019 0 0 0 1 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 1
10004296 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19997760 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19997887 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19998350 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19999112 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19999442 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

20316 rows × 1725 columns

Important Note: The diagnoses data in the MIMIC-IV Hosp dataset does not include a specific time element. For the purpose of analysis, we have inferred the timing as the admission date. Please be cautious when using this assumption in downstream analyses as it may not accurately reflect the actual timing of diagnoses.

Please proceed to generate the patient-level data for the other data types: medications, procedures, and labs.

MIMIC Notes and Hosp module version difference

In this section, we focus on processing discharge summaries from MIMIC-IV deidentified notes. It’s important to note that the most recent version of the MIMIC-IV note data is version 2.2, whereas the MIMIC-IV Hosp data we are using is from a newer version, version 3.1. This mismatch means there will be patients present in the Hosp data who are missing in the note data.

This situation is typical in real world healthcare systems, where data from different sources or timeframes does not fully align. One way to address this is to ensure that the patient cohort you are analyzing has both note and Hosp data available. Additionally, since the Hosp data is newer you may need to truncate each patient’s records in the Hosp data to align with the last observed note date for that patient.

# Installing Python Toolkit for processing EHR data. For NLP, we recommend you use NILE linked below
# https://celehs.hms.harvard.edu/software/NILE.html

!pip install petehr 

import os
import pandas as pd
from petehr import Text2Code

base_directory = os.path.dirname(os.getcwd())
cohort_aggregateddata_nlp_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'nlp')
os.makedirs(cohort_aggregateddata_nlp_directory, exist_ok=True)
Requirement already satisfied: petehr in /n/data1/hsph/biostat/celehs/lab/va67/anaconda/anaconda3/lib/python3.11/site-packages (0.0.1)
note_directory = os.path.join(base_directory, 'raw_data', 'nlp', 'physionet.org', 'files', 'mimic-iv-note', '2.2', 'note')
os.listdir(note_directory)
['discharge.csv',
 'discharge_detail.csv',
 'index.html',
 'radiology.csv',
 'radiology_detail.csv']
# The Asthma CUI dictionary was downloaded from here. https://shiny.parse-health.org/ONCE/
# Please see the bottom of this notebook for further resources 

asthma_dictionary_file = os.path.join(base_directory, 'scripts', 'meta_files','Asthma_NLP_Dict.csv')
asthma_dictionary = pd.read_csv(asthma_dictionary_file, dtype=str)
display(asthma_dictionary)

print(asthma_dictionary)
print(asthma_dictionary.describe())
STR CUI
0 pulmonary hypertension nos C0020542
1 hypertensive pulmonary vascular disease C0020542
2 hypertension pulmonary C0020542
3 pulmonary hypertension disorder C0020542
4 pulmonary hypertensions C0020542
... ... ...
2975 drug screen qualitative digoxin C0337449
2976 electrocardiogram myocardial infarction C0428953
2977 roche brand of bumetanide C0701009
2978 glaxosmithkline brand of carvedilol C0719509
2979 pfizer brand of eplerenone C1144054

2980 rows × 2 columns

                                          STR       CUI
0                  pulmonary hypertension nos  C0020542
1     hypertensive pulmonary vascular disease  C0020542
2                      hypertension pulmonary  C0020542
3             pulmonary hypertension disorder  C0020542
4                     pulmonary hypertensions  C0020542
...                                       ...       ...
2975          drug screen qualitative digoxin  C0337449
2976  electrocardiogram myocardial infarction  C0428953
2977                roche brand of bumetanide  C0701009
2978      glaxosmithkline brand of carvedilol  C0719509
2979               pfizer brand of eplerenone  C1144054

[2980 rows x 2 columns]
                            STR       CUI
count                      2976      2980
unique                     2833       276
top     ventricular tachycardia  C0038454
freq                          4        52

In the summary we can see we have identified 276 unique CUIs (Concept Unique Identifiers) using ONCE app that are related to Asthma. When we include the synonymous terms for these 276 CUIs, we have around 3000 CUI-term mappings.

text2cui = Text2Code(asthma_dictionary_file)
Dictionary loaded sucessfully from /n/scratch/users/v/va67/EHR_TUTORIAL_WORKSPACE/scripts/meta_files/Asthma_NLP_Dict.csv
# Loading discharge notes data

discharge = pd.read_csv(os.path.join(note_directory,"discharge.csv"), dtype=str)
display(discharge)
note_id subject_id hadm_id note_type note_seq charttime storetime text
0 10000032-DS-21 10000032 22595853 DS 21 2180-05-07 00:00:00 2180-05-09 15:26:00 \nName: ___ Unit No: _...
1 10000032-DS-22 10000032 22841357 DS 22 2180-06-27 00:00:00 2180-07-01 10:15:00 \nName: ___ Unit No: _...
2 10000032-DS-23 10000032 29079034 DS 23 2180-07-25 00:00:00 2180-07-25 21:42:00 \nName: ___ Unit No: _...
3 10000032-DS-24 10000032 25742920 DS 24 2180-08-07 00:00:00 2180-08-10 05:43:00 \nName: ___ Unit No: _...
4 10000084-DS-17 10000084 23052089 DS 17 2160-11-25 00:00:00 2160-11-25 15:09:00 \nName: ___ Unit No: __...
... ... ... ... ... ... ... ... ...
331788 19999828-DS-6 19999828 29734428 DS 6 2147-08-04 00:00:00 2147-08-12 15:36:00 \nName: ___ Unit No: ___...
331789 19999828-DS-7 19999828 25744818 DS 7 2149-01-18 00:00:00 2149-01-19 07:03:00 \nName: ___ Unit No: ___...
331790 19999840-DS-20 19999840 26071774 DS 20 2164-07-28 00:00:00 2164-07-29 14:52:00 \nName: ___ Unit No: ___\...
331791 19999840-DS-21 19999840 21033226 DS 21 2164-09-17 00:00:00 2164-09-18 01:36:00 \nName: ___ Unit No: ___\...
331792 19999987-DS-2 19999987 23865745 DS 2 2145-11-11 00:00:00 2145-11-11 13:13:00 \nName: ___ Unit No: __...

331793 rows × 8 columns

# Check how many patients from the Asthma cohort can be identified in the notes cohort.

print(len(asthma_cohort))

discharge_asthma_cohort = discharge[discharge['subject_id'].isin(asthma_cohort['subject_id'])]

display(discharge_asthma_cohort.describe())
20316
note_id subject_id hadm_id note_type note_seq charttime storetime text
count 50286 50286 50286 50286 50286 50286 50284 50286
unique 50286 14958 50286 1 135 25983 50232 50285
top 10001725-DS-12 12468016 25563031 DS 21 2160-06-09 00:00:00 2116-12-03 14:09:00 \nName: ___ Unit No: ___...
freq 1 85 1 50286 2391 9 2 2

Out of 20,316 patients in the asthma cohort, only 14,858 patients have discharge notes. Approximately 73% of the patients have discharge notes.

# Selecting only the required columns

discharge = discharge[['subject_id','charttime','text']]
display(discharge.head())
subject_id charttime text
0 10000032 2180-05-07 00:00:00 \nName: ___ Unit No: _...
1 10000032 2180-06-27 00:00:00 \nName: ___ Unit No: _...
2 10000032 2180-07-25 00:00:00 \nName: ___ Unit No: _...
3 10000032 2180-08-07 00:00:00 \nName: ___ Unit No: _...
4 10000084 2160-11-25 00:00:00 \nName: ___ Unit No: __...
# Filtering to retain only the notes associated with the patients of interest

asthma_cohort_notes = pd.merge(asthma_cohort,discharge, on=['subject_id'],how='inner')
display(asthma_cohort_notes)

asthma_cohort_notes.dropna(inplace=True)
display(asthma_cohort_notes)
subject_id charttime text
0 10001725 2110-04-14 00:00:00 \nName: ___ Unit No: ___\n \nA...
1 10001884 2125-10-20 00:00:00 \nName: ___ Unit No: ___\n \nA...
2 10001884 2125-10-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
3 10001884 2125-12-03 00:00:00 \nName: ___ Unit No: ___\n \nA...
4 10001884 2125-12-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 \nName: ___ Unit No: __...
50282 19757198 2191-05-26 00:00:00 \nName: ___ Unit No: __...
50283 19757198 2193-06-23 00:00:00 \nName: ___ Unit No: __...
50284 19757198 2194-02-19 00:00:00 \nName: ___ Unit No: __...
50285 19757198 2194-10-02 00:00:00 \nName: ___ Unit No: __...

50286 rows × 3 columns

subject_id charttime text
0 10001725 2110-04-14 00:00:00 \nName: ___ Unit No: ___\n \nA...
1 10001884 2125-10-20 00:00:00 \nName: ___ Unit No: ___\n \nA...
2 10001884 2125-10-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
3 10001884 2125-12-03 00:00:00 \nName: ___ Unit No: ___\n \nA...
4 10001884 2125-12-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 \nName: ___ Unit No: __...
50282 19757198 2191-05-26 00:00:00 \nName: ___ Unit No: __...
50283 19757198 2193-06-23 00:00:00 \nName: ___ Unit No: __...
50284 19757198 2194-02-19 00:00:00 \nName: ___ Unit No: __...
50285 19757198 2194-10-02 00:00:00 \nName: ___ Unit No: __...

50286 rows × 3 columns

# Converting Text to CUI (CONCEPT UNIQUE IDENTIFIER)

asthma_cohort_notes['note_cui'] = asthma_cohort_notes['text'].map(lambda x: text2cui.convert(x))
asthma_cohort_notes
subject_id charttime text note_cui
0 10001725 2110-04-14 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ... ...
50281 19757198 2189-09-12 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 4 columns

Now that we have the narrative text translated to CUI codes, we can go ahead with the reformatting and cleaning process.

asthma_cohort_missing_notes = pd.DataFrame({'Column': asthma_cohort_notes.columns,'Missing_Values': asthma_cohort_notes.isna().sum()})
display(asthma_cohort_missing_notes)

# Selecting only columns of interest

asthma_cohort_notes=asthma_cohort_notes[['subject_id','charttime','note_cui']]
display(asthma_cohort_notes)

# Rename the time columns to be consisten with other datasets
asthma_cohort_notes = asthma_cohort_notes.rename(columns = {"charttime":"date"})
display(asthma_cohort_notes)

# Cleaning the dates
asthma_cohort_notes["date"] = asthma_cohort_notes["date"].str[:10]
display(asthma_cohort_notes)
Column Missing_Values
subject_id subject_id 0
charttime charttime 0
text text 0
note_cui note_cui 0
subject_id charttime note_cui
0 10001725 2110-04-14 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

subject_id date note_cui
0 10001725 2110-04-14 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

subject_id date note_cui
0 10001725 2110-04-14 C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ...
50281 19757198 2189-09-12 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

# Convert the note_cui column to cui list so we can expand them in next step

asthma_cohort_notes['note_cui_list'] = asthma_cohort_notes['note_cui'].apply(lambda x: x.split(',') if x else None)
display(asthma_cohort_notes)
subject_id date note_cui note_cui_list
0 10001725 2110-04-14 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
1 10001884 2125-10-20 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
2 10001884 2125-10-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
3 10001884 2125-12-03 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
4 10001884 2125-12-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
... ... ... ... ...
50281 19757198 2189-09-12 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50282 19757198 2191-05-26 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50283 19757198 2193-06-23 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50284 19757198 2194-02-19 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50285 19757198 2194-10-02 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...

50286 rows × 4 columns

asthma_cohort_notes.dropna(inplace=True)
display(asthma_cohort_notes)
subject_id date note_cui note_cui_list
0 10001725 2110-04-14 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
1 10001884 2125-10-20 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
2 10001884 2125-10-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
3 10001884 2125-12-03 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
4 10001884 2125-12-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
... ... ... ... ...
50281 19757198 2189-09-12 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50282 19757198 2191-05-26 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50283 19757198 2193-06-23 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50284 19757198 2194-02-19 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50285 19757198 2194-10-02 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...

50286 rows × 4 columns

#Expand the note_cui_list column to have one CUI per row

asthma_cohort_notes = asthma_cohort_notes[['subject_id','date','note_cui_list']]
asthma_cohort_notes = asthma_cohort_notes.explode('note_cui_list')
display(asthma_cohort_notes)
subject_id date note_cui_list
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
... ... ... ...
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729

42090077 rows × 3 columns

# Drop Duplicates

asthma_cohort_notes.drop_duplicates(inplace=True)
asthma_cohort_notes=asthma_cohort_notes.rename(columns={"note_cui_list":"cui"})
display(asthma_cohort_notes)
subject_id date cui
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C0016860
0 10001725 2110-04-14 C0205082
0 10001725 2110-04-14 C0344315
0 10001725 2110-04-14 C0013604
... ... ... ...
50285 19757198 2194-10-02 C5201148
50285 19757198 2194-10-02 C4084203
50285 19757198 2194-10-02 C0033095
50285 19757198 2194-10-02 C0010957
50285 19757198 2194-10-02 C0011847

729164 rows × 3 columns

cui_counts_per_patient = asthma_cohort_notes.groupby(['subject_id', 'cui']).size().reset_index(name='counts')
display(cui_counts_per_patient)
subject_id cui counts
0 10001725 C0001645 1
1 10001725 C0004238 1
2 10001725 C0013404 1
3 10001725 C0013604 1
4 10001725 C0016860 1
... ... ... ...
297382 19999442 C4554100 1
297383 19999442 C4554645 1
297384 19999442 C5201148 1
297385 19999442 C5203119 1
297386 19999442 C5441729 2

297387 rows × 3 columns

cui_counts_per_patient_matrixformat = cui_counts_per_patient.pivot_table(index='subject_id', columns='cui', values='counts', fill_value=0)
display(cui_counts_per_patient_matrixformat)


cui_counts_per_patient_matrixformat.to_csv(os.path.join(cohort_aggregateddata_nlp_directory,"CUI_counts.csv"),index=None)
cui C0001645 C0002598 C0002962 C0003015 C0003811 C0004093 C0004238 C0006376 C0007166 C0007193 ... C4554158 C4554645 C4699158 C4723804 C4759845 C4759928 C5201148 C5203119 C5424789 C5441729
subject_id
10001725 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 1
10001884 1 4 0 0 1 0 7 0 0 0 ... 0 3 1 0 0 0 3 8 0 15
10002800 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 3 2 0 3
10003019 4 0 0 0 2 0 1 0 0 0 ... 0 3 0 0 0 0 7 6 0 11
10004296 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19992803 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 1
19997072 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 2
19997887 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 1 2 0 2
19998350 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 2 1 2 0 2
19999442 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 1 1 0 2

14958 rows × 229 columns