The coded clinical events from structured EHR data are not the only sources of information. EHR notes also contain a wealth of information. The objective of this notebook is to bring structure to unstructured note data.
To identify and extract clinically relevant concepts from notes, we translate text into codes based on mappings provided by the Unified Medical Language System (UMLS). To do this, we leverage tools such as cTAKES, NILE, and MetaMap. We recommend the use of NILE and that detailed instructions can be found in the PheCAP tutorial.
For this tutorial, we use Petehr—a NLP toolkit custom built for this tutorial. However, for better performance and in real world applications, we highly recommend using NILE or other similar software.
In this work, we present a lightweight example using PETEHR (Patient-Event Temporal and Hierarchical Representation) to illustrate its application in structuring temporal healthcare data. This example, provided at the end of the discussion, highlights PETEHR’s utility in organizing event sequences and temporal relationships. However, it is important to note PETEHR’s inherent limitations. Specifically, PETEHR focuses on syntactic and temporal structuring of data and does not perform semantic analysis (e.g., contextual interpretation of clinical events, inferring implicit meaning, or resolving ambiguities in unstructured text). This constraint necessitates complementary tools or frameworks for tasks requiring deeper contextual understanding. The example should therefore be interpreted as a demonstration of structural modeling, rather than a comprehensive analytical solution.
MIMIC Notes and Hosp module version difference
In this section, we focus on processing discharge summaries from MIMIC-IV deidentified notes. It’s important to note that the most recent version of the MIMIC-IV note data is version 2.2, whereas the MIMIC-IV Hosp data we are using is from a newer version, version 3.1. This mismatch means there will be patients present in the Hosp data who are missing in the note data.
This situation is typical in real world healthcare systems, where data from different sources or timeframes does not fully align. One way to address this is to ensure that the patient cohort you are analyzing has both note and Hosp data available. Additionally, since the Hosp data is newer you may need to truncate each patient’s records in the Hosp data to align with the last observed note date for that patient.
# Installing Python Toolkit for processing EHR data. For NLP, we recommend you use NILE linked below# https://celehs.hms.harvard.edu/software/NILE.html!pip install petehr import osimport pandas as pdfrom petehr import Text2Codebase_directory = os.path.dirname(os.getcwd())cohort_aggregateddata_nlp_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'nlp')os.makedirs(cohort_aggregateddata_nlp_directory, exist_ok=True)
Requirement already satisfied: petehr in /n/data1/hsph/biostat/celehs/lab/va67/anaconda/anaconda3/lib/python3.11/site-packages (0.0.1)
# Check how many patients from the Asthma cohort can be identified in the notes cohort.print(len(asthma_cohort))discharge_asthma_cohort = discharge[discharge['subject_id'].isin(asthma_cohort['subject_id'])]display(discharge_asthma_cohort.describe())
20316
note_id
subject_id
hadm_id
note_type
note_seq
charttime
storetime
text
count
50286
50286
50286
50286
50286
50286
50284
50286
unique
50286
14958
50286
1
135
25983
50232
50285
top
10001725-DS-12
12468016
25563031
DS
21
2160-06-09 00:00:00
2116-12-03 14:09:00
\nName: ___ Unit No: ___...
freq
1
85
1
50286
2391
9
2
2
# Selecting only the required columnsdischarge = discharge[['subject_id','charttime','text']]display(discharge.head())
subject_id
charttime
text
0
10000032
2180-05-07 00:00:00
\nName: ___ Unit No: _...
1
10000032
2180-06-27 00:00:00
\nName: ___ Unit No: _...
2
10000032
2180-07-25 00:00:00
\nName: ___ Unit No: _...
3
10000032
2180-08-07 00:00:00
\nName: ___ Unit No: _...
4
10000084
2160-11-25 00:00:00
\nName: ___ Unit No: __...
# Filtering to retain only the notes associated with the patients of interestasthma_cohort_notes = pd.merge(asthma_cohort,discharge, on=['subject_id'],how='inner')display(asthma_cohort_notes)asthma_cohort_notes.dropna(inplace=True)display(asthma_cohort_notes)
subject_id
charttime
text
0
10001725
2110-04-14 00:00:00
\nName: ___ Unit No: ___\n \nA...
1
10001884
2125-10-20 00:00:00
\nName: ___ Unit No: ___\n \nA...
2
10001884
2125-10-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
3
10001884
2125-12-03 00:00:00
\nName: ___ Unit No: ___\n \nA...
4
10001884
2125-12-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
\nName: ___ Unit No: __...
50282
19757198
2191-05-26 00:00:00
\nName: ___ Unit No: __...
50283
19757198
2193-06-23 00:00:00
\nName: ___ Unit No: __...
50284
19757198
2194-02-19 00:00:00
\nName: ___ Unit No: __...
50285
19757198
2194-10-02 00:00:00
\nName: ___ Unit No: __...
50286 rows × 3 columns
subject_id
charttime
text
0
10001725
2110-04-14 00:00:00
\nName: ___ Unit No: ___\n \nA...
1
10001884
2125-10-20 00:00:00
\nName: ___ Unit No: ___\n \nA...
2
10001884
2125-10-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
3
10001884
2125-12-03 00:00:00
\nName: ___ Unit No: ___\n \nA...
4
10001884
2125-12-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
\nName: ___ Unit No: __...
50282
19757198
2191-05-26 00:00:00
\nName: ___ Unit No: __...
50283
19757198
2193-06-23 00:00:00
\nName: ___ Unit No: __...
50284
19757198
2194-02-19 00:00:00
\nName: ___ Unit No: __...
50285
19757198
2194-10-02 00:00:00
\nName: ___ Unit No: __...
50286 rows × 3 columns
# Converting Text to CUI (CONCEPT UNIQUE IDENTIFIER)asthma_cohort_notes['note_cui'] = asthma_cohort_notes['text'].map(lambda x: text2cui.convert(x))asthma_cohort_notes
subject_id
charttime
text
note_cui
0
10001725
2110-04-14 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27 00:00:00
\nName: ___ Unit No: ___\n \nA...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02 00:00:00
\nName: ___ Unit No: __...
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 4 columns
Now that we have the narrative text translated to CUI codes, we can go ahead with the reformatting and cleaning process.
asthma_cohort_missing_notes = pd.DataFrame({'Column': asthma_cohort_notes.columns,'Missing_Values': asthma_cohort_notes.isna().sum()})display(asthma_cohort_missing_notes)# Selecting only columns of interestasthma_cohort_notes=asthma_cohort_notes[['subject_id','charttime','note_cui']]display(asthma_cohort_notes)# Rename the time columns to be consisten with other datasetsasthma_cohort_notes = asthma_cohort_notes.rename(columns = {"charttime":"date"})display(asthma_cohort_notes)# Cleaning the datesasthma_cohort_notes["date"] = asthma_cohort_notes["date"].str[:10]display(asthma_cohort_notes)
Column
Missing_Values
subject_id
subject_id
0
charttime
charttime
0
text
text
0
note_cui
note_cui
0
subject_id
charttime
note_cui
0
10001725
2110-04-14 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 3 columns
subject_id
date
note_cui
0
10001725
2110-04-14 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
50281
19757198
2189-09-12 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02 00:00:00
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 3 columns
subject_id
date
note_cui
0
10001725
2110-04-14
C5441729,C5441729,C5441729,C5441729,C5441729,C...
1
10001884
2125-10-20
C5441729,C5441729,C5441729,C5441729,C5441729,C...
2
10001884
2125-10-27
C5441729,C5441729,C5441729,C5441729,C5441729,C...
3
10001884
2125-12-03
C5441729,C5441729,C5441729,C5441729,C5441729,C...
4
10001884
2125-12-27
C5441729,C5441729,C5441729,C5441729,C5441729,C...
...
...
...
...
50281
19757198
2189-09-12
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282
19757198
2191-05-26
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283
19757198
2193-06-23
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284
19757198
2194-02-19
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285
19757198
2194-10-02
C5441729,C5441729,C5441729,C5441729,C5441729,C...
50286 rows × 3 columns
# Convert the note_cui column to cui list so we can expand them in next stepasthma_cohort_notes['note_cui_list'] = asthma_cohort_notes['note_cui'].apply(lambda x: x.split(',') if x elseNone)display(asthma_cohort_notes)
#Expand the note_cui_list column to have one CUI per rowasthma_cohort_notes = asthma_cohort_notes[['subject_id','date','note_cui_list']]asthma_cohort_notes = asthma_cohort_notes.explode('note_cui_list')display(asthma_cohort_notes)
subject_id
date
note_cui_list
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
0
10001725
2110-04-14
C5441729
...
...
...
...
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
50285
19757198
2194-10-02
C5441729
42090077 rows × 3 columns
# Drop Duplicatesasthma_cohort_notes.drop_duplicates(inplace=True)asthma_cohort_notes=asthma_cohort_notes.rename(columns={"note_cui_list":"cui"})display(asthma_cohort_notes)