Step 3: Natural Language Processing

The coded clinical events from structured EHR data are not the only sources of information. EHR notes also contain a wealth of information. The objective of this notebook is to bring structure to unstructured note data.

To identify and extract clinically relevant concepts from notes, we translate text into codes based on mappings provided by the Unified Medical Language System (UMLS). To do this, we leverage tools such as cTAKES, NILE, and MetaMap. We recommend the use of NILE and that detailed instructions can be found in the PheCAP tutorial.

For this tutorial, we use Petehr—a NLP toolkit custom built for this tutorial. However, for better performance and in real world applications, we highly recommend using NILE or other similar software.

In this work, we present a lightweight example using PETEHR (Patient-Event Temporal and Hierarchical Representation) to illustrate its application in structuring temporal healthcare data. This example, provided at the end of the discussion, highlights PETEHR’s utility in organizing event sequences and temporal relationships. However, it is important to note PETEHR’s inherent limitations. Specifically, PETEHR focuses on syntactic and temporal structuring of data and does not perform semantic analysis (e.g., contextual interpretation of clinical events, inferring implicit meaning, or resolving ambiguities in unstructured text). This constraint necessitates complementary tools or frameworks for tasks requiring deeper contextual understanding. The example should therefore be interpreted as a demonstration of structural modeling, rather than a comprehensive analytical solution.

MIMIC Notes and Hosp module version difference

In this section, we focus on processing discharge summaries from MIMIC-IV deidentified notes. It’s important to note that the most recent version of the MIMIC-IV note data is version 2.2, whereas the MIMIC-IV Hosp data we are using is from a newer version, version 3.1. This mismatch means there will be patients present in the Hosp data who are missing in the note data.

This situation is typical in real world healthcare systems, where data from different sources or timeframes does not fully align. One way to address this is to ensure that the patient cohort you are analyzing has both note and Hosp data available. Additionally, since the Hosp data is newer you may need to truncate each patient’s records in the Hosp data to align with the last observed note date for that patient.

# Installing Python Toolkit for processing EHR data. For NLP, we recommend you use NILE linked below
# https://celehs.hms.harvard.edu/software/NILE.html

!pip install petehr 

import os
import pandas as pd
from petehr import Text2Code

base_directory = os.path.dirname(os.getcwd())
cohort_aggregateddata_nlp_directory = os.path.join(base_directory, 'processed_data', 'step5_cohort_aggregateddata', 'nlp')
os.makedirs(cohort_aggregateddata_nlp_directory, exist_ok=True)
Requirement already satisfied: petehr in /n/data1/hsph/biostat/celehs/lab/va67/anaconda/anaconda3/lib/python3.11/site-packages (0.0.1)
note_directory = os.path.join(base_directory, 'raw_data', 'nlp', 'physionet.org', 'files', 'mimic-iv-note', '2.2', 'note')
os.listdir(note_directory)
['discharge.csv',
 'discharge_detail.csv',
 'index.html',
 'radiology.csv',
 'radiology_detail.csv']

nlp_dictionary_file = os.path.join(base_directory, 'scripts', 'meta_files','NLP_Dict.csv')
nlp_dictionary = pd.read_csv(nlp_dictionary_file, dtype=str)
display(nlp_dictionary)
print(nlp_dictionary.describe())
STR CUI
0 pulmonary hypertension nos C0020542
1 hypertensive pulmonary vascular disease C0020542
2 hypertension pulmonary C0020542
3 pulmonary hypertension disorder C0020542
4 pulmonary hypertensions C0020542
... ... ...
2975 drug screen qualitative digoxin C0337449
2976 electrocardiogram myocardial infarction C0428953
2977 roche brand of bumetanide C0701009
2978 glaxosmithkline brand of carvedilol C0719509
2979 pfizer brand of eplerenone C1144054

2980 rows × 2 columns

                                          STR       CUI
0                  pulmonary hypertension nos  C0020542
1     hypertensive pulmonary vascular disease  C0020542
2                      hypertension pulmonary  C0020542
3             pulmonary hypertension disorder  C0020542
4                     pulmonary hypertensions  C0020542
...                                       ...       ...
2975          drug screen qualitative digoxin  C0337449
2976  electrocardiogram myocardial infarction  C0428953
2977                roche brand of bumetanide  C0701009
2978      glaxosmithkline brand of carvedilol  C0719509
2979               pfizer brand of eplerenone  C1144054

[2980 rows x 2 columns]
                            STR       CUI
count                      2976      2980
unique                     2833       276
top     ventricular tachycardia  C0038454
freq                          4        52
text2cui = Text2Code(nlp_dictionary)
# Loading discharge notes data

discharge = pd.read_csv(os.path.join(note_directory,"discharge.csv"), dtype=str)
display(discharge)
note_id subject_id hadm_id note_type note_seq charttime storetime text
0 10000032-DS-21 10000032 22595853 DS 21 2180-05-07 00:00:00 2180-05-09 15:26:00 \nName: ___ Unit No: _...
1 10000032-DS-22 10000032 22841357 DS 22 2180-06-27 00:00:00 2180-07-01 10:15:00 \nName: ___ Unit No: _...
2 10000032-DS-23 10000032 29079034 DS 23 2180-07-25 00:00:00 2180-07-25 21:42:00 \nName: ___ Unit No: _...
3 10000032-DS-24 10000032 25742920 DS 24 2180-08-07 00:00:00 2180-08-10 05:43:00 \nName: ___ Unit No: _...
4 10000084-DS-17 10000084 23052089 DS 17 2160-11-25 00:00:00 2160-11-25 15:09:00 \nName: ___ Unit No: __...
... ... ... ... ... ... ... ... ...
331788 19999828-DS-6 19999828 29734428 DS 6 2147-08-04 00:00:00 2147-08-12 15:36:00 \nName: ___ Unit No: ___...
331789 19999828-DS-7 19999828 25744818 DS 7 2149-01-18 00:00:00 2149-01-19 07:03:00 \nName: ___ Unit No: ___...
331790 19999840-DS-20 19999840 26071774 DS 20 2164-07-28 00:00:00 2164-07-29 14:52:00 \nName: ___ Unit No: ___\...
331791 19999840-DS-21 19999840 21033226 DS 21 2164-09-17 00:00:00 2164-09-18 01:36:00 \nName: ___ Unit No: ___\...
331792 19999987-DS-2 19999987 23865745 DS 2 2145-11-11 00:00:00 2145-11-11 13:13:00 \nName: ___ Unit No: __...

331793 rows × 8 columns

# Check how many patients from the Asthma cohort can be identified in the notes cohort.

print(len(asthma_cohort))

discharge_asthma_cohort = discharge[discharge['subject_id'].isin(asthma_cohort['subject_id'])]

display(discharge_asthma_cohort.describe())
20316
note_id subject_id hadm_id note_type note_seq charttime storetime text
count 50286 50286 50286 50286 50286 50286 50284 50286
unique 50286 14958 50286 1 135 25983 50232 50285
top 10001725-DS-12 12468016 25563031 DS 21 2160-06-09 00:00:00 2116-12-03 14:09:00 \nName: ___ Unit No: ___...
freq 1 85 1 50286 2391 9 2 2
# Selecting only the required columns

discharge = discharge[['subject_id','charttime','text']]
display(discharge.head())
subject_id charttime text
0 10000032 2180-05-07 00:00:00 \nName: ___ Unit No: _...
1 10000032 2180-06-27 00:00:00 \nName: ___ Unit No: _...
2 10000032 2180-07-25 00:00:00 \nName: ___ Unit No: _...
3 10000032 2180-08-07 00:00:00 \nName: ___ Unit No: _...
4 10000084 2160-11-25 00:00:00 \nName: ___ Unit No: __...
# Filtering to retain only the notes associated with the patients of interest

asthma_cohort_notes = pd.merge(asthma_cohort,discharge, on=['subject_id'],how='inner')
display(asthma_cohort_notes)

asthma_cohort_notes.dropna(inplace=True)
display(asthma_cohort_notes)
subject_id charttime text
0 10001725 2110-04-14 00:00:00 \nName: ___ Unit No: ___\n \nA...
1 10001884 2125-10-20 00:00:00 \nName: ___ Unit No: ___\n \nA...
2 10001884 2125-10-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
3 10001884 2125-12-03 00:00:00 \nName: ___ Unit No: ___\n \nA...
4 10001884 2125-12-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 \nName: ___ Unit No: __...
50282 19757198 2191-05-26 00:00:00 \nName: ___ Unit No: __...
50283 19757198 2193-06-23 00:00:00 \nName: ___ Unit No: __...
50284 19757198 2194-02-19 00:00:00 \nName: ___ Unit No: __...
50285 19757198 2194-10-02 00:00:00 \nName: ___ Unit No: __...

50286 rows × 3 columns

subject_id charttime text
0 10001725 2110-04-14 00:00:00 \nName: ___ Unit No: ___\n \nA...
1 10001884 2125-10-20 00:00:00 \nName: ___ Unit No: ___\n \nA...
2 10001884 2125-10-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
3 10001884 2125-12-03 00:00:00 \nName: ___ Unit No: ___\n \nA...
4 10001884 2125-12-27 00:00:00 \nName: ___ Unit No: ___\n \nA...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 \nName: ___ Unit No: __...
50282 19757198 2191-05-26 00:00:00 \nName: ___ Unit No: __...
50283 19757198 2193-06-23 00:00:00 \nName: ___ Unit No: __...
50284 19757198 2194-02-19 00:00:00 \nName: ___ Unit No: __...
50285 19757198 2194-10-02 00:00:00 \nName: ___ Unit No: __...

50286 rows × 3 columns

# Converting Text to CUI (CONCEPT UNIQUE IDENTIFIER)

asthma_cohort_notes['note_cui'] = asthma_cohort_notes['text'].map(lambda x: text2cui.convert(x))
asthma_cohort_notes
subject_id charttime text note_cui
0 10001725 2110-04-14 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 00:00:00 \nName: ___ Unit No: ___\n \nA... C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ... ...
50281 19757198 2189-09-12 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 00:00:00 \nName: ___ Unit No: __... C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 4 columns

Now that we have the narrative text translated to CUI codes, we can go ahead with the reformatting and cleaning process.

asthma_cohort_missing_notes = pd.DataFrame({'Column': asthma_cohort_notes.columns,'Missing_Values': asthma_cohort_notes.isna().sum()})
display(asthma_cohort_missing_notes)

# Selecting only columns of interest

asthma_cohort_notes=asthma_cohort_notes[['subject_id','charttime','note_cui']]
display(asthma_cohort_notes)

# Rename the time columns to be consisten with other datasets
asthma_cohort_notes = asthma_cohort_notes.rename(columns = {"charttime":"date"})
display(asthma_cohort_notes)

# Cleaning the dates
asthma_cohort_notes["date"] = asthma_cohort_notes["date"].str[:10]
display(asthma_cohort_notes)
Column Missing_Values
subject_id subject_id 0
charttime charttime 0
text text 0
note_cui note_cui 0
subject_id charttime note_cui
0 10001725 2110-04-14 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

subject_id date note_cui
0 10001725 2110-04-14 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ...
50281 19757198 2189-09-12 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 00:00:00 C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

subject_id date note_cui
0 10001725 2110-04-14 C5441729,C5441729,C5441729,C5441729,C5441729,C...
1 10001884 2125-10-20 C5441729,C5441729,C5441729,C5441729,C5441729,C...
2 10001884 2125-10-27 C5441729,C5441729,C5441729,C5441729,C5441729,C...
3 10001884 2125-12-03 C5441729,C5441729,C5441729,C5441729,C5441729,C...
4 10001884 2125-12-27 C5441729,C5441729,C5441729,C5441729,C5441729,C...
... ... ... ...
50281 19757198 2189-09-12 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50282 19757198 2191-05-26 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50283 19757198 2193-06-23 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50284 19757198 2194-02-19 C5441729,C5441729,C5441729,C5441729,C5441729,C...
50285 19757198 2194-10-02 C5441729,C5441729,C5441729,C5441729,C5441729,C...

50286 rows × 3 columns

# Convert the note_cui column to cui list so we can expand them in next step

asthma_cohort_notes['note_cui_list'] = asthma_cohort_notes['note_cui'].apply(lambda x: x.split(',') if x else None)
display(asthma_cohort_notes)
subject_id date note_cui note_cui_list
0 10001725 2110-04-14 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
1 10001884 2125-10-20 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
2 10001884 2125-10-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
3 10001884 2125-12-03 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
4 10001884 2125-12-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
... ... ... ... ...
50281 19757198 2189-09-12 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50282 19757198 2191-05-26 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50283 19757198 2193-06-23 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50284 19757198 2194-02-19 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50285 19757198 2194-10-02 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...

50286 rows × 4 columns

asthma_cohort_notes.dropna(inplace=True)
display(asthma_cohort_notes)
subject_id date note_cui note_cui_list
0 10001725 2110-04-14 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
1 10001884 2125-10-20 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
2 10001884 2125-10-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
3 10001884 2125-12-03 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
4 10001884 2125-12-27 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
... ... ... ... ...
50281 19757198 2189-09-12 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50282 19757198 2191-05-26 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50283 19757198 2193-06-23 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50284 19757198 2194-02-19 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...
50285 19757198 2194-10-02 C5441729,C5441729,C5441729,C5441729,C5441729,C... [C5441729, C5441729, C5441729, C5441729, C5441...

50286 rows × 4 columns

#Expand the note_cui_list column to have one CUI per row

asthma_cohort_notes = asthma_cohort_notes[['subject_id','date','note_cui_list']]
asthma_cohort_notes = asthma_cohort_notes.explode('note_cui_list')
display(asthma_cohort_notes)
subject_id date note_cui_list
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C5441729
... ... ... ...
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729
50285 19757198 2194-10-02 C5441729

42090077 rows × 3 columns

# Drop Duplicates

asthma_cohort_notes.drop_duplicates(inplace=True)
asthma_cohort_notes=asthma_cohort_notes.rename(columns={"note_cui_list":"cui"})
display(asthma_cohort_notes)
subject_id date cui
0 10001725 2110-04-14 C5441729
0 10001725 2110-04-14 C0016860
0 10001725 2110-04-14 C0205082
0 10001725 2110-04-14 C0344315
0 10001725 2110-04-14 C0013604
... ... ... ...
50285 19757198 2194-10-02 C5201148
50285 19757198 2194-10-02 C4084203
50285 19757198 2194-10-02 C0033095
50285 19757198 2194-10-02 C0010957
50285 19757198 2194-10-02 C0011847

729164 rows × 3 columns

cui_counts_per_patient = asthma_cohort_notes.groupby(['subject_id', 'cui']).size().reset_index(name='counts')
display(cui_counts_per_patient)
subject_id cui counts
0 10001725 C0001645 1
1 10001725 C0004238 1
2 10001725 C0013404 1
3 10001725 C0013604 1
4 10001725 C0016860 1
... ... ... ...
297382 19999442 C4554100 1
297383 19999442 C4554645 1
297384 19999442 C5201148 1
297385 19999442 C5203119 1
297386 19999442 C5441729 2

297387 rows × 3 columns

For the full notebook with code, please visit here.