Structured EHR codes aren’t the only source of clinical information. EHR freetext notes contain substantial clinical information. Using NLP, we can convert those notes into standardized concept codes (e.g., CUIs) so they can be integrated with the structured EHR data.
Unified Medical Language System (UMLS)
UMLS is a collection of more than 200 health and biomedical vocabularies (e.g., ICD, SNOMED-CT, RxNorm, LOINC) developed and maintained by the U.S. National Library of Medicine to enable interoperability across systems. One of its central features, the Metathesaurus, maps synonymous terms from different vocabularies to a single Concept Unique Identifier (CUI) to support interoperability.
One can use the UMLS Browser to search for “Diabetes mellitus.” It will display a concept page similar to the one shown below.
Screenshot 2025-10-01 at 8.18.13 PM.png
The UMLS provides mappings from terms to Concept Unique Identifiers (CUIs) for a wide range of biomedical concepts. Couple of examples are shown below.
Name
CUI
AUI
Vocabulary
Term Type
Code
DIABETES MELLITUS
C0011849
—
WHO
PT
0371
Diabetes mellitus
C0011849
A2928669
SNOMEDCT_US
PT
73211009
Bronchial Asthma
C0004096
A26667996
MSH
ET
D001249
Asthma
C0004096
A2878777
SNOMEDCT_US
PT
195967001
Hypertension
C0020538
A0070978
MSH
MH
D006973
Blood Pressure, High
C0020538
A26603831
MSH
ET
D006973
If you have a clinical note like below, you can process the note using a dictionary of CUI-term mappings to convert it a sequence of CUIs
Example: Mapping clinical note text to CUIs
Given a note:
patient_id
date
note
10001
2012-12-12
The patient was admitted to the hospital due to asthma. He has a secondary diagnosis of DM – Diabetes mellitus. Discharge Diagnosis: Hypertension
After dictionary lookup, the extracted concepts and CUIs might be:
patient_id
date
extracted_terms
mapped_CUIs
10001
2012-12-12
asthma; DM – Diabetes mellitus;Hypertension
C0004096;C0011849;C0020538
Several tools are available for performing this text to code converstion:
cTAKES – An open-source NLP system developed by Apache, widely used for extracting information from clinical notes.
NILE – A lightweight and efficient tool developed at Harvard for extracting UMLS concepts from text; recommended for most applications.
MetaMap – A tool developed by the National Library of Medicine (NLM) that maps biomedical text to UMLS concepts.
Coverting Notes to Sequence of CUIs
For this tutorial, we analyze a random sample of MIMIC-IV discharge notes and process them using Petehr, a custom natural language processing (NLP) toolkit developed for this tutorial. For higher performance and production use, we recommend NILE or comparable software. We will illustrate the above example using petehr library
from petehr import Text2Codemapper = Text2Code("cui_dict.csv")note ="The patient was admitted to the hospital due to asthma and taken to ICU. He has a secondary diagnosis of DM – Diabetes mellitus and hypertension "print("Original Note:", note)codes = mapper.convert(note.lower())print("Concepts identified from the notes:", codes)
Dictionary loaded sucessfully from cui_dict.csv
Original Note: The patient was admitted to the hospital due to asthma and taken to ICU. He has a secondary diagnosis of DM – Diabetes mellitus and hypertension
Concepts identified from the notes: C0004096,C0011849,C0020538
Building the CUI-term Dictionary
To transform freetext notes into analyzable concepts, we map terms in the text to UMLS Concept Unique Identifiers (CUIs) using a term to CUI dictionary. Two practical pathways are outlined below.
Option A: Build a comprehensive term to CUI dictionary from UMLS
The UMLS Metathesaurus provides the table MRCONSO.RRF, which contains lexical variants, synonyms, and source vocabulary terms linked to CUIs. You can download the UMLS release from the National Library of Medicine:
The raw cui-term mappings are really noisy, so we encourage you to do addtional cleaning
Recommended extraction filters (for English clinical use): - Language: LAT='ENG' - Suppressible flag: exclude rows with SUPPRESS='Y' - Source vocabularies: We typically include all vocabularies to create a global dictionary for all purposes.
High-level workflow: 1. Extract: Load MRCONSO.RRF and filter as above.
2. Normalize: Lowercase, and optionally strip punctuations in STR column.
3. Stopwords: Remove stopwords from the CUI-Term mappings. 4. Deduplicate: Remove redundant mappings between CUIs and Concepts. 5. Restrict: This is optional, depending on the downstream taks, you can filter the CUIs to only keep clinically relevant semantic types (e.g., Disease or Syndrome, Finding, Pharmacologic Substance) for your use case. This can be done using the MRREL.RRF table in UMLS.
Option B: Use ONCE for obtaining condition focused dictionaries
If your project centers on specific conditions (e.g., diabetes, asthma), you may prefer a lightweight approach: