import os
# Specifice the tutorial workspace location below
= "/n/scratch/users/v/va67/EHR_TUTORIAL_WORKSPACE"
LOCATION
# Build the path to the .gz files using os.path.join
= os.path.join(LOCATION, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp")
hosp_gz_files_path
# Run the gunzip command directly in the IPython notebook
print("Please wait, files are being unzipped")
!gunzip {hosp_gz_files_path}/*.gz
print(f"All .gz files in {hosp_gz_files_path} have been unzipped.\n")
= os.path.join(LOCATION, "raw_data", "nlp", "physionet.org", "files", "mimic-iv-note", "2.2", "note")
note_gz_files_path !gunzip {note_gz_files_path}/*.gz
print(f"All .gz files in {note_gz_files_path} have been unzipped.\n")
Getting Started: Setting up Workspace and Data Overview
Accessing MIMIC-IV Dataset
MIMIC-IV is a deidentified EHR dataset sourced from BIDMC Hospital. To gain access to MIMIC-IV, follow these steps
- You will need to be a credentialed user. You can register in here to create a physionet account and become a credential user.
- Complete the training listed here and submit the trainings here.
- Sign the Data User Agreement here.
Once your access is approved, you will receive a notification via your registered email.
Setting up your Compute Environment
- Setup Workspace to download and process MIMIC data.
- Download all the tutorial notebooks.
- Create a virtual environment with necessary python libraries.
- Download the MIMIC data.
The following script can be run on Unix based systems to setup the workspace.
#!/bin/bash
# Set the location variable to the path where you want to create the MIMIC workspace
LOCATION="/path/to/your/workspace"
# If you would like to download and work in your home directory, you can update above to LOCATION="$HOME"
# Check if the workspace folder already exists
if [ -d "${LOCATION}/EHR_TUTORIAL_WORKSPACE" ]; then
echo "EHR_TUTORIAL_WORKSPACE folder already exists here ${LOCATION}/EHR_TUTORIAL_WORKSPACE"
exit 0
fi
# Create the workspace directory
mkdir "${LOCATION}/EHR_TUTORIAL_WORKSPACE"
echo "Workspace has been created here ${LOCATION}/EHR_TUTORIAL_WORKSPACE"
# Create the workspace subdirectories
echo "Creating raw_data, processed_data, and scripts subdirectories"
mkdir -p "${LOCATION}/EHR_TUTORIAL_WORKSPACE/raw_data" \
"${LOCATION}/EHR_TUTORIAL_WORKSPACE/raw_data/nlp" \
"${LOCATION}/EHR_TUTORIAL_WORKSPACE/processed_data" \
"${LOCATION}/EHR_TUTORIAL_WORKSPACE/scripts"
echo "Workspace has been set up here ${LOCATION}/EHR_TUTORIAL_WORKSPACE"
# Download and extract MIMIC Data Prep scripts from GitHub
echo "Downloading and extracting MIMIC Data Prep scripts from GitHub..."
wget https://github.com/apvidul/EHR-Processing-Tutorial/archive/refs/heads/main.zip -O mimic-data-prep.zip
unzip -q mimic-data-prep.zip -d "${LOCATION}/EHR_TUTORIAL_WORKSPACE/scripts"
mv "${LOCATION}/EHR_TUTORIAL_WORKSPACE/scripts/EHR-Processing-Tutorial-main/"* "${LOCATION}/EHR_TUTORIAL_WORKSPACE/scripts/"
# Cleanup unnecessary files and folders
rm -rf mimic-data-prep.zip "${LOCATION}/EHR_TUTORIAL_WORKSPACE/scripts/EHR-Processing-Tutorial-main"
echo "Creating the Conda environment ehr_tutorial"
conda create --yes --name ehr_tutorial pandas wordcloud jupyter tqdm matplotlib
# Downloading the MIMIC data
wget -r -N -c -np --user ${PHYSIONET_USERNAME} --password ${PHYSIONET_PASSWORD} https://physionet.org/files/mimiciv/3.1/ -P ${LOCATION}/EHR_TUTORIAL_WORKSPACE/raw_data
wget -r -N -c -np --user ${PHYSIONET_USERNAME} --password ${PHYSIONET_PASSWORD} https://physionet.org/files/mimic-iv-note/2.2/ -P ${LOCATION}/EHR_TUTORIAL_WORKSPACE/raw_data/nlp
Once MIMIC is downloaded, you are ready to work on this notebook. The data is currently in zipped format, you can go ahead and run the following block to unzip.
Understanding EHR Data
After gaining access to the raw EHR data, one needs to understand the data so you can identify the part of interest and extract them based on your research needs. This process can be split into the following substeps
- What data is stored? Where is it stored and how is it organized?
- What are the key elements in EHR data?
- How to link different components to create raw dataset of interest?
The EHR data typically comes with its own documentation, or you can consult the EHR data manager. For MIMIC data, comprehensive documentation is provided on their website here.
We also recommend watching the MIMIC data tutorial here.
What data is stored?
Electronic Health Record (EHR) data can take many forms. For example, billing information and vital signs are typically recorded as tabular data, clinical notes are typically in free text format, ECG recordings generate waveform data, and procedures such as X-rays or CT scans generate image data. These data originate from various systems within the hospital and contain a wealth of information. Today, several frameworks facilitate clinical research using EHR data such as the i2b2 framework from MGB and Harvard and VINCI, maintained by Veterans Affairs, which provide both data and analytical tools to support research and analysis.
EHR data is typically stored in databases. In this tutorial, we will be working with MIMIC data stored as flat files. MIMIC-IV data comes in “modules” based on the source of data generation. Following are the five modules currently available for users:
- hosp - Hospital-level data for patients, including labs, microbiology, and electronic medication administration records.
- icu - ICU-level data, including event tables (e.g., chart events) identical in structure to MIMIC-III.
- ed - Data from the emergency department.
- cxr - Lookup tables and metadata from MIMIC-CXR, enabling linkage to MIMIC-IV.
- note - De-identified free-text clinical notes.
We will be using MIMIC-IV hosp module and note module
for this tutorial.
Where is the data stored and how is it organized?
The hosp module consists of all data acquired from the hospital wide electronic health record. This includes patient information, lab measurements, microbiology, medication administered, and billed diagnoses. More information on tables in this module can be read on MIMIC documentation here.
Depending on the needs of your research/analysis you will have multiple tables of interest. Below we list the tables we will be using for this tutorial.
1. Data Dictionaries
These tables contain definitions for the medical codes used in the EHR data. Any table that starts with d_
is a data dictionary, including:
d_hcpcs
: Provides descriptions of CPT codes.d_icd_diagnoses
: Provides descriptions of ICD-9/ICD-10 billed diagnoses.d_icd_procedures
: Provides descriptions of ICD-9/ICD-10 billed procedures.d_labitems
: Provides descriptions of all lab items.
2. Event Tables
These tables contain events recorded in the hospital.
Diagnosis Data
diagnoses_icd
: Billed ICD diagnosis codes for hospitalizations.
Medication Data
prescriptions
: Provides information about prescribed medications.
Laboratory Data
labevents
: Laboratory measurements sourced from patient-derived specimens.
Procedure Data
hpcsevents
: Billed events occurring during hospitalization, including CPT codes.procedures_icd
: Billed procedures for patients during their hospital stay.
3. Patient Metadata
These tables provide information about patients during their hospital stays:
admissions
: Detailed information about hospital stays.transfers
: Detailed information about patients’ unit transfers.patients
: Provides information on patients’ gender, age, and date of death.
What are the key elements of interest?
- Unique Patient ID: A unique id that uniquely identifies a patient. In MIMIC data every unique patient get a unique id called subject_id.
- Event/Observation: EHR data largely includes events or observation maintained over time.
- Event/Observation Type: Type of Event/Observation type that is recorded or observed. This can be Diagnosis, Medication, Lab, Procedure.
- Time: Time an event happened or observation was made. However, we don’t have time component in the above table. We will need to identify where the date component is and link it with the above table. We will discuss that in the following section.
Integrating EHR data
How are the data components linked?
In EHR data, information can be spread across multiple tables or datasets. You will need to link data from different tables to create comprehensive dataset depending on your research needs. One of the key aspects for linking data is identifiers. The following are three major identifiers are used in MIMIC data.
SUBJECT_ID
: This is a patient level identifier. The patients tables contains demographics for each unique patient.HADM_ID
: This is a hospital level identifier provided in the hosp module. Each unique hospital admission for a patient gets a unique id.STAY_ID
: All ICU admissions within 24 hours of each other are grouped and assigned an identifier.
If a patient has multiple admissions at the hospital, we will have multiple hadm_id but only a single subject_id.
If a patient has multiple ICU admissions during the same hospital stay, you will have multiple stay_id but a single hadm_id and a unique subject_id.
For the full notebook with code, please visit here.