PEHRT: A Common Pipeline for Electronic Health Record (EHR) Data Harmonization for Translational Research
Overview
PEHRT assumes that researchers are familiar with their EHR data sources, including documentation, data schema, and coding systems. PEHRT supports both structured data (e.g., coded diagnoses and procedures, prescribed medications, laboratory results) and unstructured text (e.g., clinical notes, radiology reports). EHR data may reside in relational databases or as flat files (CSV/Parquet/JSON) and do not need to conform to a Common Data Model. At minimum, PEHRT requires a unique patient identifier that is used across tables, a timestamp for each event, recorded event and meta data to interpret the record (codes along with their coding system or note text with the note type)
The PEHRT framework integrates structured and unstructured healthcare data across multiple institutions through three modules: (i) data cleaning, (ii) representation learning. These processes support downstream applications such as phenotyping, risk prediction, and federated learning
Computing and Storage Requirements for EHR Data Processing
The required resources depend on the dataset’s scale. The following are some recommended setups:
Dataset Scale | Node Setup | CPU Cores | RAM | Storage | Notes |
---|---|---|---|---|---|
Moderate (~100k patients) | Single node | 8–16 | 32–64 GB | ≥3× raw data size | R/Python Pandas framework or similar is sufficient. |
Large (1–3 million patients) | Single powerful node | 32–64 | 256–512 GB | 3–5× raw data size | Process data in patient batches; consider using Polars or Dask for parallel processing. |
Large (1–3 million patients) | Distributed cluster (5–10 nodes) | 8–32 each | 64–128 GB per node | 3–5× raw data size | Prefer Parquet storage. Use Spark, Dask, Polars. Save intermediate results and perform data quality checks. |
Very large (>5 million patients) | Single HPC node (large memory) | ~128 | ~1 TB | 5x raw data size | Prefer this option only if a cluster isn’t available. Use big data processing frameworks like Spark or Dask on a single high memory server. Save intermediate data and add data quality checks. Process data in batches. Use HPC scratch space |
Very large (>5 million patients) | Distributed HPC cluster (10–20 nodes) | ~32-64 each | ~128-256 TB each | 5x raw data size | Use Parquet and a scalable engine (e.g., Spark, Dask). Checkpoint intermediate data and add data-quality checks. Process in batches as data grows. Use HPC scratch space |
Software Recommendations for EHR Data Processing
Data Extraction and Cleaning:
Use Python for extracting and cleaning data from SQL databases.NLP-based Processing of Unstructured Clinical Notes:
Tools like NILE, cTAKES, HITEx, MedTagger, MetaMap, OBO Annotator, or Stanford CoreNLP are ideal for tasks such as:- Named Entity Recognition (NER): Identifying clinical conditions, symptoms, medications, lab tests, etc.
- Semantic Analysis: Extracting relevant clinical concepts from notes.
- Named Entity Recognition (NER): Identifying clinical conditions, symptoms, medications, lab tests, etc.
Computational Libraries for Post-Processing:
Depending on the dataset size, consider Python libraries like Pandas for standard processing or Dask/Polars for parallel processing to efficiently handle large datasets.