PEHRT: A Common Pipeline for Electronic Health Record (EHR) Data Harmonization for Translational Research

Overview

PEHRT assumes that researchers are familiar with their EHR data sources, including documentation, data structure, and coding systems. It supports both structured data (e.g., patient IDs, diagnoses, medications, lab results) and unstructured data (e.g., clinical notes, radiology reports) stored in text format. The data must be housed in a relational database but doesn’t need to follow a common data model (CDM).

The PEHRT framework integrates structured and unstructured healthcare data across multiple institutions through three modules: (i) data cleaning, (ii) representation learning, and (iii) automated harmonization. These processes support downstream applications such as phenotyping, risk prediction, and federated learning.

Equipment and Software Requirement

Computing and Storage Requirements for EHR Data Processing

The required resources depend on the dataset’s scale:

  • Moderate-sized datasets (e.g., ~100,000 patients): A machine with 64GB RAM and 6-12 cores is typically sufficient.
  • Large datasets (millions of patients): Consider using either:
    • Multiple nodes with 64GB RAM each, or
    • Single nodes with 1TB storage and 128 cores. Estimate the minimum disk space and RAM based on the dataset size for tasks like cleaning and processing.

Software Recommendations for EHR Data Processing

  • Data Extraction and Cleaning:
    Use Python for extracting and cleaning data from SQL databases.

  • NLP-based Processing of Unstructured Clinical Notes:
    Tools like NILE, cTAKES, HITEx, MedTagger, MetaMap, OBO Annotator, or Stanford CoreNLP are ideal for tasks such as:

    • Named Entity Recognition (NER): Identifying clinical conditions, symptoms, medications, lab tests, etc.
    • Semantic Analysis: Extracting relevant clinical concepts from notes.
  • Computational Libraries for Post-Processing:
    Depending on the dataset size, consider Python libraries like Pandas for standard processing or Dask for parallel processing to efficiently handle large datasets.