Selected Work
2024 · NLP & Data Engineering

COVID-19 Records: NLP Pipeline

A scalable pipeline turning unstructured medical notes into clean, structured datasets for analysis.

PythonPySparkFAISSLLMsPydanticETL
COVID-19 Records: NLP Pipeline

Overview

Medical encounter notes are written for people, not databases. This pipeline reads unstructured COVID 19 patient records and produces clean, validated, structured data that can be analyzed at scale.

What it does

  • Large language model driven extraction with Pydantic validation to standardize fields.
  • FAISS semantic search to keep medication references consistent across free text.
  • Apache Spark analysis of comorbidities, medication frequencies, and demographic trends, with visualizations that supported the findings.

How it is built

Python drives the pipeline end to end, with PySpark for scale, FAISS for semantic matching, Pydantic for validation, and Parquet for storage. An LLM handles the initial extraction step.