2024 · NLP & Data Engineering

COVID-19 Records: NLP Pipeline

A scalable pipeline turning unstructured medical notes into clean, structured datasets for analysis.

PythonPySparkFAISSLLMsPydanticETL

Overview

Medical encounter notes are written for people, not databases. This pipeline reads unstructured COVID 19 patient records and produces clean, validated, structured data that can be analyzed at scale.

What it does

Large language model driven extraction with Pydantic validation to standardize fields.
FAISS semantic search to keep medication references consistent across free text.
Apache Spark analysis of comorbidities, medication frequencies, and demographic trends, with visualizations that supported the findings.

How it is built

Python drives the pipeline end to end, with PySpark for scale, FAISS for semantic matching, Pydantic for validation, and Parquet for storage. An LLM handles the initial extraction step.