A Medallion Architecture ETL Pipeline — a system that takes raw data, moves it through three structured layers of increasing quality, and produces clean, analytics-ready tables that a business can query and visualize.
The name "medallion" comes from the Bronze/Silver/Gold layer structure. Each layer represents a higher level of data quality and business readiness. This architecture is used in production by companies like Databricks, Netflix, and Airbnb.
Raw healthcare claims data is messy and hard to analyze — inconsistent formats, no categorization, and no structure for reporting. Without a pipeline, every analyst who wants to answer a business question has to clean the data themselves, from scratch, every time.
This pipeline solves that by building a single source of truth that any analyst can query directly — without touching a single CSV.
Loaded 1,338 raw healthcare claims from CSV into DuckDB using Python — exactly as-is, with zero transformations. Preserves the source truth permanently so nothing is ever lost or overwritten.
dbt models that apply business rules to the Bronze data: bucketing ages into life stages, classifying BMI into clinical categories, normalizing smoker flags to booleans, and flagging high-value claims over $30,000. 6 data quality tests pass.
Kimball-style fact and dimension tables. dim_patient holds who the patient is. fact_claims holds what happened — the actual claim amounts and measures. This is what analysts query.
Smokers cost 3.8× more in healthcare claims than non-smokers — derived from 1,338 real insurance records processed through the full pipeline.
Built entirely with free, open-source tools used in production data engineering roles.
Python handles raw ingestion. DuckDB acts as the local analytical database — fast, free, and runs entirely on your laptop. dbt manages all SQL transformations with version control, tests, and documentation built in. Tableau Public visualizes the Gold layer output.
Schema on read vs schema on write — why the Bronze layer lands raw data without transformation, and why preserving source truth is a first principle of data engineering.
dbt fundamentals — writing modular SQL models with ref(), running automated data quality tests, and documenting transformations inline.
Kimball dimensional modeling — separating who (dimension tables) from what happened (fact tables), and why this structure makes analytics faster and more reliable.
Data pipeline structure — how to organize a real project with clean folder structure, version control from day one, and a README that tells the story clearly.
End-to-end delivery — going from a raw CSV to a live, published dashboard entirely with free tools, and understanding every step of the data journey in between.