Project Report

Healthcare Claims
Medallion Pipeline

Python · DuckDB · dbt 1,338 claims Bronze / Silver / Gold Tableau Public

What is this project?

A Medallion Architecture ETL Pipeline — a system that takes raw data, moves it through three structured layers of increasing quality, and produces clean, analytics-ready tables that a business can query and visualize.

The name "medallion" comes from the Bronze/Silver/Gold layer structure. Each layer represents a higher level of data quality and business readiness. This architecture is used in production by companies like Databricks, Netflix, and Airbnb.

📄RAW CSV
source
🥉BRONZE
as-is ingest
🥈SILVER
dbt cleaning
🥇GOLD
dimensional model
📊TABLEAU
dashboard

What problem does it solve?

Raw healthcare claims data is messy and hard to analyze — inconsistent formats, no categorization, and no structure for reporting. Without a pipeline, every analyst who wants to answer a business question has to clean the data themselves, from scratch, every time.

This pipeline solves that by building a single source of truth that any analyst can query directly — without touching a single CSV.

What was built?

🥉
BRONZE
Raw Ingestion

Loaded 1,338 raw healthcare claims from CSV into DuckDB using Python — exactly as-is, with zero transformations. Preserves the source truth permanently so nothing is ever lost or overwritten.

🥈
SILVER
Cleaning & Standardization

dbt models that apply business rules to the Bronze data: bucketing ages into life stages, classifying BMI into clinical categories, normalizing smoker flags to booleans, and flagging high-value claims over $30,000. 6 data quality tests pass.

🥇
GOLD
Dimensional Model

Kimball-style fact and dimension tables. dim_patient holds who the patient is. fact_claims holds what happened — the actual claim amounts and measures. This is what analysts query.

What did the data reveal?

Business insight from Gold layer
$32,108
Smoker avg. claim
$8,415
Non-smoker avg. claim
3.8×
Cost difference

Smokers cost 3.8× more in healthcare claims than non-smokers — derived from 1,338 real insurance records processed through the full pipeline.

How was it built?

Built entirely with free, open-source tools used in production data engineering roles.

Python 3.13 DuckDB dbt-core 1.11 dbt-duckdb pandas Tableau Public Git + GitHub

Python handles raw ingestion. DuckDB acts as the local analytical database — fast, free, and runs entirely on your laptop. dbt manages all SQL transformations with version control, tests, and documentation built in. Tableau Public visualizes the Gold layer output.

What was learned?

01

Schema on read vs schema on write — why the Bronze layer lands raw data without transformation, and why preserving source truth is a first principle of data engineering.

02

dbt fundamentals — writing modular SQL models with ref(), running automated data quality tests, and documenting transformations inline.

03

Kimball dimensional modeling — separating who (dimension tables) from what happened (fact tables), and why this structure makes analytics faster and more reliable.

04

Data pipeline structure — how to organize a real project with clean folder structure, version control from day one, and a README that tells the story clearly.

05

End-to-end delivery — going from a raw CSV to a live, published dashboard entirely with free tools, and understanding every step of the data journey in between.

Project resources