Project Report — Healthcare Claims Medallion Pipeline

01 — Overview

What is this project?

A Medallion Architecture ETL Pipeline — a system that takes raw data, moves it through three structured layers of increasing quality, and produces clean, analytics-ready tables that a business can query and visualize.

The name "medallion" comes from the Bronze/Silver/Gold layer structure. Each layer represents a higher level of data quality and business readiness. This architecture is used in production by companies like Databricks, Netflix, and Airbnb.

📄RAW CSV

source

→

🥉BRONZE

as-is ingest

→

🥈SILVER

dbt cleaning

→

🥇GOLD

dimensional model

→

📊TABLEAU

dashboard

02 — Problem

What problem does it solve?

Raw healthcare claims data is messy and hard to analyze — inconsistent formats, no categorization, and no structure for reporting. Without a pipeline, every analyst who wants to answer a business question has to clean the data themselves, from scratch, every time.

This pipeline solves that by building a single source of truth that any analyst can query directly — without touching a single CSV.

03 — Architecture

What was built?

🥉

BRONZE

Raw Ingestion

Loaded 1,338 raw healthcare claims from CSV into DuckDB using Python — exactly as-is, with zero transformations. Preserves the source truth permanently so nothing is ever lost or overwritten.

🥈

SILVER

Cleaning & Standardization

dbt models that apply business rules to the Bronze data: bucketing ages into life stages, classifying BMI into clinical categories, normalizing smoker flags to booleans, and flagging high-value claims over $30,000. 6 data quality tests pass.

🥇

GOLD

Dimensional Model

Kimball-style fact and dimension tables. dim_patient holds who the patient is. fact_claims holds what happened — the actual claim amounts and measures. This is what analysts query.

04 — Key Finding

What did the data reveal?

Business insight from Gold layer

$32,108

Smoker avg. claim

$8,415

Non-smoker avg. claim

3.8×

Cost difference

Smokers cost 3.8× more in healthcare claims than non-smokers — derived from 1,338 real insurance records processed through the full pipeline.

05 — Tech Stack

How was it built?

Built entirely with free, open-source tools used in production data engineering roles.

Python 3.13 DuckDB dbt-core 1.11 dbt-duckdb pandas Tableau Public Git + GitHub

Python handles raw ingestion. DuckDB acts as the local analytical database — fast, free, and runs entirely on your laptop. dbt manages all SQL transformations with version control, tests, and documentation built in. Tableau Public visualizes the Gold layer output.

06 — Learnings

What was learned?

01

Schema on read vs schema on write — why the Bronze layer lands raw data without transformation, and why preserving source truth is a first principle of data engineering.

02

dbt fundamentals — writing modular SQL models with ref(), running automated data quality tests, and documenting transformations inline.

03

Kimball dimensional modeling — separating who (dimension tables) from what happened (fact tables), and why this structure makes analytics faster and more reliable.

04

Data pipeline structure — how to organize a real project with clean folder structure, version control from day one, and a README that tells the story clearly.

05

End-to-end delivery — going from a raw CSV to a live, published dashboard entirely with free tools, and understanding every step of the data journey in between.

07 — Links

Project resources

↗ GitHub Repository ↗ Live Tableau Dashboard

Healthcare ClaimsMedallion Pipeline