Complete Beginner's Guide

Building a RAG System
from Scratch

A step-by-step walkthrough of how to build a production-quality AI documentation assistant using Retrieval-Augmented Generation, a local language model, and a vector database. Every concept explained as if you've never seen any of this before.

What is RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique where you give an AI model specific documents to read before it answers a question — instead of letting it answer from its general training knowledge.

Think about it this way: if you ask ChatGPT "how do I authenticate with your company's API?", it has no idea — it was never trained on your internal docs. It will either say "I don't know" or make something up (hallucinate).

RAG solves this. When a developer asks a question, the system first searches your documentation for the relevant paragraphs, then feeds those paragraphs to the AI model and says "answer using only this." The model reads the retrieved text and generates a grounded answer — one that comes from your actual documentation, not from its imagination.

End-to-end RAG query flow: developer asks question, query becomes embedding vector, ChromaDB finds matching chunks, Ollama generates answer with source citations

The complete RAG flow — question in, cited answer out

Analogy

Imagine an open-book exam. The student (the AI model) doesn't need to memorize the textbook. Instead, they look up the relevant pages first (retrieval), read them (context), and write their answer based on what they just read (generation). That's RAG.

Why not just fine-tune the model?

Fine-tuning means retraining the AI model on your documentation so it "remembers" the content. This sounds simpler but has serious problems for documentation search:

Your docs change. Every time you add a new API endpoint or update a guide, you'd need to retrain the model. With RAG, you just drop a file in a folder and re-run ingestion — done in seconds.
You need citations. When a developer asks "how do I authenticate?", they need to trust the answer. RAG can say "this came from authentication.md, section 3." Fine-tuning can't — the knowledge is dissolved into the model's weights with no traceable source.
Fine-tuning is expensive. Even a small 7B model requires a GPU, thousands of training examples, and hours of compute. RAG runs on a laptop CPU.

Side-by-side comparison of fine-tuning versus RAG: fine-tuning bakes knowledge into weights, RAG fetches docs at query time

Fine-tuning vs RAG — why RAG is the right choice for documentation

Architecture Overview

The system has two major flows — one for ingesting documents, one for answering questions. Here's the full picture of how every component connects:

Full system architecture showing Developer, Chat UI, FastAPI, RAG pipeline, vector retrieval, ChromaDB, embeddings, ingestion pipeline, and documentation files, all inside a Docker container

Full system architecture — every phase mapped to its component

Phase 1

Project Setup and Environment

Before writing any logic, make it runnable.

What we did

Created the entire project skeleton — folders, virtual environment, dependencies, configuration, and a basic health check API endpoint that proves the system starts.

Why this matters

A production project isn't just code — it's a system of files that communicate intent. The folder layout you choose on day one determines how maintainable, testable, and deployable your project will be. We established three foundations:

Python virtual environment (.venv) — isolates this project's dependencies from everything else on your machine.
pyproject.toml — the modern Python standard for declaring dependencies, project metadata, and tool configuration in one file.
Modular folder structure — each folder has one responsibility. src/ingestion/ only handles document loading. src/retrieval/ only handles vector search.

Key files created

File	Purpose
`pyproject.toml`	All dependencies and project config
`src/config.py`	Centralized settings loaded from .env file
`src/api/main.py`	FastAPI app with /health endpoint
`.env`	Local configuration values (never committed to git)
`tests/test_health.py`	4 tests proving the API starts correctly

Why Pydantic Settings?

Configuration is managed by Pydantic Settings, which reads values from environment variables or a .env file. You change behavior by editing .env, not by touching code. If a required variable is missing, the app fails immediately with a clear error.

Phase 2

Document Ingestion Pipeline

Opening files, reading text, attaching labels.

What we did

Built a DocumentLoader class that opens documentation files from a folder, reads their text content, and wraps each one in a Document object with metadata — a label saying where the text came from.

What is a Document object?

In LangChain, the unit of content is a Document. It has exactly two fields:

Document(
    page_content = "Use Bearer tokens for all API requests...",
    metadata     = {
        "source":     "data/raw/authentication.md",
        "file_name":  "authentication.md",
        "file_type":  "md",
        "load_timestamp": "2026-03-18T17:45:59"
    }
)

Ingestion flow: .md, .pdf, .txt, and unknown files enter DocumentLoader which routes by file extension, enriches with metadata, and outputs List of Documents to Phase 3 chunker

The ingestion flow — files go in, labelled Documents come out

Why metadata matters

Content alone is not enough. Every piece of text needs metadata — where it came from, what format it was, when it was loaded. That metadata is what lets the final system say "this answer comes from authentication.md" rather than returning anonymous text.

Analogy

Think of a library intake desk. Every book that arrives gets catalogued with its title, author, and category before it goes on the shelf. The catalogue card (metadata) is just as important as the book itself — without it, you can never find the book again.

Supported file types

Extension	Loader	What it handles
`.md`	UnstructuredMarkdownLoader	Markdown documentation
`.txt`	TextLoader	Plain text files
`.pdf`	PyPDFLoader	PDF documents (one Document per page)

Incremental Ingestion

A naive approach to ingestion would re-process every file every time you run the pipeline — leading to duplicate chunks in the vector store, wasted computation, and polluted search results. This system uses incremental ingestion instead.

Before processing, the pipeline queries ChromaDB's metadata index for all file_name values already stored. Any file that has been previously chunked and embedded is skipped automatically. Only new, previously unseen documents get processed.

Three Ingestion Modes

Incremental (default) — queries ChromaDB for already-indexed file names, skips them, and only processes new documents. No duplicate vectors, no redundant embedding computation.

Force re-index (--force) — deletes the entire vectorstore and re-embeds the full corpus from scratch. Use when the embedding model changes or the vectorstore is corrupted.

Single-file re-index (--reindex filename) — deletes all stale chunks for one specific file, then re-chunks and re-embeds it. Use when a document has been edited and the old vectors are out of date.

# Default — only new files get processed
python -m src.ingestion.run_ingestion

# Force — delete vectorstore, re-embed everything
python -m src.ingestion.run_ingestion --force

# Reindex one file — delete old chunks, re-embed
python -m src.ingestion.run_ingestion --reindex authentication.md

This means scaling the document corpus is a single-step operation — drop the file into data/raw/ and run python -m src.ingestion.run_ingestion. The pipeline handles the rest.

Phase 3

Document Chunking

Splitting documents into searchable pieces.

The problem

After Phase 2, each document is one big block of text. If authentication.md is 500 words long and a developer asks "what happens when my token expires?", the system would retrieve the entire 500-word file. That's wasteful and imprecise.

The solution

Chunking splits each document into small, overlapping pieces. Instead of retrieving a whole file, the system retrieves exactly the 2-3 sentences that answer the question.

Diagram showing one document being split into 3 overlapping chunks: Chunk 1 (chars 0-200), Chunk 2 (chars 150-350), Chunk 3 (chars 300-500), with overlap regions highlighted

How overlap works — consecutive chunks share text at their boundaries

How RecursiveCharacterTextSplitter works

We use LangChain's smartest text splitter. It doesn't cut at every 512 characters blindly. It tries to find the most natural place to cut, working through this priority list:

Try to cut at a blank line (paragraph boundary)
If chunk still too big, try a line break
If still too big, try a period + space (sentence boundary)
If still too big, try a space (word boundary)
Last resort: cut at any character

Settings

chunk_size    = 512 characters   (how big each piece is)
chunk_overlap = 50 characters    (how much consecutive pieces share)

Phase 4

Embeddings and Vector Database

Converting text into searchable numbers.

What is an embedding?

An embedding converts text into a list of numbers (a "vector") so that a computer can understand meaning, not just keywords. Each chunk becomes a list of 384 numbers.

The magic: chunks with similar meaning produce similar numbers. So "log in" and "authenticate" end up close together in number-space, even though the words are completely different.

Diagram showing three text chunks being converted by SentenceTransformers into 384-number vectors, stored in ChromaDB, with auth-related chunks producing similar numbers and API endpoint chunks producing different numbers

Text → numbers → similarity search — the core mechanism of semantic retrieval

Analogy

Think of GPS coordinates. "The coffee shop on 5th Avenue" and "the café at 40.758, -73.985" are different descriptions of the same place — but the GPS numbers let you calculate how close they are to other locations. Embeddings do the same thing for meaning instead of geography.

ChromaDB: the vector database

ChromaDB stores all your embedded chunks on disk. For each chunk, it stores three things: the original text, all the metadata (file name, source, chunk index), and the 384 numbers. When you search, ChromaDB compares your question's numbers against every stored chunk's numbers and returns the closest matches.

Phase 5

Retrieval Pipeline

Finding the right chunks for any question.

Step-by-step similarity search flow: developer question is embedded into 384 numbers, ChromaDB compares against all stored vectors, top k=3 chunks are returned ranked by similarity score

How similarity search works — question in, relevant chunks out

How it works step by step

The question is converted to 384 numbers using the same embedding model
ChromaDB compares those numbers against every stored chunk's numbers
The 3 chunks whose numbers are closest (most similar meaning) are returned
Each result comes with a similarity score — higher means more relevant

Example retrieval result

Result 1 — score: 0.611 | authentication.md chunk 0   ← most relevant
Result 2 — score: 0.269 | getting_started.txt chunk 0
Result 3 — score: 0.183 | authentication.md chunk 1

No LLM involved yet

Phase 5 is pure search. The retriever is like a librarian — it doesn't answer your question, it just finds the right pages and hands them to you. The LLM in Phase 6 is the expert who reads those pages and writes the answer.

Phase 6

LLM Integration with Ollama

Adding the brain that reads and answers.

What is Ollama?

Ollama is a tool that runs large language models locally on your machine. It works like a local server — your code sends it text, and it sends back a response. No internet required after the initial model download. No API keys. Your data never leaves your laptop.

The prompt template

We don't just send the question to Mistral. We send a carefully structured prompt:

You are a helpful developer documentation assistant.
Answer the developer's question using ONLY the information
provided in the context below.
If the answer is not in the context, say "I don't have enough
information in the documentation to answer this question."
Always cite which document your answer comes from.

Context:
{retrieved chunks go here}

Question: {developer's question}

Answer:

The instruction "using ONLY the information provided" is what prevents hallucination. Without it, Mistral would blend retrieved context with its training knowledge.

Temperature = 0.1

Temperature controls how creative vs. factual the model is. 0.0 = fully deterministic. 1.0 = creative and random. We use 0.1 — slightly flexible but mostly factual. For documentation Q&A, you want consistent, reliable answers.

Phase 7

RAG Pipeline

Wiring retrieval + LLM into one call.

What this phase does

Before Phase 7, you had separate pieces — a retriever that finds chunks and an LLM client that generates answers. Phase 7 connects them into a single RAGPipeline class. You give it a question, and it handles everything: retrieve → format context → generate → return answer with sources.

RAG pipeline flow: question enters DocumentRetriever which searches ChromaDB for top 3 chunks, context formatter labels chunks with source citations, OllamaClient generates answer from context and question, RAGResponse contains answer plus sources plus scores plus model info

The RAG pipeline — one question in, one cited answer out

Phase 8

FastAPI Inference Endpoint

Making the pipeline accessible over HTTP.

What is FastAPI?

FastAPI is a Python web framework that turns your Python functions into HTTP endpoints. After this phase, any client — a browser, a mobile app, a curl command, the chat UI — can send a question over HTTP and get a JSON answer back.

The two endpoints

Endpoint	Method	What it does
`/query`	POST	Send a question, get an answer with sources as JSON
`/health`	GET	Check if all pipeline components are running

Automatic documentation

FastAPI auto-generates an interactive Swagger UI at /docs. You can open your browser, type a question in the form, click "Execute", and see the full JSON response — without writing any frontend code.

Why initialize the pipeline once?

The RAG pipeline is created once when the server starts and reused for every request. Loading the embedding model and connecting to Ollama takes several seconds — you don't want that delay on every question. Initialize once, serve many.

Phase 9

Chat Interface

A real UI developers can actually use.

What we built

A Streamlit chat application that talks to the FastAPI backend. Developers type a question, get an answer with source citation cards showing which document each piece of information came from — with relevance score bars under each citation.

Features

Dark editorial design with JetBrains Mono and Fraunces typography
Source citation cards with amber relevance score bars
Live health status indicator in the sidebar
Adjustable retrieval setting (1-5 chunks)
Conversation history preserved during session
Example question chips that fire immediately when clicked
Graceful error handling if the API is offline

Phase 10

Evaluation with RAGAS

Measuring quality with actual numbers.

The problem with "it seems to work"

After Phase 9 you can use the system and it feels like it works. But "feels like it works" isn't engineering — it's guessing. RAGAS gives you actual numbers so you can say "my system is 100% faithful to source documents" instead of guessing.

RAGAS evaluation framework showing 3 metrics: Faithfulness (catches hallucination, 1.0 means every claim is in the docs), Answer Relevancy (catches off-topic answers, 1.0 means answer is on point), Context Precision (catches bad retrieval, 1.0 means all retrieved chunks were useful). All scores range 0.0 to 1.0, above 0.7 is good, above 0.85 is excellent, below 0.5 needs fixing.

The 3 RAGAS metrics — what each one catches and what the scores mean

Our scores

Metric	Score	What it tells us
Faithfulness	1.000	The LLM never hallucinated — every claim comes from retrieved docs
Context Precision	1.000	The retriever consistently found the correct chunks
Answer Relevancy	0.458	Affected by timeouts and missing docs (not a quality issue)

How it works

You create a test set of questions with expected answers. RAGAS runs each question through your pipeline, gets the answer and retrieved chunks, then uses an LLM as a judge to score each metric.

Why an LLM judge?

Natural language isn't exact. "Include the token in the Authorization header" and "Add Authorization: Bearer your_token to every request" mean the same thing. An exact string match would say they're different. An LLM judge understands they're semantically equivalent.

Phase 11

Docker Deployment + Agentic RAG

One-command deployment and an intelligent search agent.

Docker: why containerize?

Right now, running the system requires opening 2-3 terminals, activating a venv, starting Ollama, running uvicorn, running Streamlit. Docker packages everything into containers so the entire system starts with one command: docker-compose up.

The Docker stack

Container	What it runs
ollama	Mistral LLM server
api	FastAPI backend with RAG pipeline
ui	Streamlit chat interface
ingestion	One-time document processing job

Agentic RAG: what changed?

The naive pipeline (Phases 1-9) follows a fixed path: always retrieve 3 chunks, always generate once. The agentic pipeline adds a reasoning layer — the LLM decides what to search for, can search multiple times, and determines when it has enough information to answer.

Switching modes

One environment variable controls which pipeline runs:
PIPELINE_MODE=naive — fast, predictable, ~5 seconds
PIPELINE_MODE=agentic — smarter, slower, ~15 seconds
Both share the same API, same UI, same tests.

Deep Dive

Naive RAG vs Agentic RAG

Understanding the two approaches this project implements.

Type	How it works	Best for
Naive RAG	Fixed pipeline: embed → search → generate. No decisions.	Simple questions on well-structured docs
Advanced RAG	Adds re-ranking, query rewriting, hybrid search. Still fixed.	Better retrieval without agent complexity
Agentic RAG	LLM decides what to search, when, and how many times.	Complex questions requiring multi-step reasoning

Deep Dive

What LangChain Actually Does

Understanding the toolkit that connects everything.

LangChain is a toolkit, not a model

LangChain doesn't generate text. It doesn't search databases. It's the glue that connects the pieces that do those things. Think of it like LEGO — it gives you pre-built blocks for loading documents, splitting text, generating embeddings, searching a database, and talking to an LLM.

The packages we used

Package	What it provides	Used in
`langchain-core`	Document class, PromptTemplate	Every phase
`langchain-community`	File loaders (Markdown, PDF, text)	Phase 2
`langchain-text-splitters`	RecursiveCharacterTextSplitter	Phase 3
`langchain-chroma`	ChromaDB integration	Phase 4
`langchain-huggingface`	SentenceTransformers wrapper	Phase 4
`langchain-ollama`	Ollama LLM and ChatOllama	Phases 6, 11
`langgraph`	ReAct agent framework	Phase 11

What tests are for

Each test encodes one thing that must always be true. Collectively they form a specification of the system's behavior — not just "it works today" but "it works this specific way, always." Before every commit, you run 63 tests in under 30 seconds. Green means every contract still holds.

Built as part of the Developer Documentation AI Assistant project.
github.com/sumreen7/developer-knowledge-rag

Building a RAG Systemfrom Scratch

What is RAG?

Why not just fine-tune the model?

Architecture Overview

Project Setup and Environment

What we did

Why this matters

Key files created

Document Ingestion Pipeline

What we did

What is a Document object?

Why metadata matters

Supported file types

Incremental Ingestion

Document Chunking

The problem

The solution

How RecursiveCharacterTextSplitter works

Settings

Embeddings and Vector Database

What is an embedding?

ChromaDB: the vector database

Retrieval Pipeline

How it works step by step

Example retrieval result

LLM Integration with Ollama

What is Ollama?

The prompt template

RAG Pipeline

What this phase does

FastAPI Inference Endpoint

What is FastAPI?

The two endpoints

Automatic documentation

Chat Interface

What we built

Features

Evaluation with RAGAS

The problem with "it seems to work"

Our scores

How it works

Docker Deployment + Agentic RAG

Docker: why containerize?

The Docker stack

Agentic RAG: what changed?

Naive RAG vs Agentic RAG

What LangChain Actually Does

LangChain is a toolkit, not a model

The packages we used

What tests are for

Building a RAG System
from Scratch