Complete Beginner's Guide

Building a RAG System
from Scratch

A step-by-step walkthrough of how to build a production-quality AI documentation assistant using Retrieval-Augmented Generation, a local language model, and a vector database. Every concept explained as if you've never seen any of this before.

What is RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique where you give an AI model specific documents to read before it answers a question — instead of letting it answer from its general training knowledge.

Think about it this way: if you ask ChatGPT "how do I authenticate with your company's API?", it has no idea — it was never trained on your internal docs. It will either say "I don't know" or make something up (hallucinate).

RAG solves this. When a developer asks a question, the system first searches your documentation for the relevant paragraphs, then feeds those paragraphs to the AI model and says "answer using only this." The model reads the retrieved text and generates a grounded answer — one that comes from your actual documentation, not from its imagination.

End-to-end RAG query flow: developer asks question, query becomes embedding vector, ChromaDB finds matching chunks, Ollama generates answer with source citations
The complete RAG flow — question in, cited answer out
Analogy

Imagine an open-book exam. The student (the AI model) doesn't need to memorize the textbook. Instead, they look up the relevant pages first (retrieval), read them (context), and write their answer based on what they just read (generation). That's RAG.

Why not just fine-tune the model?

Fine-tuning means retraining the AI model on your documentation so it "remembers" the content. This sounds simpler but has serious problems for documentation search:

Side-by-side comparison of fine-tuning versus RAG: fine-tuning bakes knowledge into weights, RAG fetches docs at query time
Fine-tuning vs RAG — why RAG is the right choice for documentation

Architecture Overview

The system has two major flows — one for ingesting documents, one for answering questions. Here's the full picture of how every component connects:

Full system architecture showing Developer, Chat UI, FastAPI, RAG pipeline, vector retrieval, ChromaDB, embeddings, ingestion pipeline, and documentation files, all inside a Docker container
Full system architecture — every phase mapped to its component
Phase 1

Project Setup and Environment

Before writing any logic, make it runnable.

What we did

Created the entire project skeleton — folders, virtual environment, dependencies, configuration, and a basic health check API endpoint that proves the system starts.

Why this matters

A production project isn't just code — it's a system of files that communicate intent. The folder layout you choose on day one determines how maintainable, testable, and deployable your project will be. We established three foundations:

Key files created

FilePurpose
pyproject.tomlAll dependencies and project config
src/config.pyCentralized settings loaded from .env file
src/api/main.pyFastAPI app with /health endpoint
.envLocal configuration values (never committed to git)
tests/test_health.py4 tests proving the API starts correctly
Why Pydantic Settings?
Configuration is managed by Pydantic Settings, which reads values from environment variables or a .env file. You change behavior by editing .env, not by touching code. If a required variable is missing, the app fails immediately with a clear error.
Phase 2

Document Ingestion Pipeline

Opening files, reading text, attaching labels.

What we did

Built a DocumentLoader class that opens documentation files from a folder, reads their text content, and wraps each one in a Document object with metadata — a label saying where the text came from.

What is a Document object?

In LangChain, the unit of content is a Document. It has exactly two fields:

Document(
    page_content = "Use Bearer tokens for all API requests...",
    metadata     = {
        "source":     "data/raw/authentication.md",
        "file_name":  "authentication.md",
        "file_type":  "md",
        "load_timestamp": "2026-03-18T17:45:59"
    }
)
Ingestion flow: .md, .pdf, .txt, and unknown files enter DocumentLoader which routes by file extension, enriches with metadata, and outputs List of Documents to Phase 3 chunker
The ingestion flow — files go in, labelled Documents come out

Why metadata matters

Content alone is not enough. Every piece of text needs metadata — where it came from, what format it was, when it was loaded. That metadata is what lets the final system say "this answer comes from authentication.md" rather than returning anonymous text.

Analogy

Think of a library intake desk. Every book that arrives gets catalogued with its title, author, and category before it goes on the shelf. The catalogue card (metadata) is just as important as the book itself — without it, you can never find the book again.

Supported file types

ExtensionLoaderWhat it handles
.mdUnstructuredMarkdownLoaderMarkdown documentation
.txtTextLoaderPlain text files
.pdfPyPDFLoaderPDF documents (one Document per page)

Incremental Ingestion

A naive approach to ingestion would re-process every file every time you run the pipeline — leading to duplicate chunks in the vector store, wasted computation, and polluted search results. This system uses incremental ingestion instead.

Before processing, the pipeline queries ChromaDB's metadata index for all file_name values already stored. Any file that has been previously chunked and embedded is skipped automatically. Only new, previously unseen documents get processed.

Three Ingestion Modes
Incremental (default) — queries ChromaDB for already-indexed file names, skips them, and only processes new documents. No duplicate vectors, no redundant embedding computation.

Force re-index (--force) — deletes the entire vectorstore and re-embeds the full corpus from scratch. Use when the embedding model changes or the vectorstore is corrupted.

Single-file re-index (--reindex filename) — deletes all stale chunks for one specific file, then re-chunks and re-embeds it. Use when a document has been edited and the old vectors are out of date.
# Default — only new files get processed
python -m src.ingestion.run_ingestion

# Force — delete vectorstore, re-embed everything
python -m src.ingestion.run_ingestion --force

# Reindex one file — delete old chunks, re-embed
python -m src.ingestion.run_ingestion --reindex authentication.md

This means scaling the document corpus is a single-step operation — drop the file into data/raw/ and run python -m src.ingestion.run_ingestion. The pipeline handles the rest.

Phase 3

Document Chunking

Splitting documents into searchable pieces.

The problem

After Phase 2, each document is one big block of text. If authentication.md is 500 words long and a developer asks "what happens when my token expires?", the system would retrieve the entire 500-word file. That's wasteful and imprecise.

The solution

Chunking splits each document into small, overlapping pieces. Instead of retrieving a whole file, the system retrieves exactly the 2-3 sentences that answer the question.

Diagram showing one document being split into 3 overlapping chunks: Chunk 1 (chars 0-200), Chunk 2 (chars 150-350), Chunk 3 (chars 300-500), with overlap regions highlighted
How overlap works — consecutive chunks share text at their boundaries

How RecursiveCharacterTextSplitter works

We use LangChain's smartest text splitter. It doesn't cut at every 512 characters blindly. It tries to find the most natural place to cut, working through this priority list:

  1. Try to cut at a blank line (paragraph boundary)
  2. If chunk still too big, try a line break
  3. If still too big, try a period + space (sentence boundary)
  4. If still too big, try a space (word boundary)
  5. Last resort: cut at any character

Settings

chunk_size    = 512 characters   (how big each piece is)
chunk_overlap = 50 characters    (how much consecutive pieces share)
Phase 4

Embeddings and Vector Database

Converting text into searchable numbers.

What is an embedding?

An embedding converts text into a list of numbers (a "vector") so that a computer can understand meaning, not just keywords. Each chunk becomes a list of 384 numbers.

The magic: chunks with similar meaning produce similar numbers. So "log in" and "authenticate" end up close together in number-space, even though the words are completely different.

Diagram showing three text chunks being converted by SentenceTransformers into 384-number vectors, stored in ChromaDB, with auth-related chunks producing similar numbers and API endpoint chunks producing different numbers
Text → numbers → similarity search — the core mechanism of semantic retrieval
Analogy

Think of GPS coordinates. "The coffee shop on 5th Avenue" and "the café at 40.758, -73.985" are different descriptions of the same place — but the GPS numbers let you calculate how close they are to other locations. Embeddings do the same thing for meaning instead of geography.

ChromaDB: the vector database

ChromaDB stores all your embedded chunks on disk. For each chunk, it stores three things: the original text, all the metadata (file name, source, chunk index), and the 384 numbers. When you search, ChromaDB compares your question's numbers against every stored chunk's numbers and returns the closest matches.

Phase 5

Retrieval Pipeline

Finding the right chunks for any question.

Step-by-step similarity search flow: developer question is embedded into 384 numbers, ChromaDB compares against all stored vectors, top k=3 chunks are returned ranked by similarity score
How similarity search works — question in, relevant chunks out

How it works step by step

  1. The question is converted to 384 numbers using the same embedding model
  2. ChromaDB compares those numbers against every stored chunk's numbers
  3. The 3 chunks whose numbers are closest (most similar meaning) are returned
  4. Each result comes with a similarity score — higher means more relevant

Example retrieval result

Result 1 — score: 0.611 | authentication.md chunk 0   ← most relevant
Result 2 — score: 0.269 | getting_started.txt chunk 0
Result 3 — score: 0.183 | authentication.md chunk 1
No LLM involved yet
Phase 5 is pure search. The retriever is like a librarian — it doesn't answer your question, it just finds the right pages and hands them to you. The LLM in Phase 6 is the expert who reads those pages and writes the answer.
Phase 6

LLM Integration with Ollama

Adding the brain that reads and answers.

What is Ollama?

Ollama is a tool that runs large language models locally on your machine. It works like a local server — your code sends it text, and it sends back a response. No internet required after the initial model download. No API keys. Your data never leaves your laptop.

The prompt template

We don't just send the question to Mistral. We send a carefully structured prompt:

You are a helpful developer documentation assistant.
Answer the developer's question using ONLY the information
provided in the context below.
If the answer is not in the context, say "I don't have enough
information in the documentation to answer this question."
Always cite which document your answer comes from.

Context:
{retrieved chunks go here}

Question: {developer's question}

Answer:

The instruction "using ONLY the information provided" is what prevents hallucination. Without it, Mistral would blend retrieved context with its training knowledge.

Temperature = 0.1
Temperature controls how creative vs. factual the model is. 0.0 = fully deterministic. 1.0 = creative and random. We use 0.1 — slightly flexible but mostly factual. For documentation Q&A, you want consistent, reliable answers.
Phase 7

RAG Pipeline

Wiring retrieval + LLM into one call.

What this phase does

Before Phase 7, you had separate pieces — a retriever that finds chunks and an LLM client that generates answers. Phase 7 connects them into a single RAGPipeline class. You give it a question, and it handles everything: retrieve → format context → generate → return answer with sources.

RAG pipeline flow: question enters DocumentRetriever which searches ChromaDB for top 3 chunks, context formatter labels chunks with source citations, OllamaClient generates answer from context and question, RAGResponse contains answer plus sources plus scores plus model info
The RAG pipeline — one question in, one cited answer out
Phase 8

FastAPI Inference Endpoint

Making the pipeline accessible over HTTP.

What is FastAPI?

FastAPI is a Python web framework that turns your Python functions into HTTP endpoints. After this phase, any client — a browser, a mobile app, a curl command, the chat UI — can send a question over HTTP and get a JSON answer back.

The two endpoints

EndpointMethodWhat it does
/queryPOSTSend a question, get an answer with sources as JSON
/healthGETCheck if all pipeline components are running

Automatic documentation

FastAPI auto-generates an interactive Swagger UI at /docs. You can open your browser, type a question in the form, click "Execute", and see the full JSON response — without writing any frontend code.

Why initialize the pipeline once?
The RAG pipeline is created once when the server starts and reused for every request. Loading the embedding model and connecting to Ollama takes several seconds — you don't want that delay on every question. Initialize once, serve many.
Phase 9

Chat Interface

A real UI developers can actually use.

What we built

A Streamlit chat application that talks to the FastAPI backend. Developers type a question, get an answer with source citation cards showing which document each piece of information came from — with relevance score bars under each citation.

Features

Phase 10

Evaluation with RAGAS

Measuring quality with actual numbers.

The problem with "it seems to work"

After Phase 9 you can use the system and it feels like it works. But "feels like it works" isn't engineering — it's guessing. RAGAS gives you actual numbers so you can say "my system is 100% faithful to source documents" instead of guessing.

RAGAS evaluation framework showing 3 metrics: Faithfulness (catches hallucination, 1.0 means every claim is in the docs), Answer Relevancy (catches off-topic answers, 1.0 means answer is on point), Context Precision (catches bad retrieval, 1.0 means all retrieved chunks were useful). All scores range 0.0 to 1.0, above 0.7 is good, above 0.85 is excellent, below 0.5 needs fixing.
The 3 RAGAS metrics — what each one catches and what the scores mean

Our scores

MetricScoreWhat it tells us
Faithfulness1.000The LLM never hallucinated — every claim comes from retrieved docs
Context Precision1.000The retriever consistently found the correct chunks
Answer Relevancy0.458Affected by timeouts and missing docs (not a quality issue)

How it works

You create a test set of questions with expected answers. RAGAS runs each question through your pipeline, gets the answer and retrieved chunks, then uses an LLM as a judge to score each metric.

Why an LLM judge?
Natural language isn't exact. "Include the token in the Authorization header" and "Add Authorization: Bearer your_token to every request" mean the same thing. An exact string match would say they're different. An LLM judge understands they're semantically equivalent.
Phase 11

Docker Deployment + Agentic RAG

One-command deployment and an intelligent search agent.

Docker: why containerize?

Right now, running the system requires opening 2-3 terminals, activating a venv, starting Ollama, running uvicorn, running Streamlit. Docker packages everything into containers so the entire system starts with one command: docker-compose up.

The Docker stack

ContainerWhat it runs
ollamaMistral LLM server
apiFastAPI backend with RAG pipeline
uiStreamlit chat interface
ingestionOne-time document processing job

Agentic RAG: what changed?

The naive pipeline (Phases 1-9) follows a fixed path: always retrieve 3 chunks, always generate once. The agentic pipeline adds a reasoning layer — the LLM decides what to search for, can search multiple times, and determines when it has enough information to answer.

Switching modes
One environment variable controls which pipeline runs:
PIPELINE_MODE=naive — fast, predictable, ~5 seconds
PIPELINE_MODE=agentic — smarter, slower, ~15 seconds
Both share the same API, same UI, same tests.
Deep Dive

Naive RAG vs Agentic RAG

Understanding the two approaches this project implements.

TypeHow it worksBest for
Naive RAGFixed pipeline: embed → search → generate. No decisions.Simple questions on well-structured docs
Advanced RAGAdds re-ranking, query rewriting, hybrid search. Still fixed.Better retrieval without agent complexity
Agentic RAGLLM decides what to search, when, and how many times.Complex questions requiring multi-step reasoning
Deep Dive

What LangChain Actually Does

Understanding the toolkit that connects everything.

LangChain is a toolkit, not a model

LangChain doesn't generate text. It doesn't search databases. It's the glue that connects the pieces that do those things. Think of it like LEGO — it gives you pre-built blocks for loading documents, splitting text, generating embeddings, searching a database, and talking to an LLM.

The packages we used

PackageWhat it providesUsed in
langchain-coreDocument class, PromptTemplateEvery phase
langchain-communityFile loaders (Markdown, PDF, text)Phase 2
langchain-text-splittersRecursiveCharacterTextSplitterPhase 3
langchain-chromaChromaDB integrationPhase 4
langchain-huggingfaceSentenceTransformers wrapperPhase 4
langchain-ollamaOllama LLM and ChatOllamaPhases 6, 11
langgraphReAct agent frameworkPhase 11

What tests are for

Each test encodes one thing that must always be true. Collectively they form a specification of the system's behavior — not just "it works today" but "it works this specific way, always." Before every commit, you run 63 tests in under 30 seconds. Green means every contract still holds.

Built as part of the Developer Documentation AI Assistant project.
github.com/sumreen7/developer-knowledge-rag