Hameed's dev blog

AI Fundamentals: RAG

Rag summary.png
Published on
/17 mins read/---

Part 1️⃣: Limitations of LLMs

🔍 1. LLMs are powerful but flawed

LimitationDetailsImpact
❌ Knowledge CutoffModels are trained on past dataCannot answer recent news, latest updates, real-time information
❌ No Access to Private DataCannot access company documents, internal databases, personal filesCannot build enterprise tools using raw LLMs alone
❌ HallucinationModel generates confident but wrong answersNo guarantee of factual correctness

Part 2️⃣: Fine-Tuning Deep Dive

Fine-tuning = Training the model again on your custom dataset so it learns that data internally

❌ Problems with Fine-Tuning

ProblemDetailsImpact
ExpensiveRequires: GPUs, Training pipelineNot cheap for most use cases
Time-ConsumingTraining takes timeNot instant or iterative
Not Scalable for UpdatesYour data changes frequently → you must retrain again and againBecomes impractical very quickly
Static KnowledgeOnce trained, model knowledge is fixedCannot dynamically fetch new info

Fine-Tuning Workflow — Notes

💡 Quick Reminder

Remember: The 4-step workflow below applies to any fine-tuning approach. The key difference is in Step 2 (which method you choose) — everything else is the same!

🌳 Fine-Tuning Workflow (4-Step Process)

Fine-Tuning Workflow
├── 🟢 1. Data Collection
│   ├── High-quality prompt → response pairs
│   ├── Dataset size: Few hundred → few hundred thousand
│   ├── Quality > Quantity
│   └── Must be: Clean, Relevant, Consistent
│
├── 🟢 2. Choose Fine-Tuning Method
│   ├── Full Fine-Tuning (Full FT)
│   │   ├── Updates: All model parameters
│   │   ├── Cost: High
│   │   └── Performance: Best
│   ├── LoRA / QLoRA ⭐ Most Common
│   │   ├── Updates: Small subset of parameters
│   │   ├── Cost: Low
│   │   └── Performance: Very good
│   └── Adapters
│       ├── Updates: Adds small modules to model
│       ├── Cost: Low
│       └── Performance: Moderate
│
├── 🟢 3. Training
│   ├── Train for few epochs (few passes over data)
│   ├── Full FT → Update entire model
│   ├── LoRA → Keep base model frozen, update small layers
│   └── Risk: Overfitting → model becomes too rigid
│
└── 🟢 4. Evaluation & Safety Testing
    ├── Metrics
    │   ├── Exact Match: Does output match expected answer?
    │   ├── Factuality: Is the answer correct?
    │   └── Hallucination Rate: Does model generate false info?
    └── Safety Testing (Red Teaming)
        ├── Edge cases
        ├── Malicious inputs
        └── Unsafe prompts

🟢 Choose Fine-Tuning Method

Comparison Table:

MethodUpdatesCostPerformanceUse Case
Full Fine-Tuning (Full FT)All model parametersHighBestLarge-scale systems
LoRA / QLoRASmall subset of parametersLowVery goodMost real-world apps
AdaptersAdds small modules to modelLowModerateLightweight setups

💡 Fine-Tuning: Key Takeaways

🎯 REMEMBER: Fine-tuning modifies the model itself (weights) — this is permanent and expensive

AspectDetails
What changesFine-tuning modifies the model itself (weights)
RequirementsData, Compute, Evaluation pipeline
ExpensiveYes
Time-consumingYes
Update frequencyHard to update frequently
Best forNot suitable for dynamic data

⚠️ Fine-Tuning Inefficiency Deep Dive

Fine-tuning becomes inefficient and costly, and LLM limitations can't be solved efficiently.

📌 Practical Scenario: Why Fine-Tuning Fails in Real World

TimelineActionResultProblemLesson
Month 1Fine-tune on 500 HR policy examplesWorks well ✓NoneModel trained successfully
Month 2Company updates parental leave policyMust retrain entire model (2-3 days)Expensive, time-consumingData changed → costs explode
Month 3New remote work policy addedRetrain againEndless retraining cycleNot scalable for dynamic data
Month 4Model knowledge 3 months behindHallucinating outdated infoWorse than baselineStatic knowledge becomes liability

Root Problem: You're trying to update model weights (static, expensive)

Better Approach (RAG): Update external documents (dynamic, instant) — No retraining needed, model always has latest info


In-Context Learning (ICL)

Definition: A core capability of LLMs where the model learns to solve a task purely by seeing examples in the prompt — without updating weights.

Key Property: Zero training needed. Examples in prompt = instant learning.

📌 Practical Example: Sentiment Analysis

The sources describe a primary example involving Sentiment Analysis using a technique called "Few-Shot Prompting".

In this scenario, a user provides the LLM with a few labeled examples directly within the prompt to teach it a pattern:

  • Example 1: "I love this phone" → Sentiment: Positive
  • Example 2: "This app crashes a lot" → Sentiment: Negative
  • Example 3: "The camera is amazing" → Sentiment: Positive
  • The Query: "I hate the battery life"

The LLM observes these previous examples, learns how to perform sentiment analysis for this specific context, and correctly identifies the sentiment of the final text as Negative without ever having been explicitly retrained on this data.


Part 3️⃣: Emergent Properties & Behaviors

Emergent Properties

Definition: A behavior or ability that suddenly appears in a system when it reaches a certain scale or complexity — even though it was not explicitly programmed.

Origin: Started appearing in GPT-3 (175B parameters) — became more pronounced in larger models.

In-context learning = Emergent property

Emergence Mechanics

What EmergesWhy It's EmergentWhen It Appears
Model learns task from examples in promptNot explicitly trained for thisAt scale (100B+ params)
No weight updates, no training neededCapability wasn't hard-codedNaturally as complexity increases
Reasoning abilityNot trained on reasoning tasksEmerges with size
Instruction-followingNot explicitly trained for all instructionsEmerges from scale

Key Insight for PMs

You cannot predict or control what emerges. This is why:

  • ✅ You get useful reasoning
  • ❌ You also get hallucinations
  • Both are features of the same underlying phenomenon

🔴 Negative Emergent Behaviors (VERY IMPORTANT)

BehaviorDefinitionReal-World ImpactDetection Method
❌ HallucinationGenerates false but confident answersCustomer gets wrong info, loss of trustManual review, fact-checking
❌ OverconfidenceSounds certain even when wrongUser trusts bad answerConfidence score analysis
❌ Prompt SensitivitySmall prompt change → very different outputInconsistent behavior, unpredictableA/B testing different phrasings
❌ Bias AmplificationLearns and amplifies training data biasesDiscriminatory outputs, regulatory riskBias audits, demographic testing

🧠 What ties all these together

👉 These abilities:

  • Were not explicitly programmed
  • Appear when:
    • Model size increases
    • Data increases
    • Training improves

Emergent Properties & Trade-offs

The Core Tension

Positive emergent behaviors (reasoning, instruction-following) and negative ones (hallucination, overconfidence) both emerge together as scale increases. You cannot suppress one without potentially losing the other.

Why This Matters

Solution: Don't try to "fix" the model. Instead:

  1. Leverage positive emergent behaviors (reasoning)
  2. Contain negative ones (guardrails, RAG)

📌 Interview Case Study: Fintech Customer Support Chatbot

Question: You are building a customer support chatbot for a fintech app.

During testing, you notice:

  • The model can follow complex instructions without training
  • It sometimes reasons step-by-step correctly
  • But it also hallucinates occasionally

Explain why this is happening and how you would design the system to handle it.

Step 1: Identify Emergent Behavior

The model is showing emergent properties. Capabilities like instruction-following and step-by-step reasoning were not explicitly programmed but arise as the model scales.

Step 2: Explain BOTH sides (this is key)

Along with useful abilities, negative behaviors like hallucinations also emerge. These are not bugs but inherent to how LLMs work.

Step 3: Explain WHY this happens

The model generates responses based on patterns in data rather than verified knowledge, which leads to variability—sometimes correct reasoning, sometimes incorrect but confident outputs.

Step 4: Design Implication (this is where most fail)

Because of this, we cannot rely solely on the model. The system should include:

  • Retrieval (RAG) to ground responses in real data
  • Guardrails to filter unsafe or incorrect outputs
  • Evaluation mechanisms to monitor hallucination rates

Part 4️⃣: Knowledge Storage & Pre-training Limitations

Knowledge Storage Models in LLMs

Two Types of Knowledge

TypeStorageUpdate SpeedReliabilityUse Case
Parametric KnowledgeInside model weightsRequires retraining (days)Medium (hallucinations)General knowledge
External Knowledge (RAG)Outside model (vector DB)Instant (update docs)High (verified sources)Grounded facts

Mental Model

Parametric Knowledge = Model's long-term memory (learned during training)

RAG / External = Model's short-term reference (looked up at query time)

Best practice: Use both:

  • Parametric → Reasoning, common sense
  • External → Facts, company-specific info

Why Pre-training is NOT the Solution

ReasonCore ProblemCost/Time ImpactExampleBetter Alternative
🟢 Cost of TrainingRequires huge compute (GPUs/TPUs) + months100K-1M+ per run, 1-3 monthsTraining 7B model costs 50K-100KUse RAG instead
🟢 Data Changes ConstantlyReal-world data updates frequently → must retrainEvery policy change = full retrain cycleMonth 1: train on Q1 data, Month 2: retrain for policy changeRetrieve from updated docs instantly
🟢 Knowledge Gets LockedOnce trained, knowledge frozen in weightsMonths to add new domain, fix errorsTo correct hallucination: Full retrain requiredUpdate source documents in minutes
🟢 Not ScalableEach new domain/use case = separate model3 domains = 3 models, 3x maintenance overheadHR model + Finance model + Legal model1 base model + 3 RAG indexes

Part 5️⃣: RAG (Retrieval-Augmented Generation)

✅ What RAG Does Instead (The Smart Solution)

Core Philosophy

Instead of: Storing knowledge IN the model (expensive, static)

Do this: Keep knowledge OUTSIDE, retrieve as needed (cheap, dynamic)

RAG Architecture Comparison

AspectPre-trainingRAG
Knowledge locationInside model weightsExternal vector database
Update speedDays to weeks (retrain)Minutes (upload doc)
Cost to update50K-500K+$0
ScalabilityOne model per domainOne model, many data sources
MaintenanceModel managementData management
Hallucination riskHigh (facts unreliable)Low (grounded in sources)

RAG is preferred because: It allows using external and frequently changing data without retraining the model, making the system more efficient and scalable.


RAG: Formal Definition

RAG (Retrieval-Augmented Generation) = Technique where a system:

  1. Retrieves relevant external information
  2. Augments (adds) that info to the prompt
  3. Generates accurate responses using both

One-liner: "Let the LLM read from external documents before answering."


🎯 RAG Scenario: Leave Policy Question

User Question: "What is my company's leave policy?"

PhaseWITHOUT RAG (Model Alone)WITH RAG (Grounded Answer)Outcome
SourceParametric knowledge onlySearches HR_Policy_2024.pdfFactual document found
Retrieval❌ No external docs searched✅ Retrieves: "20 paid leaves + 2 sick + 5 personal"Company-specific data available
Answer QualityGeneric ("15-20 days typical") or Hallucinated ("30 days!")Specific ("20 paid leaves + 2 sick + 5 personal")Accurate, verifiable
TrustworthinessLow - user doubts accuracyHigh - cited source (HR Policy 2024)User confident in answer
Update HandlingMust retrain model ($50K+)Update doc instantly ($0)Dynamic, low-maintenance

Step-by-Step Breakdown:

  1. Retrieve → Searches company documents → Finds relevant section → "Employees entitled to 20 paid leaves..."
  2. Augment → Adds to prompt → Context + Question combined
  3. Generate → Model produces grounded answer → "Your company provides 20 paid leaves per year"

Mapping to RAG: Retrieval ✓ | Augmented ✓ | Generation ✓


When to Choose: Fine-Tuning vs RAG

Fine-Tuning when the model needs to learn new behavior or style (static use case).

RAG when data changes frequently or you need access to real-time/private information (dynamic use case).


RAG Breakdown & Core Idea

ComponentDefinition
RetrievalFetch relevant data
AugmentedAdd that data to prompt
GenerationLLM generates answer

Simple Analogy

ScenarioApproach
Without RAGStudent answering from memory
With RAGStudent allowed to open book before answering

💡 KEY INSIGHT: RAG Solves 3 LLM Limitations

  1. Knowledge Cutoff → Fetch latest data in real-time
  2. No Private Data → Retrieve company-specific information
  3. Hallucination → Ground responses in actual retrieved facts

Complete RAG Definition

RAG is a system where instead of relying only on pretrained knowledge, we retrieve relevant external data using embeddings and vector search, then augment the prompt with that data before passing it to the LLM for grounded response generation.


End-to-End RAG Flow

StepAction
1User asks question
2Retriever finds relevant chunks
3Combine: Query + Retrieved data
4Pass to LLM
5LLM generates answer

RAG System: 4-Step Architecture

A complete RAG system consists of 4 sequential phases:

┌─────────────┐       ┌──────────────┐       ┌───────────────┐       ┌────────────┐
│  1. Index   │ ---→  │  2. Retrieve │ ---→  │  3. Augment   │ ---→  │ 4. Generate│
│  (Offline)  │       │  (Real-time) │       │  (Real-time)  │       │ (Real-time)│
└─────────────┘       └──────────────┘       └───────────────┘       └────────────┘

Setup once         Live query processing

Key insight: Phase 1 runs once offline. Phases 2-4 run for every query.


🌳 Phase 1: Indexing Pipeline (Offline Setup)

StepWhatWhyToolsData Flow
1. Data IngestionLoad source knowledge into memoryStart with raw dataLangChain loaders (PyPDFLoader, YoutubeLoader, WebBaseLoader, GitLoader)WWW → Document Loader → Documents
2. ChunkingBreak large docs into small, meaningful chunksLLM token limits + better retrieval precisionRecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter, SemanticChunkerDocs → Text Splitter → Multiple chunks
3. EmbeddingsConvert each chunk to dense vector capturing meaningEnable semantic similarity search ("car" ≈ "vehicle")OpenAIEmbeddings, SentenceTransformerEmbeddings, InstructorEmbeddingsChunks → Embedding Model → Dense Vectors
4. Vector DatabaseStore vectors + chunk text + metadataFast similarity search at query timeFAISS (local), Pinecone (cloud)Vectors → Vector Store → External Knowledge

⚠️ Important: Entire pipeline runs BEFORE user queries (Indexing Phase / Offline Phase)


💡 Mental Model: The Google Search Analogy

Indexing PhaseBuilding Google's search index

  • Run once (slow, thorough): Takes hours/days
  • Then reuse forever: Queries return in milliseconds

That's why RAG is so fast in production!

Google's approach:
  Year 1: Crawl and index entire web (massive effort)
  Year 2+: User queries return in milliseconds

RAG's approach:
  Week 1: Index company docs (one-time effort)
  Week 2+: User queries return in milliseconds

🌳 Phase 2: Retrieval (Real-Time Query Processing)

Definition: Real-time process of finding the most relevant pieces of information from the pre-built index.

Retrieval StepWhat HappensInputOutput
1. Query to EmbeddingConvert user query to vector using same embedding modelUser question: "What is my leave policy?"Query embedding vector
2. Vector SearchSearch vector DB for similar embeddings (cosine similarity)Query vector + all stored vectorsTop-K matching chunks (usually 3-5)
3. Return MatchesGet original chunk text + metadata from matchesSimilar embeddings foundRelevant text passages
4. Pass to LLMSend matched chunks to Phase 3 (Augmentation)Selected chunksReady for augmentation

Core Question: "From all indexed documents, which 3–5 chunks best match this query?"


🌳 Phase 3: Augmentation (Prompt Engineering)

Definition: Combining retrieved documents with the user's query to create an enriched prompt.

ElementRoleExample
Retrieved ContextGround truth data"Employees are entitled to 20 paid leaves per year, 2 sick leaves, and 5 personal days"
User QuestionWhat we need answered"What is my company's leave policy?"
Template FormatStructure to combine themContext: [chunks] → Question: [query] → Answer (use only context)
OutputReady-to-send promptFull prompt sent to LLM in Phase 4

Why This Matters:

  • ✅ LLM sees relevant facts first
  • ✅ Reduces hallucinations (facts in context)
  • ✅ Answers are traceable (sources known)

🌳 Phase 4: Generation (LLM Response)

Definition: Final step where LLM uses augmented prompt to generate a response.

Generation ElementWhat HappensResult
InputAugmented prompt (context + query)LLM sees facts before generating
ProcessingLLM reads context, understands query, generates answerGrounded reasoning using provided facts
Output FormatAnswer + source citationsTraceable, verifiable response
Key AdvantageFacts in context prevent hallucinations✅ Grounded ✅ Reduced hallucination ✅ Citable

Typical Output:

Answer: Your company provides 20 paid leaves per year, 
2 sick leaves, and 5 personal days.

Sources:
- HR Policy 2024, Section 3.1

RAG System End-to-End Flow

This comprehensive guide covers everything you need to understand RAG and why it's the preferred approach for building grounded AI systems that work with real-world data.

Practical notes on product management, distributed systems, and AI—written from real engineering and product experience.