Knowledge Retrieval AI (RAG) System

Problem Statement

Organisations accumulate large volumes of internal knowledge — policies, procedures, technical documentation, meeting notes, and historical decisions — across files, wikis, and email threads. Finding accurate answers requires either knowing exactly where to look or spending time searching through multiple sources. General-purpose language models cannot access this private data and often fabricate answers when asked about domain-specific topics. The challenge was to build an accurate, trustworthy question-answering system grounded in real organisational knowledge.

Key Challenges:

Grounding model answers in private data without fine-tuning
Effective retrieval from large, heterogeneous document collections
Minimising hallucination — ensuring answers cite actual source material
Keeping the knowledge base current as documents are updated
Handling multi-step questions requiring reasoning over multiple retrieved passages

System Architecture

Documents are ingested, chunked, and embedded into a vector database. At query time, the user's question is embedded and used to retrieve the most semantically relevant chunks. These chunks are injected into an LLM prompt as context, constraining the model to answer only from provided evidence. Source citations are returned alongside the answer for verification.

Document Ingestion Pipeline

Documents are parsed, split into semantically coherent chunks with overlap, embedded using a sentence embedding model, and stored in the vector database with metadata (source, version, section) for filtering.

Retrieval Engine

Hybrid retrieval combining dense vector search (semantic similarity) with sparse BM25 keyword matching. Retrieval results are re-ranked by relevance to the specific query before being passed to the generation stage.

Grounded Generation

Retrieved chunks are injected into the LLM prompt with explicit instructions to answer only from provided context and to state when the answer is not found — reducing fabrication to near zero for in-scope questions.

Knowledge Base Maintenance

Incremental update pipeline monitors document sources for changes, re-embeds modified documents, and removes stale vectors — keeping retrieval results accurate without full re-indexing.

Key Engineering Challenges

Chunking Strategy

Challenge: Poor chunking (too small or splitting across logical boundaries) degrades retrieval quality significantly.

Solution: Sentence-boundary-aware chunking with configurable overlap, preserving paragraph structure and section headings as metadata for context-aware retrieval filtering.

Hallucination Suppression

Challenge: LLMs tend to fabricate plausible-sounding answers when retrieved context is insufficient.

Solution: Explicit prompt instructions, confidence estimation based on retrieval similarity scores, and fallback responses for low-confidence retrievals directing users to contact a human expert.

Multi-Document Reasoning

Challenge: Some questions require synthesising information across several retrieved passages rather than answering from a single chunk.

Solution: Iterative retrieval with a reasoning step that identifies sub-questions, retrieves for each, and synthesises a final answer — implemented using a lightweight agentic loop.

Retrieval Precision vs. Recall

Challenge: Retrieving too few results misses relevant content; too many introduces noise that confuses the model.

Solution: Hybrid dense + sparse retrieval followed by a cross-encoder re-ranker selecting the top-k most relevant passages, balancing precision and coverage.

Solutions Implemented

Hybrid Retrieval: Combined vector similarity search with BM25 keyword matching for robust coverage across semantic and lexical query types.
Cross-Encoder Re-ranking: Secondary relevance model scoring each retrieved passage against the query before passing to the LLM, improving answer grounding.
Source Attribution: Every answer includes citations to the source documents and sections used, enabling users to verify and explore further.
Incremental Indexing: Change-detection pipeline that re-embeds only modified documents, keeping the knowledge base fresh without expensive full re-indexing.
API Layer: REST API exposing query, document ingestion, and knowledge base management endpoints for integration with internal tools and portals.

Outcome & Impact

<5% Hallucination Rate

On in-scope queries

80% Query Resolution

Without human escalation

Minutes vs. Hours

Knowledge discovery time

100% Source Cited

On every answer returned