Problem Statement
Large codebases accumulate complexity that is difficult to transfer through documentation or code review alone. New developers spend weeks mapping systems manually; experienced developers forget the rationale behind old decisions; no one knows which module owns a particular behaviour without asking the original author. The goal was an assistant that could answer "how does X work?", "where is Y implemented?", and "what does this function do?" accurately — using the actual codebase as its source of truth.
Key Challenges:
- Code has different retrieval semantics than prose — function boundaries, call graphs, and import structures matter
- Keeping the index current as the codebase changes
- Answering architectural questions that span many files
- Not hallucinating function signatures or behaviour that doesn't exist
- Handling multiple languages and frameworks within one codebase
System Architecture
The tool indexes the codebase using static analysis to extract function signatures, class hierarchies, and import graphs alongside code embeddings. Queries retrieve relevant code chunks, which are passed with structural context to an LLM that answers grounded in actual code. A CLI and web interface expose queries to developers.
Static Analysis Layer
AST-based parser extracts function signatures, class definitions, method call graphs, and import dependencies. This structural metadata augments embedding retrieval to enable precise queries like "show me all callers of this function" or "what does this class inherit from".
Code Embedding Index
Code blocks are embedded at function/class granularity using a code-specialised embedding model. The vector index enables semantic retrieval — finding implementations related to a concept even when the exact term isn't in the code.
Retrieval-Augmented Generation
Retrieved code chunks combined with call graph context are injected into an LLM prompt that instructs the model to answer only from provided code, cite file paths and line numbers, and indicate when something is not present in the retrieved context.
Incremental Indexing
Git hook and file-watcher based update pipeline re-indexes only changed files, keeping the knowledge base fresh on every commit without expensive full reindexing.
Key Engineering Challenges
Code-Appropriate Chunking
Challenge: Splitting code at arbitrary character counts breaks function context, degrading both retrieval and answer quality.
Solution: AST-aware chunking splitting at function and class boundaries, preserving complete callable units in each chunk with their docstrings, signatures, and bodies intact.
Cross-File Reasoning
Challenge: Architectural questions span many files — understanding a feature requires tracing calls across modules.
Solution: Static call graph traversal expands retrieved results to include callers and callees, providing multi-file context windows for questions about system behaviour.
Hallucinated Code References
Challenge: LLMs confidently describe function behaviour or signatures that don't exist, misleading developers.
Solution: Strict prompting requiring file:line citations for every claim, with a post-processing verification step checking that cited locations exist and contain the described content.
Multi-Language Support
Challenge: A mixed Python/JavaScript/SQL codebase needs language-appropriate parsers and embedding strategies.
Solution: Language-detecting dispatcher routing files to dedicated AST parsers per language while using a unified embedding model trained on mixed-language code corpora.
Solutions Implemented
- AST-Aware Chunking: Function and class boundary splits preserving complete callable units with full context for accurate retrieval.
- Call Graph Expansion: Static analysis augmenting retrieved chunks with related callers and callees for multi-file architectural queries.
- Citation-Required Answers: Prompt constraints forcing file:line references for all claims, with automated verification against the index.
- Incremental Git Integration: Post-commit hooks triggering re-indexing of changed files, maintaining index freshness without batch reprocessing delays.
- Developer Interface: CLI for terminal-native queries and a web UI for browsing, bookmarking, and sharing code knowledge answers across the team.
Outcome & Impact
New developer ramp-up time
File:line references included
On every commit
Python, JS, SQL and more