AI/ML 2024

AI Assisted Code Knowledge Tool

Created an internal development assistant capable of analysing codebases and answering questions about architecture, dependencies, and behaviour. Combines static analysis with retrieval-augmented generation to help developers onboard faster, understand legacy systems, and locate relevant code without needing tribal knowledge.

Technology Stack:
PythonLLMEmbeddingsStatic Analysis

Problem Statement

Large codebases accumulate complexity that is difficult to transfer through documentation or code review alone. New developers spend weeks mapping systems manually; experienced developers forget the rationale behind old decisions; no one knows which module owns a particular behaviour without asking the original author. The goal was an assistant that could answer "how does X work?", "where is Y implemented?", and "what does this function do?" accurately — using the actual codebase as its source of truth.

Key Challenges:

  • Code has different retrieval semantics than prose — function boundaries, call graphs, and import structures matter
  • Keeping the index current as the codebase changes
  • Answering architectural questions that span many files
  • Not hallucinating function signatures or behaviour that doesn't exist
  • Handling multiple languages and frameworks within one codebase

System Architecture

The tool indexes the codebase using static analysis to extract function signatures, class hierarchies, and import graphs alongside code embeddings. Queries retrieve relevant code chunks, which are passed with structural context to an LLM that answers grounded in actual code. A CLI and web interface expose queries to developers.

Static Analysis Layer

AST-based parser extracts function signatures, class definitions, method call graphs, and import dependencies. This structural metadata augments embedding retrieval to enable precise queries like "show me all callers of this function" or "what does this class inherit from".

Code Embedding Index

Code blocks are embedded at function/class granularity using a code-specialised embedding model. The vector index enables semantic retrieval — finding implementations related to a concept even when the exact term isn't in the code.

Retrieval-Augmented Generation

Retrieved code chunks combined with call graph context are injected into an LLM prompt that instructs the model to answer only from provided code, cite file paths and line numbers, and indicate when something is not present in the retrieved context.

Incremental Indexing

Git hook and file-watcher based update pipeline re-indexes only changed files, keeping the knowledge base fresh on every commit without expensive full reindexing.

Key Engineering Challenges

Code-Appropriate Chunking

Challenge: Splitting code at arbitrary character counts breaks function context, degrading both retrieval and answer quality.

Solution: AST-aware chunking splitting at function and class boundaries, preserving complete callable units in each chunk with their docstrings, signatures, and bodies intact.

Cross-File Reasoning

Challenge: Architectural questions span many files — understanding a feature requires tracing calls across modules.

Solution: Static call graph traversal expands retrieved results to include callers and callees, providing multi-file context windows for questions about system behaviour.

Hallucinated Code References

Challenge: LLMs confidently describe function behaviour or signatures that don't exist, misleading developers.

Solution: Strict prompting requiring file:line citations for every claim, with a post-processing verification step checking that cited locations exist and contain the described content.

Multi-Language Support

Challenge: A mixed Python/JavaScript/SQL codebase needs language-appropriate parsers and embedding strategies.

Solution: Language-detecting dispatcher routing files to dedicated AST parsers per language while using a unified embedding model trained on mixed-language code corpora.

Solutions Implemented

  • AST-Aware Chunking: Function and class boundary splits preserving complete callable units with full context for accurate retrieval.
  • Call Graph Expansion: Static analysis augmenting retrieved chunks with related callers and callees for multi-file architectural queries.
  • Citation-Required Answers: Prompt constraints forcing file:line references for all claims, with automated verification against the index.
  • Incremental Git Integration: Post-commit hooks triggering re-indexing of changed files, maintaining index freshness without batch reprocessing delays.
  • Developer Interface: CLI for terminal-native queries and a web UI for browsing, bookmarking, and sharing code knowledge answers across the team.

Outcome & Impact

50% Faster Onboarding

New developer ramp-up time

Cited Every Answer

File:line references included

Real-time Index Updates

On every commit

Multi-lang Support

Python, JS, SQL and more