AI/ML 2024

AI Document Intelligence Platform

Designed a document processing pipeline that converts unstructured business documents — invoices, receipts, contracts, and forms — into structured, queryable records. The system combines OCR extraction, semantic classification, and contextual validation using rule engines and language models, with asynchronous background workers and a human-in-the-loop review interface.

Technology Stack:
PythonFastAPIPostgreSQLRedisCeleryOCRLLMVector Search

Problem Statement

Businesses routinely receive thousands of unstructured documents — invoices from vendors, receipts from field staff, contracts from partners, and forms from customers. Manual data entry is slow, error-prone, and creates bottlenecks in approval, auditing, and compliance workflows. The goal was to automate extraction and classification with sufficient accuracy to eliminate routine manual review while still allowing human verification on uncertain records.

Key Challenges:

  • Wide document variability — different layouts, fonts, languages, and quality levels
  • High accuracy requirements for financial and legal data
  • Confidence-aware routing to balance automation with human oversight
  • Scalable asynchronous processing for high document volumes
  • Audit trail requirements for compliance

System Architecture

The platform is built around a multi-stage processing pipeline. Documents are uploaded through a FastAPI service, queued in Redis, and processed by Celery workers that run OCR, classification, extraction, and validation in sequence. Results are stored in PostgreSQL with vector embeddings for semantic search and retrieval.

Ingestion Layer

FastAPI endpoint accepts document uploads, performs initial format validation, generates a processing job, and pushes it to a Redis-backed Celery queue for asynchronous handling.

OCR & Extraction

OCR engine extracts raw text and bounding box data from images and PDFs. A second-pass LLM extracts structured fields (dates, amounts, parties, line items) with positional awareness from the OCR layout.

Semantic Classification

Each document is classified by type (invoice, receipt, contract, form) using embeddings and a fine-tuned classifier. Classification confidence determines whether the record enters automated or manual review queues.

Validation & Storage

Rule engines validate extracted fields against business constraints (e.g., valid date ranges, VAT calculations, supplier whitelists). Valid records are committed to PostgreSQL with full audit metadata and vector embeddings for future retrieval.

Key Engineering Challenges

Document Layout Variability

Challenge: Invoices and forms from different suppliers have radically different visual layouts, making rigid templates unworkable.

Solution: Combined positional OCR output with an LLM prompted to locate fields semantically rather than by fixed coordinates, enabling layout-agnostic extraction.

Confidence-Based Routing

Challenge: Determining when the system is reliable enough to fully automate versus when to escalate to a human reviewer.

Solution: Implemented per-field confidence scores aggregated into a document-level score. Records below configurable thresholds are routed to a prioritised review queue with pre-filled suggestions for the reviewer.

OCR Quality on Poor Scans

Challenge: Field documents often arrive as low-resolution photos with skew, shadows, or partial obstruction.

Solution: Applied pre-processing steps (deskew, contrast normalisation, denoising) before OCR, with fallback prompts instructing the LLM to reason about partially readable text.

Audit & Compliance Requirements

Challenge: Every extraction decision must be traceable for financial and regulatory audits.

Solution: Persisted the full processing trace per document — raw OCR output, LLM prompts and responses, validation rule outcomes, reviewer actions — in an immutable audit log table.

Solutions Implemented

  • Multi-Stage Pipeline: OCR → LLM extraction → rule validation → confidence routing, all orchestrated by Celery workers with retry and dead-letter handling.
  • Semantic Field Extraction: LLM prompted with OCR layout context to locate and extract fields without relying on fixed template positions.
  • Human Review Interface: Web UI presenting low-confidence documents with field suggestions, bounding box highlights, and one-click approve or correct actions feeding back into training data.
  • Vector Search: Document embeddings enabling semantic retrieval — e.g., find all invoices similar to a flagged duplicate or retrieve contracts mentioning specific clauses.
  • Automated Reporting: Aggregated extraction metrics, compliance summaries, and vendor-level analytics generated on a scheduled basis from structured PostgreSQL records.

Outcome & Impact

85% Automation Rate

Documents processed without human review

10x Processing Speed

Compared to manual data entry

<3% Error Rate

On automated records

100% Audit Coverage

Full trace on every decision