Problem Statement
Businesses routinely receive thousands of unstructured documents — invoices from vendors, receipts from field staff, contracts from partners, and forms from customers. Manual data entry is slow, error-prone, and creates bottlenecks in approval, auditing, and compliance workflows. The goal was to automate extraction and classification with sufficient accuracy to eliminate routine manual review while still allowing human verification on uncertain records.
Key Challenges:
- Wide document variability — different layouts, fonts, languages, and quality levels
- High accuracy requirements for financial and legal data
- Confidence-aware routing to balance automation with human oversight
- Scalable asynchronous processing for high document volumes
- Audit trail requirements for compliance
System Architecture
The platform is built around a multi-stage processing pipeline. Documents are uploaded through a FastAPI service, queued in Redis, and processed by Celery workers that run OCR, classification, extraction, and validation in sequence. Results are stored in PostgreSQL with vector embeddings for semantic search and retrieval.
Ingestion Layer
FastAPI endpoint accepts document uploads, performs initial format validation, generates a processing job, and pushes it to a Redis-backed Celery queue for asynchronous handling.
OCR & Extraction
OCR engine extracts raw text and bounding box data from images and PDFs. A second-pass LLM extracts structured fields (dates, amounts, parties, line items) with positional awareness from the OCR layout.
Semantic Classification
Each document is classified by type (invoice, receipt, contract, form) using embeddings and a fine-tuned classifier. Classification confidence determines whether the record enters automated or manual review queues.
Validation & Storage
Rule engines validate extracted fields against business constraints (e.g., valid date ranges, VAT calculations, supplier whitelists). Valid records are committed to PostgreSQL with full audit metadata and vector embeddings for future retrieval.
Key Engineering Challenges
Document Layout Variability
Challenge: Invoices and forms from different suppliers have radically different visual layouts, making rigid templates unworkable.
Solution: Combined positional OCR output with an LLM prompted to locate fields semantically rather than by fixed coordinates, enabling layout-agnostic extraction.
Confidence-Based Routing
Challenge: Determining when the system is reliable enough to fully automate versus when to escalate to a human reviewer.
Solution: Implemented per-field confidence scores aggregated into a document-level score. Records below configurable thresholds are routed to a prioritised review queue with pre-filled suggestions for the reviewer.
OCR Quality on Poor Scans
Challenge: Field documents often arrive as low-resolution photos with skew, shadows, or partial obstruction.
Solution: Applied pre-processing steps (deskew, contrast normalisation, denoising) before OCR, with fallback prompts instructing the LLM to reason about partially readable text.
Audit & Compliance Requirements
Challenge: Every extraction decision must be traceable for financial and regulatory audits.
Solution: Persisted the full processing trace per document — raw OCR output, LLM prompts and responses, validation rule outcomes, reviewer actions — in an immutable audit log table.
Solutions Implemented
- Multi-Stage Pipeline: OCR → LLM extraction → rule validation → confidence routing, all orchestrated by Celery workers with retry and dead-letter handling.
- Semantic Field Extraction: LLM prompted with OCR layout context to locate and extract fields without relying on fixed template positions.
- Human Review Interface: Web UI presenting low-confidence documents with field suggestions, bounding box highlights, and one-click approve or correct actions feeding back into training data.
- Vector Search: Document embeddings enabling semantic retrieval — e.g., find all invoices similar to a flagged duplicate or retrieve contracts mentioning specific clauses.
- Automated Reporting: Aggregated extraction metrics, compliance summaries, and vendor-level analytics generated on a scheduled basis from structured PostgreSQL records.
Outcome & Impact
Documents processed without human review
Compared to manual data entry
On automated records
Full trace on every decision