RAG Solution: Financial Document Analysis with Graph-Based Provenance Tracking

For a financial analysis firm processing corporate financial documents, I developed a RAG system that combines vector search with graph-based provenance tracking. The solution enables natural language queries across complex financial documents while maintaining complete traceability of information sources through Neo4j’s native graph capabilities.

The Challenge

The client needed to analyze financial reports for multiple companies, extracting business insights, risk factors, and strategic analysis. Traditional keyword search failed to capture semantic meaning, while standard RAG implementations couldn’t trace information lineage - critical for financial due diligence and compliance requirements.

Key requirements:

Query across multiple documents using natural language
Track provenance: which company, which section, which reporting period
Maintain document structure and sequential context
Handle verbose structured formats (markup-heavy corporate documents)
GPU-accelerated processing for batch analysis

Technical Architecture

Document Processing Pipeline:

I built an ETL pipeline to extract and structure financial documents:

Downloaded documents from document repositories
Isolated key sections
Stripped XML/HTML markup using BeautifulSoup, reducing file size from MBs to KBs
Chunked text using RecursiveCharacterTextSplitter (2000 chars, 200 char overlap)
Generated embeddings and loaded into Neo4j with provenance metadata

Embedding Strategy:

Model: WhereIsAI/UAE-Large-V1 (AnglE embeddings, 1024-dimensional)
Pooling: CLS token strategy with Prompts.C optimization

Neo4j Graph Schema:

I designed a schema that serves dual purposes - vector search and provenance tracking:

Nodes:

Chunk with properties: chunkId, text, textEmbedding (vector), documentId, sectionType, chunkSeqId, companyName

Relationships:

NEXT - Links sequential chunks within same document section, preserving document order

Indexes:

Vector index on textEmbedding (cosine similarity, 1024 dimensions)
Unique constraint on chunkId

Provenance Implementation:

Each chunk maintains complete lineage:

documentId: Source document identifier
sectionType: Section category (business, risk, financial_analysis, market_risk)
chunkSeqId: Position within section
NEXT relationships: Sequential graph traversal for context expansion

When a query retrieves a chunk, the system can trace back to the exact company, reporting period, section, and position within that section.

Custom LangChain Retriever:

I implemented a custom N4JRetriever (BaseRetriever subclass) that:

Encodes queries using the same AnglE model

Executes native Neo4j vector search via Cypher:

CALL db.index.vector.queryNodes(index_name, k, embedding)

Returns top-k semantically similar chunks with similarity scores
Integrates seamlessly with LangChain LCEL chains

Advanced Context Retrieval (Prototype):

I designed a sophisticated window-based retrieval pattern that leverages the NEXT relationships:

MATCH window= (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)

This retrieves not just the matching chunk, but also ±1 neighboring chunks, providing richer context while maintaining provenance. The graph traversal ensures sequential coherence.

Generation & Orchestration:

LLM: Cohere Command-R (standard queries) / Command-R-Plus (agent workflows)
Framework: LangChain LCEL for retrieval → prompt → generation pipeline
Prompt: Hub-based RAG prompt with source attribution formatting
Agent: Multi-tool ReAct agent with custom retriever as vectorstore search tool

Infrastructure

Neo4j with APOC and GenAI plugins
Docker containerized deployment
GPU server for embedding generation
Batch processing with error handling and progress logging

Technical Innovations

Unified Vector + Graph Database: Neo4j serves dual roles, eliminating need for separate vector store and simplifying provenance queries
Section-Aware Chunking: Maintains document structural integrity across different report sections
Graph-Based Context Expansion: NEXT relationships enable semantic + structural retrieval
Structured Format Processing: Custom pipeline handles verbose markup-heavy document formats

Results

Successfully processed multiple companies’ financial documents with complete provenance tracking
Enabled natural language queries with traceable, auditable answers
Vector search returns relevant chunks in milliseconds via Neo4j native indexing
Graph traversal provides document context while maintaining full audit trail
GPU acceleration enables efficient batch processing of large document collections
System deployed for production use with multiple concurrent users

Technologies

Python, Neo4j, LangChain, Cohere Command-R, AnglE Embeddings (WhereIsAI/UAE-Large-V1), Docker, BeautifulSoup, APOC, Cypher