RAG Solution: Financial Document Analysis with Graph-Based Provenance Tracking
For a financial analysis firm processing corporate financial documents, I developed a RAG system that combines vector search with graph-based provenance tracking. The solution enables natural language queries across complex financial documents while maintaining complete traceability of information sources through Neo4j’s native graph capabilities.
The Challenge
The client needed to analyze financial reports for multiple companies, extracting business insights, risk factors, and strategic analysis. Traditional keyword search failed to capture semantic meaning, while standard RAG implementations couldn’t trace information lineage - critical for financial due diligence and compliance requirements.
Key requirements:
- Query across multiple documents using natural language
- Track provenance: which company, which section, which reporting period
- Maintain document structure and sequential context
- Handle verbose structured formats (markup-heavy corporate documents)
- GPU-accelerated processing for batch analysis
Technical Architecture
Document Processing Pipeline:
I built an ETL pipeline to extract and structure financial documents:
- Downloaded documents from document repositories
- Isolated key sections
- Stripped XML/HTML markup using BeautifulSoup, reducing file size from MBs to KBs
- Chunked text using RecursiveCharacterTextSplitter (2000 chars, 200 char overlap)
- Generated embeddings and loaded into Neo4j with provenance metadata
Embedding Strategy:
- Model: WhereIsAI/UAE-Large-V1 (AnglE embeddings, 1024-dimensional)
- Pooling: CLS token strategy with Prompts.C optimization
Neo4j Graph Schema:
I designed a schema that serves dual purposes - vector search and provenance tracking:
Nodes:
- Chunk with properties: chunkId, text, textEmbedding (vector), documentId, sectionType, chunkSeqId, companyName
Relationships:
- NEXT - Links sequential chunks within same document section, preserving document order
Indexes:
- Vector index on textEmbedding (cosine similarity, 1024 dimensions)
- Unique constraint on chunkId
Provenance Implementation:
Each chunk maintains complete lineage:
- documentId: Source document identifier
- sectionType: Section category (business, risk, financial_analysis, market_risk)
- chunkSeqId: Position within section
- NEXT relationships: Sequential graph traversal for context expansion
When a query retrieves a chunk, the system can trace back to the exact company, reporting period, section, and position within that section.
Custom LangChain Retriever:
I implemented a custom N4JRetriever (BaseRetriever subclass) that:
- Encodes queries using the same AnglE model
- Executes native Neo4j vector search via Cypher:
CALL db.index.vector.queryNodes(index_name, k, embedding) - Returns top-k semantically similar chunks with similarity scores
- Integrates seamlessly with LangChain LCEL chains
Advanced Context Retrieval (Prototype):
I designed a sophisticated window-based retrieval pattern that leverages the NEXT relationships:
MATCH window= (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
This retrieves not just the matching chunk, but also ±1 neighboring chunks, providing richer context while maintaining provenance. The graph traversal ensures sequential coherence.
Generation & Orchestration:
- LLM: Cohere Command-R (standard queries) / Command-R-Plus (agent workflows)
- Framework: LangChain LCEL for retrieval → prompt → generation pipeline
- Prompt: Hub-based RAG prompt with source attribution formatting
- Agent: Multi-tool ReAct agent with custom retriever as vectorstore search tool
Infrastructure
- Neo4j with APOC and GenAI plugins
- Docker containerized deployment
- GPU server for embedding generation
- Batch processing with error handling and progress logging
Technical Innovations
- Unified Vector + Graph Database: Neo4j serves dual roles, eliminating need for separate vector store and simplifying provenance queries
- Section-Aware Chunking: Maintains document structural integrity across different report sections
- Graph-Based Context Expansion: NEXT relationships enable semantic + structural retrieval
- Structured Format Processing: Custom pipeline handles verbose markup-heavy document formats
Results
- Successfully processed multiple companies’ financial documents with complete provenance tracking
- Enabled natural language queries with traceable, auditable answers
- Vector search returns relevant chunks in milliseconds via Neo4j native indexing
- Graph traversal provides document context while maintaining full audit trail
- GPU acceleration enables efficient batch processing of large document collections
- System deployed for production use with multiple concurrent users
Technologies
Python, Neo4j, LangChain, Cohere Command-R, AnglE Embeddings (WhereIsAI/UAE-Large-V1), Docker, BeautifulSoup, APOC, Cypher