RAG Solution: Financial Document Analysis with Graph-Based Provenance Tracking

Aug 10, 2023

For a financial analysis firm processing corporate financial documents, I developed a RAG system that combines vector search with graph-based provenance tracking. The solution enables natural language queries across complex financial documents while maintaining complete traceability of information sources through Neo4j’s native graph capabilities.

The Challenge

The client needed to analyze financial reports for multiple companies, extracting business insights, risk factors, and strategic analysis. Traditional keyword search failed to capture semantic meaning, while standard RAG implementations couldn’t trace information lineage - critical for financial due diligence and compliance requirements.

Key requirements:

  • Query across multiple documents using natural language
  • Track provenance: which company, which section, which reporting period
  • Maintain document structure and sequential context
  • Handle verbose structured formats (markup-heavy corporate documents)
  • GPU-accelerated processing for batch analysis

Technical Architecture

Document Processing Pipeline:

I built an ETL pipeline to extract and structure financial documents:

  1. Downloaded documents from document repositories
  2. Isolated key sections
  3. Stripped XML/HTML markup using BeautifulSoup, reducing file size from MBs to KBs
  4. Chunked text using RecursiveCharacterTextSplitter (2000 chars, 200 char overlap)
  5. Generated embeddings and loaded into Neo4j with provenance metadata

Embedding Strategy:

  • Model: WhereIsAI/UAE-Large-V1 (AnglE embeddings, 1024-dimensional)
  • Pooling: CLS token strategy with Prompts.C optimization

Neo4j Graph Schema:

I designed a schema that serves dual purposes - vector search and provenance tracking:

Nodes:

  • Chunk with properties: chunkId, text, textEmbedding (vector), documentId, sectionType, chunkSeqId, companyName

Relationships:

  • NEXT - Links sequential chunks within same document section, preserving document order

Indexes:

  • Vector index on textEmbedding (cosine similarity, 1024 dimensions)
  • Unique constraint on chunkId

Provenance Implementation:

Each chunk maintains complete lineage:

  • documentId: Source document identifier
  • sectionType: Section category (business, risk, financial_analysis, market_risk)
  • chunkSeqId: Position within section
  • NEXT relationships: Sequential graph traversal for context expansion

When a query retrieves a chunk, the system can trace back to the exact company, reporting period, section, and position within that section.

Custom LangChain Retriever:

I implemented a custom N4JRetriever (BaseRetriever subclass) that:

  1. Encodes queries using the same AnglE model
  2. Executes native Neo4j vector search via Cypher:
    CALL db.index.vector.queryNodes(index_name, k, embedding)
    
  3. Returns top-k semantically similar chunks with similarity scores
  4. Integrates seamlessly with LangChain LCEL chains

Advanced Context Retrieval (Prototype):

I designed a sophisticated window-based retrieval pattern that leverages the NEXT relationships:

MATCH window= (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)

This retrieves not just the matching chunk, but also ±1 neighboring chunks, providing richer context while maintaining provenance. The graph traversal ensures sequential coherence.

Generation & Orchestration:

  • LLM: Cohere Command-R (standard queries) / Command-R-Plus (agent workflows)
  • Framework: LangChain LCEL for retrieval → prompt → generation pipeline
  • Prompt: Hub-based RAG prompt with source attribution formatting
  • Agent: Multi-tool ReAct agent with custom retriever as vectorstore search tool

Infrastructure

  • Neo4j with APOC and GenAI plugins
  • Docker containerized deployment
  • GPU server for embedding generation
  • Batch processing with error handling and progress logging

Technical Innovations

  1. Unified Vector + Graph Database: Neo4j serves dual roles, eliminating need for separate vector store and simplifying provenance queries
  2. Section-Aware Chunking: Maintains document structural integrity across different report sections
  3. Graph-Based Context Expansion: NEXT relationships enable semantic + structural retrieval
  4. Structured Format Processing: Custom pipeline handles verbose markup-heavy document formats

Results

  • Successfully processed multiple companies’ financial documents with complete provenance tracking
  • Enabled natural language queries with traceable, auditable answers
  • Vector search returns relevant chunks in milliseconds via Neo4j native indexing
  • Graph traversal provides document context while maintaining full audit trail
  • GPU acceleration enables efficient batch processing of large document collections
  • System deployed for production use with multiple concurrent users

Technologies

Python, Neo4j, LangChain, Cohere Command-R, AnglE Embeddings (WhereIsAI/UAE-Large-V1), Docker, BeautifulSoup, APOC, Cypher