AI Game Narrative Extraction Pipeline with Multi-Stage LLM Architecture

Mar 20, 2026

For a sports media technology company, I designed and built a production AI pipeline that extracts compelling game narratives from live broadcast transcriptions. The system combines a three-stage LLM architecture (parallel extraction across 9 narrative categories, independent judge verification, and top-10 curation), prompt caching optimization for cost efficiency, and structured output enforcement via Pydantic schemas. Through dozens of prompt iteration rounds evaluated by subject matter experts across four LLM vendors (Claude, OpenAI, Gemini, Grok), I converged on Claude Opus 4.6 with a carefully engineered prompt system that produces broadcast-quality narratives grounded strictly in source material.

The Challenge

A sports media technology company producing automated voice-over content needed to extract the most compelling storylines from NBA game broadcasts. The process faced several challenges:

  • Noisy input data: Speech-to-text transcriptions contain recognition errors, partial scores, and misheard player names
  • Hallucination risk: LLMs tend to fabricate details, add dramatic context, or infer outcomes not stated by commentators
  • Narrative quality: Output must match broadcast-quality sports writing with vivid language while remaining factually grounded
  • Scale: Processing hundreds of games per season with 9 narrative categories each, requiring cost-efficient API usage
  • Evaluation difficulty: No automated metric captures whether a sports narrative “feels right” to a domain expert
  • Multi-sport extensibility: Architecture must support adding new sports without modifying core pipeline logic

They needed:

  • Multi-category narrative extraction covering momentum shifts, player performances, records, injuries, and storylines
  • Factual verification preventing hallucinated details while allowing vivid sports language
  • Final curation selecting the 10 most compelling narratives per game
  • Cost-optimized LLM usage for high-volume production processing
  • Sport-agnostic architecture supporting future expansion

Solution Architecture

I designed a three-stage pipeline replacing an earlier LangGraph-based approach with pure asyncio for better control over concurrency and prompt caching:

graph LR DB[(Database)] --> retrieval[Data Retrieval] STT[Transcription API] --> retrieval retrieval --> prep[Context Preparation] prep --> extract[Parallel Extraction
9 Categories] extract --> judge[Judge Verification
Per Element] judge --> top10[Top-10 Curation] top10 --> output[Ranked Narratives]

Stage 1: Data Retrieval and Context Preparation

Multi-Source Data Integration:

The pipeline retrieves and fuses data from multiple sources to build comprehensive game context:

  • Game Database: SQL Server queries fetch game metadata, team/player rosters with nickname resolution, and box score statistics (87 distinct stat types mapped to human-readable descriptions)
  • Transcription API: JWT-authenticated HTTP calls fetch speech-to-text transcriptions segmented by game period (pre-game, quarters, between-period breaks, overtime, post-game)
  • Statistical Ground Truth: Final game statistics serve as authoritative source for correcting STT numerical errors

Context Preparation:

A pure-function transformation layer converts raw game data into prompt-ready context. Transcriptions are formatted into structured text blocks with period headers, point-per-quarter breakdowns, and confirmed score sections. Player and team statistics are formatted with zero-value filtering, organized by winner/loser team, with accent-insensitive name matching via Unicode transliteration.

Stage 2: Parallel Extraction with Judge Verification

9-Category Parallel Extraction:

The system extracts narratives across nine categories, each with specialized prompt instructions:

  1. Game Momentum Changes: Scoring runs, comebacks, quarter dominance (mandatory timing context)
  2. Player Statistics: Individual game totals and notable performances
  3. Player Gets Red Hot: Period-specific hot streaks (mandatory period specification)
  4. Season Momentum: Streaks, records, standings implications
  5. Special Records/Stats: Career highs, franchise records, all-time marks
  6. Special Stories: Rivalries, player returns, coaching milestones (highest verification standard)
  7. Specific Plays: Decisive last-minute plays only
  8. Team Stats: Team-level metrics supporting the game narrative
  9. Injuries and Absences: Impact from key player injuries or returns

All nine categories run concurrently via asyncio.gather with semaphore-bounded concurrency, each producing structured StoryElement objects (narrative + source quote + category source).

Independent Judge Verification:

Every extracted element is independently verified by a judge LLM call that checks:

  • Every fact in the narrative is traceable to the source quote
  • Player names match the official roster
  • Numerical statistics match confirmed final game statistics
  • No forbidden additions (timing, outcomes, locations, inferences not in source)
  • Category-specific requirements are met (e.g., momentum shifts must include when they occurred)

Elements receiving a “fail” verdict are filtered out with logged reasons, ensuring only factually grounded narratives proceed.

Stage 3: Top-10 Curation

A final LLM call receives all passing narratives and applies a structured curation algorithm:

  • Deduplication: Merges overlapping facts across categories
  • Aggregation: Unifies related facts about the same player/team/period using strict language constraints (factual connectors only, no interpretive verbs)
  • Scoring: 0-10 rubric across four factors: Outcome Impact (0-4), Rarity/Significance (0-3), Star Power (0-2), Timing (0-1)
  • Selection: Includes all facts scoring 8+, fills remaining slots by score with category diversity constraints
  • Validation: Checks distinctness, accuracy, language neutrality, length, and team balance (~60% winner / ~40% loser focus)

Prompt Engineering Methodology

Iterative Refinement with SME Evaluation:

The prompt system was developed through dozens of iteration cycles with sports domain experts. Each cycle followed a structured process:

  1. Run extraction on a batch of games across multiple game types (blowouts, close games, overtime thrillers)
  2. SME expert reviews each narrative for factual accuracy, missing stories, and narrative quality
  3. Identify systematic failure patterns (e.g., LLM adding “late-game” timing not in source, inferring causal relationships)
  4. Encode failures as explicit forbidden patterns with good/bad examples in prompts
  5. Re-evaluate on the same batch plus new games to verify fixes without regressions

Multi-Vendor LLM Evaluation:

I conducted systematic evaluation across four LLM providers:

  • Claude (Anthropic): Opus 4.6 and Sonnet 4.5 via Azure API Management
  • OpenAI: o3 and GPT-5.1 via Azure OpenAI and Azure AI Foundry
  • Gemini (Google): Gemini 3 Pro Preview and Gemini 2.5 Pro via Vertex AI
  • Grok (xAI): Grok 4 via Azure AI Foundry

Each vendor was evaluated on: factual grounding (no hallucinated details), narrative quality (vivid but accurate sports language), structured output reliability, and instruction adherence (respecting forbidden additions). Based on comprehensive SME evaluation, Claude Opus 4.6 was selected for production, delivering the best balance of factual discipline, narrative quality, and consistent structured output compliance.

Prompt Architecture for Cache Optimization:

The prompt system is split into system prompt (cached) and user message (per-call) to maximize prompt caching:

  • System prompt (shared across all 9 category calls per game): Contains the extraction role definition, forbidden additions with examples, workflow instructions, output rules, score correction rules, footer instructions, game context (teams, rosters, statistics), and full STT transcription
  • User message (varies per category): Contains only the category-specific extraction instructions

This architecture means the large system prompt (~20-50KB depending on game length) is cached after the first category call, and the remaining 8 category calls benefit from cache reads, significantly reducing per-game API costs.

Technical Implementation

Structured Output Enforcement:

For Anthropic’s API, I implemented strict JSON schema generation from Pydantic models. Anthropic’s grammar engine requires fully inlined schemas with additionalProperties: false on every object, no $ref/$defs references, and stripped metadata fields. I built a recursive schema resolver that inlines all Pydantic $ref references and enforces these constraints, enabling reliable structured output across nested models.

Resilient Processing:

  • Retry configuration with exponential backoff (2s base, 2x multiplier, 600s max, 10 attempts) for long-running thinking requests
  • Per-category failure isolation: extraction failure for one category produces an empty list without affecting others
  • Per-element judge failure: element is skipped, other elements proceed
  • Per-game pipeline failure: logged and skipped, other games continue
  • Token tracking across all LLM calls with cache read metrics for cost monitoring

Sport-Agnostic Registry Pattern:

The architecture uses a frozen SportConfig dataclass registered at import time:

  • Categories, stat name mappings, prompt templates directory, stream priority, and retriever class are all sport-specific
  • Adding a new sport requires only: a new sport module with retriever and config, prompt template files, and a single import line
  • No changes needed to pipeline, extraction, preparation, prompt assembly, or LLM modules

Key Design Decisions

Three-stage pipeline (extract → judge → curate) over single-pass extraction:

  • Single-pass LLMs conflate finding facts with evaluating importance, leading to either missed stories or low-quality ones
  • Separate judge stage catches systematic hallucination patterns that extraction prompts alone cannot prevent
  • Curation stage has complete picture of all categories, enabling intelligent aggregation and diversity balancing

asyncio.gather over LangGraph:

  • Pure asyncio over LangGraph for direct control over concurrency and prompt construction
  • No framework overhead and dependency while maintaining parallel execution
  • Precise system/user prompt splitting required for cache optimization

Explicit forbidden-addition examples in prompts:

  • Generic instructions (“don’t hallucinate”) are ineffective; LLMs need concrete positive and negative examples
  • Each forbidden category (timing, outcomes, context, locations, scores, events, conclusions, past games) includes multiple good/bad example pairs
  • SME evaluation showed this approach reduced hallucination rate significantly compared to instruction-only prompts

STT error correction via confirmed statistics:

  • Broadcast transcription frequently mishears numbers (“42” when actual is “45”) and captures partial mid-game counts
  • Final game statistics serve as authoritative ground truth for all numerical corrections
  • Prompts include explicit priority rules: final statistics always override STT-mentioned numbers

Claude Opus 4.6 selection:

  • Best factual discipline: consistently avoided adding details not in source material
  • Strongest instruction adherence: respected complex forbidden-addition rules across all categories
  • Best narrative quality: vivid sports language without crossing into fabrication
  • Reliable structured output: consistent JSON schema compliance with Pydantic models

Results & Impact

  • Deployed production pipeline processing NBA games at scale for automated voice-over content generation
  • Architected three-stage LLM pipeline (extraction, judge verification, curation) ensuring factually grounded narratives
  • Engineered prompt system through dozens of SME-evaluated iteration rounds, systematically eliminating hallucination patterns
  • Evaluated four LLM vendors (Claude, OpenAI, Gemini, Grok) with structured comparison, selecting Claude Opus 4.6 for production
  • Implemented prompt caching architecture reducing API costs by sharing system prompts across 9 parallel category calls per game
  • Built sport-agnostic registry pattern enabling new sport addition without core pipeline changes
  • Designed structured output enforcement with strict JSON schema generation for Anthropic’s grammar engine
  • Achieved reliable factual grounding through explicit forbidden-addition patterns with concrete positive/negative examples

Technologies

Python, Claude Opus 4.6, Anthropic API, Structured Output, Pydantic, asyncio, Azure API Management, SQL Server, JWT Authentication, Prompt Engineering, Prompt Caching