Intelligent Few-Shot and Glossary Selection for Multi-Language Translation

Jun 15, 2025

For a sports media organization with a rigid translation system limited to 5 languages, I designed and implemented OpenSearch-based hybrid retrieval combining BM25 lexical matching with neural semantic search and MMR diversity optimization. The solution replaced static CSV/JSON example storage with dynamic, relevance-based few-shot selection and sophisticated three-tier glossary matching, enabling scalable multi-directional translation across multiple sports domains.

The Challenge

The organization’s existing translation system had severe scalability limitations:

  • Few-shot examples stored in CSV/JSON files, requiring manual curation per language pair
  • Language-specific embedding models for each source language, making new languages expensive
  • Random or rule-based example selection producing inconsistent translation quality
  • Static glossaries in files, difficult to update and no fuzzy matching capability
  • Limited to English-to-5-languages, couldn’t support X-to-Y arbitrary pairs
  • Example selection algorithm didn’t account for diversity or edge case coverage

They needed:

  • Dynamic example retrieval based on semantic similarity to source text
  • Scalable architecture where adding new languages doesn’t require new embedding models
  • Intelligent selection balancing relevance and diversity
  • Sophisticated glossary matching handling spelling variations and synonyms
  • Multi-directional translation support (any source to any target language)

OpenSearch Migration & Few-Shot Selection

I implemented hybrid search combining lexical and semantic matching for intelligent example retrieval:

Multi-Lingual Embedding Strategy:

  • Deployed paraphrase-multilingual-MiniLM-L12-v2 via SageMaker endpoint generating 384-dimensional embeddings
  • Unified multi-lingual model eliminates need for language-specific embedding models
  • Single model handles all source and target languages with shared semantic space
  • Supports 50+ languages including all major European and Asian languages
  • Batch processing for efficient encoding of large example repositories
  • Drastically simplified adding new language pairs (no model retraining required)

Hybrid Search Implementation:

  • BM25 lexical matching captures exact keyword overlap and sports-specific terminology
  • Neural search using cosine similarity on embeddings for semantic matching
  • Combined scoring retrieves top-20 candidate examples
  • Filters by source language, target language, and sports domain for relevance
  • Hybrid approach outperforms pure semantic or pure lexical methods

Diversity Optimization with MMR:

  • Implemented Maximal Marginal Relevance algorithm with lambda=0.7 balancing relevance and diversity
  • Selects final k=5 examples from top-20 candidates
  • Iteratively selects next example that maximizes relevance to query while minimizing similarity to already-selected examples
  • Prevents repetitive examples by penalizing candidates too similar to previous selections
  • Ensures coverage of different phrasings, contexts, and edge cases
  • Measurably improved translation quality compared to pure top-k similarity ranking

OpenSearch Architecture:

  • FAISS HNSW indexing for fast approximate nearest neighbor search
  • Integrated with SageMaker endpoint as external embedding model
  • Separate indexes for few-shot examples and glossary terms
  • Hybrid query execution combining BM25 and neural scoring in single pass
  • Sub-second retrieval latency enabling real-time example selection
  • Horizontal scaling supports growing example repositories across language pairs

Glossary Matching System

I designed a three-tier matching algorithm handling terminology variations:

Tier 1: Exact Matching with Aho-Corasick Automaton

  • Constructs efficient finite-state automaton from glossary terms
  • Handles multi-word terms and overlapping matches in single pass
  • O(n + m) complexity for text of length n, glossary size m
  • Catches precise terminology usage (team names, player names, technical terms)

Tier 2: Fuzzy Matching with RapidFuzz

  • Sliding window approach with configurable window size and stride
  • Generates n-gram candidates from source text for comparison against glossary
  • Partial ratio scoring with 90% cutoff threshold
  • Catches spelling variations, inflections, spacing differences, and minor typos
  • Particularly effective for morphologically rich languages with case/gender variations

Tier 3: Semantic Matching with Embeddings

  • Encodes text windows and glossary terms using same SageMaker-hosted embedding model
  • Cosine similarity with 0.7 threshold for semantic relatedness
  • Captures synonyms and conceptually related terms not caught by string matching
  • Handles cases where terminology varies across different sports contexts (e.g., “goalkeeper” vs “goalie”)

Text Normalization Pipeline:

  • Unicode normalization (NFKC followed by NFD) for consistent character representation
  • Lowercasing and whitespace standardization
  • Diacritics removal for fuzzy matching tier
  • Ensures consistent matching across different text encodings and input sources

Cascading Strategy:

  • Tiers execute in sequence, each handling cases missed by previous tier
  • Exact matches take precedence, fuzzy fills gaps, semantic catches edge cases
  • Combined approach achieves high recall without excessive false positives

Technical Innovations

1. Multi-Lingual Embedding Migration: Replaced language-specific embedding models with unified multi-lingual sentence-transformers model deployed on SageMaker. Adding new language now requires only adding training examples, not training new models.

2. Hybrid Search Architecture: Combined BM25 lexical and neural semantic search in OpenSearch. Captures both exact terminology matches and conceptually similar examples.

3. MMR for Example Diversity: Implemented diversity-aware retrieval algorithm preventing repetitive examples. Ensures few-shot examples cover varied contexts and edge cases.

4. Three-Tier Glossary Matching: Cascaded exact, fuzzy, and semantic matching algorithms. Handles terminology variations without manual synonym lists or extensive rule maintenance.

Results

  • Transformed limited English-to-5-languages system into extensible multi-directional translation service
  • OpenSearch-based few-shot retrieval demonstrated measurable quality improvement over baseline random selection
  • MMR diversity optimization improved translation consistency across varied input contexts
  • Three-tier glossary matching handles terminology variations across morphologically diverse languages
  • Architecture enables adding new language pairs by simply adding example data (no model retraining)
  • Supports multiple sports domains with domain-specific glossaries and examples
  • Sub-second retrieval latency maintains real-time translation performance

Technologies

Python, OpenSearch, FAISS, Sentence Transformers, SageMaker, BM25, Neural Search, MMR, Aho-Corasick, RapidFuzz, Multi-Lingual Embeddings