Intelligent Few-Shot and Glossary Selection for Multi-Language Translation
For a sports media organization with a rigid translation system limited to 5 languages, I designed and implemented OpenSearch-based hybrid retrieval combining BM25 lexical matching with neural semantic search and MMR diversity optimization. The solution replaced static CSV/JSON example storage with dynamic, relevance-based few-shot selection and sophisticated three-tier glossary matching, enabling scalable multi-directional translation across multiple sports domains.
The Challenge
The organization’s existing translation system had severe scalability limitations:
- Few-shot examples stored in CSV/JSON files, requiring manual curation per language pair
- Language-specific embedding models for each source language, making new languages expensive
- Random or rule-based example selection producing inconsistent translation quality
- Static glossaries in files, difficult to update and no fuzzy matching capability
- Limited to English-to-5-languages, couldn’t support X-to-Y arbitrary pairs
- Example selection algorithm didn’t account for diversity or edge case coverage
They needed:
- Dynamic example retrieval based on semantic similarity to source text
- Scalable architecture where adding new languages doesn’t require new embedding models
- Intelligent selection balancing relevance and diversity
- Sophisticated glossary matching handling spelling variations and synonyms
- Multi-directional translation support (any source to any target language)
OpenSearch Migration & Few-Shot Selection
I implemented hybrid search combining lexical and semantic matching for intelligent example retrieval:
Multi-Lingual Embedding Strategy:
- Deployed paraphrase-multilingual-MiniLM-L12-v2 via SageMaker endpoint generating 384-dimensional embeddings
- Unified multi-lingual model eliminates need for language-specific embedding models
- Single model handles all source and target languages with shared semantic space
- Supports 50+ languages including all major European and Asian languages
- Batch processing for efficient encoding of large example repositories
- Drastically simplified adding new language pairs (no model retraining required)
Hybrid Search Implementation:
- BM25 lexical matching captures exact keyword overlap and sports-specific terminology
- Neural search using cosine similarity on embeddings for semantic matching
- Combined scoring retrieves top-20 candidate examples
- Filters by source language, target language, and sports domain for relevance
- Hybrid approach outperforms pure semantic or pure lexical methods
Diversity Optimization with MMR:
- Implemented Maximal Marginal Relevance algorithm with lambda=0.7 balancing relevance and diversity
- Selects final k=5 examples from top-20 candidates
- Iteratively selects next example that maximizes relevance to query while minimizing similarity to already-selected examples
- Prevents repetitive examples by penalizing candidates too similar to previous selections
- Ensures coverage of different phrasings, contexts, and edge cases
- Measurably improved translation quality compared to pure top-k similarity ranking
OpenSearch Architecture:
- FAISS HNSW indexing for fast approximate nearest neighbor search
- Integrated with SageMaker endpoint as external embedding model
- Separate indexes for few-shot examples and glossary terms
- Hybrid query execution combining BM25 and neural scoring in single pass
- Sub-second retrieval latency enabling real-time example selection
- Horizontal scaling supports growing example repositories across language pairs
Glossary Matching System
I designed a three-tier matching algorithm handling terminology variations:
Tier 1: Exact Matching with Aho-Corasick Automaton
- Constructs efficient finite-state automaton from glossary terms
- Handles multi-word terms and overlapping matches in single pass
- O(n + m) complexity for text of length n, glossary size m
- Catches precise terminology usage (team names, player names, technical terms)
Tier 2: Fuzzy Matching with RapidFuzz
- Sliding window approach with configurable window size and stride
- Generates n-gram candidates from source text for comparison against glossary
- Partial ratio scoring with 90% cutoff threshold
- Catches spelling variations, inflections, spacing differences, and minor typos
- Particularly effective for morphologically rich languages with case/gender variations
Tier 3: Semantic Matching with Embeddings
- Encodes text windows and glossary terms using same SageMaker-hosted embedding model
- Cosine similarity with 0.7 threshold for semantic relatedness
- Captures synonyms and conceptually related terms not caught by string matching
- Handles cases where terminology varies across different sports contexts (e.g., “goalkeeper” vs “goalie”)
Text Normalization Pipeline:
- Unicode normalization (NFKC followed by NFD) for consistent character representation
- Lowercasing and whitespace standardization
- Diacritics removal for fuzzy matching tier
- Ensures consistent matching across different text encodings and input sources
Cascading Strategy:
- Tiers execute in sequence, each handling cases missed by previous tier
- Exact matches take precedence, fuzzy fills gaps, semantic catches edge cases
- Combined approach achieves high recall without excessive false positives
Technical Innovations
1. Multi-Lingual Embedding Migration: Replaced language-specific embedding models with unified multi-lingual sentence-transformers model deployed on SageMaker. Adding new language now requires only adding training examples, not training new models.
2. Hybrid Search Architecture: Combined BM25 lexical and neural semantic search in OpenSearch. Captures both exact terminology matches and conceptually similar examples.
3. MMR for Example Diversity: Implemented diversity-aware retrieval algorithm preventing repetitive examples. Ensures few-shot examples cover varied contexts and edge cases.
4. Three-Tier Glossary Matching: Cascaded exact, fuzzy, and semantic matching algorithms. Handles terminology variations without manual synonym lists or extensive rule maintenance.
Results
- Transformed limited English-to-5-languages system into extensible multi-directional translation service
- OpenSearch-based few-shot retrieval demonstrated measurable quality improvement over baseline random selection
- MMR diversity optimization improved translation consistency across varied input contexts
- Three-tier glossary matching handles terminology variations across morphologically diverse languages
- Architecture enables adding new language pairs by simply adding example data (no model retraining)
- Supports multiple sports domains with domain-specific glossaries and examples
- Sub-second retrieval latency maintains real-time translation performance
Technologies
Python, OpenSearch, FAISS, Sentence Transformers, SageMaker, BM25, Neural Search, MMR, Aho-Corasick, RapidFuzz, Multi-Lingual Embeddings