Conversational Support Summarization with Large Language Models
For a customer service organization processing thousands of daily support conversations, I developed an automated summarization system using fine-tuned FLAN-T5. The solution generates concise summaries for management review, reducing manual analysis time and enabling rapid identification of critical issues and customer sentiment trends.
The Challenge
Customer service managers needed to review large volumes of support conversations for quality assurance and trend analysis, but manual review was time-consuming and inconsistent. The organization required:
- Automated generation of 6-7 sentences summaries highlighting key issues and resolutions
- Identification of escalation-worthy conversations
- Daily aggregate reporting on common problems and customer concerns
- Integration with existing MongoDB-based support system
- Low-latency inference to support near-real-time summarization
Model Selection & Fine-Tuning
Base Model: FLAN-T5
I selected FLAN-T5 as the foundation model for its strong instruction-following capabilities and efficient inference characteristics. The model provides a good balance between quality and operational cost.
Training Strategy:
Phase 1 - Initial Fine-tuning:
- Started with public dialogue summarization datasets (SAMSum, DialogSum)
- Adapted the model to conversational formats and summary length requirements
- Used LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, reducing training time and GPU memory requirements
Phase 2 - Domain Adaptation:
- Collected hand-crafted summaries from actual support conversations
- Annotation process involved experienced support managers creating reference summaries
- Iteratively refined the model with domain-specific examples to capture company terminology, names, and common issue patterns
Fine-Tuning Implementation:
I used Hugging Face PEFT library with LoRA configuration:
- Applied LoRA to attention layers for efficient adaptation
- Trained on GPU infrastructure using PyTorch
- Employed gradient checkpointing to manage memory constraints
- Used teacher forcing with cross-entropy loss for sequence generation
Prompt Engineering:
Designed instruction-style prompts that guide the model to produce consistent summary formats:
Summarize the following customer support conversation. Include the customer's main issue
and the resolution status. Keep the summary under 50 words.
Conversation:
[conversation text]
Summary:
Production Pipeline Architecture
Dagster Pipeline:
I implemented a data orchestration pipeline with the following stages:
1. extract_conversations
- Queries MongoDB for new/updated support conversations
- Filters for completed conversations requiring summarization
- Handles incremental processing with watermark-based change detection
2. preprocess
- Formats conversations into model-compatible input structure
- Handles multi-turn dialogue formatting
- Performs basic text cleaning while preserving conversational context
3. batch_inference
- Groups conversations into batches for efficient GPU utilization
- Invokes FLAN-T5 model for summary generation
- Implements retry logic for transient failures
- Uses mixed-precision inference (FP16) for reduced memory footprint
4. post_process
- Validates summary format and length
- Extracts structured metadata (sentiment indicators, urgency flags)
- Applies business rules for quality filtering
5. store_summaries
- Persists summaries back to MongoDB alongside original conversations
- Updates conversation documents with summary field and generation timestamp
- Maintains audit trail of summarization process
MongoDB Integration:
- Document Storage: Conversations and summaries stored in unified collection
- Vector Search: Used MongoDB Atlas Vector Search for semantic similarity queries, enabling managers to find similar past issues
- Schema: Embedded summaries within conversation documents for atomic updates
- Indexing: Compound indexes on timestamp, status, and summary fields for efficient queries
Infrastructure & Deployment
Model Hosting:
- Deployed FLAN-T5 model on GPU-enabled compute
- Containerized inference service with Docker
- Implemented model versioning for rollback capability
Dagster Orchestration:
- Scheduled daily batch processing for overnight runs
- Event-driven triggers for real-time summarization of priority conversations
- Monitoring and alerting integrated with the pipeline for failure detection
- Asset materialization tracking for pipeline observability
Performance Optimizations:
- Batching: Processed multiple conversations simultaneously for GPU efficiency
- Quantization: Explored INT8 quantization for cost reduction in production
Addressing Key Challenges
1. Long Conversation Handling: For longer conversations, I implemented:
- Extractive preprocessing to identify most relevant turns
- Sliding window approach with overlap for very long conversations
- Hierarchical summarization for multi-hour chat sessions
2. Domain-Specific Terminology: Fine-tuning with customer data taught the model names, technical terms, and company-specific language, preventing generic or inaccurate summaries.
3. Summary Consistency: Instruction-tuned prompts and post-processing rules ensured summaries followed consistent structure, making them easier for managers to quickly scan.
4. Quality Monitoring: Implemented feedback loop where managers could flag poor summaries, feeding these examples back into periodic model refinement cycles.
Results & Impact
- Automated summarization of customer support conversations deployed to production
- Management review workflow streamlined with actionable, concise summaries
- System processes daily conversation volumes with consistent quality
- Enabled faster identification of emerging issues through aggregated summary analysis
- Integration with existing MongoDB infrastructure minimized operational complexity
- LoRA fine-tuning approach reduced training costs compared to full model fine-tuning
- Model updates and retraining streamlined through established Dagster pipeline
Technologies
Python, PyTorch, Hugging Face, FLAN-T5, PEFT (LoRA), Dagster, MongoDB, Docker, GPU Compute