Conversational Support Summarization with Large Language Models

For a customer service organization processing thousands of daily support conversations, I developed an automated summarization system using fine-tuned FLAN-T5. The solution generates concise summaries for management review, reducing manual analysis time and enabling rapid identification of critical issues and customer sentiment trends.

The Challenge

Customer service managers needed to review large volumes of support conversations for quality assurance and trend analysis, but manual review was time-consuming and inconsistent. The organization required:

Automated generation of 6-7 sentences summaries highlighting key issues and resolutions
Identification of escalation-worthy conversations
Daily aggregate reporting on common problems and customer concerns
Integration with existing MongoDB-based support system
Low-latency inference to support near-real-time summarization

Model Selection & Fine-Tuning

Base Model: FLAN-T5

I selected FLAN-T5 as the foundation model for its strong instruction-following capabilities and efficient inference characteristics. The model provides a good balance between quality and operational cost.

Training Strategy:

Phase 1 - Initial Fine-tuning:

Started with public dialogue summarization datasets (SAMSum, DialogSum)
Adapted the model to conversational formats and summary length requirements
Used LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, reducing training time and GPU memory requirements

Phase 2 - Domain Adaptation:

Collected hand-crafted summaries from actual support conversations
Annotation process involved experienced support managers creating reference summaries
Iteratively refined the model with domain-specific examples to capture company terminology, names, and common issue patterns

Fine-Tuning Implementation:

I used Hugging Face PEFT library with LoRA configuration:

Applied LoRA to attention layers for efficient adaptation
Trained on GPU infrastructure using PyTorch
Employed gradient checkpointing to manage memory constraints
Used teacher forcing with cross-entropy loss for sequence generation

Prompt Engineering:

Designed instruction-style prompts that guide the model to produce consistent summary formats:

Summarize the following customer support conversation. Include the customer's main issue
and the resolution status. Keep the summary under 50 words.

Conversation:
[conversation text]

Summary:

Production Pipeline Architecture

Dagster Pipeline:

I implemented a data orchestration pipeline with the following stages:

1. extract_conversations

Queries MongoDB for new/updated support conversations
Filters for completed conversations requiring summarization
Handles incremental processing with watermark-based change detection

2. preprocess

Formats conversations into model-compatible input structure
Handles multi-turn dialogue formatting
Performs basic text cleaning while preserving conversational context

3. batch_inference

Groups conversations into batches for efficient GPU utilization
Invokes FLAN-T5 model for summary generation
Implements retry logic for transient failures
Uses mixed-precision inference (FP16) for reduced memory footprint

4. post_process

Validates summary format and length
Extracts structured metadata (sentiment indicators, urgency flags)
Applies business rules for quality filtering

5. store_summaries

Persists summaries back to MongoDB alongside original conversations
Updates conversation documents with summary field and generation timestamp
Maintains audit trail of summarization process

MongoDB Integration:

Document Storage: Conversations and summaries stored in unified collection
Vector Search: Used MongoDB Atlas Vector Search for semantic similarity queries, enabling managers to find similar past issues
Schema: Embedded summaries within conversation documents for atomic updates
Indexing: Compound indexes on timestamp, status, and summary fields for efficient queries

Infrastructure & Deployment

Model Hosting:

Deployed FLAN-T5 model on GPU-enabled compute
Containerized inference service with Docker
Implemented model versioning for rollback capability

Dagster Orchestration:

Scheduled daily batch processing for overnight runs
Event-driven triggers for real-time summarization of priority conversations
Monitoring and alerting integrated with the pipeline for failure detection
Asset materialization tracking for pipeline observability

Performance Optimizations:

Batching: Processed multiple conversations simultaneously for GPU efficiency
Quantization: Explored INT8 quantization for cost reduction in production

Addressing Key Challenges

1. Long Conversation Handling: For longer conversations, I implemented:

Extractive preprocessing to identify most relevant turns
Sliding window approach with overlap for very long conversations
Hierarchical summarization for multi-hour chat sessions

2. Domain-Specific Terminology: Fine-tuning with customer data taught the model names, technical terms, and company-specific language, preventing generic or inaccurate summaries.

3. Summary Consistency: Instruction-tuned prompts and post-processing rules ensured summaries followed consistent structure, making them easier for managers to quickly scan.

4. Quality Monitoring: Implemented feedback loop where managers could flag poor summaries, feeding these examples back into periodic model refinement cycles.

Results & Impact

Automated summarization of customer support conversations deployed to production
Management review workflow streamlined with actionable, concise summaries
System processes daily conversation volumes with consistent quality
Enabled faster identification of emerging issues through aggregated summary analysis
Integration with existing MongoDB infrastructure minimized operational complexity
LoRA fine-tuning approach reduced training costs compared to full model fine-tuning
Model updates and retraining streamlined through established Dagster pipeline

Technologies

Python, PyTorch, Hugging Face, FLAN-T5, PEFT (LoRA), Dagster, MongoDB, Docker, GPU Compute