LLM Evaluation in Production: How We Hit 98% Accuracy
A deep dive into the code and concepts behind our 98% accurate medical data extraction pipeline.
LLM Evaluation in Production: How We Hit 98% Accuracy
"It's good enough."
In most software, that phrase is fine. If your recommendation engine is 90% accurate, nobody dies. If your search bar misses a result, the user tries a different keyword.
In healthcare, "good enough" is negligent.
At Mantys, we faced a brutal reality: our AI agents were processing medical claims where a single wrong digit could stick a patient with a $20,000 bill they didn't owe. We couldn't just "vibe check" our LLM outputs. We needed a rig that was mathematically precise, reproducible, and ruthless.
We managed to build a system that achieves 98% accuracy and, more importantly, knows exactly when it isn't accurate so it can ask for help.
Here is the architecture of how we did it.
The Challenge
Healthcare data is a nightmare to parse. We aren't dealing with clean text; we're dealing with "clinical notes" (often shorthand), "insurance codes" (confusing alphanumerics), and "dates" (written in every format imaginable).
For a long time, we were stuck at ~85% accuracy. To bridge that gap to 98%, we had to stop treating the LLM as a black box and start treating it as a component in a larger validation machine.
My Evaluation Architecture
We built a "Swiss Cheese" model of evaluation. No single layer is perfect, but if you stack enough of them, the holes don't align, and errors don't get through.
Core Components
class HealthcareEvaluationFramework:
def __init__(self):
# Layer 1: The 'Smart' Evaluator (LLM evaluating LLM)
self.cove_evaluator = CoveEvaluator()
# Layer 2: The 'Statistical' Evaluator
self.gevals_framework = GevalsFramework()
# Layer 3: The 'Deterministic' Evaluator (Logic Checks)
self.domain_validator = HealthcareDomainValidator()
# Safety Net: The Human
self.human_reviewer = HumanReviewSystem()
def evaluate_extraction(self, document, extracted_data):
# Multi-stage evaluation pipeline
results = {}
# Stage 1: Automated evaluation
results['cove_score'] = self.cove_evaluator.evaluate(document, extracted_data)
results['gevals_metrics'] = self.gevals_framework.compute_metrics(extracted_data)
results['log_values'] = self.log_analyzer.analyze(extracted_data)
# Stage 2: Domain-specific validation
results['domain_validation'] = self.domain_validator.validate(extracted_data)
# Stage 3: Human review (for edge cases)
if results['confidence'] < self.threshold:
results['human_review'] = self.human_reviewer.queue_for_review(document, extracted_data)
return results
Cove: Comprehensive Evaluation for Healthcare
Cove (Comprehensive Output Verification and Evaluation) became our primary automated evaluation tool for several reasons:
Key Advantages
- Domain adaptability: Can be fine-tuned for healthcare-specific metrics
- Explainability: Provides detailed reasoning for evaluation scores
- Scalability: Handles large volumes of documents efficiently
- Integration: Easy API integration with our existing pipeline
Implementation Details
class CoveEvaluator:
def __init__(self, model_config):
self.model = load_cove_model(model_config)
self.healthcare_schema = HealthcareSchema()
def evaluate(self, source_document, extracted_data):
# Prepare evaluation context
context = {
'document_type': self.identify_document_type(source_document),
'expected_fields': self.healthcare_schema.get_required_fields(),
'extraction_confidence': extracted_data.get('confidence', 0.0)
}
# Run Cove evaluation
evaluation = self.model.evaluate(
source=source_document,
extraction=extracted_data,
context=context
)
return self.process_cove_results(evaluation)
def process_cove_results(self, evaluation):
return {
'accuracy_score': evaluation.accuracy,
'completeness_score': evaluation.completeness,
'consistency_score': evaluation.consistency,
'confidence_interval': evaluation.confidence_interval,
'field_level_scores': evaluation.field_scores,
'explanation': evaluation.reasoning
}
Healthcare-Specific Adaptations
We customized Cove for healthcare by:
- Medical terminology training: Fine-tuned on healthcare-specific vocabulary
- Context-aware evaluation: Understanding medical relationships and dependencies
- Regulatory alignment: Ensuring evaluations align with healthcare compliance requirements
Gevals: Generative Evaluation at Scale
Gevals provided our framework for systematic, large-scale evaluation:
Core Metrics Implementation
class GevalsFramework:
def __init__(self):
self.metrics = {
'accuracy': AccuracyMetric(),
'completeness': CompletenessMetric(),
'consistency': ConsistencyMetric(),
'clinical_relevance': ClinicalRelevanceMetric(),
'regulatory_compliance': ComplianceMetric()
}
def compute_metrics(self, extracted_data):
results = {}
for metric_name, metric in self.metrics.items():
try:
score = metric.compute(extracted_data)
results[metric_name] = {
'score': score,
'details': metric.get_details(),
'confidence': metric.get_confidence()
}
except Exception as e:
results[metric_name] = {'error': str(e)}
# Compute composite score
results['composite_score'] = self.calculate_composite_score(results)
return results
def calculate_composite_score(self, metric_results):
# Weighted average based on healthcare priorities
weights = {
'accuracy': 0.35,
'completeness': 0.25,
'consistency': 0.20,
'clinical_relevance': 0.15,
'regulatory_compliance': 0.05
}
weighted_sum = sum(
weights[metric] * results['score']
for metric, results in metric_results.items()
if 'score' in results
)
return weighted_sum / sum(weights.values())
Custom Healthcare Metrics
We developed healthcare-specific metrics:
Clinical Relevance Metric
class ClinicalRelevanceMetric:
def __init__(self):
self.medical_knowledge_base = MedicalKnowledgeBase()
self.icd_validator = ICDCodeValidator()
self.cpt_validator = CPTCodeValidator()
def compute(self, extracted_data):
relevance_scores = []
# Validate medical codes
if 'icd_codes' in extracted_data:
icd_relevance = self.validate_icd_codes(extracted_data['icd_codes'])
relevance_scores.append(icd_relevance)
if 'cpt_codes' in extracted_data:
cpt_relevance = self.validate_cpt_codes(extracted_data['cpt_codes'])
relevance_scores.append(cpt_relevance)
# Check clinical logic
clinical_logic_score = self.validate_clinical_logic(extracted_data)
relevance_scores.append(clinical_logic_score)
return sum(relevance_scores) / len(relevance_scores)
Log Values Method: Quantitative Accuracy Analysis
The log values approach provided quantitative insights into our model's performance:
Implementation
class LogValuesAnalyzer:
def __init__(self):
self.baseline_model = BaselineModel()
self.production_model = ProductionModel()
def analyze(self, extracted_data):
# Calculate log-likelihood improvements
baseline_likelihood = self.baseline_model.log_likelihood(extracted_data)
production_likelihood = self.production_model.log_likelihood(extracted_data)
improvement = production_likelihood - baseline_likelihood
# Field-level analysis
field_analysis = {}
for field, value in extracted_data.items():
field_analysis[field] = {
'baseline_ll': self.baseline_model.field_likelihood(field, value),
'production_ll': self.production_model.field_likelihood(field, value),
'improvement': improvement,
'confidence': self.calculate_confidence(improvement)
}
return {
'overall_improvement': improvement,
'field_analysis': field_analysis,
'statistical_significance': self.test_significance(improvement)
}
Key Insights from Log Analysis
The log values method revealed:
- Field-specific performance: Some fields (like patient names) had higher accuracy than others (like copay amounts)
- Document type variations: Performance varied significantly across document types
- Improvement quantification: Measurable improvements over baseline approaches
Production Pipeline Integration
Our evaluation framework integrates seamlessly with our production extraction pipeline:
Real-time Evaluation
class ProductionEvaluationPipeline:
def __init__(self):
self.evaluator = HealthcareEvaluationFramework()
self.metrics_collector = MetricsCollector()
self.alert_system = AlertSystem()
async def process_document(self, document):
# Extract data
extracted_data = await self.extract_data(document)
# Real-time evaluation
evaluation_results = self.evaluator.evaluate_extraction(document, extracted_data)
# Collect metrics
self.metrics_collector.record(evaluation_results)
# Check for alerts
if evaluation_results['composite_score'] < 0.95:
await self.alert_system.send_alert(document, evaluation_results)
# Decide on human review
if self.needs_human_review(evaluation_results):
await self.queue_for_human_review(document, extracted_data, evaluation_results)
return extracted_data, evaluation_results
def needs_human_review(self, evaluation_results):
return (
evaluation_results['composite_score'] < 0.98 or
evaluation_results['domain_validation']['critical_errors'] > 0 or
evaluation_results['confidence_interval'][1] - evaluation_results['confidence_interval'][0] > 0.1
)
Achieving 98% Accuracy: Key Strategies
Several strategies were crucial to achieving our 98% accuracy target:
1. Multi-Stage Validation
- Automated evaluation catches obvious errors
- Domain-specific validation ensures medical accuracy
- Human review handles edge cases and builds training data
2. Continuous Learning
class ContinuousLearningSystem:
def __init__(self):
self.training_queue = TrainingQueue()
self.model_updater = ModelUpdater()
def learn_from_corrections(self, corrections):
# Add human corrections to training queue
for correction in corrections:
self.training_queue.add(
document=correction.original_document,
correct_extraction=correction.corrected_data,
error_type=correction.error_classification
)
# Trigger retraining when queue reaches threshold
if len(self.training_queue) >= self.retrain_threshold:
self.trigger_retraining()
async def trigger_retraining(self):
training_data = self.training_queue.get_all()
updated_model = await self.model_updater.retrain(training_data)
# Validate updated model
validation_results = await self.validate_model(updated_model)
if validation_results['accuracy'] > self.current_model_accuracy:
await self.deploy_updated_model(updated_model)
3. Error Analysis and Prevention
class ErrorAnalysisSystem:
def __init__(self):
self.error_classifier = ErrorClassifier()
self.pattern_detector = PatternDetector()
def analyze_errors(self, evaluation_results):
errors = self.extract_errors(evaluation_results)
# Classify error types
classified_errors = [
self.error_classifier.classify(error)
for error in errors
]
# Detect patterns
patterns = self.pattern_detector.find_patterns(classified_errors)
# Generate improvement recommendations
recommendations = self.generate_recommendations(patterns)
return {
'error_breakdown': classified_errors,
'patterns': patterns,
'recommendations': recommendations
}
Results and Impact
Our comprehensive evaluation framework has delivered significant results:
Quantitative Improvements
- 98% accuracy on critical field extraction
- 80% reduction in manual verification work
- 95% confidence intervals within ±2% for most extractions
- <100ms average evaluation latency in production
Operational Benefits
- Standardized quality metrics across all extraction tasks
- Real-time monitoring of model performance
- Automated alerting for accuracy degradation
- Continuous improvement through systematic error analysis
Business Impact
- Reduced operational costs through automation
- Improved compliance with healthcare regulations
- Faster processing times for critical documents
- Higher customer satisfaction due to accuracy
Lessons Learned
Building this system taught us several important lessons:
1. Healthcare Requires Domain-Specific Evaluation
Standard NLP metrics don't capture healthcare-specific requirements. Custom metrics for clinical relevance and regulatory compliance were essential.
2. Human-in-the-Loop is Critical
Even with 98% accuracy, human oversight remains necessary for:
- Edge cases and novel document types
- Regulatory compliance verification
- Continuous improvement through error correction
3. Multi-Modal Evaluation is Powerful
Combining different evaluation approaches (Cove, Gevals, log values) provides comprehensive coverage and catches different types of errors.
4. Real-Time Monitoring Enables Quick Response
Production monitoring allows us to detect and respond to accuracy degradation quickly, maintaining system reliability.
Future Directions
We're continuing to improve our evaluation framework:
Enhanced Automation
- Self-improving models that automatically incorporate corrections
- Predictive error detection to prevent issues before they occur
- Dynamic threshold adjustment based on document complexity
Broader Healthcare Applications
- Clinical decision support evaluation frameworks
- Drug interaction detection accuracy measurement
- Population health analytics validation systems
Conclusion
Building a production-grade LLM evaluation framework isn't just about importing a library. It requires a fundamental shift in how you build AI products.
For us, the key realization was that evaluation is not a separate step. It is the product. The confidence score we generate is just as important as the data we extract.
If you are building in high-stakes domains, stop looking for a better prompt. Start looking for a better way to measure if your prompt worked.
Harshavardhan is a Founding Engineer at Mantys Healthcare AI. He spends too much time looking at precision-recall curves.