2025-01-01
12 min read

LLM Evaluation in Production: How We Hit 98% Accuracy

A deep dive into the code and concepts behind our 98% accurate medical data extraction pipeline.

LLMsEvaluationHealthcare Tech

LLM Evaluation in Production: How We Hit 98% Accuracy

"It's good enough."

In most software, that phrase is fine. If your recommendation engine is 90% accurate, nobody dies. If your search bar misses a result, the user tries a different keyword.

In healthcare, "good enough" is negligent.

At Mantys, we faced a brutal reality: our AI agents were processing medical claims where a single wrong digit could stick a patient with a $20,000 bill they didn't owe. We couldn't just "vibe check" our LLM outputs. We needed a rig that was mathematically precise, reproducible, and ruthless.

We managed to build a system that achieves 98% accuracy and, more importantly, knows exactly when it isn't accurate so it can ask for help.

Here is the architecture of how we did it.

The Challenge

Healthcare data is a nightmare to parse. We aren't dealing with clean text; we're dealing with "clinical notes" (often shorthand), "insurance codes" (confusing alphanumerics), and "dates" (written in every format imaginable).

For a long time, we were stuck at ~85% accuracy. To bridge that gap to 98%, we had to stop treating the LLM as a black box and start treating it as a component in a larger validation machine.

My Evaluation Architecture

We built a "Swiss Cheese" model of evaluation. No single layer is perfect, but if you stack enough of them, the holes don't align, and errors don't get through.

Core Components

class HealthcareEvaluationFramework:
    def __init__(self):
        # Layer 1: The 'Smart' Evaluator (LLM evaluating LLM)
        self.cove_evaluator = CoveEvaluator()
        
        # Layer 2: The 'Statistical' Evaluator
        self.gevals_framework = GevalsFramework()
        
        # Layer 3: The 'Deterministic' Evaluator (Logic Checks)
        self.domain_validator = HealthcareDomainValidator()
        
        # Safety Net: The Human
        self.human_reviewer = HumanReviewSystem()

    def evaluate_extraction(self, document, extracted_data):
        # Multi-stage evaluation pipeline
        results = {}

        # Stage 1: Automated evaluation
        results['cove_score'] = self.cove_evaluator.evaluate(document, extracted_data)
        results['gevals_metrics'] = self.gevals_framework.compute_metrics(extracted_data)
        results['log_values'] = self.log_analyzer.analyze(extracted_data)

        # Stage 2: Domain-specific validation
        results['domain_validation'] = self.domain_validator.validate(extracted_data)

        # Stage 3: Human review (for edge cases)
        if results['confidence'] < self.threshold:
            results['human_review'] = self.human_reviewer.queue_for_review(document, extracted_data)

        return results

Cove: Comprehensive Evaluation for Healthcare

Cove (Comprehensive Output Verification and Evaluation) became our primary automated evaluation tool for several reasons:

Key Advantages

  1. Domain adaptability: Can be fine-tuned for healthcare-specific metrics
  2. Explainability: Provides detailed reasoning for evaluation scores
  3. Scalability: Handles large volumes of documents efficiently
  4. Integration: Easy API integration with our existing pipeline

Implementation Details

class CoveEvaluator:
    def __init__(self, model_config):
        self.model = load_cove_model(model_config)
        self.healthcare_schema = HealthcareSchema()

    def evaluate(self, source_document, extracted_data):
        # Prepare evaluation context
        context = {
            'document_type': self.identify_document_type(source_document),
            'expected_fields': self.healthcare_schema.get_required_fields(),
            'extraction_confidence': extracted_data.get('confidence', 0.0)
        }

        # Run Cove evaluation
        evaluation = self.model.evaluate(
            source=source_document,
            extraction=extracted_data,
            context=context
        )

        return self.process_cove_results(evaluation)

    def process_cove_results(self, evaluation):
        return {
            'accuracy_score': evaluation.accuracy,
            'completeness_score': evaluation.completeness,
            'consistency_score': evaluation.consistency,
            'confidence_interval': evaluation.confidence_interval,
            'field_level_scores': evaluation.field_scores,
            'explanation': evaluation.reasoning
        }

Healthcare-Specific Adaptations

We customized Cove for healthcare by:

  1. Medical terminology training: Fine-tuned on healthcare-specific vocabulary
  2. Context-aware evaluation: Understanding medical relationships and dependencies
  3. Regulatory alignment: Ensuring evaluations align with healthcare compliance requirements

Gevals: Generative Evaluation at Scale

Gevals provided our framework for systematic, large-scale evaluation:

Core Metrics Implementation

class GevalsFramework:
    def __init__(self):
        self.metrics = {
            'accuracy': AccuracyMetric(),
            'completeness': CompletenessMetric(),
            'consistency': ConsistencyMetric(),
            'clinical_relevance': ClinicalRelevanceMetric(),
            'regulatory_compliance': ComplianceMetric()
        }

    def compute_metrics(self, extracted_data):
        results = {}

        for metric_name, metric in self.metrics.items():
            try:
                score = metric.compute(extracted_data)
                results[metric_name] = {
                    'score': score,
                    'details': metric.get_details(),
                    'confidence': metric.get_confidence()
                }
            except Exception as e:
                results[metric_name] = {'error': str(e)}

        # Compute composite score
        results['composite_score'] = self.calculate_composite_score(results)

        return results

    def calculate_composite_score(self, metric_results):
        # Weighted average based on healthcare priorities
        weights = {
            'accuracy': 0.35,
            'completeness': 0.25,
            'consistency': 0.20,
            'clinical_relevance': 0.15,
            'regulatory_compliance': 0.05
        }

        weighted_sum = sum(
            weights[metric] * results['score']
            for metric, results in metric_results.items()
            if 'score' in results
        )

        return weighted_sum / sum(weights.values())

Custom Healthcare Metrics

We developed healthcare-specific metrics:

Clinical Relevance Metric

class ClinicalRelevanceMetric:
    def __init__(self):
        self.medical_knowledge_base = MedicalKnowledgeBase()
        self.icd_validator = ICDCodeValidator()
        self.cpt_validator = CPTCodeValidator()

    def compute(self, extracted_data):
        relevance_scores = []

        # Validate medical codes
        if 'icd_codes' in extracted_data:
            icd_relevance = self.validate_icd_codes(extracted_data['icd_codes'])
            relevance_scores.append(icd_relevance)

        if 'cpt_codes' in extracted_data:
            cpt_relevance = self.validate_cpt_codes(extracted_data['cpt_codes'])
            relevance_scores.append(cpt_relevance)

        # Check clinical logic
        clinical_logic_score = self.validate_clinical_logic(extracted_data)
        relevance_scores.append(clinical_logic_score)

        return sum(relevance_scores) / len(relevance_scores)

Log Values Method: Quantitative Accuracy Analysis

The log values approach provided quantitative insights into our model's performance:

Implementation

class LogValuesAnalyzer:
    def __init__(self):
        self.baseline_model = BaselineModel()
        self.production_model = ProductionModel()

    def analyze(self, extracted_data):
        # Calculate log-likelihood improvements
        baseline_likelihood = self.baseline_model.log_likelihood(extracted_data)
        production_likelihood = self.production_model.log_likelihood(extracted_data)

        improvement = production_likelihood - baseline_likelihood

        # Field-level analysis
        field_analysis = {}
        for field, value in extracted_data.items():
            field_analysis[field] = {
                'baseline_ll': self.baseline_model.field_likelihood(field, value),
                'production_ll': self.production_model.field_likelihood(field, value),
                'improvement': improvement,
                'confidence': self.calculate_confidence(improvement)
            }

        return {
            'overall_improvement': improvement,
            'field_analysis': field_analysis,
            'statistical_significance': self.test_significance(improvement)
        }

Key Insights from Log Analysis

The log values method revealed:

  1. Field-specific performance: Some fields (like patient names) had higher accuracy than others (like copay amounts)
  2. Document type variations: Performance varied significantly across document types
  3. Improvement quantification: Measurable improvements over baseline approaches

Production Pipeline Integration

Our evaluation framework integrates seamlessly with our production extraction pipeline:

Real-time Evaluation

class ProductionEvaluationPipeline:
    def __init__(self):
        self.evaluator = HealthcareEvaluationFramework()
        self.metrics_collector = MetricsCollector()
        self.alert_system = AlertSystem()

    async def process_document(self, document):
        # Extract data
        extracted_data = await self.extract_data(document)

        # Real-time evaluation
        evaluation_results = self.evaluator.evaluate_extraction(document, extracted_data)

        # Collect metrics
        self.metrics_collector.record(evaluation_results)

        # Check for alerts
        if evaluation_results['composite_score'] < 0.95:
            await self.alert_system.send_alert(document, evaluation_results)

        # Decide on human review
        if self.needs_human_review(evaluation_results):
            await self.queue_for_human_review(document, extracted_data, evaluation_results)

        return extracted_data, evaluation_results

    def needs_human_review(self, evaluation_results):
        return (
            evaluation_results['composite_score'] < 0.98 or
            evaluation_results['domain_validation']['critical_errors'] > 0 or
            evaluation_results['confidence_interval'][1] - evaluation_results['confidence_interval'][0] > 0.1
        )

Achieving 98% Accuracy: Key Strategies

Several strategies were crucial to achieving our 98% accuracy target:

1. Multi-Stage Validation

  • Automated evaluation catches obvious errors
  • Domain-specific validation ensures medical accuracy
  • Human review handles edge cases and builds training data

2. Continuous Learning

class ContinuousLearningSystem:
    def __init__(self):
        self.training_queue = TrainingQueue()
        self.model_updater = ModelUpdater()

    def learn_from_corrections(self, corrections):
        # Add human corrections to training queue
        for correction in corrections:
            self.training_queue.add(
                document=correction.original_document,
                correct_extraction=correction.corrected_data,
                error_type=correction.error_classification
            )

        # Trigger retraining when queue reaches threshold
        if len(self.training_queue) >= self.retrain_threshold:
            self.trigger_retraining()

    async def trigger_retraining(self):
        training_data = self.training_queue.get_all()
        updated_model = await self.model_updater.retrain(training_data)

        # Validate updated model
        validation_results = await self.validate_model(updated_model)

        if validation_results['accuracy'] > self.current_model_accuracy:
            await self.deploy_updated_model(updated_model)

3. Error Analysis and Prevention

class ErrorAnalysisSystem:
    def __init__(self):
        self.error_classifier = ErrorClassifier()
        self.pattern_detector = PatternDetector()

    def analyze_errors(self, evaluation_results):
        errors = self.extract_errors(evaluation_results)

        # Classify error types
        classified_errors = [
            self.error_classifier.classify(error)
            for error in errors
        ]

        # Detect patterns
        patterns = self.pattern_detector.find_patterns(classified_errors)

        # Generate improvement recommendations
        recommendations = self.generate_recommendations(patterns)

        return {
            'error_breakdown': classified_errors,
            'patterns': patterns,
            'recommendations': recommendations
        }

Results and Impact

Our comprehensive evaluation framework has delivered significant results:

Quantitative Improvements

  • 98% accuracy on critical field extraction
  • 80% reduction in manual verification work
  • 95% confidence intervals within ±2% for most extractions
  • <100ms average evaluation latency in production

Operational Benefits

  • Standardized quality metrics across all extraction tasks
  • Real-time monitoring of model performance
  • Automated alerting for accuracy degradation
  • Continuous improvement through systematic error analysis

Business Impact

  • Reduced operational costs through automation
  • Improved compliance with healthcare regulations
  • Faster processing times for critical documents
  • Higher customer satisfaction due to accuracy

Lessons Learned

Building this system taught us several important lessons:

1. Healthcare Requires Domain-Specific Evaluation

Standard NLP metrics don't capture healthcare-specific requirements. Custom metrics for clinical relevance and regulatory compliance were essential.

2. Human-in-the-Loop is Critical

Even with 98% accuracy, human oversight remains necessary for:

  • Edge cases and novel document types
  • Regulatory compliance verification
  • Continuous improvement through error correction

3. Multi-Modal Evaluation is Powerful

Combining different evaluation approaches (Cove, Gevals, log values) provides comprehensive coverage and catches different types of errors.

4. Real-Time Monitoring Enables Quick Response

Production monitoring allows us to detect and respond to accuracy degradation quickly, maintaining system reliability.

Future Directions

We're continuing to improve our evaluation framework:

Enhanced Automation

  • Self-improving models that automatically incorporate corrections
  • Predictive error detection to prevent issues before they occur
  • Dynamic threshold adjustment based on document complexity

Broader Healthcare Applications

  • Clinical decision support evaluation frameworks
  • Drug interaction detection accuracy measurement
  • Population health analytics validation systems

Conclusion

Building a production-grade LLM evaluation framework isn't just about importing a library. It requires a fundamental shift in how you build AI products.

For us, the key realization was that evaluation is not a separate step. It is the product. The confidence score we generate is just as important as the data we extract.

If you are building in high-stakes domains, stop looking for a better prompt. Start looking for a better way to measure if your prompt worked.


Harshavardhan is a Founding Engineer at Mantys Healthcare AI. He spends too much time looking at precision-recall curves.