# Evaluation Guide

This guide covers how to evaluate table QA models using Tabula Rasa.

## Overview

Model evaluation helps you understand:

- How well your model performs on unseen data
- Which types of questions are challenging
- Where improvements are needed

## Basic Evaluation

```python
from tabula_rasa import TabulaRasa, TableQADataset
from tabula_rasa.evaluation import Evaluator

# Load model and test data
model = TabulaRasa.from_pretrained("./my_model")
test_dataset = TableQADataset.from_json("test.json")

# Evaluate
evaluator = Evaluator(model=model)
results = evaluator.evaluate(test_dataset)

print(f"Exact Match: {results['exact_match']:.2%}")
print(f"F1 Score: {results['f1']:.2%}")
```

## Metrics

### Exact Match (EM)

The percentage of predictions that match the ground truth exactly:

```python
# EM = (number of exact matches) / (total examples)
```

### F1 Score

Token-level F1 score between prediction and ground truth:

```python
# Precision = (true positives) / (true positives + false positives)
# Recall = (true positives) / (true positives + false negatives)
# F1 = 2 * (precision * recall) / (precision + recall)
```

### Custom Metrics

Define custom evaluation metrics:

```python
from tabula_rasa.evaluation import Metric

class TableAccuracyMetric(Metric):
    """Custom metric for table-specific accuracy."""

    def compute(self, predictions, references):
        correct = sum(
            pred.strip() == ref.strip()
            for pred, ref in zip(predictions, references)
        )
        return {"table_accuracy": correct / len(predictions)}

# Use in evaluation
evaluator = Evaluator(
    model=model,
    metrics=[TableAccuracyMetric()]
)
```

## Error Analysis

Identify where your model struggles:

```python
results = evaluator.evaluate(test_dataset, return_predictions=True)

# Analyze errors
errors = [
    (example, pred)
    for example, pred in zip(test_dataset, results['predictions'])
    if pred != example['answer']
]

# Group by error type
error_types = {
    'numerical': [],
    'textual': [],
    'aggregation': [],
}

for example, pred in errors:
    if example['question_type'] == 'numerical':
        error_types['numerical'].append((example, pred))
    # ... categorize other types
```

## Benchmarking

Compare your model against baselines:

```python
from tabula_rasa.evaluation import benchmark

models = {
    'baseline': TabulaRasa.from_pretrained('t5-small'),
    'fine-tuned': TabulaRasa.from_pretrained('./my_model'),
}

results = benchmark(
    models=models,
    dataset=test_dataset,
    metrics=['exact_match', 'f1'],
)

# Print comparison
for model_name, metrics in results.items():
    print(f"{model_name}:")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value:.2%}")
```

## Cross-Validation

Perform k-fold cross-validation:

```python
from tabula_rasa.evaluation import cross_validate

results = cross_validate(
    model=model,
    dataset=full_dataset,
    n_splits=5,
    metrics=['exact_match', 'f1'],
)

print(f"Mean EM: {results['exact_match'].mean():.2%} ± {results['exact_match'].std():.2%}")
print(f"Mean F1: {results['f1'].mean():.2%} ± {results['f1'].std():.2%}")
```

## Performance Profiling

Measure inference speed:

```python
import time
from tabula_rasa.evaluation import profile

# Profile model
profile_results = profile(
    model=model,
    dataset=test_dataset,
    batch_sizes=[1, 4, 8, 16],
)

for batch_size, metrics in profile_results.items():
    print(f"Batch size {batch_size}:")
    print(f"  Throughput: {metrics['throughput']:.2f} examples/sec")
    print(f"  Latency: {metrics['latency']:.2f} ms/example")
```

## Evaluation on Specific Question Types

Evaluate performance on different question categories:

```python
# Group by question type
question_types = {}
for example in test_dataset:
    qtype = example.get('question_type', 'general')
    if qtype not in question_types:
        question_types[qtype] = []
    question_types[qtype].append(example)

# Evaluate each type
for qtype, examples in question_types.items():
    subset = TableQADataset(examples)
    results = evaluator.evaluate(subset)
    print(f"\n{qtype.upper()}:")
    print(f"  EM: {results['exact_match']:.2%}")
    print(f"  F1: {results['f1']:.2%}")
```

## CLI Evaluation

Use the command-line interface:

```bash
# Basic evaluation
tabula-rasa eval \
    --model ./my_model \
    --data test.json \
    --output results.json

# With specific metrics
tabula-rasa eval \
    --model ./my_model \
    --data test.json \
    --metrics exact_match f1 bleu \
    --output results.json

# Detailed error analysis
tabula-rasa eval \
    --model ./my_model \
    --data test.json \
    --error-analysis \
    --output results.json
```

## Best Practices

1. **Hold-Out Test Set**: Always evaluate on data the model hasn't seen during training
2. **Multiple Metrics**: Use multiple metrics to get a complete picture of performance
3. **Error Analysis**: Regularly analyze errors to identify improvement opportunities
4. **Stratified Sampling**: Ensure test set represents all question types
5. **Statistical Significance**: Use multiple runs and report confidence intervals
6. **Domain-Specific Evaluation**: Create custom metrics for your specific use case

## Example Evaluation Script

```python
#!/usr/bin/env python
"""Evaluation script for table QA model."""

import argparse
import json
from tabula_rasa import TabulaRasa, TableQADataset
from tabula_rasa.evaluation import Evaluator

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--data", required=True)
    parser.add_argument("--output", default="results.json")
    args = parser.parse_args()

    # Load model and data
    model = TabulaRasa.from_pretrained(args.model)
    dataset = TableQADataset.from_json(args.data)

    # Evaluate
    evaluator = Evaluator(model=model)
    results = evaluator.evaluate(
        dataset,
        return_predictions=True,
        return_confidence=True,
    )

    # Save results
    with open(args.output, 'w') as f:
        json.dump(results, f, indent=2)

    # Print summary
    print(f"\nEvaluation Results:")
    print(f"  Exact Match: {results['exact_match']:.2%}")
    print(f"  F1 Score: {results['f1']:.2%}")
    print(f"\nResults saved to {args.output}")

if __name__ == "__main__":
    main()
```