Evaluation Guide¶
This guide covers how to evaluate table QA models using Tabula Rasa.
Overview¶
Model evaluation helps you understand:
How well your model performs on unseen data
Which types of questions are challenging
Where improvements are needed
Basic Evaluation¶
from tabula_rasa import TabulaRasa, TableQADataset
from tabula_rasa.evaluation import Evaluator
# Load model and test data
model = TabulaRasa.from_pretrained("./my_model")
test_dataset = TableQADataset.from_json("test.json")
# Evaluate
evaluator = Evaluator(model=model)
results = evaluator.evaluate(test_dataset)
print(f"Exact Match: {results['exact_match']:.2%}")
print(f"F1 Score: {results['f1']:.2%}")
Metrics¶
Exact Match (EM)¶
The percentage of predictions that match the ground truth exactly:
# EM = (number of exact matches) / (total examples)
F1 Score¶
Token-level F1 score between prediction and ground truth:
# Precision = (true positives) / (true positives + false positives)
# Recall = (true positives) / (true positives + false negatives)
# F1 = 2 * (precision * recall) / (precision + recall)
Custom Metrics¶
Define custom evaluation metrics:
from tabula_rasa.evaluation import Metric
class TableAccuracyMetric(Metric):
"""Custom metric for table-specific accuracy."""
def compute(self, predictions, references):
correct = sum(
pred.strip() == ref.strip()
for pred, ref in zip(predictions, references)
)
return {"table_accuracy": correct / len(predictions)}
# Use in evaluation
evaluator = Evaluator(
model=model,
metrics=[TableAccuracyMetric()]
)
Error Analysis¶
Identify where your model struggles:
results = evaluator.evaluate(test_dataset, return_predictions=True)
# Analyze errors
errors = [
(example, pred)
for example, pred in zip(test_dataset, results['predictions'])
if pred != example['answer']
]
# Group by error type
error_types = {
'numerical': [],
'textual': [],
'aggregation': [],
}
for example, pred in errors:
if example['question_type'] == 'numerical':
error_types['numerical'].append((example, pred))
# ... categorize other types
Benchmarking¶
Compare your model against baselines:
from tabula_rasa.evaluation import benchmark
models = {
'baseline': TabulaRasa.from_pretrained('t5-small'),
'fine-tuned': TabulaRasa.from_pretrained('./my_model'),
}
results = benchmark(
models=models,
dataset=test_dataset,
metrics=['exact_match', 'f1'],
)
# Print comparison
for model_name, metrics in results.items():
print(f"{model_name}:")
for metric_name, value in metrics.items():
print(f" {metric_name}: {value:.2%}")
Cross-Validation¶
Perform k-fold cross-validation:
from tabula_rasa.evaluation import cross_validate
results = cross_validate(
model=model,
dataset=full_dataset,
n_splits=5,
metrics=['exact_match', 'f1'],
)
print(f"Mean EM: {results['exact_match'].mean():.2%} ± {results['exact_match'].std():.2%}")
print(f"Mean F1: {results['f1'].mean():.2%} ± {results['f1'].std():.2%}")
Performance Profiling¶
Measure inference speed:
import time
from tabula_rasa.evaluation import profile
# Profile model
profile_results = profile(
model=model,
dataset=test_dataset,
batch_sizes=[1, 4, 8, 16],
)
for batch_size, metrics in profile_results.items():
print(f"Batch size {batch_size}:")
print(f" Throughput: {metrics['throughput']:.2f} examples/sec")
print(f" Latency: {metrics['latency']:.2f} ms/example")
Evaluation on Specific Question Types¶
Evaluate performance on different question categories:
# Group by question type
question_types = {}
for example in test_dataset:
qtype = example.get('question_type', 'general')
if qtype not in question_types:
question_types[qtype] = []
question_types[qtype].append(example)
# Evaluate each type
for qtype, examples in question_types.items():
subset = TableQADataset(examples)
results = evaluator.evaluate(subset)
print(f"\n{qtype.upper()}:")
print(f" EM: {results['exact_match']:.2%}")
print(f" F1: {results['f1']:.2%}")
CLI Evaluation¶
Use the command-line interface:
# Basic evaluation
tabula-rasa eval \
--model ./my_model \
--data test.json \
--output results.json
# With specific metrics
tabula-rasa eval \
--model ./my_model \
--data test.json \
--metrics exact_match f1 bleu \
--output results.json
# Detailed error analysis
tabula-rasa eval \
--model ./my_model \
--data test.json \
--error-analysis \
--output results.json
Best Practices¶
Hold-Out Test Set: Always evaluate on data the model hasn’t seen during training
Multiple Metrics: Use multiple metrics to get a complete picture of performance
Error Analysis: Regularly analyze errors to identify improvement opportunities
Stratified Sampling: Ensure test set represents all question types
Statistical Significance: Use multiple runs and report confidence intervals
Domain-Specific Evaluation: Create custom metrics for your specific use case
Example Evaluation Script¶
#!/usr/bin/env python
"""Evaluation script for table QA model."""
import argparse
import json
from tabula_rasa import TabulaRasa, TableQADataset
from tabula_rasa.evaluation import Evaluator
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--data", required=True)
parser.add_argument("--output", default="results.json")
args = parser.parse_args()
# Load model and data
model = TabulaRasa.from_pretrained(args.model)
dataset = TableQADataset.from_json(args.data)
# Evaluate
evaluator = Evaluator(model=model)
results = evaluator.evaluate(
dataset,
return_predictions=True,
return_confidence=True,
)
# Save results
with open(args.output, 'w') as f:
json.dump(results, f, indent=2)
# Print summary
print(f"\nEvaluation Results:")
print(f" Exact Match: {results['exact_match']:.2%}")
print(f" F1 Score: {results['f1']:.2%}")
print(f"\nResults saved to {args.output}")
if __name__ == "__main__":
main()