Architecture¶

StatQA’s modular design enables flexible workflows and easy extension.

Core Components¶

Metadata System (`statqa/metadata/`)¶

Schema: Pydantic models for Variables and Codebooks
Parsers: Support for text, CSV, PDF, and statistical formats
Enrichment: LLM-powered metadata enhancement

Analysis Pipeline (`statqa/analysis/`)¶

Univariate: Descriptive statistics and distribution testing
Bivariate: Correlations, group comparisons, effect sizes
Temporal: Trend detection and change point analysis
Causal: Regression with confounding control

Q/A Generation (`statqa/qa/`)¶

Templates: Structured question generation patterns
LLM Integration: Natural language paraphrasing
Visual Metadata: Plot association and descriptions

Data Flow¶

Parse Metadata → Variable/Codebook objects
LLM Enrichment → Enhanced type inference (optional)
Statistical Analysis → Numerical results
Format Insights → Human-readable text
Generate Q/A → Training-ready pairs
Export → JSONL/OpenAI/Anthropic formats

Extension Points¶

Custom Analyzers: Implement analyze() method
New Parsers: Inherit from BaseParser
Output Formats: Add formatters and exporters
LLM Providers: Extend enricher with new APIs