StatQA Documentation¶

StatQA is a modern Python framework for automatically extracting structured facts, statistical insights, and multimodal Q/A pairs from tabular datasets. It converts raw columns and values into clear, human-readable statements paired with rich visualizations, enabling rapid knowledge discovery, CLIP-style multimodal RAG corpus construction, and LLM training.

Key Features¶

📋 Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats
🤖 LLM-Powered Enrichment: Automatically infer variable types and relationships
📊 Comprehensive Statistical Analysis:
- Univariate: descriptive statistics, distribution tests, robust estimators
- Bivariate: correlations, chi-square, group comparisons with effect sizes
- Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis
- Causal: regression with confounding control, sensitivity analysis
💬 Natural Language Insights: Convert statistics to publication-ready text
❓ Multimodal Q/A Generation: Create CLIP-style visual-text pairs with template-based and LLM-paraphrased questions
🖼️ Rich Visual Metadata: Captions, alt-text, and visual elements for each plot (colors, annotations, features)
🔍 Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types, plot generation)
📈 Publication-Quality Visualizations: Automated plots for all analyses with question-plot association mapping
🔬 Statistical Rigor: Multiple testing correction, effect sizes, normality tests
⚡ Modern Python: Type-safe (Pydantic), async-ready, fully typed

Getting Started:

Development:

StatQA Documentation¶

Key Features¶

Indices and tables¶