StatQA DocumentationΒΆ
StatQA is a modern Python framework for automatically extracting structured facts, statistical insights, and multimodal Q/A pairs from tabular datasets. It converts raw columns and values into clear, human-readable statements paired with rich visualizations, enabling rapid knowledge discovery, CLIP-style multimodal RAG corpus construction, and LLM training.
Key FeaturesΒΆ
π Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats
π€ LLM-Powered Enrichment: Automatically infer variable types and relationships
π Comprehensive Statistical Analysis:
Univariate: descriptive statistics, distribution tests, robust estimators
Bivariate: correlations, chi-square, group comparisons with effect sizes
Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis
Causal: regression with confounding control, sensitivity analysis
π¬ Natural Language Insights: Convert statistics to publication-ready text
β Multimodal Q/A Generation: Create CLIP-style visual-text pairs with template-based and LLM-paraphrased questions
πΌοΈ Rich Visual Metadata: Captions, alt-text, and visual elements for each plot (colors, annotations, features)
π Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types, plot generation)
π Publication-Quality Visualizations: Automated plots for all analyses with question-plot association mapping
π¬ Statistical Rigor: Multiple testing correction, effect sizes, normality tests
β‘ Modern Python: Type-safe (Pydantic), async-ready, fully typed
Getting Started:
Documentation: