StatQA Documentation

StatQA is a Python framework for automatically extracting structured facts, insights, and multimodal Q/A pairs from tabular datasets with rich visual metadata and publication-quality plots.

Features

  • Metadata Understanding: Parse codebooks and enrich with LLM

  • Statistical Analysis: Univariate, bivariate, temporal, and causal analyses

  • Natural Language Insights: Convert statistics to human-readable text

  • Multimodal Q/A Generation: Create CLIP-style visual-text pairs for AI training

  • Rich Visual Metadata: Captions, alt-text, and visual elements for each plot

  • Publication-Quality Visualizations: Automated plots with question-plot association mapping

  • Accessibility Support: Full alt-text and captions for inclusive AI applications

Quick Start

Installation:

pip install statqa

Basic usage with multimodal Q/A generation:

from statqa import Codebook, UnivariateAnalyzer
from statqa.metadata.parsers import TextParser
from statqa.qa import QAGenerator

# Parse codebook
parser = TextParser()
codebook = parser.parse("codebook.txt")

# Run analysis with visualization
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])

# Generate multimodal Q/A pairs
qa_gen = QAGenerator()
plot_data = {"data": data, "variables": codebook.variables, "output_path": "plots/age.png"}
visual_metadata = qa_gen.generate_visual_metadata(result, variables=["age"], plot_data=plot_data)
qa_pairs = qa_gen.generate_qa_pairs(result, insight, variables=["age"], visual_data=visual_metadata)

Indices and tables