StatQA DocumentationΒΆ

StatQA is a modern Python framework for automatically extracting structured facts, statistical insights, and multimodal Q/A pairs from tabular datasets. It converts raw columns and values into clear, human-readable statements paired with rich visualizations, enabling rapid knowledge discovery, CLIP-style multimodal RAG corpus construction, and LLM training.

Key FeaturesΒΆ

  • πŸ“‹ Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats

  • πŸ€– LLM-Powered Enrichment: Automatically infer variable types and relationships

  • πŸ“Š Comprehensive Statistical Analysis:

    • Univariate: descriptive statistics, distribution tests, robust estimators

    • Bivariate: correlations, chi-square, group comparisons with effect sizes

    • Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis

    • Causal: regression with confounding control, sensitivity analysis

  • πŸ’¬ Natural Language Insights: Convert statistics to publication-ready text

  • ❓ Multimodal Q/A Generation: Create CLIP-style visual-text pairs with template-based and LLM-paraphrased questions

  • πŸ–ΌοΈ Rich Visual Metadata: Captions, alt-text, and visual elements for each plot (colors, annotations, features)

  • πŸ” Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types, plot generation)

  • πŸ“ˆ Publication-Quality Visualizations: Automated plots for all analyses with question-plot association mapping

  • πŸ”¬ Statistical Rigor: Multiple testing correction, effect sizes, normality tests

  • ⚑ Modern Python: Type-safe (Pydantic), async-ready, fully typed

Indices and tablesΒΆ