Getting Started¶

Welcome to StatQA! This guide will help you get up and running quickly.

Key Features¶

📋 Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats
🤖 LLM-Powered Enrichment: Automatically infer variable types and relationships
📊 Comprehensive Statistical Analysis:
- Univariate: descriptive statistics, distribution tests, robust estimators
- Bivariate: correlations, chi-square, group comparisons with effect sizes
- Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis
- Causal: regression with confounding control, sensitivity analysis
💬 Natural Language Insights: Convert statistics to publication-ready text
❓ Multimodal Q/A Generation: Create CLIP-style visual-text pairs with template-based and LLM-paraphrased questions
🖼️ Rich Visual Metadata: Captions, alt-text, and visual elements for each plot (colors, annotations, features)
🔍 Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types, plot generation)
📈 Publication-Quality Visualizations: Automated plots for all analyses with question-plot association mapping
🔬 Statistical Rigor: Multiple testing correction, effect sizes, normality tests
⚡ Modern Python: Type-safe (Pydantic), async-ready, fully typed

Installation¶

Basic Installation¶

pip install statqa

With Optional Features¶

# Include LLM support (OpenAI/Anthropic)
pip install statqa[llm]

# Include PDF parsing
pip install statqa[pdf]

# Include statistical formats (SPSS/Stata/SAS)
pip install statqa[statistical-formats]

# Development installation
pip install statqa[dev]

# Complete installation
pip install statqa[all]

From Source¶

git clone https://github.com/gojiplus/statqa.git
cd statqa
uv pip install -e ".[dev]"

Development Environment¶

For development, we recommend using uv for faster dependency management:

# Install uv first
pip install uv

# Clone and setup
git clone https://github.com/gojiplus/statqa.git
cd statqa
uv sync --all-extras

Quick Start¶

1. Create a Codebook¶

from statqa.metadata.parsers import TextParser

codebook_text = """
# Variable: age
Label: Respondent Age
Type: numeric_continuous
Units: years
Range: 18-99
Missing: -1, 999

# Variable: satisfaction
Label: Job Satisfaction
Type: categorical_ordinal
Values:
  1: Very Dissatisfied
  2: Dissatisfied
  3: Neutral
  4: Satisfied
  5: Very Satisfied
"""

parser = TextParser()
codebook = parser.parse(codebook_text)

2. Run Statistical Analyses¶

import pandas as pd
from statqa.analysis import UnivariateAnalyzer, BivariateAnalyzer

# Load your data
data = pd.read_csv("survey_data.csv")

# Univariate analysis
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])

print(result)
# Output: {'mean': 42.5, 'median': 41.0, 'std': 12.3, ...}

# Bivariate analysis
biv_analyzer = BivariateAnalyzer()
result = biv_analyzer.analyze(
    data,
    codebook.variables["age"],
    codebook.variables["satisfaction"]
)

3. Generate Natural Language Insights¶

from statqa.interpretation import InsightFormatter

formatter = InsightFormatter()
insight = formatter.format_univariate(result)

print(insight)
# Output: "**Respondent Age**: mean=42.5, median=41.0, std=12.3, range=[18, 95]. N=1,000 [2.3% outliers]."

4. Create Multimodal Q/A Pairs for LLM Training¶

from statqa.qa import QAGenerator
from statqa.visualization import PlotFactory

qa_gen = QAGenerator(use_llm=False)  # Template-based

# Generate Q/A pairs with visual metadata
plot_data = {
    "data": data,
    "variables": codebook.variables,
    "output_path": "plots/univariate_age.png"
}
visual_metadata = qa_gen.generate_visual_metadata(result, variables=["age"], plot_data=plot_data)
qa_pairs = qa_gen.generate_qa_pairs(result, insight, variables=["age"], visual_data=visual_metadata)

for qa in qa_pairs:
    print(f"Q: {qa['question']}")
    print(f"A: {qa['answer']}")
    print(f"Plot: {qa['visual']['primary_plot']}")
    print(f"Caption: {qa['visual']['caption']}")
    print(f"Provenance: {qa['provenance']}\n")

Next Steps¶

Tutorials for detailed tutorials
API Reference for complete API documentation
Examples for real-world examples