Getting Started¶
Welcome to StatQA! This guide will help you get up and running quickly.
Key Features¶
📋 Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats
🤖 LLM-Powered Enrichment: Automatically infer variable types and relationships
📊 Comprehensive Statistical Analysis:
Univariate: descriptive statistics, distribution tests, robust estimators
Bivariate: correlations, chi-square, group comparisons with effect sizes
Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis
Causal: regression with confounding control, sensitivity analysis
💬 Natural Language Insights: Convert statistics to publication-ready text
❓ Multimodal Q/A Generation: Create CLIP-style visual-text pairs with template-based and LLM-paraphrased questions
🖼️ Rich Visual Metadata: Captions, alt-text, and visual elements for each plot (colors, annotations, features)
🔍 Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types, plot generation)
📈 Publication-Quality Visualizations: Automated plots for all analyses with question-plot association mapping
🔬 Statistical Rigor: Multiple testing correction, effect sizes, normality tests
⚡ Modern Python: Type-safe (Pydantic), async-ready, fully typed
Installation¶
Basic Installation¶
pip install statqa
With Optional Features¶
# Include LLM support (OpenAI/Anthropic)
pip install statqa[llm]
# Include PDF parsing
pip install statqa[pdf]
# Include statistical formats (SPSS/Stata/SAS)
pip install statqa[statistical-formats]
# Development installation
pip install statqa[dev]
# Complete installation
pip install statqa[all]
From Source¶
git clone https://github.com/gojiplus/statqa.git
cd statqa
uv pip install -e ".[dev]"
Development Environment¶
For development, we recommend using uv for faster dependency management:
# Install uv first
pip install uv
# Clone and setup
git clone https://github.com/gojiplus/statqa.git
cd statqa
uv sync --all-extras
Quick Start¶
1. Create a Codebook¶
from statqa.metadata.parsers import TextParser
codebook_text = """
# Variable: age
Label: Respondent Age
Type: numeric_continuous
Units: years
Range: 18-99
Missing: -1, 999
# Variable: satisfaction
Label: Job Satisfaction
Type: categorical_ordinal
Values:
1: Very Dissatisfied
2: Dissatisfied
3: Neutral
4: Satisfied
5: Very Satisfied
"""
parser = TextParser()
codebook = parser.parse(codebook_text)
2. Run Statistical Analyses¶
import pandas as pd
from statqa.analysis import UnivariateAnalyzer, BivariateAnalyzer
# Load your data
data = pd.read_csv("survey_data.csv")
# Univariate analysis
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])
print(result)
# Output: {'mean': 42.5, 'median': 41.0, 'std': 12.3, ...}
# Bivariate analysis
biv_analyzer = BivariateAnalyzer()
result = biv_analyzer.analyze(
data,
codebook.variables["age"],
codebook.variables["satisfaction"]
)
3. Generate Natural Language Insights¶
from statqa.interpretation import InsightFormatter
formatter = InsightFormatter()
insight = formatter.format_univariate(result)
print(insight)
# Output: "**Respondent Age**: mean=42.5, median=41.0, std=12.3, range=[18, 95]. N=1,000 [2.3% outliers]."
4. Create Multimodal Q/A Pairs for LLM Training¶
from statqa.qa import QAGenerator
from statqa.visualization import PlotFactory
qa_gen = QAGenerator(use_llm=False) # Template-based
# Generate Q/A pairs with visual metadata
plot_data = {
"data": data,
"variables": codebook.variables,
"output_path": "plots/univariate_age.png"
}
visual_metadata = qa_gen.generate_visual_metadata(result, variables=["age"], plot_data=plot_data)
qa_pairs = qa_gen.generate_qa_pairs(result, insight, variables=["age"], visual_data=visual_metadata)
for qa in qa_pairs:
print(f"Q: {qa['question']}")
print(f"A: {qa['answer']}")
print(f"Plot: {qa['visual']['primary_plot']}")
print(f"Caption: {qa['visual']['caption']}")
print(f"Provenance: {qa['provenance']}\n")
Next Steps¶
Tutorials for detailed tutorials
API Reference for complete API documentation
Examples for real-world examples