Getting Started¶
Installation¶
Basic Installation¶
pip install statqa
With Optional Features¶
# Include LLM support (OpenAI/Anthropic)
pip install statqa[llm]
# Include PDF parsing
pip install statqa[pdf]
# Include statistical formats (SPSS/Stata/SAS)
pip install statqa[statistical-formats]
# Development installation
pip install statqa[dev]
# Complete installation
pip install statqa[all]
From Source¶
git clone https://github.com/gojiplus/statqa.git
cd statqa
uv pip install -e ".[dev]"
Development Environment¶
For development, we recommend using uv for faster dependency management:
# Install uv first
pip install uv
# Clone and setup
git clone https://github.com/gojiplus/statqa.git
cd statqa
uv sync --all-extras
Quick Start¶
1. Create a Codebook¶
from statqa.metadata.parsers import TextParser
codebook_text = """
# Variable: age
Label: Respondent Age
Type: numeric_continuous
Units: years
Range: 18-99
Missing: -1, 999
# Variable: satisfaction
Label: Job Satisfaction
Type: categorical_ordinal
Values:
1: Very Dissatisfied
2: Dissatisfied
3: Neutral
4: Satisfied
5: Very Satisfied
"""
parser = TextParser()
codebook = parser.parse(codebook_text)
2. Run Statistical Analyses¶
import pandas as pd
from statqa.analysis import UnivariateAnalyzer, BivariateAnalyzer
# Load your data
data = pd.read_csv("survey_data.csv")
# Univariate analysis
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])
print(result)
# Output: {'mean': 42.5, 'median': 41.0, 'std': 12.3, ...}
# Bivariate analysis
biv_analyzer = BivariateAnalyzer()
result = biv_analyzer.analyze(
data,
codebook.variables["age"],
codebook.variables["satisfaction"]
)
3. Generate Natural Language Insights¶
from statqa.interpretation import InsightFormatter
formatter = InsightFormatter()
insight = formatter.format_univariate(result)
print(insight)
# Output: "**Respondent Age**: mean=42.5, median=41.0, std=12.3, range=[18, 95]. N=1,000 [2.3% outliers]."
4. Create Multimodal Q/A Pairs for LLM Training¶
from statqa.qa import QAGenerator
from statqa.visualization import PlotFactory
qa_gen = QAGenerator(use_llm=False) # Template-based
# Generate Q/A pairs with visual metadata
plot_data = {
"data": data,
"variables": codebook.variables,
"output_path": "plots/univariate_age.png"
}
visual_metadata = qa_gen.generate_visual_metadata(result, variables=["age"], plot_data=plot_data)
qa_pairs = qa_gen.generate_qa_pairs(result, insight, variables=["age"], visual_data=visual_metadata)
for qa in qa_pairs:
print(f"Q: {qa['question']}")
print(f"A: {qa['answer']}")
print(f"Plot: {qa['visual']['primary_plot']}")
print(f"Caption: {qa['visual']['caption']}")
print(f"Provenance: {qa['provenance']}\n")
Command-Line Interface¶
StatQA provides a powerful CLI for common workflows:
# Parse a codebook
statqa parse-codebook codebook.csv --output codebook.json --enrich
# Run full analysis pipeline with plots and visual metadata
statqa analyze data.csv codebook.json --output-dir results/ --plots --multimodal
# Generate multimodal Q/A pairs
statqa generate-qa results/all_insights.json --output qa_pairs.jsonl --llm --visual-metadata
# Complete multimodal pipeline
statqa pipeline data.csv codebook.csv --output-dir output/ --enrich --qa --plots --multimodal
Next Steps¶
User Guide - Comprehensive guides for all features
Examples - Real-world examples with datasets
api/index - Complete API reference
Core Concepts - Core concepts and architecture