Getting Started

Installation

Basic Installation

pip install statqa

With Optional Features

# Include LLM support (OpenAI/Anthropic)
pip install statqa[llm]

# Include PDF parsing
pip install statqa[pdf]

# Include statistical formats (SPSS/Stata/SAS)
pip install statqa[statistical-formats]

# Development installation
pip install statqa[dev]

# Complete installation
pip install statqa[all]

From Source

git clone https://github.com/gojiplus/statqa.git
cd statqa
uv pip install -e ".[dev]"

Development Environment

For development, we recommend using uv for faster dependency management:

# Install uv first
pip install uv

# Clone and setup
git clone https://github.com/gojiplus/statqa.git
cd statqa
uv sync --all-extras

Quick Start

1. Create a Codebook

from statqa.metadata.parsers import TextParser

codebook_text = """
# Variable: age
Label: Respondent Age
Type: numeric_continuous
Units: years
Range: 18-99
Missing: -1, 999

# Variable: satisfaction
Label: Job Satisfaction
Type: categorical_ordinal
Values:
  1: Very Dissatisfied
  2: Dissatisfied
  3: Neutral
  4: Satisfied
  5: Very Satisfied
"""

parser = TextParser()
codebook = parser.parse(codebook_text)

2. Run Statistical Analyses

import pandas as pd
from statqa.analysis import UnivariateAnalyzer, BivariateAnalyzer

# Load your data
data = pd.read_csv("survey_data.csv")

# Univariate analysis
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])

print(result)
# Output: {'mean': 42.5, 'median': 41.0, 'std': 12.3, ...}

# Bivariate analysis
biv_analyzer = BivariateAnalyzer()
result = biv_analyzer.analyze(
    data,
    codebook.variables["age"],
    codebook.variables["satisfaction"]
)

3. Generate Natural Language Insights

from statqa.interpretation import InsightFormatter

formatter = InsightFormatter()
insight = formatter.format_univariate(result)

print(insight)
# Output: "**Respondent Age**: mean=42.5, median=41.0, std=12.3, range=[18, 95]. N=1,000 [2.3% outliers]."

4. Create Multimodal Q/A Pairs for LLM Training

from statqa.qa import QAGenerator
from statqa.visualization import PlotFactory

qa_gen = QAGenerator(use_llm=False)  # Template-based

# Generate Q/A pairs with visual metadata
plot_data = {
    "data": data,
    "variables": codebook.variables,
    "output_path": "plots/univariate_age.png"
}
visual_metadata = qa_gen.generate_visual_metadata(result, variables=["age"], plot_data=plot_data)
qa_pairs = qa_gen.generate_qa_pairs(result, insight, variables=["age"], visual_data=visual_metadata)

for qa in qa_pairs:
    print(f"Q: {qa['question']}")
    print(f"A: {qa['answer']}")
    print(f"Plot: {qa['visual']['primary_plot']}")
    print(f"Caption: {qa['visual']['caption']}")
    print(f"Provenance: {qa['provenance']}\n")

Command-Line Interface

StatQA provides a powerful CLI for common workflows:

# Parse a codebook
statqa parse-codebook codebook.csv --output codebook.json --enrich

# Run full analysis pipeline with plots and visual metadata
statqa analyze data.csv codebook.json --output-dir results/ --plots --multimodal

# Generate multimodal Q/A pairs
statqa generate-qa results/all_insights.json --output qa_pairs.jsonl --llm --visual-metadata

# Complete multimodal pipeline
statqa pipeline data.csv codebook.csv --output-dir output/ --enrich --qa --plots --multimodal

Next Steps

  • User Guide - Comprehensive guides for all features

  • Examples - Real-world examples with datasets

  • api/index - Complete API reference

  • Core Concepts - Core concepts and architecture