Metadata¶

Variable metadata parsing and enrichment functionality.

Schema¶

Core data structures for representing variables and codebooks.

Pydantic models for metadata representation.

This module defines the core data structures for variables and codebooks, providing type-safe, validated models with rich metadata support.

class statqa.metadata.schema.Codebook(**data)[source]¶

Bases: BaseModel

Represents a complete codebook/data dictionary.

name¶: Codebook name/identifier

description¶: Overall dataset description

variables¶: Mapping of variable names to Variable objects

dataset_info¶: General dataset metadata

citation¶: How to cite this dataset

version¶: Codebook version

last_updated¶: Last update date

add_variable(variable)[source]¶

Add a variable to the codebook.

Return type:: None

citation: str | None¶

dataset_info: dict[str, Any]¶

description: str | None¶

get_categorical_variables()[source]¶

Get all categorical variables.

Return type:: list[Variable]

get_numeric_variables()[source]¶

Get all numeric variables.

Return type:: list[Variable]

get_outcome_variables()[source]¶

Get all outcome variables.

Return type:: list[Variable]

get_temporal_variables()[source]¶

Get all temporal variables.

Return type:: list[Variable]

get_treatment_variables()[source]¶

Get all treatment variables.

Return type:: list[Variable]

get_variable(name)[source]¶

Get variable by name.

Return type:: Variable | None

last_updated: str | None¶

model_config: ClassVar[ConfigDict] = {'validate_assignment': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str¶

variables: dict[str, Variable]¶

version: str | None¶

class statqa.metadata.schema.DataGeneratingProcess(*values)[source]¶

Bases: str, Enum

How the data was generated.

ADMINISTRATIVE = 'administrative'¶

EXPERIMENTAL = 'experimental'¶

OBSERVATIONAL = 'observational'¶

QUASI_EXPERIMENTAL = 'quasi_experimental'¶

SIMULATION = 'simulation'¶

SURVEY = 'survey'¶

UNKNOWN = 'unknown'¶

class statqa.metadata.schema.MissingPattern(*values)[source]¶

Bases: str, Enum

Pattern of missing data.

MAR = 'mar'¶

MCAR = 'mcar'¶

MNAR = 'mnar'¶

UNKNOWN = 'unknown'¶

class statqa.metadata.schema.Variable(**data)[source]¶

Bases: BaseModel

Represents a single variable/column in a dataset.

name¶: Variable identifier (e.g., ‘VCF0101’, ‘age’, ‘income’)

label¶: Human-readable label/description

var_type¶: Statistical type of the variable

dtype¶: Raw data type (from pandas/numpy)

description¶: Detailed description of what this variable measures

valid_values¶: Mapping of codes to descriptions (e.g., {1: “Male”, 2: “Female”})

missing_values¶: Set of codes representing missing data (e.g., {-1, 999})

missing_pattern¶: Pattern of missingness

units¶: Measurement units (e.g., “years”, “USD”, “percentage”)

range_min¶: Minimum valid value (for numeric)

range_max¶: Maximum valid value (for numeric)

is_ordinal¶: Whether categorical variable has meaningful order

dgp¶: Data generating process

is_treatment¶: Whether this is a treatment/intervention variable

is_outcome¶: Whether this is an outcome/dependent variable

is_confounder¶: Whether this is a potential confounder

temporal_variable¶: Name of associated time variable (if longitudinal)

notes¶: Additional metadata notes

source¶: Data source or survey question text

enriched_metadata¶: LLM-generated enrichment information

description: str | None¶

dgp: DataGeneratingProcess¶

dtype: str | None¶

enriched_metadata: dict[str, Any]¶

classmethod ensure_set(v)[source]¶

Ensure missing_values is a set.

Return type:: set[int | str]

get_cleaned_values()[source]¶

Get valid values excluding missing codes.

Return type:: dict[int | str, str]

is_categorical()[source]¶

Check if variable is categorical.

Return type:: bool

is_confounder: bool¶

is_numeric()[source]¶

Check if variable is numeric.

Return type:: bool

is_ordinal: bool¶

is_outcome: bool¶

is_temporal()[source]¶

Check if variable represents time.

Return type:: bool

is_treatment: bool¶

label: str¶

missing_pattern: MissingPattern¶

missing_values: set[int | str]¶

model_config: ClassVar[ConfigDict] = {'use_enum_values': True, 'validate_assignment': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str¶

notes: str | None¶

range_max: float | None¶

range_min: float | None¶

source: str | None¶

temporal_variable: str | None¶

units: str | None¶

valid_values: dict[int | str, str]¶

var_type: VariableType¶

class statqa.metadata.schema.VariableType(*values)[source]¶

Bases: str, Enum

Statistical type of a variable.

BOOLEAN = 'boolean'¶

CATEGORICAL_NOMINAL = 'categorical_nominal'¶

CATEGORICAL_ORDINAL = 'categorical_ordinal'¶

DATETIME = 'datetime'¶

NUMERIC_CONTINUOUS = 'numeric_continuous'¶

NUMERIC_DISCRETE = 'numeric_discrete'¶

TEXT = 'text'¶

UNKNOWN = 'unknown'¶

Parsers¶

Base Parser¶

Base parser interface for codebook parsing.

Defines the abstract interface that all codebook parsers must implement.

class statqa.metadata.parsers.base.BaseParser(**kwargs)[source]¶

Bases: ABC

Abstract base class for codebook parsers.

Parameters:: **kwargs (Any) – Parser-specific configuration options

abstractmethod parse(source)[source]¶

Parse a codebook from the given source.

Parameters:

source (str | Path) – Path to codebook file or string content

Return type:

Codebook

Returns:

Parsed Codebook object

Raises:

ValueError – If source format is invalid
FileNotFoundError – If source file doesn’t exist

parse_file(file_path)[source]¶

Convenience method to parse from file path.

Parameters:: file_path (str | Path) – Path to codebook file
Return type:: Codebook
Returns:: Parsed Codebook object

parse_string(content)[source]¶

Convenience method to parse from string content.

Parameters:: content (str) – Codebook content as string
Return type:: Codebook
Returns:: Parsed Codebook object

abstractmethod validate(source)[source]¶

Check if this parser can handle the given source.

Parameters:: source (str | Path) – Path to codebook file or string content
Return type:: bool
Returns:: True if parser can handle this source

CSV Parser¶

CSV-based codebook parser.

Parses codebooks stored in CSV format with columns like: - variable_name - label - type - description - valid_values - missing_values - units - etc.

class statqa.metadata.parsers.csv.CSVParser(**kwargs)[source]¶

Bases: BaseParser

Parser for CSV codebooks.

parse(source)[source]¶

Parse CSV codebook.

Return type:: Codebook

validate(source)[source]¶

Check if source is valid CSV.

Return type:: bool

Text Parser¶

Text-based codebook parser.

Parses structured text codebooks with variable definitions. Supports formats like:

``` # Variable: age Label: Respondent Age Type: numeric_continuous Units: years Range: 18-99 Missing: -1, 999 Description: Age of respondent at time of survey

# Variable: gender Label: Gender Type: categorical_nominal Values:

1: Male 2: Female 3: Other

Missing: 0 ```

class statqa.metadata.parsers.text.TextParser(**kwargs)[source]¶

Bases: BaseParser

Parser for structured text codebooks.

parse(source)[source]¶

Parse text codebook.

Return type:: Codebook

validate(source)[source]¶

Check if source is valid text format.

Return type:: bool

Statistical Formats Parser¶

Statistical format parser for SPSS, Stata, and SAS files.

Uses pyreadstat library to parse statistical data files and extract rich metadata including variable labels, value labels, and missing value definitions.

class statqa.metadata.parsers.statistical.StatisticalFormatParser(**kwargs)[source]¶

Bases: BaseParser

Parser for statistical data files (SPSS, Stata, SAS).

parse(source)[source]¶

Parse statistical format file.

Return type:: Codebook

validate(source)[source]¶

Check if source is a supported statistical format.

Return type:: bool

Enricher¶

LLM-powered metadata enhancement.

LLM-based metadata enrichment.

Uses language models to verify, infer, and enrich variable metadata including: - Type inference and validation - Relationship suggestions - Causal structure hints - Missing pattern detection - Variable importance ranking

class statqa.metadata.enricher.MetadataEnricher(provider='openai', model=None, api_key=None, **kwargs)[source]¶

Bases: object

Enrich metadata using LLM capabilities.

Supports both OpenAI and Anthropic models.

Parameters:

provider (Literal['openai', 'anthropic']) – LLM provider (‘openai’ or ‘anthropic’)
model (str | None) – Model name (defaults to gpt-4 or claude-3-sonnet)
api_key (str | None) – API key (or use environment variable)
**kwargs (Any) – Additional provider-specific options

Raises:

ImportError – If required LLM package not installed
ValueError – If provider is not supported

enrich_codebook(codebook)[source]¶

Enrich entire codebook metadata.

Parameters:: codebook (Codebook) – Codebook to enrich
Return type:: Codebook
Returns:: Enriched Codebook

enrich_variable(variable, dataset_context=None)[source]¶

Enrich a single variable’s metadata.

Parameters:

variable (Variable) – Variable to enrich
dataset_context (str | None) – Optional context about the dataset

Return type:

Variable

Returns:

Enriched Variable with updated metadata

Raises:

EnrichmentError – If enrichment process fails
LLMConnectionError – If LLM connection fails
LLMResponseError – If LLM response is invalid