Metadata

Variable metadata parsing and enrichment functionality.

Schema

Core data structures for representing variables and codebooks.

Pydantic models for metadata representation.

This module defines the core data structures for variables and codebooks, providing type-safe, validated models with rich metadata support.

class statqa.metadata.schema.Codebook(**data)[source]

Bases: BaseModel

Represents a complete codebook/data dictionary.

name

Codebook name/identifier

description

Overall dataset description

variables

Mapping of variable names to Variable objects

dataset_info

General dataset metadata

citation

How to cite this dataset

version

Codebook version

last_updated

Last update date

add_variable(variable)[source]

Add a variable to the codebook.

Return type:

None

citation: str | None
dataset_info: dict[str, Any]
description: str | None
get_categorical_variables()[source]

Get all categorical variables.

Return type:

list[Variable]

get_numeric_variables()[source]

Get all numeric variables.

Return type:

list[Variable]

get_outcome_variables()[source]

Get all outcome variables.

Return type:

list[Variable]

get_temporal_variables()[source]

Get all temporal variables.

Return type:

list[Variable]

get_treatment_variables()[source]

Get all treatment variables.

Return type:

list[Variable]

get_variable(name)[source]

Get variable by name.

Return type:

Variable | None

last_updated: str | None
model_config: ClassVar[ConfigDict] = {'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
variables: dict[str, Variable]
version: str | None
class statqa.metadata.schema.DataGeneratingProcess(*values)[source]

Bases: str, Enum

How the data was generated.

ADMINISTRATIVE = 'administrative'
EXPERIMENTAL = 'experimental'
OBSERVATIONAL = 'observational'
QUASI_EXPERIMENTAL = 'quasi_experimental'
SIMULATION = 'simulation'
SURVEY = 'survey'
UNKNOWN = 'unknown'
class statqa.metadata.schema.MissingPattern(*values)[source]

Bases: str, Enum

Pattern of missing data.

MAR = 'mar'
MCAR = 'mcar'
MNAR = 'mnar'
UNKNOWN = 'unknown'
class statqa.metadata.schema.Variable(**data)[source]

Bases: BaseModel

Represents a single variable/column in a dataset.

name

Variable identifier (e.g., ‘VCF0101’, ‘age’, ‘income’)

label

Human-readable label/description

var_type

Statistical type of the variable

dtype

Raw data type (from pandas/numpy)

description

Detailed description of what this variable measures

valid_values

Mapping of codes to descriptions (e.g., {1: “Male”, 2: “Female”})

missing_values

Set of codes representing missing data (e.g., {-1, 999})

missing_pattern

Pattern of missingness

units

Measurement units (e.g., “years”, “USD”, “percentage”)

range_min

Minimum valid value (for numeric)

range_max

Maximum valid value (for numeric)

is_ordinal

Whether categorical variable has meaningful order

dgp

Data generating process

is_treatment

Whether this is a treatment/intervention variable

is_outcome

Whether this is an outcome/dependent variable

is_confounder

Whether this is a potential confounder

temporal_variable

Name of associated time variable (if longitudinal)

notes

Additional metadata notes

source

Data source or survey question text

enriched_metadata

LLM-generated enrichment information

description: str | None
dgp: DataGeneratingProcess
dtype: str | None
enriched_metadata: dict[str, Any]
classmethod ensure_set(v)[source]

Ensure missing_values is a set.

Return type:

set[int | str]

get_cleaned_values()[source]

Get valid values excluding missing codes.

Return type:

dict[int | str, str]

is_categorical()[source]

Check if variable is categorical.

Return type:

bool

is_confounder: bool
is_numeric()[source]

Check if variable is numeric.

Return type:

bool

is_ordinal: bool
is_outcome: bool
is_temporal()[source]

Check if variable represents time.

Return type:

bool

is_treatment: bool
label: str
missing_pattern: MissingPattern
missing_values: set[int | str]
model_config: ClassVar[ConfigDict] = {'use_enum_values': True, 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
notes: str | None
range_max: float | None
range_min: float | None
source: str | None
temporal_variable: str | None
units: str | None
valid_values: dict[int | str, str]
var_type: VariableType
class statqa.metadata.schema.VariableType(*values)[source]

Bases: str, Enum

Statistical type of a variable.

BOOLEAN = 'boolean'
CATEGORICAL_NOMINAL = 'categorical_nominal'
CATEGORICAL_ORDINAL = 'categorical_ordinal'
DATETIME = 'datetime'
NUMERIC_CONTINUOUS = 'numeric_continuous'
NUMERIC_DISCRETE = 'numeric_discrete'
TEXT = 'text'
UNKNOWN = 'unknown'

Parsers

Base Parser

Base parser interface for codebook parsing.

Defines the abstract interface that all codebook parsers must implement.

class statqa.metadata.parsers.base.BaseParser(**kwargs)[source]

Bases: ABC

Abstract base class for codebook parsers.

Parameters:

**kwargs (Any) – Parser-specific configuration options

abstractmethod parse(source)[source]

Parse a codebook from the given source.

Parameters:

source (str | Path) – Path to codebook file or string content

Return type:

Codebook

Returns:

Parsed Codebook object

Raises:
parse_file(file_path)[source]

Convenience method to parse from file path.

Parameters:

file_path (str | Path) – Path to codebook file

Return type:

Codebook

Returns:

Parsed Codebook object

parse_string(content)[source]

Convenience method to parse from string content.

Parameters:

content (str) – Codebook content as string

Return type:

Codebook

Returns:

Parsed Codebook object

abstractmethod validate(source)[source]

Check if this parser can handle the given source.

Parameters:

source (str | Path) – Path to codebook file or string content

Return type:

bool

Returns:

True if parser can handle this source

CSV Parser

CSV-based codebook parser.

Parses codebooks stored in CSV format with columns like: - variable_name - label - type - description - valid_values - missing_values - units - etc.

class statqa.metadata.parsers.csv.CSVParser(**kwargs)[source]

Bases: BaseParser

Parser for CSV codebooks.

parse(source)[source]

Parse CSV codebook.

Return type:

Codebook

validate(source)[source]

Check if source is valid CSV.

Return type:

bool

Text Parser

Text-based codebook parser.

Parses structured text codebooks with variable definitions. Supports formats like:

``` # Variable: age Label: Respondent Age Type: numeric_continuous Units: years Range: 18-99 Missing: -1, 999 Description: Age of respondent at time of survey

# Variable: gender Label: Gender Type: categorical_nominal Values:

1: Male 2: Female 3: Other

Missing: 0 ```

class statqa.metadata.parsers.text.TextParser(**kwargs)[source]

Bases: BaseParser

Parser for structured text codebooks.

parse(source)[source]

Parse text codebook.

Return type:

Codebook

validate(source)[source]

Check if source is valid text format.

Return type:

bool

Statistical Formats Parser

Statistical format parser for SPSS, Stata, and SAS files.

Uses pyreadstat library to parse statistical data files and extract rich metadata including variable labels, value labels, and missing value definitions.

class statqa.metadata.parsers.statistical.StatisticalFormatParser(**kwargs)[source]

Bases: BaseParser

Parser for statistical data files (SPSS, Stata, SAS).

parse(source)[source]

Parse statistical format file.

Return type:

Codebook

validate(source)[source]

Check if source is a supported statistical format.

Return type:

bool

Enricher

LLM-powered metadata enhancement.

LLM-based metadata enrichment.

Uses language models to verify, infer, and enrich variable metadata including: - Type inference and validation - Relationship suggestions - Causal structure hints - Missing pattern detection - Variable importance ranking

class statqa.metadata.enricher.MetadataEnricher(provider='openai', model=None, api_key=None, **kwargs)[source]

Bases: object

Enrich metadata using LLM capabilities.

Supports both OpenAI and Anthropic models.

Parameters:
  • provider (Literal['openai', 'anthropic']) – LLM provider (‘openai’ or ‘anthropic’)

  • model (str | None) – Model name (defaults to gpt-4 or claude-3-sonnet)

  • api_key (str | None) – API key (or use environment variable)

  • **kwargs (Any) – Additional provider-specific options

Raises:
enrich_codebook(codebook)[source]

Enrich entire codebook metadata.

Parameters:

codebook (Codebook) – Codebook to enrich

Return type:

Codebook

Returns:

Enriched Codebook

enrich_variable(variable, dataset_context=None)[source]

Enrich a single variable’s metadata.

Parameters:
  • variable (Variable) – Variable to enrich

  • dataset_context (str | None) – Optional context about the dataset

Return type:

Variable

Returns:

Enriched Variable with updated metadata

Raises: