Metadata¶
Variable metadata parsing and enrichment functionality.
Schema¶
Core data structures for representing variables and codebooks.
Pydantic models for metadata representation.
This module defines the core data structures for variables and codebooks, providing type-safe, validated models with rich metadata support.
- class statqa.metadata.schema.Codebook(**data)[source]¶
Bases:
BaseModelRepresents a complete codebook/data dictionary.
- name¶
Codebook name/identifier
- description¶
Overall dataset description
- variables¶
Mapping of variable names to Variable objects
- dataset_info¶
General dataset metadata
- citation¶
How to cite this dataset
- version¶
Codebook version
- last_updated¶
Last update date
- citation: str | None¶
- dataset_info: dict[str, Any]¶
- description: str | None¶
- last_updated: str | None¶
- model_config: ClassVar[ConfigDict] = {'validate_assignment': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str¶
- variables: dict[str, Variable]¶
- version: str | None¶
- class statqa.metadata.schema.DataGeneratingProcess(*values)[source]¶
-
How the data was generated.
- ADMINISTRATIVE = 'administrative'¶
- EXPERIMENTAL = 'experimental'¶
- OBSERVATIONAL = 'observational'¶
- QUASI_EXPERIMENTAL = 'quasi_experimental'¶
- SIMULATION = 'simulation'¶
- SURVEY = 'survey'¶
- UNKNOWN = 'unknown'¶
- class statqa.metadata.schema.MissingPattern(*values)[source]¶
-
Pattern of missing data.
- MAR = 'mar'¶
- MCAR = 'mcar'¶
- MNAR = 'mnar'¶
- UNKNOWN = 'unknown'¶
- class statqa.metadata.schema.Variable(**data)[source]¶
Bases:
BaseModelRepresents a single variable/column in a dataset.
- name¶
Variable identifier (e.g., ‘VCF0101’, ‘age’, ‘income’)
- label¶
Human-readable label/description
- var_type¶
Statistical type of the variable
- dtype¶
Raw data type (from pandas/numpy)
- description¶
Detailed description of what this variable measures
- valid_values¶
Mapping of codes to descriptions (e.g., {1: “Male”, 2: “Female”})
- missing_values¶
Set of codes representing missing data (e.g., {-1, 999})
- missing_pattern¶
Pattern of missingness
- units¶
Measurement units (e.g., “years”, “USD”, “percentage”)
- range_min¶
Minimum valid value (for numeric)
- range_max¶
Maximum valid value (for numeric)
- is_ordinal¶
Whether categorical variable has meaningful order
- dgp¶
Data generating process
- is_treatment¶
Whether this is a treatment/intervention variable
- is_outcome¶
Whether this is an outcome/dependent variable
- is_confounder¶
Whether this is a potential confounder
- temporal_variable¶
Name of associated time variable (if longitudinal)
- notes¶
Additional metadata notes
- source¶
Data source or survey question text
- enriched_metadata¶
LLM-generated enrichment information
- description: str | None¶
- dgp: DataGeneratingProcess¶
- dtype: str | None¶
- enriched_metadata: dict[str, Any]¶
- is_confounder: bool¶
- is_ordinal: bool¶
- is_outcome: bool¶
- is_treatment: bool¶
- label: str¶
- missing_pattern: MissingPattern¶
- missing_values: set[int | str]¶
- model_config: ClassVar[ConfigDict] = {'use_enum_values': True, 'validate_assignment': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str¶
- notes: str | None¶
- range_max: float | None¶
- range_min: float | None¶
- source: str | None¶
- temporal_variable: str | None¶
- units: str | None¶
- valid_values: dict[int | str, str]¶
- var_type: VariableType¶
- class statqa.metadata.schema.VariableType(*values)[source]¶
-
Statistical type of a variable.
- BOOLEAN = 'boolean'¶
- CATEGORICAL_NOMINAL = 'categorical_nominal'¶
- CATEGORICAL_ORDINAL = 'categorical_ordinal'¶
- DATETIME = 'datetime'¶
- NUMERIC_CONTINUOUS = 'numeric_continuous'¶
- NUMERIC_DISCRETE = 'numeric_discrete'¶
- TEXT = 'text'¶
- UNKNOWN = 'unknown'¶
Parsers¶
Base Parser¶
Base parser interface for codebook parsing.
Defines the abstract interface that all codebook parsers must implement.
- class statqa.metadata.parsers.base.BaseParser(**kwargs)[source]¶
Bases:
ABCAbstract base class for codebook parsers.
- Parameters:
**kwargs (
Any) – Parser-specific configuration options
- abstractmethod parse(source)[source]¶
Parse a codebook from the given source.
- Parameters:
source (
str|Path) – Path to codebook file or string content- Return type:
- Returns:
Parsed Codebook object
- Raises:
ValueError – If source format is invalid
FileNotFoundError – If source file doesn’t exist
CSV Parser¶
CSV-based codebook parser.
Parses codebooks stored in CSV format with columns like: - variable_name - label - type - description - valid_values - missing_values - units - etc.
Text Parser¶
Text-based codebook parser.
Parses structured text codebooks with variable definitions. Supports formats like:
``` # Variable: age Label: Respondent Age Type: numeric_continuous Units: years Range: 18-99 Missing: -1, 999 Description: Age of respondent at time of survey
# Variable: gender Label: Gender Type: categorical_nominal Values:
1: Male 2: Female 3: Other
Statistical Formats Parser¶
Statistical format parser for SPSS, Stata, and SAS files.
Uses pyreadstat library to parse statistical data files and extract rich metadata including variable labels, value labels, and missing value definitions.
- class statqa.metadata.parsers.statistical.StatisticalFormatParser(**kwargs)[source]¶
Bases:
BaseParserParser for statistical data files (SPSS, Stata, SAS).
Enricher¶
LLM-powered metadata enhancement.
LLM-based metadata enrichment.
Uses language models to verify, infer, and enrich variable metadata including: - Type inference and validation - Relationship suggestions - Causal structure hints - Missing pattern detection - Variable importance ranking
- class statqa.metadata.enricher.MetadataEnricher(provider='openai', model=None, api_key=None, **kwargs)[source]¶
Bases:
objectEnrich metadata using LLM capabilities.
Supports both OpenAI and Anthropic models.
- Parameters:
- Raises:
ImportError – If required LLM package not installed
ValueError – If provider is not supported
- enrich_variable(variable, dataset_context=None)[source]¶
Enrich a single variable’s metadata.
- Parameters:
- Return type:
- Returns:
Enriched Variable with updated metadata
- Raises:
EnrichmentError – If enrichment process fails
LLMConnectionError – If LLM connection fails
LLMResponseError – If LLM response is invalid