tabula_rasa
Tabula Rasa: Production Table Knowledge LLM.
Teaching LLMs to accurately answer questions about tabular data through
statistical sketching and execution grounding.
-
class tabula_rasa.AdvancedStatSketch(max_categories=50, confidence_level=0.95)[source]
Production-grade statistical sketch with:
- Automatic distribution detection
- Robust copula estimation
- Conditional distribution inference
- Multi-table relationship tracking
-
__init__(max_categories=50, confidence_level=0.95)[source]
Initialize the statistical sketch extractor.
- Parameters:
-
Extract comprehensive statistical sketch from a DataFrame.
- Parameters:
-
- Return type:
dict
- Returns:
Dictionary containing statistical sketch with columns, correlations,
copula parameters, and conditional distributions
-
class tabula_rasa.AdvancedQueryExecutor(df)[source]
Execute queries on actual data.
Supports: aggregations, filters, conditionals, group-by
-
__init__(df)[source]
Initialize query executor.
- Parameters:
df (DataFrame) – DataFrame to execute queries against
-
execute(query)[source]
Execute structured query.
- Parameters:
query (Query) – Query object specifying the operation
- Return type:
Any
- Returns:
Query result (float for aggregates, int for counts, dict for group-by)
- Raises:
ValueError – If query type is unknown
-
class tabula_rasa.Query(query_type, target_column=None, aggregation=None, condition=None, percentile=None, group_by=None)[source]
Structured query representation.
-
__init__(query_type, target_column=None, aggregation=None, condition=None, percentile=None, group_by=None)
-
aggregation:
str | None = None
-
condition:
str | None = None
-
group_by:
str | None = None
-
percentile:
float | None = None
-
target_column:
str | None = None
-
query_type:
str
-
class tabula_rasa.StatisticalEncoder(hidden_dim=256, output_dim=768)[source]
Encode statistical sketch into neural representation.
Handles variable-length column lists via attention pooling.
-
__init__(hidden_dim=256, output_dim=768)[source]
Initialize the statistical encoder.
- Parameters:
-
-
forward(sketch)[source]
Encode sketch to fixed-size vector.
- Parameters:
sketch (dict) – Statistical sketch dictionary
- Return type:
Tensor
- Returns:
Tensor of shape (output_dim,) representing the encoded table
-
class tabula_rasa.ProductionTableQA(model_name='t5-small', stat_dim=768)[source]
Production Table QA model with T5 backbone.
Combines:
- Pretrained language understanding (T5)
- Statistical table knowledge (StatEncoder)
- Execution grounding (trained to match executor)
-
__init__(model_name='t5-small', stat_dim=768)[source]
Initialize the Table QA model.
- Parameters:
-
-
forward(question, sketch, return_features=False)[source]
Forward pass.
- Parameters:
question (str) – Natural language question
sketch (dict) – Statistical sketch dictionary
return_features (bool) – Whether to return intermediate features
- Returns:
answer: Predicted numerical answer
confidence: Confidence score [0, 1]
query_type_logits: Query type classification logits
features (optional): Fused representation
- Return type:
Dictionary with keys
-
class tabula_rasa.TableQADataset(df, sketch, n_samples=1000)[source]
Dataset for table QA with synthetic query generation.
-
__getitem__(idx)[source]
Get a single sample.
- Return type:
dict
-
__init__(df, sketch, n_samples=1000)[source]
Initialize dataset with synthetic query generation.
- Parameters:
df (DataFrame) – Source DataFrame
sketch (dict) – Statistical sketch of the DataFrame
n_samples (int) – Number of training samples to generate
-
__len__()[source]
Return number of samples.
- Return type:
int
-
class tabula_rasa.ProductionTrainer(model, df, sketch, lr=0.0001, batch_size=16, device='cpu')[source]
Production training with best practices.
-
__init__(model, df, sketch, lr=0.0001, batch_size=16, device='cpu')[source]
Initialize the trainer.
- Parameters:
model (ProductionTableQA) – ProductionTableQA model to train
df (DataFrame) – Training DataFrame
sketch (dict) – Statistical sketch of the DataFrame
lr (float) – Learning rate
batch_size (int) – Batch size for training
device (str) – Device to train on (‘cpu’ or ‘cuda’)
-
train(n_epochs=10, n_train_samples=1000, n_val_samples=200)[source]
Training loop with validation.
- Parameters:
n_epochs (int) – Number of training epochs
n_train_samples (int) – Number of training samples to generate
n_val_samples (int) – Number of validation samples to generate
- Return type:
tuple[float, dict]
- Returns:
Tuple of (best_val_loss, history_dict)