BigQuery Sampler¶

The BigQuery sampler leverages Google BigQuery’s public GitHub dataset to sample repositories with advanced filtering capabilities.

class reporoulette.BigQuerySampler(credentials_path: str | None = None, project_id: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶

Bases: BaseSampler

Sample repositories using Google BigQuery’s GitHub dataset.

This sampler leverages the public GitHub dataset in Google BigQuery to efficiently sample repositories with complex criteria and at scale.

get_languages(repos: list[dict[str, Any]]) → dict[str, list[dict[str, Any]]][source]¶: Retrieve language information for a list of repositories.

sample(n_samples: int = 100, population: str = 'all', **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories using BigQuery.

Parameters:

n_samples – Number of repositories to sample
population – Type of repository population to sample from (‘all’ or ‘active’)
**kwargs – Any: Additional filtering criteria

Returns:

List of repository dictionaries

sample_active(n_samples: int = 100, created_after: str | datetime | None = None, created_before: str | datetime | None = None, languages: list[str] | None = None, **kwargs: Any) → list[dict[str, Any]][source]¶: Sample repositories with recent commit activity.

sample_by_day(n_samples: int = 100, days_to_sample: int = 10, repos_per_day: int = 50, years_back: int = 10, **kwargs: Any) → list[dict[str, Any]][source]¶: Sample repositories using a day-based approach with GitHub Archive tables.

Advantages¶

Handles large sample sizes efficiently
Powerful filtering and stratification options
Not limited by GitHub API rate limits
Access to historical data and metadata

Disadvantages¶

Requires Google Cloud Platform account
Can be expensive for large queries
Dataset may have slight delays (24-48 hours)

Usage Example¶

from reporoulette import BigQuerySampler

# Direct usage
sampler = BigQuerySampler(
    credentials_path="/path/to/credentials.json",
    project_id="your-gcp-project"
)
repos = sampler.sample(n_samples=100)

# Using convenience function
from reporoulette import sample
results = sample(
    method='bigquery',
    n_samples=100,
    credentials_path="/path/to/credentials.json",
    project_id="your-gcp-project"
)