BigQuery Sampler

The BigQuery sampler leverages Google BigQuery’s public GitHub dataset to sample repositories with advanced filtering capabilities.

class reporoulette.BigQuerySampler(credentials_path: str | None = None, project_id: str | None = None, seed: int | None = None, log_level: int = 20)[source]

Bases: BaseSampler

Sample repositories using Google BigQuery’s GitHub dataset.

This sampler leverages the public GitHub dataset in Google BigQuery to efficiently sample repositories with complex criteria and at scale.

get_languages(repos: list[dict[str, Any]]) dict[str, list[dict[str, Any]]][source]

Retrieve language information for a list of repositories.

sample(n_samples: int = 100, population: str = 'all', **kwargs: Any) list[dict[str, Any]][source]

Sample repositories using BigQuery.

Parameters:
  • n_samples – Number of repositories to sample

  • population – Type of repository population to sample from (‘all’ or ‘active’)

  • **kwargs – Any: Additional filtering criteria

Returns:

List of repository dictionaries

sample_active(n_samples: int = 100, created_after: str | datetime | None = None, created_before: str | datetime | None = None, languages: list[str] | None = None, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories with recent commit activity.

sample_by_day(n_samples: int = 100, days_to_sample: int = 10, repos_per_day: int = 50, years_back: int = 10, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories using a day-based approach with GitHub Archive tables.

Advantages

  • Handles large sample sizes efficiently

  • Powerful filtering and stratification options

  • Not limited by GitHub API rate limits

  • Access to historical data and metadata

Disadvantages

  • Requires Google Cloud Platform account

  • Can be expensive for large queries

  • Dataset may have slight delays (24-48 hours)

Usage Example

from reporoulette import BigQuerySampler

# Direct usage
sampler = BigQuerySampler(
    credentials_path="/path/to/credentials.json",
    project_id="your-gcp-project"
)
repos = sampler.sample(n_samples=100)

# Using convenience function
from reporoulette import sample
results = sample(
    method='bigquery',
    n_samples=100,
    credentials_path="/path/to/credentials.json",
    project_id="your-gcp-project"
)