BigQuery Sampler¶
The BigQuery sampler leverages Google BigQuery’s public GitHub dataset to sample repositories with advanced filtering capabilities.
- class reporoulette.BigQuerySampler(credentials_path: str | None = None, project_id: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶
Bases:
BaseSamplerSample repositories using Google BigQuery’s GitHub dataset.
This sampler leverages the public GitHub dataset in Google BigQuery to efficiently sample repositories with complex criteria and at scale.
- get_languages(repos: list[dict[str, Any]]) dict[str, list[dict[str, Any]]][source]¶
Retrieve language information for a list of repositories.
- sample(n_samples: int = 100, population: str = 'all', **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories using BigQuery.
- Parameters:
n_samples – Number of repositories to sample
population – Type of repository population to sample from (‘all’ or ‘active’)
**kwargs – Any: Additional filtering criteria
- Returns:
List of repository dictionaries
Advantages¶
Handles large sample sizes efficiently
Powerful filtering and stratification options
Not limited by GitHub API rate limits
Access to historical data and metadata
Disadvantages¶
Requires Google Cloud Platform account
Can be expensive for large queries
Dataset may have slight delays (24-48 hours)
Usage Example¶
from reporoulette import BigQuerySampler
# Direct usage
sampler = BigQuerySampler(
credentials_path="/path/to/credentials.json",
project_id="your-gcp-project"
)
repos = sampler.sample(n_samples=100)
# Using convenience function
from reporoulette import sample
results = sample(
method='bigquery',
n_samples=100,
credentials_path="/path/to/credentials.json",
project_id="your-gcp-project"
)