RepoRoulette: Randomly Sample GitHub Repositories¶
RepoRoulette: Randomly Sample GitHub Repositories.
A Python library for randomly sampling GitHub repositories using multiple methods: - ID-based sampling: Probes random repository IDs - Temporal sampling: Weighted sampling based on repository activity by time period - BigQuery sampling: Advanced querying using Google BigQuery’s GitHub dataset - GitHub Archive sampling: Event-based sampling from GitHub Archive files
Example
>>> from reporoulette import sample
>>> results = sample(method='temporal', n_samples=10)
>>> print(f"Found {len(results['samples'])} repositories")
- class reporoulette.BigQuerySampler(credentials_path: str | None = None, project_id: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶
Bases:
BaseSamplerSample repositories using Google BigQuery’s GitHub dataset.
This sampler leverages the public GitHub dataset in Google BigQuery to efficiently sample repositories with complex criteria and at scale.
- get_languages(repos: list[dict[str, Any]]) dict[str, list[dict[str, Any]]][source]¶
Retrieve language information for a list of repositories.
- sample(n_samples: int = 100, population: str = 'all', **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories using BigQuery.
- Parameters:
n_samples – Number of repositories to sample
population – Type of repository population to sample from (‘all’ or ‘active’)
**kwargs – Any: Additional filtering criteria
- Returns:
List of repository dictionaries
- class reporoulette.GHArchiveSampler(token: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶
Bases:
BaseSamplerSample repositories by downloading and processing GH Archive files.
This sampler randomly selects days from GitHub’s event history, downloads the corresponding archive files, and extracts repository information.
- gh_sampler(n_samples: int = 100, days_to_sample: int = 5, repos_per_day: int = 20, years_back: int = 10, event_types: list[str] | None = None, **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories by downloading and processing full day’s GH Archive files.
- Parameters:
n_samples – Target number of repositories to sample
days_to_sample – Number of random days to sample
repos_per_day – Maximum repositories to sample per day
years_back – How many years to look back
event_types – Types of GitHub events to consider
**kwargs – Additional filters to apply
- Returns:
List of repository data
- sample(n_samples: int = 100, **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories using the GH Archive approach.
This is the implementation of the abstract method from BaseSampler, which delegates to the gh_sampler method with the provided parameters.
- Parameters:
n_samples – Number of repositories to sample
**kwargs – Additional parameters to pass to gh_sampler
- Returns:
List of repository data
- class reporoulette.IDSampler(token: str | None = None, min_id: int = 1, max_id: int = 850000000, rate_limit_safety: int = 100, seed: int | None = None, log_level: int = 20, auto_discover_max: bool = False)[source]¶
Bases:
BaseSamplerSample repositories using random ID probing.
This sampler generates random repository IDs within a specified range and attempts to retrieve repositories with those IDs from GitHub.
- sample(n_samples: int = 10, min_wait: float = 0.1, max_attempts: int = 1000, **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories by trying random IDs.
- Parameters:
n_samples – Number of valid repositories to collect
min_wait – Minimum wait time between API requests
max_attempts – Maximum number of IDs to try
**kwargs – Additional filters to apply
- Returns:
List of repository data
- class reporoulette.TemporalSampler(token: str | None = None, start_date: datetime | str | None = None, end_date: datetime | str | None = None, rate_limit_safety: int = 100, seed: int | None = None, years_back: int = 10, log_level: int = 20)[source]¶
Bases:
BaseSamplerSample repositories by randomly selecting days and fetching repos updated in those periods.
This sampler selects random days within a specified date range, weights them by repository count, and retrieves repositories with proportional sampling.
- sample(n_samples: int = 100, days_to_sample: int = 10, per_page: int = 100, min_wait: float = 1.0, min_stars: int = 0, min_size_kb: int = 0, language: str | None = None, **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories by randomly selecting days with weighting based on repo count.
- Parameters:
n_samples – Target number of repositories to collect
days_to_sample – Number of random days to initially sample for count assessment
per_page – Number of results per page (max 100)
min_wait – Minimum wait time between API requests
min_stars – Minimum number of stars (0 for no filtering)
min_size_kb – Minimum repository size in KB (0 for no filtering)
language – Programming language to filter by
**kwargs – Additional filters to apply
- Returns:
List of repository data
- reporoulette.sample(method: str = 'temporal', n_samples: int = 50, token: str | None = None, **kwargs: Any) dict[str, Any][source]¶
Sample repositories using the specified method.
- Parameters:
method – Sampling method (‘id’, ‘temporal’, ‘archive’, or ‘bigquery’)
n_samples – Number of repositories to sample
token – GitHub Personal Access Token (not used for BigQuery)
**kwargs – Additional parameters specific to each sampler
- Returns:
Dictionary with sampling results and stats
- Raises:
ValueError – If an unknown sampling method is provided
RepoRoulette provides multiple methods for randomly sampling GitHub repositories:
ID-based sampling: Probes random repository IDs
Temporal sampling: Weighted sampling based on repository activity by time period
BigQuery sampling: Advanced querying using Google BigQuery’s GitHub dataset
GitHub Archive sampling: Event-based sampling from GitHub Archive files
Quick Start¶
from reporoulette import sample
# Sample 10 repositories using temporal sampling
results = sample(method='temporal', n_samples=10)
print(f"Found {len(results['samples'])} repositories")