RepoRoulette: Randomly Sample GitHub Repositories¶

RepoRoulette: Randomly Sample GitHub Repositories.

A Python library for randomly sampling GitHub repositories using multiple methods: - ID-based sampling: Probes random repository IDs - Temporal sampling: Weighted sampling based on repository activity by time period - BigQuery sampling: Advanced querying using Google BigQuery’s GitHub dataset - GitHub Archive sampling: Event-based sampling from GitHub Archive files

Example

>>> from reporoulette import sample
>>> results = sample(method='temporal', n_samples=10)
>>> print(f"Found {len(results['samples'])} repositories")

class reporoulette.BigQuerySampler(credentials_path: str | None = None, project_id: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶

Bases: BaseSampler

Sample repositories using Google BigQuery’s GitHub dataset.

This sampler leverages the public GitHub dataset in Google BigQuery to efficiently sample repositories with complex criteria and at scale.

get_languages(repos: list[dict[str, Any]]) → dict[str, list[dict[str, Any]]][source]¶: Retrieve language information for a list of repositories.

sample(n_samples: int = 100, population: str = 'all', **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories using BigQuery.

Parameters:

n_samples – Number of repositories to sample
population – Type of repository population to sample from (‘all’ or ‘active’)
**kwargs – Any: Additional filtering criteria

Returns:

List of repository dictionaries

sample_active(n_samples: int = 100, created_after: str | datetime | None = None, created_before: str | datetime | None = None, languages: list[str] | None = None, **kwargs: Any) → list[dict[str, Any]][source]¶: Sample repositories with recent commit activity.

sample_by_day(n_samples: int = 100, days_to_sample: int = 10, repos_per_day: int = 50, years_back: int = 10, **kwargs: Any) → list[dict[str, Any]][source]¶: Sample repositories using a day-based approach with GitHub Archive tables.

class reporoulette.GHArchiveSampler(token: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶

Bases: BaseSampler

Sample repositories by downloading and processing GH Archive files.

This sampler randomly selects days from GitHub’s event history, downloads the corresponding archive files, and extracts repository information.

gh_sampler(n_samples: int = 100, days_to_sample: int = 5, repos_per_day: int = 20, years_back: int = 10, event_types: list[str] | None = None, **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories by downloading and processing full day’s GH Archive files.

Parameters:

n_samples – Target number of repositories to sample
days_to_sample – Number of random days to sample
repos_per_day – Maximum repositories to sample per day
years_back – How many years to look back
event_types – Types of GitHub events to consider
**kwargs – Additional filters to apply

Returns:

List of repository data

sample(n_samples: int = 100, **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories using the GH Archive approach.

This is the implementation of the abstract method from BaseSampler, which delegates to the gh_sampler method with the provided parameters.

Parameters:

n_samples – Number of repositories to sample
**kwargs – Additional parameters to pass to gh_sampler

Returns:

List of repository data

class reporoulette.IDSampler(token: str | None = None, min_id: int = 1, max_id: int = 850000000, rate_limit_safety: int = 100, seed: int | None = None, log_level: int = 20, auto_discover_max: bool = False)[source]¶

Bases: BaseSampler

Sample repositories using random ID probing.

This sampler generates random repository IDs within a specified range and attempts to retrieve repositories with those IDs from GitHub.

sample(n_samples: int = 10, min_wait: float = 0.1, max_attempts: int = 1000, **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories by trying random IDs.

Parameters:

n_samples – Number of valid repositories to collect
min_wait – Minimum wait time between API requests
max_attempts – Maximum number of IDs to try
**kwargs – Additional filters to apply

Returns:

List of repository data

Bases: BaseSampler

Sample repositories by randomly selecting days and fetching repos updated in those periods.

This sampler selects random days within a specified date range, weights them by repository count, and retrieves repositories with proportional sampling.

sample(n_samples: int = 100, days_to_sample: int = 10, per_page: int = 100, min_wait: float = 1.0, min_stars: int = 0, min_size_kb: int = 0, language: str | None = None, **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories by randomly selecting days with weighting based on repo count.

Parameters:

n_samples – Target number of repositories to collect
days_to_sample – Number of random days to initially sample for count assessment
per_page – Number of results per page (max 100)
min_wait – Minimum wait time between API requests
min_stars – Minimum number of stars (0 for no filtering)
min_size_kb – Minimum repository size in KB (0 for no filtering)
language – Programming language to filter by
**kwargs – Additional filters to apply

Returns:

List of repository data

reporoulette.sample(method: str = 'temporal', n_samples: int = 50, token: str | None = None, **kwargs: Any) → dict[str, Any][source]¶

Sample repositories using the specified method.

Parameters:

method – Sampling method (‘id’, ‘temporal’, ‘archive’, or ‘bigquery’)
n_samples – Number of repositories to sample
token – GitHub Personal Access Token (not used for BigQuery)
**kwargs – Additional parameters specific to each sampler

Returns:

Dictionary with sampling results and stats

Raises:

ValueError – If an unknown sampling method is provided

RepoRoulette provides multiple methods for randomly sampling GitHub repositories:

ID-based sampling: Probes random repository IDs
Temporal sampling: Weighted sampling based on repository activity by time period
BigQuery sampling: Advanced querying using Google BigQuery’s GitHub dataset
GitHub Archive sampling: Event-based sampling from GitHub Archive files

Quick Start¶

from reporoulette import sample

# Sample 10 repositories using temporal sampling
results = sample(method='temporal', n_samples=10)
print(f"Found {len(results['samples'])} repositories")

RepoRoulette: Randomly Sample GitHub Repositories¶

Quick Start¶

Indices and tables¶