RepoRoulette: Randomly Sample GitHub Repositories

RepoRoulette: Randomly Sample GitHub Repositories.

A Python library for randomly sampling GitHub repositories using multiple methods: - ID-based sampling: Probes random repository IDs - Temporal sampling: Weighted sampling based on repository activity by time period - BigQuery sampling: Advanced querying using Google BigQuery’s GitHub dataset - GitHub Archive sampling: Event-based sampling from GitHub Archive files

Example

>>> from reporoulette import sample
>>> results = sample(method='temporal', n_samples=10)
>>> print(f"Found {len(results['samples'])} repositories")
class reporoulette.BigQuerySampler(credentials_path: str | None = None, project_id: str | None = None, seed: int | None = None, log_level: int = 20)[source]

Bases: BaseSampler

Sample repositories using Google BigQuery’s GitHub dataset.

This sampler leverages the public GitHub dataset in Google BigQuery to efficiently sample repositories with complex criteria and at scale.

get_languages(repos: list[dict[str, Any]]) dict[str, list[dict[str, Any]]][source]

Retrieve language information for a list of repositories.

sample(n_samples: int = 100, population: str = 'all', **kwargs: Any) list[dict[str, Any]][source]

Sample repositories using BigQuery.

Parameters:
  • n_samples – Number of repositories to sample

  • population – Type of repository population to sample from (‘all’ or ‘active’)

  • **kwargs – Any: Additional filtering criteria

Returns:

List of repository dictionaries

sample_active(n_samples: int = 100, created_after: str | datetime | None = None, created_before: str | datetime | None = None, languages: list[str] | None = None, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories with recent commit activity.

sample_by_day(n_samples: int = 100, days_to_sample: int = 10, repos_per_day: int = 50, years_back: int = 10, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories using a day-based approach with GitHub Archive tables.

class reporoulette.GHArchiveSampler(token: str | None = None, seed: int | None = None, log_level: int = 20)[source]

Bases: BaseSampler

Sample repositories by downloading and processing GH Archive files.

This sampler randomly selects days from GitHub’s event history, downloads the corresponding archive files, and extracts repository information.

gh_sampler(n_samples: int = 100, days_to_sample: int = 5, repos_per_day: int = 20, years_back: int = 10, event_types: list[str] | None = None, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories by downloading and processing full day’s GH Archive files.

Parameters:
  • n_samples – Target number of repositories to sample

  • days_to_sample – Number of random days to sample

  • repos_per_day – Maximum repositories to sample per day

  • years_back – How many years to look back

  • event_types – Types of GitHub events to consider

  • **kwargs – Additional filters to apply

Returns:

List of repository data

sample(n_samples: int = 100, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories using the GH Archive approach.

This is the implementation of the abstract method from BaseSampler, which delegates to the gh_sampler method with the provided parameters.

Parameters:
  • n_samples – Number of repositories to sample

  • **kwargs – Additional parameters to pass to gh_sampler

Returns:

List of repository data

class reporoulette.IDSampler(token: str | None = None, min_id: int = 1, max_id: int = 850000000, rate_limit_safety: int = 100, seed: int | None = None, log_level: int = 20, auto_discover_max: bool = False)[source]

Bases: BaseSampler

Sample repositories using random ID probing.

This sampler generates random repository IDs within a specified range and attempts to retrieve repositories with those IDs from GitHub.

sample(n_samples: int = 10, min_wait: float = 0.1, max_attempts: int = 1000, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories by trying random IDs.

Parameters:
  • n_samples – Number of valid repositories to collect

  • min_wait – Minimum wait time between API requests

  • max_attempts – Maximum number of IDs to try

  • **kwargs – Additional filters to apply

Returns:

List of repository data

class reporoulette.TemporalSampler(token: str | None = None, start_date: datetime | str | None = None, end_date: datetime | str | None = None, rate_limit_safety: int = 100, seed: int | None = None, years_back: int = 10, log_level: int = 20)[source]

Bases: BaseSampler

Sample repositories by randomly selecting days and fetching repos updated in those periods.

This sampler selects random days within a specified date range, weights them by repository count, and retrieves repositories with proportional sampling.

sample(n_samples: int = 100, days_to_sample: int = 10, per_page: int = 100, min_wait: float = 1.0, min_stars: int = 0, min_size_kb: int = 0, language: str | None = None, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories by randomly selecting days with weighting based on repo count.

Parameters:
  • n_samples – Target number of repositories to collect

  • days_to_sample – Number of random days to initially sample for count assessment

  • per_page – Number of results per page (max 100)

  • min_wait – Minimum wait time between API requests

  • min_stars – Minimum number of stars (0 for no filtering)

  • min_size_kb – Minimum repository size in KB (0 for no filtering)

  • language – Programming language to filter by

  • **kwargs – Additional filters to apply

Returns:

List of repository data

reporoulette.sample(method: str = 'temporal', n_samples: int = 50, token: str | None = None, **kwargs: Any) dict[str, Any][source]

Sample repositories using the specified method.

Parameters:
  • method – Sampling method (‘id’, ‘temporal’, ‘archive’, or ‘bigquery’)

  • n_samples – Number of repositories to sample

  • token – GitHub Personal Access Token (not used for BigQuery)

  • **kwargs – Additional parameters specific to each sampler

Returns:

Dictionary with sampling results and stats

Raises:

ValueError – If an unknown sampling method is provided

RepoRoulette provides multiple methods for randomly sampling GitHub repositories:

  • ID-based sampling: Probes random repository IDs

  • Temporal sampling: Weighted sampling based on repository activity by time period

  • BigQuery sampling: Advanced querying using Google BigQuery’s GitHub dataset

  • GitHub Archive sampling: Event-based sampling from GitHub Archive files

Quick Start

from reporoulette import sample

# Sample 10 repositories using temporal sampling
results = sample(method='temporal', n_samples=10)
print(f"Found {len(results['samples'])} repositories")

Indices and tables