GitHub Archive Sampler

The GitHub Archive sampler fetches repositories by sampling events from GitHub Archive, which records the public GitHub timeline.

class reporoulette.GHArchiveSampler(token: str | None = None, seed: int | None = None, log_level: int = 20)[source]

Bases: BaseSampler

Sample repositories by downloading and processing GH Archive files.

This sampler randomly selects days from GitHub’s event history, downloads the corresponding archive files, and extracts repository information.

gh_sampler(n_samples: int = 100, days_to_sample: int = 5, repos_per_day: int = 20, years_back: int = 10, event_types: list[str] | None = None, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories by downloading and processing full day’s GH Archive files.

Parameters:
  • n_samples – Target number of repositories to sample

  • days_to_sample – Number of random days to sample

  • repos_per_day – Maximum repositories to sample per day

  • years_back – How many years to look back

  • event_types – Types of GitHub events to consider

  • **kwargs – Additional filters to apply

Returns:

List of repository data

sample(n_samples: int = 100, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories using the GH Archive approach.

This is the implementation of the abstract method from BaseSampler, which delegates to the gh_sampler method with the provided parameters.

Parameters:
  • n_samples – Number of repositories to sample

  • **kwargs – Additional parameters to pass to gh_sampler

Returns:

List of repository data

Advantages

  • Free to use (no API tokens required)

  • Access to event-based data

  • Can sample based on specific event types

Disadvantages

  • Limited to repositories with recent activity

  • May be slower due to processing compressed archives

  • Less control over sampling criteria

Usage Example

from reporoulette import GHArchiveSampler
from datetime import datetime

# Direct usage
sampler = GHArchiveSampler()
repos = sampler.sample(
    n_samples=10,
    date=datetime(2024, 1, 15),
    hour=12
)

# Using convenience function
from reporoulette import sample
results = sample(method='archive', n_samples=10)