GitHub Archive Sampler¶

The GitHub Archive sampler fetches repositories by sampling events from GitHub Archive, which records the public GitHub timeline.

class reporoulette.GHArchiveSampler(token: str | None = None, seed: int | None = None, log_level: int = 20)[source]¶

Bases: BaseSampler

Sample repositories by downloading and processing GH Archive files.

This sampler randomly selects days from GitHub’s event history, downloads the corresponding archive files, and extracts repository information.

gh_sampler(n_samples: int = 100, days_to_sample: int = 5, repos_per_day: int = 20, years_back: int = 10, event_types: list[str] | None = None, **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories by downloading and processing full day’s GH Archive files.

Parameters:

n_samples – Target number of repositories to sample
days_to_sample – Number of random days to sample
repos_per_day – Maximum repositories to sample per day
years_back – How many years to look back
event_types – Types of GitHub events to consider
**kwargs – Additional filters to apply

Returns:

List of repository data

sample(n_samples: int = 100, **kwargs: Any) → list[dict[str, Any]][source]¶

Sample repositories using the GH Archive approach.

This is the implementation of the abstract method from BaseSampler, which delegates to the gh_sampler method with the provided parameters.

Parameters:

n_samples – Number of repositories to sample
**kwargs – Additional parameters to pass to gh_sampler

Returns:

List of repository data

Advantages¶

Free to use (no API tokens required)
Access to event-based data
Can sample based on specific event types

Disadvantages¶

Limited to repositories with recent activity
May be slower due to processing compressed archives
Less control over sampling criteria

Usage Example¶

from reporoulette import GHArchiveSampler
from datetime import datetime

# Direct usage
sampler = GHArchiveSampler()
repos = sampler.sample(
    n_samples=10,
    date=datetime(2024, 1, 15),
    hour=12
)

# Using convenience function
from reporoulette import sample
results = sample(method='archive', n_samples=10)