ID Sampler

The ID-based sampler uses GitHub’s sequential repository ID system to generate truly random samples by probing random IDs from the valid ID range.

class reporoulette.IDSampler(token: str | None = None, min_id: int = 1, max_id: int = 850000000, rate_limit_safety: int = 100, seed: int | None = None, log_level: int = 20, auto_discover_max: bool = False)[source]

Bases: BaseSampler

Sample repositories using random ID probing.

This sampler generates random repository IDs within a specified range and attempts to retrieve repositories with those IDs from GitHub.

sample(n_samples: int = 10, min_wait: float = 0.1, max_attempts: int = 1000, **kwargs: Any) list[dict[str, Any]][source]

Sample repositories by trying random IDs.

Parameters:
  • n_samples – Number of valid repositories to collect

  • min_wait – Minimum wait time between API requests

  • max_attempts – Maximum number of IDs to try

  • **kwargs – Additional filters to apply

Returns:

List of repository data

Advantages

  • Truly random sampling across all public repositories

  • Simple and straightforward approach

  • Good for unbiased statistical sampling

Disadvantages

  • Low hit rate due to many invalid IDs (private/deleted repos)

  • Any filtering must be done after sampling

  • Limited by GitHub API rate limits

Usage Example

from reporoulette import IDSampler

# Direct usage
sampler = IDSampler(token="your_github_token")
repos = sampler.sample(n_samples=10)

# Using convenience function
from reporoulette import sample
results = sample(method='id', n_samples=10)