ID Sampler¶
The ID-based sampler uses GitHub’s sequential repository ID system to generate truly random samples by probing random IDs from the valid ID range.
- class reporoulette.IDSampler(token: str | None = None, min_id: int = 1, max_id: int = 850000000, rate_limit_safety: int = 100, seed: int | None = None, log_level: int = 20, auto_discover_max: bool = False)[source]¶
Bases:
BaseSamplerSample repositories using random ID probing.
This sampler generates random repository IDs within a specified range and attempts to retrieve repositories with those IDs from GitHub.
- sample(n_samples: int = 10, min_wait: float = 0.1, max_attempts: int = 1000, **kwargs: Any) list[dict[str, Any]][source]¶
Sample repositories by trying random IDs.
- Parameters:
n_samples – Number of valid repositories to collect
min_wait – Minimum wait time between API requests
max_attempts – Maximum number of IDs to try
**kwargs – Additional filters to apply
- Returns:
List of repository data
Advantages¶
Truly random sampling across all public repositories
Simple and straightforward approach
Good for unbiased statistical sampling
Disadvantages¶
Low hit rate due to many invalid IDs (private/deleted repos)
Any filtering must be done after sampling
Limited by GitHub API rate limits
Usage Example¶
from reporoulette import IDSampler
# Direct usage
sampler = IDSampler(token="your_github_token")
repos = sampler.sample(n_samples=10)
# Using convenience function
from reporoulette import sample
results = sample(method='id', n_samples=10)