BloomJoin: Bloom Filter Based Joins
BloomJoin provides an alternative join implementation for R that uses an actual Bloom filter, implemented in C++ via Rcpp, to optimize the performance of joins between data frames. Traditional joins in R can be inefficient when dealing with large datasets, especially when one table is significantly larger than the other and the join key selectivity is low.
Documentation
The full pkgdown site, including function reference, articles, and release notes, is published automatically to GitHub Pages whenever changes are pushed to the main branch. You can browse it at https://gojiplus.github.io/bloomjoin/. To rebuild the site locally run:
pkgdown::build_site()
Usage
library(bloomjoin)
# Basic usage
result <- bloom_join(df1, df2, by = "id", type = "inner")
# With multiple join columns
result <- bloom_join(df1, df2, by = c("id", "date"), type = "left")
# With performance tuning parameters
result <- bloom_join(df1, df2,
by = "id",
type = "inner",
engine = "bloom",
prefilter_side = "auto",
fpr = 0.001,
n_hint = list(y = 50000),
verbose = TRUE)
Proof of Correctness
library(dplyr)
library(tibble)
set.seed(123)
left <- tibble(id = sample(1:5000, 4000), value_left = rnorm(4000))
right <- tibble(id = sample(1:5000, 1500), value_right = rnorm(1500))
bloom <- bloom_join(left, right, by = "id") %>% arrange(id)
reference <- inner_join(left, right, by = "id") %>% arrange(id)
stopifnot(identical(bloom, reference))
Benchmarking Time and Memory
library(bench)
bench::mark(
bloom_join = bloom_join(left, right, by = "id"),
dplyr_join = inner_join(left, right, by = "id"),
iterations = 5,
check = FALSE
)
For a ready-to-run demonstration that generates data, verifies correctness, and prints benchmark summaries, execute:
How It Works
BloomJoin uses a Bloom filter pre-processing step to optimize joins:
- Sample the join keys to estimate distinct counts and decide which table to pre-filter.
- Build a Bloom filter (using a cache-friendly bitset stored in C++) from the chosen build-side join keys.
- Probe the filter with the opposing table to remove rows that cannot match before delegating to the appropriate
dplyr
verb.
Because Bloom filters may produce false positives but never false negatives, this pre-filtering step safely reduces the number of rows that participate in the expensive join while preserving all possible matches. The underlying Bloom filter implementation is provided by a compiled Rcpp module for performance.
Performance Benchmarks
See here