Performance Analysis of Bloom Filter Joins
Gaurav Sood
2026-04-16
Source:vignettes/benchmarking-bloomjoin.Rmd
benchmarking-bloomjoin.RmdOverview
This vignette compares bloomjoin against standard
dplyr joins on two dimensions: speed and
memory. Bloom joins excel when tables are asymmetric
(large left, small right) with low overlap.
Correctness
First, verify that bloom_join() produces identical
results to dplyr:
dat <- generate_data(5000, 500, 0.1)
bloom_result <- bloom_join(dat$x, dat$y, by = "id") |> arrange(id)
dplyr_result <- inner_join(dat$x, dat$y, by = "id") |> arrange(id)
# Strip bloom metadata for comparison
bloom_cmp <- as.data.frame(bloom_result)
attr(bloom_cmp, "bloom_metadata") <- NULL
all.equal(bloom_cmp, as.data.frame(dplyr_result), check.attributes = FALSE)
#> [1] TRUESpeed and Memory Benchmarks
We test across scenarios varying table asymmetry and overlap.
speed_ratio and mem_ratio > 1 means bloom
wins.
scenarios <- list(
c(1e6, 1e4, 0.01),
c(1e6, 1e4, 0.05),
c(5e5, 5e3, 0.02),
c(5e5, 5e3, 0.10),
c(2e5, 2e4, 0.05),
c(2e5, 2e4, 0.25),
c(1e5, 1e5, 0.10),
c(1e5, 1e5, 0.50)
)
results <- do.call(rbind, lapply(scenarios, function(s) {
run_bench(s[1], s[2], s[3], reps = 3)
}))
kable(results,
col.names = c("n_x", "n_y", "overlap", "speed", "memory", "reduction"),
align = "rrrrrr")| n_x | n_y | overlap | speed | memory | reduction |
|---|---|---|---|---|---|
| 1,000,000 | 10,000 | 1% | 2.05 | 2.19 | 98% |
| 1,000,000 | 10,000 | 5% | 1.80 | 1.96 | 94% |
| 500,000 | 5,000 | 2% | 2.42 | 1.91 | 97% |
| 500,000 | 5,000 | 10% | 1.43 | 1.59 | 89% |
| 200,000 | 20,000 | 5% | 1.07 | 1.18 | 94% |
| 200,000 | 20,000 | 25% | 0.80 | 0.92 | 74% |
| 100,000 | 100,000 | 10% | 0.38 | 0.50 | 89% |
| 100,000 | 100,000 | 50% | 0.42 | 0.43 | 49% |
Interpretation
- speed, memory > 1: bloom wins (dplyr_time / bloom_time)
- reduction: % of rows filtered out before join — explains memory savings
High reduction means fewer rows held in memory during the join. When reduction is low, Bloom filter overhead dominates and dplyr wins.
When to Use Bloom Joins
| Condition | Bloom benefit |
|---|---|
| Large x, small y (10:1+) | Strong |
| Low overlap (<25%) | Strong |
| Equal-sized tables | None (use dplyr) |
| High overlap (>50%) | None (use dplyr) |
Tuning
The fpr parameter controls the false positive rate.
Lower values reduce false positives but increase filter size:
dat <- generate_data(50000, 5000, 0.05)
fpr_results <- do.call(rbind, lapply(c(0.001, 0.01, 0.1), function(fpr) {
result <- bench::mark(
bloom_join(dat$x, dat$y, by = "id", fpr = fpr),
iterations = 3,
check = FALSE
)
data.frame(
fpr = fpr,
median_ms = round(as.numeric(result$median) * 1000, 1),
mem_mb = round(as.numeric(result$mem_alloc) / 1024^2, 1)
)
}))
kable(fpr_results, col.names = c("FPR", "Time (ms)", "Memory (MB)"))| FPR | Time (ms) | Memory (MB) |
|---|---|---|
| 0.001 | 4.4 | 3.3 |
| 0.010 | 4.5 | 3.3 |
| 0.100 | 4.9 | 3.5 |
Default fpr = 0.01 balances speed and memory well.