Efficient Data Frame Joins Using Bloom Filters • bloomjoin

bloomjoin

Faster, memory-efficient joins when joining a large table to a small lookup table.

bloomjoin helps when: - Large table joined to small table (10:1 ratio or more) - Low overlap between join keys (<25%)

n_x	n_y	overlap	speed	memory
1,000,000	10,000	1%	2.0x	2.2x
1,000,000	10,000	5%	1.6x	2.0x
500,000	5,000	2%	1.7x	1.9x
200,000	20,000	5%	1.2x	1.2x

Values > 1 mean bloomjoin is faster / uses less memory than dplyr.

devtools::install_github("gojiplus/bloomjoin")

library(bloomjoin)

result <- bloom_join(large_df, small_lookup, by = "id")

Same syntax as dplyr. Supports type = "inner", "left", "right", "semi", "anti".

n_x	n_y	overlap	speed	memory
100,000	100,000	10%	0.4x	0.5x
100,000	100,000	50%	0.4x	0.4x

Values < 1 mean dplyr is faster.

Bloom filters have no false negatives, so no matches are lost.

MIT