Abstract:
Bloom filters are space-efficient probabilistic data structures that are used to test
whether an element is a member of a set, and may return false positives. Recently,
variations referred to as learned Bloom filters were developed that can provide
improved performance in terms of the rate of false positives, by using a learned
model for the represented set. However, previous methods for learned Bloom filters
do not take full advantage of the learned model. Here we show how to frame the
problem of optimal model utilization as an optimization problem, and using our
framework derive algorithms that can achieve near-optimal performance in many
cases. Experimental results from both simulated and real-world datasets show
significant performance improvements from our optimization approach over both
the original learned Bloom filter constructions and previously proposed heuristic
improvements.