When my friends and I were undergrads (3rd year, I believe), we had an absolute blast exploring this exact topic - the intersection of Bloom Filters and client side searching. So much so that it became part of our undergrad thesis.
It all started when Stavos's blog was circulated on Hacker News! The way we approached the search part was by using "Spectral Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi Matias from the early 2000s - its basically an iteration on the counting bloom filters. We used the minimal selection and minimal increase algorithm from the paper for insertion and ranking of results.
I wrote a blog on it too - https://pncnmnp.github.io/blogs/spectral-bloom-filters.html
Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf
When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.
However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.
With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.
A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.
I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.
This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.
Ha, collision avoidance all the way down
Interesting trick about the constant value, and thank you for the detailed write up!
Splunk uses bloom filters to make searching for rare events fast. Rare events are usually the most interesting.
I’ve only used Splunk with one set of devs and maybe we were doing it wrong, but it didn’t feel fast to me.
Several of us were working hard to move everything into Prometheus that made any sense to be in Prometheus instead of Splunk.
Notably any time we had a production issue that it was unclear which team was responsible, Splunk became the bottleneck because we started exceeding quotas immediately.
Splunk is one of the best software I've ever used but it HAS to be used with very fast storage to be effective. I've only used it on enterprise grade storage arrays and servers with lots of RAM for caches. On modern PCIe 5.0 NVMe drives it is stupid fast.
I'm not sure what you mean by exceed quotas because Splunk is normally licensed on GB ingested per day. This can lead to bitter fights between teams over how this is allocated.
The good thing about this license model is that you can use as much hardware as you want for no extra license cost.
> used with very fast storage
That’s sounds like self hosting. Which is not the only product they offer. But you still have hardware that can only run so many queries at once and then starts queuing any additional request, yeah? Once you have a dozen people on a call it went to shit. Only occasionally ran into problems like this with graphite. But you need a lot of people looking at a very large dashboard to start feeling refresh delays.
I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.
Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.
A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.
a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw
The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.
I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.
Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e
Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.
But it works and loading a hand full of files or doing an AND is amazingly fast.
What you've described here is an n-gram inverted index (with n = 2) represented as a bitset. We could call it a bigram bitset inverted index. Glad to know you designed and implemented all of this from first principles, and that it serves you well!
I had a problem where we needed to compare large data sets between machines for keys that existed in both, and the bandwidth cost just wasn’t mathing for the median result set size. I was trying to figure out how to send a fingerprint from machine A to B, then have machine B send the hits back. Or how many round trips I could do based on set size to minimize bandwidth + latency. I ended up with a calculus problem nobody could help me solve because of an n^5 term.
My boss was generally pretty good with obscure data structures but neither of us had encountered Bloom filters. This was about three years after Google published their paper on how they were using Bloom filters, but that company would be bankrupt before I figured it out.
May be true for offline full text search, but not true for online string search.
I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].
[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
The "no sharing between filters" insight clicked for me on a different problem.
I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.
Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.
TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.
Why do these “inverted indexes” just look like indexes to me? Too much time with databases perhaps?
I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...
It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.
I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).
There's also code online here: https://github.com/bingmann/cobs
Bing uses bloom filters for the most-recent index:
I will forever think of Bloom filters as "bouncer filters." Could go with concierge filter. Basically, it is the equivalent of every movie where the detective is asking the front desk various attributes of who they are looking for.
It is not hard to see how you could start asking the front desk to track every obscure attribute and to expect to fall over for various reasons.
> Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild.
I thought that was going to be a link to Google.com
Reminds me of @danthegoodman's project: bloomsearch [1]
is there a better way then bloom filters to handle needle in the haystack type searches where the haystack might be terabytes of data and you only want a few lines?
There are a lot of "better than Bloom" filters that work similarly in some aspects. I have used Cuckoo [1] and Ribbon [2] filters for Bloom-type applications. If you have an application where you do a lot of one kind of searching, it may also be worth implementing a specialized variant of a data structure. I needed a Cuckoo-type filter on the JVM but only for 64 bit integers and I was able to make a smaller, faster code base that was specialized to this data type instead of handling generic objects.
You need to know up front whether you need to be able to dynamically add entries to the filter or if your application can tolerate rebuilding the filter entirely whenever the underlying data changes. In the latter case you have more freedom to choose data structures; many of the modern "better than Bloom" filters are more compact but don't support dynamic updates.
[1] https://en.wikipedia.org/wiki/Cuckoo_filter
[2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-filt...
I wonder how often in the wild people are tuning for a 1% false positive rate versus a much lower one, like .1%. You do quickly reach data set sizes where even 1% introduces some strain on resources or responsiveness.
Cuckoo claims 70% of the size of bloom for the same error rate, and the space is logarithmic to the error rate. Looks like about 6.6 bits per record versus 9.56 bits for bloom at 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you can get to about a .13% error rate for a cuckoo only a hair larger than the equivalent bloom filter (n^9.567 = 758.5)
thanks.. i'll read up into these.. always amazes me that companies like datadog somehow made log search quick
The key insight about bloom filters lacking synergy is excellent. The ~7K document crossover point makes sense because inverted indexes amortize dictionary storage across all documents while bloom filters must encode it linearly per document
But doesn’t that depend on the cardinality of the indexes versus the document count? I’ve seen systems with a stupid number of tag values.
the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.
I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.
the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.
Exactly. A 1.2% false positive rate means unnecessary reads 1.2% of the time vs 100% without the filter. Even at 10% FP rate, you skip 90% of I/O.
This asymmetry works great for I/O-bound workloads (skip-indexes) but fails for TFA's approach where every document needs its own filter.
In practice, you combine both: inverted index for the dictionary (amortizes across documents), then bloom filters per chunk of the index (amortizes across chunks). This two-level approach handles scale much better than TFA's one-filter-per-document design. It's bloom filters as an optimization layer, not a replacement.
> that asymmetry is weirdly powerful for specific problems.
100% agree when it works in your favor. We use it for exactly that situation where a non-zero false positive rate is fine and you can choose how much memory to devote to getting it closer to zero.
There have a been a couple times though where we've needed a "keep everything not on this list" operation, and unfortunately bloom filters don't work well for that situation. There are alternatives, but none as elegantly compact as bloom filters.