by javindo on 10/28/13, 10:54 AM
Interesting anecdote: an alumni of my university was chatting with me recently about his work at Google, essentially he replaced what he described as a "large, complex machine learning pipeline" with a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.
For example, if you search for "iPhone 5S", the filter determines whether to show something like this http://i.imgur.com/Dp3y1Gi.png (not sure if sponsored makes a difference here, possibly a bad example).
by mbostock on 10/28/13, 3:44 PM
by GhotiFish on 10/28/13, 3:12 PM
I always loved bloom filters, because to me, they showed something that was very fundamentally important.
That the amount of information required so that you can say whether an element is in a set or not is significantly less that the set itself.
That is just wierd, but amazing.
If someone asks me about weird and wonderful things in computer science, this is what I talk about.
by fennecfoxen on 10/28/13, 3:54 PM
Now we just need HyperLogLog by example.
(HyperLogLogs are vaguely similar data structures in that you hash things a bunch to store a value inside, except instead of set-membership inquiries, they're better at cardinality-estimation purposes.)
by makmanalp on 10/28/13, 5:44 PM
Question, instead of using two hash functions, why not use just a single function but use a = hash(foo) and b = hash(foo+"something")? A well distributed hash function should give drastically different results for a and b for different values of foo, which is what we want, correct?
by rethab on 10/28/13, 11:57 AM
by pearkes on 10/28/13, 12:09 PM
bloomd[1] is a "high performance c server for bloom filters" for those looking for a low-overhead and deployable implementation with quite a few clients, like Python and Ruby.
[1] https://github.com/armon/bloomd
by ngcazz on 10/28/13, 10:54 AM
Wow, I didn't realize they were this simple to implement and reason about!
by sandwell on 10/28/13, 11:09 AM
This looks very cool, however I was able to get lots of false positives using the first few letters of the alphabet as test data. Could you reduce these by using k hashes and k bit vectors with m bits, i.e. one bit vector for each hash algorithm, at the expense of a little extra computation?
by gngeal on 10/28/13, 11:19 AM
Google Chrome not only uses bloom filters for malicious URL filtering but also as a part of the CSS selector matching process. I'm not sure how exactly, but allegedly, it gives them a nice speed bump in majority of the cases without compromising on generality.
by roye on 10/28/13, 4:27 PM
by timruffles on 10/28/13, 11:43 AM
Coincidence: used this page on Friday to implement bloom filters for a kata at Software Craftsmanship 2013 (in Bletchley Park of all places).
Fun to implement, I'd recommend having a go. Here's our terrible clojure code: https://gist.github.com/timruffles/7195405
by aidos on 10/28/13, 8:04 PM
Interesting. I'd never looked into how BFs work before and this is a nice clear explaination.
One thing I don't quite get - as you're hashing into the same bit array for each of the hashes - you must get false positives from combinations of hashes. So say you chose a bad combination of hash algorithms where one key hashed into bits a and b using the two different hashes respectively. Maybe some other key might hash into b and a. With a totally suboptimal choice of hash algorithms you could end up with a 50% error rate. Or am I misunderstanding something?
by tropicalmug on 10/28/13, 5:59 PM
Another great Bloom Filter library not mentioned in this article is Bitly's implementation on GitHub[0]. It has the best of both worlds: a C implementation and higher-level wrappers (Golang, PHP, Python, and Common Lisp). If you don't want to play around with a fancier DBMS and really just want the damn filters, I would look here.
[0] https://github.com/bitly/dablooms
by jbochi on 10/28/13, 9:17 PM
by joeblau on 10/28/13, 12:57 PM
Last time I checked, the Cassandra implementation was really messed up. Myself and another dev at our last company spent a few weeks debugging, adding test cases and fixing the code to get it to work correctly.
by dpcx on 10/28/13, 12:12 PM
Anyone got pointers to "simple" real-world use cases?
by mrcactu5 on 10/28/13, 5:49 PM
I had just put down this bloom filter tutorial by Jason Davies, now another one comes up. It's a sign.
www.jasondavies.com/bloomfilter/
by malkia on 10/28/13, 1:46 PM
As an example of real usage - In my ChromeBook home folder, there are several "bloom" filter files for various predefined filtering in chrome