7Sep/08Off
A whirlwind tour of Bloom filters
What's a Bloom filter?
A Bloom filter is a memory and cpu efficient way to remember which things have been seen and which haven't. They are more familiar than programmers realize; the key set of a hash map is loosely a simple Bloom filter. I haven't had a need for this level of efficiency, but it's a neat tool to keep in mind.
A hash map works by applying a hash function to data, and associating the output with the input data. Considering a simple hash map which stores only values, and not buckets of values:
Traits of a hash map key set
- After an item has been stored, its existance will be known. It will never go missing. There is no chance of saying an item has not been stored after it has been stored.
- After an item has been stored, it may be identified as a different item. This happens when the hash function generates the same value for distinct data. So there is a risk of falsely confirming item B has been stored after only storing item A.
- The space required to hold these keys will be the product of the key size and key count. While key size is probably fixed, key count is variable and will lead to more space needed as more items are added.
Introducing Bloom filters
A Bloom filter shares traits 1 and 2 of a hash map key set, but not trait 3. The space required by a Bloom filter does not increase as more items are stored. Rather than processing data with one hash function and using the result as a lookup value, a Bloom filter uses multiple hash functions to identify booleans in a fixed size array.Bloom filter traits
- Same as hash map key set trait 1
- Same as hash map key set trait 2
- Storage space is small, but depends on the desired false positive rate
- Lookup time depends on number of hash functions; can be run in parallel
- Items cannot be removed
Code
This tacky, easily improvable code was released to the public domain and promises nothing.class MediocreBloomFilter M = 16 # number of boolean fields in filter K = 3 # number of hash functions to find boolean fields attr_accessor :v # the filter, a vector of boolean fields def initialize() @v = Array.new(M, 0) end def track_item(item) indicies_for?(item).each do |i| @v[i] = 1 end end def item_tracked?(item) hits = 0 indicies_for?(item).each do |i| hits = hits + @v[i] end hits == K end def indicies_for?(item) indicies = [] md5 = Digest::MD5.new hash = "" md5.update(item.to_s) hash = md5.hexdigest K.times do |k| indicies << HEX.index(hash[k].chr.to_s.upcase) end indicies end HEX = [ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F" ] end
- Line 7 sets the filter storage to 16 integers. It could have been two bytes if I wanted to do bitwise operations.
- Line 24 runs the hash functions against the target data. Since this is tacky sample code only one hash function is executed (line 30), and multiple slices are taken from the value and converted into array indicies (at line 33).
- Using integers rather than booleans has a nice side effect. It permits removal of items by using the integers as reference counts. I haven't tested it, but minor changes should be needed around lines 12 and 19, and a removal function to decrement the counts.
- item: "abc"
- md5: 900150983cd24fb0d6963f7d28e17f72
- indicies: [9, 0, 0]
- filter: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
- item: "abd"
- md5: 4911e516e5aa21d327512e0c8b197616
- indicies: [4, 9, 1]
- filter: [1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
- item: "seth schroeder"
- md5: 2c38f5fd13873c139567551ca1b7496e
- indicies: [2, 12, 3]
- filter: [1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
References
- Using Bloom Filters
- Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
- Burton Bloom, Space/time trade-offs in hash coding with allowable errors, CACM, 13(7):422-426, July 1970.
- Paul E. Black, "Bloom filter", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 16 May 2008. Available from: http://www.nist.gov/dads/HTML/bloomFilter.html


