Don’t Thrash: How to Cache your Hash on Flash

Authors: Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, Erez Zadok

Date: 2012

Link: PDF

This paper describes the Cascade Filter, an approximate-membership-query (AMQ) data structure that scales beyond main memory, supporting over half a million insertions/deletions per second and over 500 lookups per second on a commodity flash-based SSD.
An AMQ data structure is a dictionary that trades off space for a false positive rate on membership queries.
The Bloom filter is a well-known example of an AMQ; but it doesn’t scale beyond main memory.
An AMQ data structure supports the following dictionary operations on a set of keys:
- Insert
- Lookup
- Optionally delete.
For a key in the set, lookup returns present.
For a key not in the set, lookup returns absent with probability at least 1 − ε, where ε is a tunable false-positive rate. There is a tradeoff between ε and space consumption.
Bloom filters require about one byte per stored data item.
A quotient Filter (QF) is an in-memory AMQ data structure that is functionally similar to a Bloom filter, but lookups incur a single cache miss, as opposed to at least two in expectation for a Bloom filter.
The Cascade Filter (CF) comprises a collection of QFs organized into a data structure inspired by the Cache-Oblivious Lookahead Array (COLA).
A rotational disk performs only 100–200 (random) I/Os per second.
The Cascade Filter supports insertions at rates 40 times faster than a Bloom filter with buffering and 3,200 times faster than a traditional Bloom filter. Lookup throughput is 3 times slower than that of a Bloom filter or about the cost of 6 times random reads on flash.