Don’t Thrash: How to Cache your Hash on Flash
Authors: Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, Erez Zadok
Date: 2012
Link: PDF
- This paper describes the Cascade Filter, an approximate-membership-query (AMQ) data structure that scales beyond main memory, supporting over half a million insertions/deletions per second and over 500 lookups per second on a commodity flash-based SSD.
- An AMQ data structure is a dictionary that trades off space for a false positive rate on membership queries.
- The Bloom filter is a well-known example of an AMQ; but it doesn’t scale beyond main memory.
- An AMQ data structure supports the following dictionary operations on a set of keys:
- Insert
- Lookup
- Optionally delete.
- For a key in the set, lookup returns
present
.
- For a key not in the set, lookup returns
absent
with probability at least 1 − ε
, where ε
is a tunable false-positive rate. There is a tradeoff between ε
and space consumption.
- Bloom filters require about one byte per stored data item.
- A quotient Filter (QF) is an in-memory AMQ data structure that is functionally similar to a Bloom filter, but lookups incur a single cache miss, as opposed to at least two in expectation for a Bloom filter.
- The Cascade Filter (CF) comprises a collection of QFs organized into a data structure inspired by the Cache-Oblivious Lookahead Array (COLA).
- A rotational disk performs only
100–200
(random) I/Os per second.
- The Cascade Filter supports insertions at rates 40 times faster than a Bloom filter with buffering and 3,200 times faster than a traditional Bloom filter. Lookup throughput is 3 times slower than that of a Bloom filter or about the cost of 6 times random reads on flash.