2. Benefits
High-speed reads and writes of key/value pairs
sustained over growing volumes of data
Read costs are always 0 or 1 disk seek
Efficient use of memory
Simple file structures with strong durability
guarantees
3. Why “Lucene” KV store?
Uses Lucene’s “Directory” APIs for low-level file
access
Based on Lucene’s concepts of segment
files, soft deletes, background merges, commit
points etc BUT a fundamentally different form of
index
I’d like to offer it to the Lucene community as a
“contrib” module because they have a track
record in optimizing these same concepts (and
could potentially make use of it in Lucene?)
4. Example benchmark
results
Note, regular Lucene search indexes follow the same trajectory of the
“Common KV Store” when it comes to lookups on a store with millions
of keys
5. KV-Store High-level Design
Map Key hash (int) Disk pointer
(int)
held in
23434 0
RAM 6545463 10
874382 22
Num keys Key 1 Key 1 Value 1 Value 1 Key/values 2,3,4…
with hash size (byte [ ]) size (byte[ ])
(VInt) (VInt) (Vint)
Disk
1 3 Foo 3 Bar
2 5 Hello 5 World 7,Bonjour,8,Le Mon..
Most hashes have only one associated key and value Some hashes will
have key collisions
requiring the use of
extra columns here
6. Read logic (pseudo code)
int keyHash=hash(searchKey);
int filePointer=ramMap.get(keyHash); There is a
if filePointer is null guaranteed
maximum of one
return null for value; random disk seek
file.seek(filePointer); for any lookup
int numKeysWithHash=file.readInt()
for numKeysWithHash With a good
{ hashing function
most lookups will
storedKey=file.readKeyData(); only need to go
if(storedKey==searchKey) once around this
return file.readValueData(); loop
file.readValueData();
}
7. Write logic (pseudo code)
Updates will
int keyHash=hash(newKey); always append to
int oldFilePointer=ramMap.get(keyHash); the end of the
ramMap.put(keyHash,file.length()); file, leaving older
if oldFilePointer is null values
{ unreferenced
file.append(1);//only 1 key with hash
file.append(newKey);
file.append(newValue); In case of any key
}else collisions, previou
{ sly stored values
file.seek(oldFilePointer); are copied to the
int numOldKeys=file.readInt(); new position at
Map tmpMap=file.readNextNKeysAndValues(numOldKeys); the end of the file
tmpMap.put(newKey,newValue); along with the
file.append(tmpMap.size()); new content
file.appendKeysAndValues(tmpMap);
}
8. Segment generations:
writes
Hash Pointer Hash Pointer Hash Pointer Hash Pointer
Maps held 23434 0 203765 0 23434 0 15243 0
Writes append to
in RAM 65463 10 37594 10 65463 10
3 the end of the
74229 10 latest generation
… … … … … … 7
… …
segment until it
reaches a set
Key and size then it is
3
value disk made read-only
0 1 2
and new
stores
segment is
old created.
new
9. Segment generations:
reads
Maps held Hash Pointer Hash Pointer Hash Pointer Hash Pointer Read operations
23434 0 203765 0 23434 0 15243 0 search memory
in RAM 65463 10 37594 10 65463 10
3
74229 10
maps in reverse
… … … … … … 7 order. The first
… …
map found with a
hash is expected
Key and 3 to have a pointer
value disk 0 1 2 into its associated
stores file for all the latest
keys/values with
old new this hash
10. Segment generations:
merges
Hash Pointer Hash Pointer Hash Pointer Hash Pointer
Maps held 23434 0 20376 0 23434 0 15243 0
in RAM 65463 10
5
65463 10
3
37594 10 74229 10
… … … … 7
… …
… …
Key and 3
value disk 0 1 2
stores
A background thread
merges read-only
segments with many 4
outdated entries into
new, more compact
versions
11. Segment generations:
durability
Hash Pointer Hash Pointer Hash Pointer
Maps held 23434 0 203765 0 152433 0
in RAM 65463 10 37594 10 742297 10
… … … … … …
Key and 3
value disk 0 4
stores
Completed 0,4
Segment IDs
Active 3
Like Lucene, commit Segment ID
operations create a new Active 423423
segment
generation of a “segments” committed
length
file, the contents of which
reflect the committed (i.e.
fsync’ed state of the store.)
12. Implementation details
JVM needs sufficient RAM for 2 ints for every active key
(note: using “modulo N” on the hash can reduce RAM max
to Nx2 ints at the cost of more key collisions = more disk
IO)
Uses Lucene Directory for
Abstraction from choice of file system
Buffered reads/writes
Support for Vint encoding of numbers
Rate-limited merge operations
Borrows successful Lucene concepts:
Multiple segments flushed then made read-only.
“Segments” file used to list committed content (could
potentially support multiple commit points)
Background merges
Uses LGPL “Trove” for maps of primitives