ElephantDB

ElephantDB

Nathan Marz
BackType
@nathanmarz

Specialized database for
exporting key/value
data from Hadoop

ElephantDB
Server

File File

ElephantDB Key/value queries
Server
File File

Distributed Filesystem
ElephantDB
Server

BackType’s Challenges

Complex analytics


Complex analytics
on lots of data (> 30TB)


Complex analytics
on lots of data (> 30TB)
in realtime

Data system

View 1
Raw data

View 2

View 3

Data system

# Tweets /
URL
Tweets

Inﬂuence
scores

Trending
topics

Our approach

Speed Layer

Batch Layer

Batch layer

MapReduce Batch
Master dataset
View 1

MapReduce Batch
View 2

Batch
View 3
MapReduce

(this is too slow in practice)

Incremental batch layer
Batch
View 1

New data
Batch View Batch
View 2
maintenance
workﬂow
Query Append Batch
View 3

All data

Batch views are always out of
date by a few hours

Speed layer

DB

Messages

DB

Compensate for high latency
of updates to batch layer

Only needs to worry about
last few hours of data

Application-level Queries

Batch Layer Query

Merge

Speed Layer Query

Batch layer

MapReduce Batch
Master dataset

ElephantDB
View 1

MapReduce Batch
ElephantDB
View 2

Batch ElephantDB
View 3
MapReduce

ElephantDB

• Random reads
• Batch writes
• Random writes

Why ElephantDB?

• Simplicity
• Ease of use
• Trivially scalable
• “Just works”

Creating an ElephantDB
index is disassociated from
serving an index

ElephantDB indexing
data ﬂow

sharded key/value
key/value pairs database
MapReduce Distributed ﬁlesystem

Index creation

• ElephantDB just used as a library
• No dependencies on a live ElephantDB
cluster

ElephantDB serving
data ﬂow
Distributed ﬁlesystem
Shard ElephantDB
Server
Shard

Shard ElephantDB
Server
Shard

ElephantDB serving

• Each server in “ring” serves a subset of the
data
• Download shards, open, serve

Terminology
• “Domain”: related set of key/value pairs
• “Shard”: Subset of a domain
• “Ring”: Cluster of servers that work
together to serve same set of domains
• “Local persistence”: Regular key/value
database that implements a shard (like
Berkeley DB)

ElephantDB domain

• Versioned
• domain-spec.yaml contains metadata
(number of shards, how domains are
stored)

ElephantDB version

• Folder for each shard

ElephantDB Shard

• Files for local persistence
• Berkeley DB JE in this example

Writing to ElephantDB

Cascading

Writing to ElephantDB

Cascalog

MapReduce ﬂow
(key, value) (shard, list<key, value>)

Stream into LP

hash mod Group by shard

Local
Persistence on Shard on DFS
Upload
(shard, key, value) LFS

Incremental ElephantDB
(key, value) (shard, list<key, value>)

Stream into LP Old shard on
hash mod Group by shard Download DFS

Local
New shard on
Persistence on
Upload DFS
(shard, key, value) LFS

Incremental ElephantDB

• Avoid reindexing domain from scratch
• Massive performance beneﬁts in creating/
updating versions
• ElephantDB ring still has to download
domain from scratch

Consuming ElephantDB
from MapReduce

• Read from shards in DFS, not from live
ElephantDB servers
• “Indexed key/value ﬁle format on Hadoop”

CAP Theorem

In a distributed database, pick at most two:

Consistency
Availability
Partition Tolerance

ElephantDB and CAP

Availability

Partition Tolerance

ElephantDB and CAP

But, a very predictable kind of consistency

Comparison to HBase

• HBase can be written to in batch
• Supports random writes
• So why ElephantDB?

Comparison to HBase

• HBase has an enormous complexity cost
• HBase has many moving parts
• Does not “just work”

Comparison to HBase

• HBase is not highly available
• ElephantDB has ability to revert to older
versions of indexes

Implementation details

• Written in Clojure
• Thrift interface
• Local Persistences are pluggable

Questions?

Twitter: @nathanmarz

Email: nathan.marz@gmail.com

Web: http://nathanmarz.com

ElephantDB

More Related Content

What's hot

Viewers also liked

Similar to ElephantDB

More from nathanmarz

Recently uploaded

ElephantDB

Editor's Notes