ElephantDB
Upcoming SlideShare
Loading in...5
×
 

ElephantDB

on

  • 9,337 views

 

Statistics

Views

Total Views
9,337
Views on SlideShare
9,141
Embed Views
196

Actions

Likes
24
Downloads
176
Comments
0

8 Embeds 196

https://twitter.com 126
http://lanyrd.com 24
http://tedwon.com 22
http://paper.li 7
https://si0.twimg.com 6
https://abs.twimg.com 5
http://twitter.com 3
http://a0.twimg.com 3
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

ElephantDB ElephantDB Presentation Transcript

  • ElephantDB Nathan Marz BackType @nathanmarz
  • Specialized database for exporting key/value data from Hadoop
  • ElephantDB Server File File ElephantDB Key/value queries Server File FileDistributed Filesystem ElephantDB Server
  • First, some context
  • BackType’s Challenges
  • BackType’s Challenges Complex analytics
  • BackType’s Challenges Complex analyticson lots of data (> 30TB)
  • BackType’s Challenges Complex analyticson lots of data (> 30TB) in realtime
  • What is a data system?
  • Data system View 1Raw data View 2 View 3
  • Data system # Tweets / URLTweets Influence scores Trending topics
  • Our approach Speed Layer Batch Layer
  • Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
  • (this is too slow in practice)
  • Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
  • Batch views are always out of date by a few hours
  • Speed layer DBMessages DB
  • Compensate for high latency of updates to batch layer
  • Only needs to worry about last few hours of data
  • Application-level Queries Batch Layer Query Merge Speed Layer Query
  • Batch layer MapReduce BatchMaster dataset ElephantDB View 1 MapReduce Batch ElephantDB View 2 Batch ElephantDB View 3 MapReduce
  • ElephantDB• Random reads• Batch writes• Random writes
  • Why ElephantDB?• Simplicity• Ease of use• Trivially scalable• “Just works”
  • Creating an ElephantDBindex is disassociated from serving an index
  • ElephantDB indexing data flow sharded key/valuekey/value pairs database MapReduce Distributed filesystem
  • Index creation• ElephantDB just used as a library• No dependencies on a live ElephantDB cluster
  • ElephantDB serving data flowDistributed filesystem Shard ElephantDB Server Shard Shard ElephantDB Server Shard
  • ElephantDB serving• Each server in “ring” serves a subset of the data• Download shards, open, serve
  • Terminology• “Domain”: related set of key/value pairs• “Shard”: Subset of a domain• “Ring”: Cluster of servers that work together to serve same set of domains• “Local persistence”: Regular key/value database that implements a shard (like Berkeley DB)
  • ElephantDB domain• Versioned• domain-spec.yaml contains metadata (number of shards, how domains are stored)
  • ElephantDB version• Folder for each shard
  • ElephantDB Shard• Files for local persistence• Berkeley DB JE in this example
  • Writing to ElephantDB Cascading
  • Writing to ElephantDB Cascalog
  • MapReduce flow (key, value) (shard, list<key, value>) Stream into LP hash mod Group by shard Local Persistence on Shard on DFS Upload(shard, key, value) LFS
  • Incremental ElephantDB (key, value) (shard, list<key, value>) Stream into LP Old shard on hash mod Group by shard Download DFS Local New shard on Persistence on Upload DFS(shard, key, value) LFS
  • Incremental ElephantDB• Avoid reindexing domain from scratch• Massive performance benefits in creating/ updating versions• ElephantDB ring still has to download domain from scratch
  • Incremental ElephantDB
  • Configuring a ring
  • Querying ElephantDB
  • Consuming ElephantDB from MapReduce• Read from shards in DFS, not from live ElephantDB servers• “Indexed key/value file format on Hadoop”
  • Demo
  • One-click deploy
  • CAP TheoremIn a distributed database, pick at most two: Consistency Availability Partition Tolerance
  • ElephantDB and CAP Availability Partition Tolerance
  • ElephantDB and CAPBut, a very predictable kind of consistency
  • Comparison to HBase• HBase can be written to in batch• Supports random writes• So why ElephantDB?
  • Comparison to HBase• HBase has an enormous complexity cost• HBase has many moving parts• Does not “just work”
  • Comparison to HBase• HBase is not highly available• ElephantDB has ability to revert to older versions of indexes
  • Implementation details• Written in Clojure• Thrift interface• Local Persistences are pluggable
  • Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com