ElephantDB
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
9,896
On Slideshare
9,687
From Embeds
209
Number of Embeds
8

Actions

Shares
Downloads
184
Comments
0
Likes
27

Embeds 209

https://twitter.com 139
http://lanyrd.com 24
http://tedwon.com 22
http://paper.li 7
https://si0.twimg.com 6
https://abs.twimg.com 5
http://twitter.com 3
http://a0.twimg.com 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. ElephantDB Nathan Marz BackType @nathanmarz
  • 2. Specialized database for exporting key/value data from Hadoop
  • 3. ElephantDB Server File File ElephantDB Key/value queries Server File FileDistributed Filesystem ElephantDB Server
  • 4. First, some context
  • 5. BackType’s Challenges
  • 6. BackType’s Challenges Complex analytics
  • 7. BackType’s Challenges Complex analyticson lots of data (> 30TB)
  • 8. BackType’s Challenges Complex analyticson lots of data (> 30TB) in realtime
  • 9. What is a data system?
  • 10. Data system View 1Raw data View 2 View 3
  • 11. Data system # Tweets / URLTweets Influence scores Trending topics
  • 12. Our approach Speed Layer Batch Layer
  • 13. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
  • 14. (this is too slow in practice)
  • 15. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
  • 16. Batch views are always out of date by a few hours
  • 17. Speed layer DBMessages DB
  • 18. Compensate for high latency of updates to batch layer
  • 19. Only needs to worry about last few hours of data
  • 20. Application-level Queries Batch Layer Query Merge Speed Layer Query
  • 21. Batch layer MapReduce BatchMaster dataset ElephantDB View 1 MapReduce Batch ElephantDB View 2 Batch ElephantDB View 3 MapReduce
  • 22. ElephantDB• Random reads• Batch writes• Random writes
  • 23. Why ElephantDB?• Simplicity• Ease of use• Trivially scalable• “Just works”
  • 24. Creating an ElephantDBindex is disassociated from serving an index
  • 25. ElephantDB indexing data flow sharded key/valuekey/value pairs database MapReduce Distributed filesystem
  • 26. Index creation• ElephantDB just used as a library• No dependencies on a live ElephantDB cluster
  • 27. ElephantDB serving data flowDistributed filesystem Shard ElephantDB Server Shard Shard ElephantDB Server Shard
  • 28. ElephantDB serving• Each server in “ring” serves a subset of the data• Download shards, open, serve
  • 29. Terminology• “Domain”: related set of key/value pairs• “Shard”: Subset of a domain• “Ring”: Cluster of servers that work together to serve same set of domains• “Local persistence”: Regular key/value database that implements a shard (like Berkeley DB)
  • 30. ElephantDB domain• Versioned• domain-spec.yaml contains metadata (number of shards, how domains are stored)
  • 31. ElephantDB version• Folder for each shard
  • 32. ElephantDB Shard• Files for local persistence• Berkeley DB JE in this example
  • 33. Writing to ElephantDB Cascading
  • 34. Writing to ElephantDB Cascalog
  • 35. MapReduce flow (key, value) (shard, list<key, value>) Stream into LP hash mod Group by shard Local Persistence on Shard on DFS Upload(shard, key, value) LFS
  • 36. Incremental ElephantDB (key, value) (shard, list<key, value>) Stream into LP Old shard on hash mod Group by shard Download DFS Local New shard on Persistence on Upload DFS(shard, key, value) LFS
  • 37. Incremental ElephantDB• Avoid reindexing domain from scratch• Massive performance benefits in creating/ updating versions• ElephantDB ring still has to download domain from scratch
  • 38. Incremental ElephantDB
  • 39. Configuring a ring
  • 40. Querying ElephantDB
  • 41. Consuming ElephantDB from MapReduce• Read from shards in DFS, not from live ElephantDB servers• “Indexed key/value file format on Hadoop”
  • 42. Demo
  • 43. One-click deploy
  • 44. CAP TheoremIn a distributed database, pick at most two: Consistency Availability Partition Tolerance
  • 45. ElephantDB and CAP Availability Partition Tolerance
  • 46. ElephantDB and CAPBut, a very predictable kind of consistency
  • 47. Comparison to HBase• HBase can be written to in batch• Supports random writes• So why ElephantDB?
  • 48. Comparison to HBase• HBase has an enormous complexity cost• HBase has many moving parts• Does not “just work”
  • 49. Comparison to HBase• HBase is not highly available• ElephantDB has ability to revert to older versions of indexes
  • 50. Implementation details• Written in Clojure• Thrift interface• Local Persistences are pluggable
  • 51. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com