Your SlideShare is downloading. ×

ElephantDB

9,662

Published on

Published in: Technology
0 Comments
32 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,662
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
200
Comments
0
Likes
32
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. ElephantDB Nathan Marz BackType @nathanmarz
    • 2. Specialized database for exporting key/value data from Hadoop
    • 3. ElephantDB Server File File ElephantDB Key/value queries Server File FileDistributed Filesystem ElephantDB Server
    • 4. First, some context
    • 5. BackType’s Challenges
    • 6. BackType’s Challenges Complex analytics
    • 7. BackType’s Challenges Complex analyticson lots of data (> 30TB)
    • 8. BackType’s Challenges Complex analyticson lots of data (> 30TB) in realtime
    • 9. What is a data system?
    • 10. Data system View 1Raw data View 2 View 3
    • 11. Data system # Tweets / URLTweets Influence scores Trending topics
    • 12. Our approach Speed Layer Batch Layer
    • 13. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
    • 14. (this is too slow in practice)
    • 15. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
    • 16. Batch views are always out of date by a few hours
    • 17. Speed layer DBMessages DB
    • 18. Compensate for high latency of updates to batch layer
    • 19. Only needs to worry about last few hours of data
    • 20. Application-level Queries Batch Layer Query Merge Speed Layer Query
    • 21. Batch layer MapReduce BatchMaster dataset ElephantDB View 1 MapReduce Batch ElephantDB View 2 Batch ElephantDB View 3 MapReduce
    • 22. ElephantDB• Random reads• Batch writes• Random writes
    • 23. Why ElephantDB?• Simplicity• Ease of use• Trivially scalable• “Just works”
    • 24. Creating an ElephantDBindex is disassociated from serving an index
    • 25. ElephantDB indexing data flow sharded key/valuekey/value pairs database MapReduce Distributed filesystem
    • 26. Index creation• ElephantDB just used as a library• No dependencies on a live ElephantDB cluster
    • 27. ElephantDB serving data flowDistributed filesystem Shard ElephantDB Server Shard Shard ElephantDB Server Shard
    • 28. ElephantDB serving• Each server in “ring” serves a subset of the data• Download shards, open, serve
    • 29. Terminology• “Domain”: related set of key/value pairs• “Shard”: Subset of a domain• “Ring”: Cluster of servers that work together to serve same set of domains• “Local persistence”: Regular key/value database that implements a shard (like Berkeley DB)
    • 30. ElephantDB domain• Versioned• domain-spec.yaml contains metadata (number of shards, how domains are stored)
    • 31. ElephantDB version• Folder for each shard
    • 32. ElephantDB Shard• Files for local persistence• Berkeley DB JE in this example
    • 33. Writing to ElephantDB Cascading
    • 34. Writing to ElephantDB Cascalog
    • 35. MapReduce flow (key, value) (shard, list<key, value>) Stream into LP hash mod Group by shard Local Persistence on Shard on DFS Upload(shard, key, value) LFS
    • 36. Incremental ElephantDB (key, value) (shard, list<key, value>) Stream into LP Old shard on hash mod Group by shard Download DFS Local New shard on Persistence on Upload DFS(shard, key, value) LFS
    • 37. Incremental ElephantDB• Avoid reindexing domain from scratch• Massive performance benefits in creating/ updating versions• ElephantDB ring still has to download domain from scratch
    • 38. Incremental ElephantDB
    • 39. Configuring a ring
    • 40. Querying ElephantDB
    • 41. Consuming ElephantDB from MapReduce• Read from shards in DFS, not from live ElephantDB servers• “Indexed key/value file format on Hadoop”
    • 42. Demo
    • 43. One-click deploy
    • 44. CAP TheoremIn a distributed database, pick at most two: Consistency Availability Partition Tolerance
    • 45. ElephantDB and CAP Availability Partition Tolerance
    • 46. ElephantDB and CAPBut, a very predictable kind of consistency
    • 47. Comparison to HBase• HBase can be written to in batch• Supports random writes• So why ElephantDB?
    • 48. Comparison to HBase• HBase has an enormous complexity cost• HBase has many moving parts• Does not “just work”
    • 49. Comparison to HBase• HBase is not highly available• ElephantDB has ability to revert to older versions of indexes
    • 50. Implementation details• Written in Clojure• Thrift interface• Local Persistences are pluggable
    • 51. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com

    ×