Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,845
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
187
Comments
0
Likes
30

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. ElephantDB Nathan Marz BackType @nathanmarz
  • 2. Specialized database for exporting key/value data from Hadoop
  • 3. ElephantDB Server File File ElephantDB Key/value queries Server File FileDistributed Filesystem ElephantDB Server
  • 4. First, some context
  • 5. BackType’s Challenges
  • 6. BackType’s Challenges Complex analytics
  • 7. BackType’s Challenges Complex analyticson lots of data (> 30TB)
  • 8. BackType’s Challenges Complex analyticson lots of data (> 30TB) in realtime
  • 9. What is a data system?
  • 10. Data system View 1Raw data View 2 View 3
  • 11. Data system # Tweets / URLTweets Influence scores Trending topics
  • 12. Our approach Speed Layer Batch Layer
  • 13. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
  • 14. (this is too slow in practice)
  • 15. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
  • 16. Batch views are always out of date by a few hours
  • 17. Speed layer DBMessages DB
  • 18. Compensate for high latency of updates to batch layer
  • 19. Only needs to worry about last few hours of data
  • 20. Application-level Queries Batch Layer Query Merge Speed Layer Query
  • 21. Batch layer MapReduce BatchMaster dataset ElephantDB View 1 MapReduce Batch ElephantDB View 2 Batch ElephantDB View 3 MapReduce
  • 22. ElephantDB• Random reads• Batch writes• Random writes
  • 23. Why ElephantDB?• Simplicity• Ease of use• Trivially scalable• “Just works”
  • 24. Creating an ElephantDBindex is disassociated from serving an index
  • 25. ElephantDB indexing data flow sharded key/valuekey/value pairs database MapReduce Distributed filesystem
  • 26. Index creation• ElephantDB just used as a library• No dependencies on a live ElephantDB cluster
  • 27. ElephantDB serving data flowDistributed filesystem Shard ElephantDB Server Shard Shard ElephantDB Server Shard
  • 28. ElephantDB serving• Each server in “ring” serves a subset of the data• Download shards, open, serve
  • 29. Terminology• “Domain”: related set of key/value pairs• “Shard”: Subset of a domain• “Ring”: Cluster of servers that work together to serve same set of domains• “Local persistence”: Regular key/value database that implements a shard (like Berkeley DB)
  • 30. ElephantDB domain• Versioned• domain-spec.yaml contains metadata (number of shards, how domains are stored)
  • 31. ElephantDB version• Folder for each shard
  • 32. ElephantDB Shard• Files for local persistence• Berkeley DB JE in this example
  • 33. Writing to ElephantDB Cascading
  • 34. Writing to ElephantDB Cascalog
  • 35. MapReduce flow (key, value) (shard, list<key, value>) Stream into LP hash mod Group by shard Local Persistence on Shard on DFS Upload(shard, key, value) LFS
  • 36. Incremental ElephantDB (key, value) (shard, list<key, value>) Stream into LP Old shard on hash mod Group by shard Download DFS Local New shard on Persistence on Upload DFS(shard, key, value) LFS
  • 37. Incremental ElephantDB• Avoid reindexing domain from scratch• Massive performance benefits in creating/ updating versions• ElephantDB ring still has to download domain from scratch
  • 38. Incremental ElephantDB
  • 39. Configuring a ring
  • 40. Querying ElephantDB
  • 41. Consuming ElephantDB from MapReduce• Read from shards in DFS, not from live ElephantDB servers• “Indexed key/value file format on Hadoop”
  • 42. Demo
  • 43. One-click deploy
  • 44. CAP TheoremIn a distributed database, pick at most two: Consistency Availability Partition Tolerance
  • 45. ElephantDB and CAP Availability Partition Tolerance
  • 46. ElephantDB and CAPBut, a very predictable kind of consistency
  • 47. Comparison to HBase• HBase can be written to in batch• Supports random writes• So why ElephantDB?
  • 48. Comparison to HBase• HBase has an enormous complexity cost• HBase has many moving parts• Does not “just work”
  • 49. Comparison to HBase• HBase is not highly available• ElephantDB has ability to revert to older versions of indexes
  • 50. Implementation details• Written in Clojure• Thrift interface• Local Persistences are pluggable
  • 51. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com