Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ElephantDB

14,064 views

Published on

Published in: Technology
  • Be the first to comment

ElephantDB

  1. 1. ElephantDB Nathan Marz BackType @nathanmarz
  2. 2. Specialized database for exporting key/value data from Hadoop
  3. 3. ElephantDB Server File File ElephantDB Key/value queries Server File FileDistributed Filesystem ElephantDB Server
  4. 4. First, some context
  5. 5. BackType’s Challenges
  6. 6. BackType’s Challenges Complex analytics
  7. 7. BackType’s Challenges Complex analyticson lots of data (> 30TB)
  8. 8. BackType’s Challenges Complex analyticson lots of data (> 30TB) in realtime
  9. 9. What is a data system?
  10. 10. Data system View 1Raw data View 2 View 3
  11. 11. Data system # Tweets / URLTweets Influence scores Trending topics
  12. 12. Our approach Speed Layer Batch Layer
  13. 13. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
  14. 14. (this is too slow in practice)
  15. 15. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
  16. 16. Batch views are always out of date by a few hours
  17. 17. Speed layer DBMessages DB
  18. 18. Compensate for high latency of updates to batch layer
  19. 19. Only needs to worry about last few hours of data
  20. 20. Application-level Queries Batch Layer Query Merge Speed Layer Query
  21. 21. Batch layer MapReduce BatchMaster dataset ElephantDB View 1 MapReduce Batch ElephantDB View 2 Batch ElephantDB View 3 MapReduce
  22. 22. ElephantDB• Random reads• Batch writes• Random writes
  23. 23. Why ElephantDB?• Simplicity• Ease of use• Trivially scalable• “Just works”
  24. 24. Creating an ElephantDBindex is disassociated from serving an index
  25. 25. ElephantDB indexing data flow sharded key/valuekey/value pairs database MapReduce Distributed filesystem
  26. 26. Index creation• ElephantDB just used as a library• No dependencies on a live ElephantDB cluster
  27. 27. ElephantDB serving data flowDistributed filesystem Shard ElephantDB Server Shard Shard ElephantDB Server Shard
  28. 28. ElephantDB serving• Each server in “ring” serves a subset of the data• Download shards, open, serve
  29. 29. Terminology• “Domain”: related set of key/value pairs• “Shard”: Subset of a domain• “Ring”: Cluster of servers that work together to serve same set of domains• “Local persistence”: Regular key/value database that implements a shard (like Berkeley DB)
  30. 30. ElephantDB domain• Versioned• domain-spec.yaml contains metadata (number of shards, how domains are stored)
  31. 31. ElephantDB version• Folder for each shard
  32. 32. ElephantDB Shard• Files for local persistence• Berkeley DB JE in this example
  33. 33. Writing to ElephantDB Cascading
  34. 34. Writing to ElephantDB Cascalog
  35. 35. MapReduce flow (key, value) (shard, list<key, value>) Stream into LP hash mod Group by shard Local Persistence on Shard on DFS Upload(shard, key, value) LFS
  36. 36. Incremental ElephantDB (key, value) (shard, list<key, value>) Stream into LP Old shard on hash mod Group by shard Download DFS Local New shard on Persistence on Upload DFS(shard, key, value) LFS
  37. 37. Incremental ElephantDB• Avoid reindexing domain from scratch• Massive performance benefits in creating/ updating versions• ElephantDB ring still has to download domain from scratch
  38. 38. Incremental ElephantDB
  39. 39. Configuring a ring
  40. 40. Querying ElephantDB
  41. 41. Consuming ElephantDB from MapReduce• Read from shards in DFS, not from live ElephantDB servers• “Indexed key/value file format on Hadoop”
  42. 42. Demo
  43. 43. One-click deploy
  44. 44. CAP TheoremIn a distributed database, pick at most two: Consistency Availability Partition Tolerance
  45. 45. ElephantDB and CAP Availability Partition Tolerance
  46. 46. ElephantDB and CAPBut, a very predictable kind of consistency
  47. 47. Comparison to HBase• HBase can be written to in batch• Supports random writes• So why ElephantDB?
  48. 48. Comparison to HBase• HBase has an enormous complexity cost• HBase has many moving parts• Does not “just work”
  49. 49. Comparison to HBase• HBase is not highly available• ElephantDB has ability to revert to older versions of indexes
  50. 50. Implementation details• Written in Clojure• Thrift interface• Local Persistences are pluggable
  51. 51. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com

×