ElephantDB

12,015 views
11,653 views

Published on

Published in: Technology
0 Comments
36 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,015
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
215
Comments
0
Likes
36
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • ElephantDB

    1. 1. ElephantDB Nathan Marz BackType @nathanmarz
    2. 2. Specialized database for exporting key/value data from Hadoop
    3. 3. ElephantDB Server File File ElephantDB Key/value queries Server File FileDistributed Filesystem ElephantDB Server
    4. 4. First, some context
    5. 5. BackType’s Challenges
    6. 6. BackType’s Challenges Complex analytics
    7. 7. BackType’s Challenges Complex analyticson lots of data (> 30TB)
    8. 8. BackType’s Challenges Complex analyticson lots of data (> 30TB) in realtime
    9. 9. What is a data system?
    10. 10. Data system View 1Raw data View 2 View 3
    11. 11. Data system # Tweets / URLTweets Influence scores Trending topics
    12. 12. Our approach Speed Layer Batch Layer
    13. 13. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
    14. 14. (this is too slow in practice)
    15. 15. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
    16. 16. Batch views are always out of date by a few hours
    17. 17. Speed layer DBMessages DB
    18. 18. Compensate for high latency of updates to batch layer
    19. 19. Only needs to worry about last few hours of data
    20. 20. Application-level Queries Batch Layer Query Merge Speed Layer Query
    21. 21. Batch layer MapReduce BatchMaster dataset ElephantDB View 1 MapReduce Batch ElephantDB View 2 Batch ElephantDB View 3 MapReduce
    22. 22. ElephantDB• Random reads• Batch writes• Random writes
    23. 23. Why ElephantDB?• Simplicity• Ease of use• Trivially scalable• “Just works”
    24. 24. Creating an ElephantDBindex is disassociated from serving an index
    25. 25. ElephantDB indexing data flow sharded key/valuekey/value pairs database MapReduce Distributed filesystem
    26. 26. Index creation• ElephantDB just used as a library• No dependencies on a live ElephantDB cluster
    27. 27. ElephantDB serving data flowDistributed filesystem Shard ElephantDB Server Shard Shard ElephantDB Server Shard
    28. 28. ElephantDB serving• Each server in “ring” serves a subset of the data• Download shards, open, serve
    29. 29. Terminology• “Domain”: related set of key/value pairs• “Shard”: Subset of a domain• “Ring”: Cluster of servers that work together to serve same set of domains• “Local persistence”: Regular key/value database that implements a shard (like Berkeley DB)
    30. 30. ElephantDB domain• Versioned• domain-spec.yaml contains metadata (number of shards, how domains are stored)
    31. 31. ElephantDB version• Folder for each shard
    32. 32. ElephantDB Shard• Files for local persistence• Berkeley DB JE in this example
    33. 33. Writing to ElephantDB Cascading
    34. 34. Writing to ElephantDB Cascalog
    35. 35. MapReduce flow (key, value) (shard, list<key, value>) Stream into LP hash mod Group by shard Local Persistence on Shard on DFS Upload(shard, key, value) LFS
    36. 36. Incremental ElephantDB (key, value) (shard, list<key, value>) Stream into LP Old shard on hash mod Group by shard Download DFS Local New shard on Persistence on Upload DFS(shard, key, value) LFS
    37. 37. Incremental ElephantDB• Avoid reindexing domain from scratch• Massive performance benefits in creating/ updating versions• ElephantDB ring still has to download domain from scratch
    38. 38. Incremental ElephantDB
    39. 39. Configuring a ring
    40. 40. Querying ElephantDB
    41. 41. Consuming ElephantDB from MapReduce• Read from shards in DFS, not from live ElephantDB servers• “Indexed key/value file format on Hadoop”
    42. 42. Demo
    43. 43. One-click deploy
    44. 44. CAP TheoremIn a distributed database, pick at most two: Consistency Availability Partition Tolerance
    45. 45. ElephantDB and CAP Availability Partition Tolerance
    46. 46. ElephantDB and CAPBut, a very predictable kind of consistency
    47. 47. Comparison to HBase• HBase can be written to in batch• Supports random writes• So why ElephantDB?
    48. 48. Comparison to HBase• HBase has an enormous complexity cost• HBase has many moving parts• Does not “just work”
    49. 49. Comparison to HBase• HBase is not highly available• ElephantDB has ability to revert to older versions of indexes
    50. 50. Implementation details• Written in Clojure• Thrift interface• Local Persistences are pluggable
    51. 51. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com

    ×