Successfully reported this slideshow.

MapReduce and NoSQL

7

Share

Upcoming SlideShare
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
Loading in …3
×
1 of 72
1 of 72

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

MapReduce and NoSQL

  1. 1. MapReduce and NoSQL Exploring the Solution Space
  2. 2. Changing the Game
  3. 3. Conventional MR + NoSQL Cost Dollars Data Volume
  4. 4. Conventional MR + NoSQL Scale Dollars Data Volume
  5. 5. Performance Profiles
  6. 6. Performance Profiles Good MapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
  7. 7. Performance Profiles MapReduce NoSQL Throughput Bulk Update Latency Seek
  8. 8. Performance Profiles MapReduce NoSQL Good Bad Throughput Bulk Update Latency Seek
  9. 9. Data Goals Collect Serve Analyze
  10. 10. Traditional Approach Collect Transactional Analytical Users System System Apps OLTP OLAP Serve Analyze
  11. 11. One Approach Collect Hadoop MapReduce Users Apps NoSQL HDFS Serve Analyze
  12. 12. Analysis Challenges Analytical Latency Data is always old Answers can take a long time Serving up analytical results Higher cost, complexity Incremental Updates
  13. 13. Analysis Challenges Word Counts a: 5342 New Document: aardvark: 13 an: 4553 “The red aardvarks anteater: 27 live in holes.” ... yellow: 302 zebra:19
  14. 14. Analysis Challenges HDFS Log files: sources/routers MapReduce over data sources/apps from all sources for the sources/webservers week of Jan 13th
  15. 15. One Approach Collect Hadoop MapReduce Users Apps NoSQL HDFS Serve Analyze
  16. 16. Possible Approach Collect Hadoop MapReduce Users Apps NoSQL Serve Analyze
  17. 17. Have to be careful to avoid creating a system that’s bad at everything
  18. 18. What could go wrong?
  19. 19. Performance Profiles MapReduce NoSQL Good Bad Throughput Bulk Update Latency Seek
  20. 20. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good Bad Throughput Bulk Update Latency Seek
  21. 21. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good Bad Throughput Bulk Update Latency Seek
  22. 22. Performance Profiles MapReduce on NoSQL Good Bad Throughput Bulk Update Latency Seek
  23. 23. Best Practices Use a NoSQL db that has good throughput - it helps to do local communication Isolate MapReduce workers to a subset of your NoSQL nodes so that some are available for fast queries If MR output is written back to NoSQL db, it is immediately available for query
  24. 24. THE IN T ER L L E C T I V E Concept-Based Search
  25. 25. Patents News Articles PubMed Clinical Trials ArXive Articles
  26. 26. MongoDB Python Ruby on Rails Hadoop Thrift
  27. 27. Feature Vectors Unstructured Data
  28. 28. www.interllective.com
  29. 29. MapReduce on MongoDB Built-in MapReduce - Javascript mongo-hadoop MongoReduce
  30. 30. MongoReduce
  31. 31. Config Replicas App servers S P S P Shards S P S P mongos
  32. 32. Config Replicas App servers MR Workers Shards mongos
  33. 33. Config Replicas App servers P P Single job Shards P P mongos
  34. 34. MongoDB Mappers read directly from a single mongod mongod process, not through mongos - tends to be local Map HDFS Balancer can be turned off to avoid potential for mongos reading data twice
  35. 35. MongoReduce Only MongoDb mongod primaries do writes. Schedule mappers on secondaries Map HDFS Intermediate output goes to HDFS Reduce mongos
  36. 36. MongoReduce mongod Final output can go to HDFS HDFS or MongoDb Reduce mongos
  37. 37. MongoReduce mongod Mappers can just write to global MongoDb Map HDFS through mongos mongos
  38. 38. What’s Going On? Map Map Map Map r1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
  39. 39. MongoReduce Instead of specifying an HDFS directory for input, can submit MongoDb query and select statements: q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]} s = {authors:true} Queries use indexes!
  40. 40. MongoReduce If outputting to MongoDb, new collections are automatically sharded, pre-split, and balanced Can choose the shard key Reducers can choose to call update()
  41. 41. MongoReduce If writing output to MongoDb, specify an objectId to ensure idempotent writes - i.e. not a random UUID
  42. 42. https://github.com/acordova/MongoReduce
  43. 43. DIY MapRed + NoSQL YourInputFormat YourInputSplit YourRecordReader YourOutputFormat YourRecordWriter YourOutputCommitter
  44. 44. Brisk
  45. 45. Accumulo
  46. 46. Accumulo Based on Google's BigTable design Uses Apache Hadoop, Zookeeper, and Thrift Features a few novel improvements on the BigTable design cell-level access labels server-side programming mechanism called Iterators
  47. 47. Accumulo Based on Google's BigTable design Uses Apache Hadoop, Zookeeper, and Thrift Features a few novel improvements on the BigTable design cell-level access labels server-side programming mechanism called Iterators
  48. 48. MapReduce and Accumulo Can do regular ol’ MapReduce just like w/ MongoDb But can use Iterators to achieve a kind of ‘continual MapReduce’
  49. 49. TabletServers App servers Master Ingest clients
  50. 50. TabletServer App server Reduce’ map() Ingest client live:142 WordCount in:2342 holes:234 Table
  51. 51. TabletServer App server Reduce’ map() Ingest client live:142 in:2342 The red holes:234 aardvarks live in holes.
  52. 52. TabletServer App server Reduce’ map() Ingest client live:142 aardvarks:1 in:2342 live:1 holes:234 in:1 holes:1
  53. 53. TabletServer App server Reduce’ map() Ingest client aardvarks:1 live:143 in:2343 holes:235
  54. 54. Accumulo Map Map map() map() r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
  55. 55. Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
  56. 56. Key to performance When the functions are run Rather than atomic increment: lock, read, +1, write SLOW Write all values, sum at read time minor compaction time major compaction time
  57. 57. TabletServer aardvark:1 scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
  58. 58. TabletServer aardvark:1 live:142 aardvark:1 live:1 live:143 in:2342 Reduce’ in:2343 sum() in:1 holes:235 holes:234 holes:1 major compact
  59. 59. Reduce’ (prime) Because a function has not seen all values for a given key - another may show up More like writing a MapReduce combiner function
  60. 60. ‘Continuous’ MapReduce Can maintain huge result sets that are always available for query Update graph edge weights Update feature vector weights Statistical counts normalize after query to get probabilities
  61. 61. Accumulo - latin to accumulate ...
  62. 62. Accumulo - latin to accumulate ... awesomeness
  63. 63. incubator.apache.org/accumulo wiki.apache.org/incubator/AccumuloProposal
  64. 64. Google Percolator A system for incrementally processing updates to a large data set Used to create the Google web search index. Reduced the average age of documents in Google search results by 50%.
  65. 65. Google Percolator A novel, proprietary system of Distributed Transactions and Notifications built on top of BigTable
  66. 66. Solution Space Incremental update, multi-row consistency: Percolator Results can’t be broken down (sort): MapReduce No multi-row updates: BigTable Computation is small: Traditional DBMS
  67. 67. Questions

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • ×