Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MapReduce and NoSQL

3,118 views

Published on

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

Published in: Technology
  • Be the first to comment

MapReduce and NoSQL

  1. 1. MapReduce and NoSQLExploring the Solution Space
  2. 2. Changing the Game
  3. 3. Conventional MR + NoSQLCost Dollars Data Volume
  4. 4. Conventional MR + NoSQLScale Dollars Data Volume
  5. 5. Performance Profiles
  6. 6. Performance Profiles GoodMapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
  7. 7. Performance Profiles MapReduce NoSQLThroughput Bulk Update Latency Seek
  8. 8. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
  9. 9. Data GoalsCollectServeAnalyze
  10. 10. Traditional Approach Collect Transactional AnalyticalUsers System System Apps OLTP OLAP Serve Analyze
  11. 11. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
  12. 12. Analysis ChallengesAnalytical Latency Data is always old Answers can take a long timeServing up analytical resultsHigher cost, complexityIncremental Updates
  13. 13. Analysis ChallengesWord Countsa: 5342 New Document:aardvark: 13an: 4553 “The red aardvarksanteater: 27 live in holes.”...yellow: 302zebra:19
  14. 14. Analysis ChallengesHDFS Log files:sources/routers MapReduce over datasources/apps from all sources for thesources/webservers week of Jan 13th
  15. 15. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
  16. 16. Possible Approach Collect Hadoop MapReduceUsers Apps NoSQL Serve Analyze
  17. 17. Have to be careful to avoidcreating a system that’s badat everything
  18. 18. What could go wrong?
  19. 19. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
  20. 20. Performance Profiles MapReduce NoSQL MapReduce on NoSQLGoodBad Throughput Bulk Update Latency Seek
  21. 21. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
  22. 22. Performance Profiles MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
  23. 23. Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a subset of your NoSQLnodes so that some are available for fast queriesIf MR output is written back to NoSQL db, it isimmediately available for query
  24. 24. THE IN T ER L L E C T I V E Concept-Based Search
  25. 25. PatentsNews Articles PubMedClinical TrialsArXive Articles
  26. 26. MongoDB PythonRuby on Rails Hadoop Thrift
  27. 27. Feature VectorsUnstructured Data
  28. 28. www.interllective.com
  29. 29. MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
  30. 30. MongoReduce
  31. 31. Config Replicas App servers S P S PShards S P S P mongos
  32. 32. Config Replicas App servers MR WorkersShards mongos
  33. 33. Config Replicas App servers P P Single jobShards P P mongos
  34. 34. MongoDBMappers read directlyfrom a single mongod mongodprocess, not throughmongos - tends to belocal Map HDFSBalancer can be turnedoff to avoid potential for mongosreading data twice
  35. 35. MongoReduceOnly MongoDb mongodprimaries do writes.Schedule mappers onsecondaries Map HDFSIntermediate outputgoes to HDFS Reduce mongos
  36. 36. MongoReduce mongodFinal output can go to HDFSHDFS or MongoDb Reduce mongos
  37. 37. MongoReduce mongodMappers can just writeto global MongoDb Map HDFSthrough mongos mongos
  38. 38. What’s Going On? Map Map Map Mapr1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
  39. 39. MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}s = {authors:true}Queries use indexes!
  40. 40. MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard keyReducers can choose to call update()
  41. 41. MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
  42. 42. https://github.com/acordova/MongoReduce
  43. 43. DIY MapRed + NoSQLYourInputFormat YourInputSplit YourRecordReaderYourOutputFormat YourRecordWriter YourOutputCommitter
  44. 44. Brisk
  45. 45. Accumulo
  46. 46. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
  47. 47. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
  48. 48. MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continualMapReduce’
  49. 49. TabletServers App serversMaster Ingest clients
  50. 50. TabletServer App server Reduce’ map() Ingest clientlive:142 WordCountin:2342holes:234 Table
  51. 51. TabletServer App server Reduce’ map() Ingest clientlive:142in:2342 The redholes:234 aardvarks live in holes.
  52. 52. TabletServer App server Reduce’ map() Ingest clientlive:142 aardvarks:1in:2342 live:1holes:234 in:1 holes:1
  53. 53. TabletServer App server Reduce’ map() Ingest clientaardvarks:1live:143in:2343holes:235
  54. 54. Accumulo Map Map map() map()r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
  55. 55. Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
  56. 56. Key to performanceWhen the functions are runRather than atomic increment: lock, read, +1, write SLOWWrite all values, sum at read time minor compaction time major compaction time
  57. 57. TabletServer aardvark:1scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
  58. 58. TabletServer aardvark:1 live:142aardvark:1 live:1live:143 in:2342 Reduce’in:2343 sum() in:1holes:235 holes:234 holes:1 major compact
  59. 59. Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduce combiner function
  60. 60. ‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feature vector weightsStatistical countsnormalize after query to get probabilities
  61. 61. Accumulo - latin to accumulate ...
  62. 62. Accumulo - latin to accumulate ... awesomeness
  63. 63. incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
  64. 64. Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search index.Reduced the average age of documents in Googlesearch results by 50%.
  65. 65. Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
  66. 66. Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-row updates: BigTableComputation is small: Traditional DBMS
  67. 67. Questions

×