MapReduce and NoSQL

2,670
-1

Published on

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,670
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
81
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • MapReduce and NoSQL

    1. 1. MapReduce and NoSQLExploring the Solution Space
    2. 2. Changing the Game
    3. 3. Conventional MR + NoSQLCost Dollars Data Volume
    4. 4. Conventional MR + NoSQLScale Dollars Data Volume
    5. 5. Performance Profiles
    6. 6. Performance Profiles GoodMapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
    7. 7. Performance Profiles MapReduce NoSQLThroughput Bulk Update Latency Seek
    8. 8. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
    9. 9. Data GoalsCollectServeAnalyze
    10. 10. Traditional Approach Collect Transactional AnalyticalUsers System System Apps OLTP OLAP Serve Analyze
    11. 11. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
    12. 12. Analysis ChallengesAnalytical Latency Data is always old Answers can take a long timeServing up analytical resultsHigher cost, complexityIncremental Updates
    13. 13. Analysis ChallengesWord Countsa: 5342 New Document:aardvark: 13an: 4553 “The red aardvarksanteater: 27 live in holes.”...yellow: 302zebra:19
    14. 14. Analysis ChallengesHDFS Log files:sources/routers MapReduce over datasources/apps from all sources for thesources/webservers week of Jan 13th
    15. 15. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
    16. 16. Possible Approach Collect Hadoop MapReduceUsers Apps NoSQL Serve Analyze
    17. 17. Have to be careful to avoidcreating a system that’s badat everything
    18. 18. What could go wrong?
    19. 19. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
    20. 20. Performance Profiles MapReduce NoSQL MapReduce on NoSQLGoodBad Throughput Bulk Update Latency Seek
    21. 21. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
    22. 22. Performance Profiles MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
    23. 23. Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a subset of your NoSQLnodes so that some are available for fast queriesIf MR output is written back to NoSQL db, it isimmediately available for query
    24. 24. THE IN T ER L L E C T I V E Concept-Based Search
    25. 25. PatentsNews Articles PubMedClinical TrialsArXive Articles
    26. 26. MongoDB PythonRuby on Rails Hadoop Thrift
    27. 27. Feature VectorsUnstructured Data
    28. 28. www.interllective.com
    29. 29. MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
    30. 30. MongoReduce
    31. 31. Config Replicas App servers S P S PShards S P S P mongos
    32. 32. Config Replicas App servers MR WorkersShards mongos
    33. 33. Config Replicas App servers P P Single jobShards P P mongos
    34. 34. MongoDBMappers read directlyfrom a single mongod mongodprocess, not throughmongos - tends to belocal Map HDFSBalancer can be turnedoff to avoid potential for mongosreading data twice
    35. 35. MongoReduceOnly MongoDb mongodprimaries do writes.Schedule mappers onsecondaries Map HDFSIntermediate outputgoes to HDFS Reduce mongos
    36. 36. MongoReduce mongodFinal output can go to HDFSHDFS or MongoDb Reduce mongos
    37. 37. MongoReduce mongodMappers can just writeto global MongoDb Map HDFSthrough mongos mongos
    38. 38. What’s Going On? Map Map Map Mapr1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
    39. 39. MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}s = {authors:true}Queries use indexes!
    40. 40. MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard keyReducers can choose to call update()
    41. 41. MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
    42. 42. https://github.com/acordova/MongoReduce
    43. 43. DIY MapRed + NoSQLYourInputFormat YourInputSplit YourRecordReaderYourOutputFormat YourRecordWriter YourOutputCommitter
    44. 44. Brisk
    45. 45. Accumulo
    46. 46. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
    47. 47. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
    48. 48. MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continualMapReduce’
    49. 49. TabletServers App serversMaster Ingest clients
    50. 50. TabletServer App server Reduce’ map() Ingest clientlive:142 WordCountin:2342holes:234 Table
    51. 51. TabletServer App server Reduce’ map() Ingest clientlive:142in:2342 The redholes:234 aardvarks live in holes.
    52. 52. TabletServer App server Reduce’ map() Ingest clientlive:142 aardvarks:1in:2342 live:1holes:234 in:1 holes:1
    53. 53. TabletServer App server Reduce’ map() Ingest clientaardvarks:1live:143in:2343holes:235
    54. 54. Accumulo Map Map map() map()r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
    55. 55. Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
    56. 56. Key to performanceWhen the functions are runRather than atomic increment: lock, read, +1, write SLOWWrite all values, sum at read time minor compaction time major compaction time
    57. 57. TabletServer aardvark:1scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
    58. 58. TabletServer aardvark:1 live:142aardvark:1 live:1live:143 in:2342 Reduce’in:2343 sum() in:1holes:235 holes:234 holes:1 major compact
    59. 59. Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduce combiner function
    60. 60. ‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feature vector weightsStatistical countsnormalize after query to get probabilities
    61. 61. Accumulo - latin to accumulate ...
    62. 62. Accumulo - latin to accumulate ... awesomeness
    63. 63. incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
    64. 64. Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search index.Reduced the average age of documents in Googlesearch results by 50%.
    65. 65. Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
    66. 66. Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-row updates: BigTableComputation is small: Traditional DBMS
    67. 67. Questions
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×