MapReduce and NoSQLExploring the Solution Space
Changing the Game
Conventional          MR + NoSQLCost   Dollars                         Data Volume
Conventional          MR + NoSQLScale   Dollars                          Data Volume
Performance Profiles
Performance Profiles             GoodMapReduce             Bad            Throughput   Bulk Update   Latency   Seek        ...
Performance Profiles             MapReduce          NoSQLThroughput    Bulk Update   Latency     Seek
Performance Profiles                    MapReduce               NoSQLGoodBad       Throughput       Bulk Update   Latency  ...
Data GoalsCollectServeAnalyze
Traditional Approach               Collect                    Transactional   AnalyticalUsers                       System...
One Approach               Collect                                  Hadoop                                 MapReduceUsers ...
Analysis ChallengesAnalytical Latency  Data is always old  Answers can take a long timeServing up analytical resultsHigher...
Analysis ChallengesWord Countsa: 5342         New Document:aardvark: 13an: 4553       “The red aardvarksanteater: 27      ...
Analysis ChallengesHDFS Log files:sources/routers      MapReduce over datasources/apps         from all sources for thesour...
One Approach               Collect                                  Hadoop                                 MapReduceUsers ...
Possible Approach               Collect                          Hadoop                         MapReduceUsers        Apps...
Have to be careful to avoidcreating a system that’s badat everything
What could go wrong?
Performance Profiles                    MapReduce               NoSQLGoodBad       Throughput       Bulk Update   Latency  ...
Performance Profiles        MapReduce        NoSQL      MapReduce on NoSQLGoodBad       Throughput   Bulk Update   Latency ...
Performance Profiles             MapReduce            NoSQL    MapReduce on NoSQL Good BadThroughput               Bulk Upd...
Performance Profiles                   MapReduce on NoSQL Good BadThroughput   Bulk Update          Latency   Seek
Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a su...
THE IN T ER L L E C T I V E     Concept-Based Search
PatentsNews Articles  PubMedClinical TrialsArXive Articles
MongoDB   PythonRuby on Rails  Hadoop    Thrift
Feature VectorsUnstructured Data
www.interllective.com
MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
MongoReduce
Config         Replicas           App servers         S      P         S      PShards         S      P         S      P    ...
Config         Replicas                App servers                    MR WorkersShards                                  mon...
Config         Replicas                App servers                P         P                    Single jobShards          ...
MongoDBMappers read directlyfrom a single mongod         mongodprocess, not throughmongos - tends to belocal              ...
MongoReduceOnly MongoDb           mongodprimaries do writes.Schedule mappers onsecondaries             Map     HDFSInterme...
MongoReduce                         mongodFinal output can go to                                  HDFSHDFS or MongoDb     ...
MongoReduce                         mongodMappers can just writeto global MongoDb         Map     HDFSthrough mongos      ...
What’s Going On?     Map              Map         Map          Mapr1   r2    r3    r1   r2    r3   mongos       mongos    ...
MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_s...
MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard...
MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
https://github.com/acordova/MongoReduce
DIY MapRed + NoSQLYourInputFormat  YourInputSplit  YourRecordReaderYourOutputFormat  YourRecordWriter  YourOutputCommitter
Brisk
Accumulo
AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the ...
AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the ...
MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continu...
TabletServers                         App serversMaster                         Ingest clients
TabletServer                           App server            Reduce’                            map()                     ...
TabletServer                      App server            Reduce’                       map()                       Ingest c...
TabletServer                                    App server            Reduce’                                     map()   ...
TabletServer                      App server         Reduce’                       map()                       Ingest clie...
Accumulo     Map              Map        map()     map()r1   r2    r3    r1   r2    r3   client    client           r1   r...
Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g.   row : colu...
Key to performanceWhen the functions are runRather than atomic increment:   lock, read, +1, write SLOWWrite all values, su...
TabletServer     aardvark:1scan live:142     live:1                    Reduce’     in:2342     in:1                       ...
TabletServer                         aardvark:1                         live:142aardvark:1                         live:1l...
Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduc...
‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feat...
Accumulo - latin  to accumulate ...
Accumulo - latin  to accumulate ...       awesomeness
incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search inde...
Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-...
Questions
MapReduce and NoSQL
MapReduce and NoSQL
MapReduce and NoSQL
MapReduce and NoSQL
MapReduce and NoSQL
Upcoming SlideShare
Loading in...5
×

MapReduce and NoSQL

2,552

Published on

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,552
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
80
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • MapReduce and NoSQL

    1. 1. MapReduce and NoSQLExploring the Solution Space
    2. 2. Changing the Game
    3. 3. Conventional MR + NoSQLCost Dollars Data Volume
    4. 4. Conventional MR + NoSQLScale Dollars Data Volume
    5. 5. Performance Profiles
    6. 6. Performance Profiles GoodMapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
    7. 7. Performance Profiles MapReduce NoSQLThroughput Bulk Update Latency Seek
    8. 8. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
    9. 9. Data GoalsCollectServeAnalyze
    10. 10. Traditional Approach Collect Transactional AnalyticalUsers System System Apps OLTP OLAP Serve Analyze
    11. 11. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
    12. 12. Analysis ChallengesAnalytical Latency Data is always old Answers can take a long timeServing up analytical resultsHigher cost, complexityIncremental Updates
    13. 13. Analysis ChallengesWord Countsa: 5342 New Document:aardvark: 13an: 4553 “The red aardvarksanteater: 27 live in holes.”...yellow: 302zebra:19
    14. 14. Analysis ChallengesHDFS Log files:sources/routers MapReduce over datasources/apps from all sources for thesources/webservers week of Jan 13th
    15. 15. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
    16. 16. Possible Approach Collect Hadoop MapReduceUsers Apps NoSQL Serve Analyze
    17. 17. Have to be careful to avoidcreating a system that’s badat everything
    18. 18. What could go wrong?
    19. 19. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
    20. 20. Performance Profiles MapReduce NoSQL MapReduce on NoSQLGoodBad Throughput Bulk Update Latency Seek
    21. 21. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
    22. 22. Performance Profiles MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
    23. 23. Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a subset of your NoSQLnodes so that some are available for fast queriesIf MR output is written back to NoSQL db, it isimmediately available for query
    24. 24. THE IN T ER L L E C T I V E Concept-Based Search
    25. 25. PatentsNews Articles PubMedClinical TrialsArXive Articles
    26. 26. MongoDB PythonRuby on Rails Hadoop Thrift
    27. 27. Feature VectorsUnstructured Data
    28. 28. www.interllective.com
    29. 29. MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
    30. 30. MongoReduce
    31. 31. Config Replicas App servers S P S PShards S P S P mongos
    32. 32. Config Replicas App servers MR WorkersShards mongos
    33. 33. Config Replicas App servers P P Single jobShards P P mongos
    34. 34. MongoDBMappers read directlyfrom a single mongod mongodprocess, not throughmongos - tends to belocal Map HDFSBalancer can be turnedoff to avoid potential for mongosreading data twice
    35. 35. MongoReduceOnly MongoDb mongodprimaries do writes.Schedule mappers onsecondaries Map HDFSIntermediate outputgoes to HDFS Reduce mongos
    36. 36. MongoReduce mongodFinal output can go to HDFSHDFS or MongoDb Reduce mongos
    37. 37. MongoReduce mongodMappers can just writeto global MongoDb Map HDFSthrough mongos mongos
    38. 38. What’s Going On? Map Map Map Mapr1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
    39. 39. MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}s = {authors:true}Queries use indexes!
    40. 40. MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard keyReducers can choose to call update()
    41. 41. MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
    42. 42. https://github.com/acordova/MongoReduce
    43. 43. DIY MapRed + NoSQLYourInputFormat YourInputSplit YourRecordReaderYourOutputFormat YourRecordWriter YourOutputCommitter
    44. 44. Brisk
    45. 45. Accumulo
    46. 46. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
    47. 47. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
    48. 48. MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continualMapReduce’
    49. 49. TabletServers App serversMaster Ingest clients
    50. 50. TabletServer App server Reduce’ map() Ingest clientlive:142 WordCountin:2342holes:234 Table
    51. 51. TabletServer App server Reduce’ map() Ingest clientlive:142in:2342 The redholes:234 aardvarks live in holes.
    52. 52. TabletServer App server Reduce’ map() Ingest clientlive:142 aardvarks:1in:2342 live:1holes:234 in:1 holes:1
    53. 53. TabletServer App server Reduce’ map() Ingest clientaardvarks:1live:143in:2343holes:235
    54. 54. Accumulo Map Map map() map()r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
    55. 55. Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
    56. 56. Key to performanceWhen the functions are runRather than atomic increment: lock, read, +1, write SLOWWrite all values, sum at read time minor compaction time major compaction time
    57. 57. TabletServer aardvark:1scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
    58. 58. TabletServer aardvark:1 live:142aardvark:1 live:1live:143 in:2342 Reduce’in:2343 sum() in:1holes:235 holes:234 holes:1 major compact
    59. 59. Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduce combiner function
    60. 60. ‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feature vector weightsStatistical countsnormalize after query to get probabilities
    61. 61. Accumulo - latin to accumulate ...
    62. 62. Accumulo - latin to accumulate ... awesomeness
    63. 63. incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
    64. 64. Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search index.Reduced the average age of documents in Googlesearch results by 50%.
    65. 65. Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
    66. 66. Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-row updates: BigTableComputation is small: Traditional DBMS
    67. 67. Questions
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×