MapReduce and NoSQL

  • 2,338 views
Uploaded on

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,338
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
62
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. MapReduce and NoSQLExploring the Solution Space
  • 2. Changing the Game
  • 3. Conventional MR + NoSQLCost Dollars Data Volume
  • 4. Conventional MR + NoSQLScale Dollars Data Volume
  • 5. Performance Profiles
  • 6. Performance Profiles GoodMapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
  • 7. Performance Profiles MapReduce NoSQLThroughput Bulk Update Latency Seek
  • 8. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
  • 9. Data GoalsCollectServeAnalyze
  • 10. Traditional Approach Collect Transactional AnalyticalUsers System System Apps OLTP OLAP Serve Analyze
  • 11. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
  • 12. Analysis ChallengesAnalytical Latency Data is always old Answers can take a long timeServing up analytical resultsHigher cost, complexityIncremental Updates
  • 13. Analysis ChallengesWord Countsa: 5342 New Document:aardvark: 13an: 4553 “The red aardvarksanteater: 27 live in holes.”...yellow: 302zebra:19
  • 14. Analysis ChallengesHDFS Log files:sources/routers MapReduce over datasources/apps from all sources for thesources/webservers week of Jan 13th
  • 15. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
  • 16. Possible Approach Collect Hadoop MapReduceUsers Apps NoSQL Serve Analyze
  • 17. Have to be careful to avoidcreating a system that’s badat everything
  • 18. What could go wrong?
  • 19. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
  • 20. Performance Profiles MapReduce NoSQL MapReduce on NoSQLGoodBad Throughput Bulk Update Latency Seek
  • 21. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
  • 22. Performance Profiles MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
  • 23. Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a subset of your NoSQLnodes so that some are available for fast queriesIf MR output is written back to NoSQL db, it isimmediately available for query
  • 24. THE IN T ER L L E C T I V E Concept-Based Search
  • 25. PatentsNews Articles PubMedClinical TrialsArXive Articles
  • 26. MongoDB PythonRuby on Rails Hadoop Thrift
  • 27. Feature VectorsUnstructured Data
  • 28. www.interllective.com
  • 29. MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
  • 30. MongoReduce
  • 31. Config Replicas App servers S P S PShards S P S P mongos
  • 32. Config Replicas App servers MR WorkersShards mongos
  • 33. Config Replicas App servers P P Single jobShards P P mongos
  • 34. MongoDBMappers read directlyfrom a single mongod mongodprocess, not throughmongos - tends to belocal Map HDFSBalancer can be turnedoff to avoid potential for mongosreading data twice
  • 35. MongoReduceOnly MongoDb mongodprimaries do writes.Schedule mappers onsecondaries Map HDFSIntermediate outputgoes to HDFS Reduce mongos
  • 36. MongoReduce mongodFinal output can go to HDFSHDFS or MongoDb Reduce mongos
  • 37. MongoReduce mongodMappers can just writeto global MongoDb Map HDFSthrough mongos mongos
  • 38. What’s Going On? Map Map Map Mapr1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
  • 39. MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}s = {authors:true}Queries use indexes!
  • 40. MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard keyReducers can choose to call update()
  • 41. MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
  • 42. https://github.com/acordova/MongoReduce
  • 43. DIY MapRed + NoSQLYourInputFormat YourInputSplit YourRecordReaderYourOutputFormat YourRecordWriter YourOutputCommitter
  • 44. Brisk
  • 45. Accumulo
  • 46. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
  • 47. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
  • 48. MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continualMapReduce’
  • 49. TabletServers App serversMaster Ingest clients
  • 50. TabletServer App server Reduce’ map() Ingest clientlive:142 WordCountin:2342holes:234 Table
  • 51. TabletServer App server Reduce’ map() Ingest clientlive:142in:2342 The redholes:234 aardvarks live in holes.
  • 52. TabletServer App server Reduce’ map() Ingest clientlive:142 aardvarks:1in:2342 live:1holes:234 in:1 holes:1
  • 53. TabletServer App server Reduce’ map() Ingest clientaardvarks:1live:143in:2343holes:235
  • 54. Accumulo Map Map map() map()r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
  • 55. Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
  • 56. Key to performanceWhen the functions are runRather than atomic increment: lock, read, +1, write SLOWWrite all values, sum at read time minor compaction time major compaction time
  • 57. TabletServer aardvark:1scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
  • 58. TabletServer aardvark:1 live:142aardvark:1 live:1live:143 in:2342 Reduce’in:2343 sum() in:1holes:235 holes:234 holes:1 major compact
  • 59. Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduce combiner function
  • 60. ‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feature vector weightsStatistical countsnormalize after query to get probabilities
  • 61. Accumulo - latin to accumulate ...
  • 62. Accumulo - latin to accumulate ... awesomeness
  • 63. incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
  • 64. Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search index.Reduced the average age of documents in Googlesearch results by 50%.
  • 65. Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
  • 66. Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-row updates: BigTableComputation is small: Traditional DBMS
  • 67. Questions