Your SlideShare is downloading. ×
MapReduce and NoSQL
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

MapReduce and NoSQL

2,485

Published on

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,485
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
78
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. MapReduce and NoSQLExploring the Solution Space
    • 2. Changing the Game
    • 3. Conventional MR + NoSQLCost Dollars Data Volume
    • 4. Conventional MR + NoSQLScale Dollars Data Volume
    • 5. Performance Profiles
    • 6. Performance Profiles GoodMapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
    • 7. Performance Profiles MapReduce NoSQLThroughput Bulk Update Latency Seek
    • 8. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
    • 9. Data GoalsCollectServeAnalyze
    • 10. Traditional Approach Collect Transactional AnalyticalUsers System System Apps OLTP OLAP Serve Analyze
    • 11. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
    • 12. Analysis ChallengesAnalytical Latency Data is always old Answers can take a long timeServing up analytical resultsHigher cost, complexityIncremental Updates
    • 13. Analysis ChallengesWord Countsa: 5342 New Document:aardvark: 13an: 4553 “The red aardvarksanteater: 27 live in holes.”...yellow: 302zebra:19
    • 14. Analysis ChallengesHDFS Log files:sources/routers MapReduce over datasources/apps from all sources for thesources/webservers week of Jan 13th
    • 15. One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
    • 16. Possible Approach Collect Hadoop MapReduceUsers Apps NoSQL Serve Analyze
    • 17. Have to be careful to avoidcreating a system that’s badat everything
    • 18. What could go wrong?
    • 19. Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
    • 20. Performance Profiles MapReduce NoSQL MapReduce on NoSQLGoodBad Throughput Bulk Update Latency Seek
    • 21. Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
    • 22. Performance Profiles MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
    • 23. Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a subset of your NoSQLnodes so that some are available for fast queriesIf MR output is written back to NoSQL db, it isimmediately available for query
    • 24. THE IN T ER L L E C T I V E Concept-Based Search
    • 25. PatentsNews Articles PubMedClinical TrialsArXive Articles
    • 26. MongoDB PythonRuby on Rails Hadoop Thrift
    • 27. Feature VectorsUnstructured Data
    • 28. www.interllective.com
    • 29. MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
    • 30. MongoReduce
    • 31. Config Replicas App servers S P S PShards S P S P mongos
    • 32. Config Replicas App servers MR WorkersShards mongos
    • 33. Config Replicas App servers P P Single jobShards P P mongos
    • 34. MongoDBMappers read directlyfrom a single mongod mongodprocess, not throughmongos - tends to belocal Map HDFSBalancer can be turnedoff to avoid potential for mongosreading data twice
    • 35. MongoReduceOnly MongoDb mongodprimaries do writes.Schedule mappers onsecondaries Map HDFSIntermediate outputgoes to HDFS Reduce mongos
    • 36. MongoReduce mongodFinal output can go to HDFSHDFS or MongoDb Reduce mongos
    • 37. MongoReduce mongodMappers can just writeto global MongoDb Map HDFSthrough mongos mongos
    • 38. What’s Going On? Map Map Map Mapr1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
    • 39. MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}s = {authors:true}Queries use indexes!
    • 40. MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard keyReducers can choose to call update()
    • 41. MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
    • 42. https://github.com/acordova/MongoReduce
    • 43. DIY MapRed + NoSQLYourInputFormat YourInputSplit YourRecordReaderYourOutputFormat YourRecordWriter YourOutputCommitter
    • 44. Brisk
    • 45. Accumulo
    • 46. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
    • 47. AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
    • 48. MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continualMapReduce’
    • 49. TabletServers App serversMaster Ingest clients
    • 50. TabletServer App server Reduce’ map() Ingest clientlive:142 WordCountin:2342holes:234 Table
    • 51. TabletServer App server Reduce’ map() Ingest clientlive:142in:2342 The redholes:234 aardvarks live in holes.
    • 52. TabletServer App server Reduce’ map() Ingest clientlive:142 aardvarks:1in:2342 live:1holes:234 in:1 holes:1
    • 53. TabletServer App server Reduce’ map() Ingest clientaardvarks:1live:143in:2343holes:235
    • 54. Accumulo Map Map map() map()r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
    • 55. Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
    • 56. Key to performanceWhen the functions are runRather than atomic increment: lock, read, +1, write SLOWWrite all values, sum at read time minor compaction time major compaction time
    • 57. TabletServer aardvark:1scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
    • 58. TabletServer aardvark:1 live:142aardvark:1 live:1live:143 in:2342 Reduce’in:2343 sum() in:1holes:235 holes:234 holes:1 major compact
    • 59. Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduce combiner function
    • 60. ‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feature vector weightsStatistical countsnormalize after query to get probabilities
    • 61. Accumulo - latin to accumulate ...
    • 62. Accumulo - latin to accumulate ... awesomeness
    • 63. incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
    • 64. Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search index.Reduced the average age of documents in Googlesearch results by 50%.
    • 65. Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
    • 66. Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-row updates: BigTableComputation is small: Traditional DBMS
    • 67. Questions

    ×