MapReduce and NoSQL
Upcoming SlideShare
Loading in...5
×
 

MapReduce and NoSQL

on

  • 2,900 views

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

A discussion of MapReduce over NoSQL databases, including MongoDB and Accumulo

Statistics

Views

Total Views
2,900
Views on SlideShare
2,737
Embed Views
163

Actions

Likes
4
Downloads
59
Comments
0

3 Embeds 163

http://www.scoop.it 161
http://webcache.googleusercontent.com 1
http://www.slashdocs.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

MapReduce and NoSQL MapReduce and NoSQL Presentation Transcript

  • MapReduce and NoSQLExploring the Solution Space
  • Changing the Game
  • Conventional MR + NoSQLCost Dollars Data Volume View slide
  • Conventional MR + NoSQLScale Dollars Data Volume View slide
  • Performance Profiles
  • Performance Profiles GoodMapReduce Bad Throughput Bulk Update Latency Seek Good NoSQL Bad
  • Performance Profiles MapReduce NoSQLThroughput Bulk Update Latency Seek
  • Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
  • Data GoalsCollectServeAnalyze
  • Traditional Approach Collect Transactional AnalyticalUsers System System Apps OLTP OLAP Serve Analyze
  • One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
  • Analysis ChallengesAnalytical Latency Data is always old Answers can take a long timeServing up analytical resultsHigher cost, complexityIncremental Updates
  • Analysis ChallengesWord Countsa: 5342 New Document:aardvark: 13an: 4553 “The red aardvarksanteater: 27 live in holes.”...yellow: 302zebra:19
  • Analysis ChallengesHDFS Log files:sources/routers MapReduce over datasources/apps from all sources for thesources/webservers week of Jan 13th
  • One Approach Collect Hadoop MapReduceUsers Apps NoSQL HDFS Serve Analyze
  • Possible Approach Collect Hadoop MapReduceUsers Apps NoSQL Serve Analyze
  • Have to be careful to avoidcreating a system that’s badat everything
  • What could go wrong?
  • Performance Profiles MapReduce NoSQLGoodBad Throughput Bulk Update Latency Seek
  • Performance Profiles MapReduce NoSQL MapReduce on NoSQLGoodBad Throughput Bulk Update Latency Seek
  • Performance Profiles MapReduce NoSQL MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
  • Performance Profiles MapReduce on NoSQL Good BadThroughput Bulk Update Latency Seek
  • Best PracticesUse a NoSQL db that has good throughput - it helps todo local communicationIsolate MapReduce workers to a subset of your NoSQLnodes so that some are available for fast queriesIf MR output is written back to NoSQL db, it isimmediately available for query
  • THE IN T ER L L E C T I V E Concept-Based Search
  • PatentsNews Articles PubMedClinical TrialsArXive Articles
  • MongoDB PythonRuby on Rails Hadoop Thrift
  • Feature VectorsUnstructured Data
  • www.interllective.com
  • MapReduce on MongoDBBuilt-in MapReduce - Javascriptmongo-hadoopMongoReduce
  • MongoReduce
  • Config Replicas App servers S P S PShards S P S P mongos
  • Config Replicas App servers MR WorkersShards mongos
  • Config Replicas App servers P P Single jobShards P P mongos
  • MongoDBMappers read directlyfrom a single mongod mongodprocess, not throughmongos - tends to belocal Map HDFSBalancer can be turnedoff to avoid potential for mongosreading data twice
  • MongoReduceOnly MongoDb mongodprimaries do writes.Schedule mappers onsecondaries Map HDFSIntermediate outputgoes to HDFS Reduce mongos
  • MongoReduce mongodFinal output can go to HDFSHDFS or MongoDb Reduce mongos
  • MongoReduce mongodMappers can just writeto global MongoDb Map HDFSthrough mongos mongos
  • What’s Going On? Map Map Map Mapr1 r2 r3 r1 r2 r3 mongos mongos r1 r2 r3 P P P Identity Reducer
  • MongoReduceInstead of specifying an HDFS directory for input, cansubmit MongoDb query and select statements:q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}s = {authors:true}Queries use indexes!
  • MongoReduceIf outputting to MongoDb, new collections areautomatically sharded, pre-split, and balancedCan choose the shard keyReducers can choose to call update()
  • MongoReduceIf writing output to MongoDb, specify an objectId toensure idempotent writes - i.e. not a random UUID
  • https://github.com/acordova/MongoReduce
  • DIY MapRed + NoSQLYourInputFormat YourInputSplit YourRecordReaderYourOutputFormat YourRecordWriter YourOutputCommitter
  • Brisk
  • Accumulo
  • AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
  • AccumuloBased on Googles BigTable designUses Apache Hadoop, Zookeeper, and ThriftFeatures a few novel improvements on the BigTabledesign cell-level access labels server-side programming mechanism called Iterators
  • MapReduce and AccumuloCan do regular ol’ MapReduce just like w/ MongoDbBut can use Iterators to achieve a kind of ‘continualMapReduce’
  • TabletServers App serversMaster Ingest clients
  • TabletServer App server Reduce’ map() Ingest clientlive:142 WordCountin:2342holes:234 Table
  • TabletServer App server Reduce’ map() Ingest clientlive:142in:2342 The redholes:234 aardvarks live in holes.
  • TabletServer App server Reduce’ map() Ingest clientlive:142 aardvarks:1in:2342 live:1holes:234 in:1 holes:1
  • TabletServer App server Reduce’ map() Ingest clientaardvarks:1live:143in:2343holes:235
  • Accumulo Map Map map() map()r1 r2 r3 r1 r2 r3 client client r1 r2 r3 reduce’()
  • Iterators row : column family : column qualifier : ts -> value can specify which key elements are unique, e.g. row : column family can specify a function to execute on values of identical key-portions, e.g. sum(), max(), min()
  • Key to performanceWhen the functions are runRather than atomic increment: lock, read, +1, write SLOWWrite all values, sum at read time minor compaction time major compaction time
  • TabletServer aardvark:1scan live:142 live:1 Reduce’ in:2342 in:1 live:143 sum() holes:234 holes:1 read
  • TabletServer aardvark:1 live:142aardvark:1 live:1live:143 in:2342 Reduce’in:2343 sum() in:1holes:235 holes:234 holes:1 major compact
  • Reduce’ (prime)Because a function has not seen all values for a givenkey - another may show upMore like writing a MapReduce combiner function
  • ‘Continuous’ MapReduceCan maintain huge result sets that are always availablefor queryUpdate graph edge weightsUpdate feature vector weightsStatistical countsnormalize after query to get probabilities
  • Accumulo - latin to accumulate ...
  • Accumulo - latin to accumulate ... awesomeness
  • incubator.apache.org/accumulowiki.apache.org/incubator/AccumuloProposal
  • Google PercolatorA system for incrementally processing updates to alarge data setUsed to create the Google web search index.Reduced the average age of documents in Googlesearch results by 50%.
  • Google PercolatorA novel, proprietary system of Distributed Transactionsand Notifications built on top of BigTable
  • Solution SpaceIncremental update, multi-row consistency: PercolatorResults can’t be broken down (sort): MapReduceNo multi-row updates: BigTableComputation is small: Traditional DBMS
  • Questions