SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
12.
Analysis Challenges
Analytical Latency
Data is always old
Answers can take a long time
Serving up analytical results
Higher cost, complexity
Incremental Updates
13.
Analysis Challenges
Word Counts
a: 5342 New Document:
aardvark: 13
an: 4553 “The red aardvarks
anteater: 27 live in holes.”
...
yellow: 302
zebra:19
14.
Analysis Challenges
HDFS Log files:
sources/routers MapReduce over data
sources/apps from all sources for the
sources/webservers week of Jan 13th
19.
Performance Profiles
MapReduce NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
20.
Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
21.
Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
22.
Performance Profiles
MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
23.
Best Practices
Use a NoSQL db that has good throughput - it helps to
do local communication
Isolate MapReduce workers to a subset of your NoSQL
nodes so that some are available for fast queries
If MR output is written back to NoSQL db, it is
immediately available for query
24.
THE IN T ER L L E C T I V E
Concept-Based Search
33.
Config
Replicas App servers
P
P
Single job
Shards
P
P
mongos
34.
MongoDB
Mappers read directly
from a single mongod mongod
process, not through
mongos - tends to be
local Map HDFS
Balancer can be turned
off to avoid potential for mongos
reading data twice
35.
MongoReduce
Only MongoDb mongod
primaries do writes.
Schedule mappers on
secondaries Map HDFS
Intermediate output
goes to HDFS Reduce mongos
36.
MongoReduce
mongod
Final output can go to
HDFS
HDFS or MongoDb
Reduce mongos
37.
MongoReduce
mongod
Mappers can just write
to global MongoDb Map HDFS
through mongos
mongos
38.
What’s Going On?
Map Map Map Map
r1 r2 r3 r1 r2 r3 mongos mongos
r1 r2 r3 P P P
Identity Reducer
39.
MongoReduce
Instead of specifying an HDFS directory for input, can
submit MongoDb query and select statements:
q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}
s = {authors:true}
Queries use indexes!
40.
MongoReduce
If outputting to MongoDb, new collections are
automatically sharded, pre-split, and balanced
Can choose the shard key
Reducers can choose to call update()
41.
MongoReduce
If writing output to MongoDb, specify an objectId to
ensure idempotent writes - i.e. not a random UUID
46.
Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
47.
Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
48.
MapReduce and Accumulo
Can do regular ol’ MapReduce just like w/ MongoDb
But can use Iterators to achieve a kind of ‘continual
MapReduce’
55.
Iterators
row : column family : column qualifier : ts -> value
can specify which key elements are unique, e.g.
row : column family
can specify a function to execute on values of identical
key-portions, e.g.
sum(), max(), min()
56.
Key to performance
When the functions are run
Rather than atomic increment:
lock, read, +1, write SLOW
Write all values, sum at
read time
minor compaction time
major compaction time
59.
Reduce’ (prime)
Because a function has not seen all values for a given
key - another may show up
More like writing a MapReduce combiner function
60.
‘Continuous’ MapReduce
Can maintain huge result sets that are always available
for query
Update graph edge weights
Update feature vector weights
Statistical counts
normalize after query to get probabilities
64.
Google Percolator
A system for incrementally processing updates to a
large data set
Used to create the Google web search index.
Reduced the average age of documents in Google
search results by 50%.
65.
Google Percolator
A novel, proprietary system of Distributed Transactions
and Notifications built on top of BigTable
66.
Solution Space
Incremental update, multi-row consistency: Percolator
Results can’t be broken down (sort): MapReduce
No multi-row updates: BigTable
Computation is small: Traditional DBMS