Analysis Challenges
Analytical Latency
Data is always old
Answers can take a long time
Serving up analytical results
Higher cost, complexity
Incremental Updates
Analysis Challenges
Word Counts
a: 5342 New Document:
aardvark: 13
an: 4553 “The red aardvarks
anteater: 27 live in holes.”
...
yellow: 302
zebra:19
Analysis Challenges
HDFS Log files:
sources/routers MapReduce over data
sources/apps from all sources for the
sources/webservers week of Jan 13th
Performance Profiles
MapReduce NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
Performance Profiles
MapReduce NoSQL MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
Performance Profiles
MapReduce on NoSQL
Good
Bad
Throughput Bulk Update Latency Seek
Best Practices
Use a NoSQL db that has good throughput - it helps to
do local communication
Isolate MapReduce workers to a subset of your NoSQL
nodes so that some are available for fast queries
If MR output is written back to NoSQL db, it is
immediately available for query
Config
Replicas App servers
P
P
Single job
Shards
P
P
mongos
MongoDB
Mappers read directly
from a single mongod mongod
process, not through
mongos - tends to be
local Map HDFS
Balancer can be turned
off to avoid potential for mongos
reading data twice
MongoReduce
Only MongoDb mongod
primaries do writes.
Schedule mappers on
secondaries Map HDFS
Intermediate output
goes to HDFS Reduce mongos
MongoReduce
mongod
Final output can go to
HDFS
HDFS or MongoDb
Reduce mongos
MongoReduce
mongod
Mappers can just write
to global MongoDb Map HDFS
through mongos
mongos
What’s Going On?
Map Map Map Map
r1 r2 r3 r1 r2 r3 mongos mongos
r1 r2 r3 P P P
Identity Reducer
MongoReduce
Instead of specifying an HDFS directory for input, can
submit MongoDb query and select statements:
q = {article_source: {$in: [‘nytimes.com’, ‘wsj.com’]}
s = {authors:true}
Queries use indexes!
MongoReduce
If outputting to MongoDb, new collections are
automatically sharded, pre-split, and balanced
Can choose the shard key
Reducers can choose to call update()
Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
Accumulo
Based on Google's BigTable design
Uses Apache Hadoop, Zookeeper, and Thrift
Features a few novel improvements on the BigTable
design
cell-level access labels
server-side programming mechanism called Iterators
MapReduce and Accumulo
Can do regular ol’ MapReduce just like w/ MongoDb
But can use Iterators to achieve a kind of ‘continual
MapReduce’
Iterators
row : column family : column qualifier : ts -> value
can specify which key elements are unique, e.g.
row : column family
can specify a function to execute on values of identical
key-portions, e.g.
sum(), max(), min()
Key to performance
When the functions are run
Rather than atomic increment:
lock, read, +1, write SLOW
Write all values, sum at
read time
minor compaction time
major compaction time
Reduce’ (prime)
Because a function has not seen all values for a given
key - another may show up
More like writing a MapReduce combiner function
‘Continuous’ MapReduce
Can maintain huge result sets that are always available
for query
Update graph edge weights
Update feature vector weights
Statistical counts
normalize after query to get probabilities
Google Percolator
A system for incrementally processing updates to a
large data set
Used to create the Google web search index.
Reduced the average age of documents in Google
search results by 50%.
Google Percolator
A novel, proprietary system of Distributed Transactions
and Notifications built on top of BigTable
Solution Space
Incremental update, multi-row consistency: Percolator
Results can’t be broken down (sort): MapReduce
No multi-row updates: BigTable
Computation is small: Traditional DBMS