Seattle Scalability Meetup - Ted Dunning - MapR

Seattle Monthly Hadoop / Scalability /
NoSQLMeetup

Ted Dunning, MapR..

Agenda
• Lightning talks / community announcements
• Main Speaker
• Bier @ Feierabend - 422 Yale Ave North
• Hashtags #Seattle #Hadoop

Fast & Frugal: Running a Lean Startup
with AWS – Oct 27th 10am-2pm
http://aws.amazon.com/about-aws/events/

Seattle AWS User Group November
9th, 2011 – 6:30 -9pm
• November we're going to hear from Amy
Woodward from EngineYard about keeping
your systems live through outages and other
problems using EngineYard atop AWS. Come
check out this great talk and learn a thing or
three about EngineYard& keeping high
availability for your systems!
• http://www.nwcloud.org

www.mapr.com
• MapR is an amazing new distributed
filesystem modeled after Hadoop. It maintains
API compatibility with Hadoop, but far
exceeds it in performance, manageability, and
more.

MapR, Scaling, Machine Learning

Outline
• Philosophy
• Architecture
• Applications

For startups
• History is always small
• The future is huge
• Must adopt new technology to survive
• Compatibility is not as important
– In fact, incompatibility is assumed

Physics of large companies

Absolute growth
still very large

Startup
phase

For large businesses
• Present state is always large
• Relative growth is much smaller
• Absolute growth rate can be very large
• Must adopt new technology to survive
– Cautiously!
– But must integrate technology with legacy
• Compatibility is crucial

The startup technology picture
No compatibility
requirement

Old computers
and software
Expected hardware
and software growth

Current computers
and software

The large enterprise picture
Must work
together

?
Current hardware
and software
Proof of concept
Hadoop cluster

Long-term Hadoop
cluster

What does this mean?
• Hadoop is very, very good at streaming
through things in batch jobs
• Hbase is good at persisting data in very write-
heavy workloads
• Unfortunately, the foundation of both systems
is HDFS which does not export or import well

Narrow Foundations

Big data is Pig Hive
Web Services and
heavy
expensive to
move.

Sequential File Map/
OLAP OLTP Hbase
Processing Reduce

RDBMS NAS HDFS

Narrow Foundations
• Because big data has inertia, it is difficult to
move
– It costs time to move
– It costs reliability because of more moving parts
• The result is many duplicate copies

One Possible Answer
• Widen the foundation
• Use standard communication protocols
• Allow conventional processing to share with
parallel processing

Broad Foundation

Pig Hive
Web Services

Sequential File Map/
OLAP OLTP Hbase
Processing Reduce

RDBMS NAS HDFS

MapR

Broad Foundation
• Having a broad foundation allows many kinds
of computation to work together
• It is no longer necessary to throw data over a
wall
• Performance much higher for map-reduce
• Enterprise grade feature sets such as
snapshots and mirrors can be integrated
• Operations more familiar to admin staff

Map-Reduce

Input Output
Map function
Reduce function

Shuffle

Map-reduce key details
• User supplies f1 (map) and f2 (reduce)
– Both are pure functions, no side effect
• Framework supplies input, shuffle, output
• Framework will re-run f1 and f2 on failure
• Redundant task completion is OK

Map-Reduce

Input Output

Map-Reduce
f1 Local f2
Disk

Input Output

f1 Local f2
Disk

f1

Example – WordCount
• Mapper
– read line, tokenize into words
– emit (word, 1)
• Reducer
– read (word, [k1, … , kn])
– Emit (word, Σki)

Example – Map Tiles
• Input is set of objects
– Roads (polyline)
– Towns (polygon)
– Lakes (polygon)
• Output is set of map-tiles
– Graphic image of part of map

Bottlenecks and Issues
• Read-only files
• Many copies in I/O path
• Shuffle based on HTTP
– Can’t use new technologies
– Eats file descriptors
• Spills go to local file space
– Bad for skewed distribution of sizes

MapR Areas of Development

HBase Map
Reduce
Ecosystem

Storage Management
Services

MapR Improvements
• Faster file system
– Fewer copies
– Multiple NICS
– No file descriptor or page-buf competition
• Faster map-reduce
– Uses distributed file system
– Direct RPC to receiver
– Very wide merges

MapR Innovations
• Volumes
– Distributed management
– Data placement
• Read/write random access file system
– Allows distributed meta-data
– Improved scaling
– Enables NFS access
• Application-level NIC bonding
• Transactionally correct snapshots and mirrors

MapR'sContainers
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
 Each container contains
 Directories & files
 Data blocks
 Replicated on servers
Containers are 16-
 No need to manage
32 GB segments of
directly
disk, placed on
nodes

MapR'sContainers

 Each container has a
replication chain
 Updates are transactional
 Failures are handled by
rearranging replication

Container locations and replication

N1, N2 N1
N3, N2
N1, N2
N1, N3 N2

N3, N2

CLDB
N3
Container location database
(CLDB) keeps track of nodes
hosting each container and
replication chain order

MapR Scaling
Containers represent 16 - 32GB of data
 Each can hold up to 1 Billion files and directories
 100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
 25GB to cache all containers for 2EB cluster
But not necessary, can page to disk
 Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
 Serve 100x more data-nodes
 Increase container size to 64G to serve 4EB cluster
 Map/reduce not affected

MapR's Streaming Performance
2250 2250
11 x 7200rpm SATA 11 x 15Krpm SAS
2000 2000
1750 1750
1500 1500
1250 1250 Hardware
MapR
1000 1000
MB Hadoop
750 750
per
sec 500 500
250 250
0 0
Read Write Read Write
Higher is better

Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB

Terasort on MapR
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
60 300

50 250

40 200

Elapsed 150
MapR
30
time Hadoop
(mins) 20 100

10 50

0 0
1.0 TB 3.5 TB

Lower is better

HBase on MapR
YCSB Random Read with 1 billion 1K records
10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
25000

20000

Records 15000
per MapR
second 10000 Apache

5000

0
Zipfian Uniform Higher is better

Small Files (Apache Hadoop, 10 nodes)

Out of box
Op: - create file
Rate (files/sec)

- write 100 bytes
Tuned - close
Notes:
- NN not replicated
- NN uses 20G DRAM
- DN uses 2G DRAM

# of files (m)

MUCH faster for some operations
Same 10 nodes …

Create
Rate

# of files (millions)

What MapR is not
• Volumes != federation
– MapR supports > 10,000 volumes all with
independent placement and defaults
– Volumes support snapshots and mirroring
• NFS != FUSE
– Checksum and compress at gateway
– IP fail-over
– Read/write/update semantics at full speed
• MapR != maprfs

Not Your Father’s NFS
• Multiple architectures possible
• Export to the world
– NFS gateway runs on selected gateway hosts
• Local server
– NFS gateway runs on local host
– Enables local compression and check summing
• Export to self
– NFS gateway runs on all data nodes, mounted
from localhost

Export to the world

NFS
NFS
Server
NFS
Server
NFS
Server
NFS Server
Client

Local server

Application

NFS
Server
Client

Cluster
Nodes

Universal export to self
Cluster Nodes

Task

NFS
Cluster Server
Node

Nodes are identical
Task
Task
NFS
NFS
Cluster Server
Node Cluster Server
Node

Task

NFS
Cluster Server
Node

Application architecture
• High performance map-reduce is nice

• But algorithmic flexibility is even nicer

Sharded textIndex text to local disk
Indexing
Assign documents
to shards and then copy index to
distributed file store

Clustered
Reducer index storage
Input Map
documents
Copy to local disk
Local
typically disk
required before Local Search
index can be loaded disk Engine

Shardedtext indexing
• Mapper assigns document to shard
– Shard is usually hash of document id
• Reducer indexes all documents for a shard
– Indexes created on local disk
– On success, copy index to DFS
– On failure, delete local files
• Must avoid directory collisions
– can’t use shard id!
• Must manage and reclaim local disk space

Conventional data flow
Failure of search
engine requires
Failure of a reducer another download
causes garbage to of the index from
accumulate in the clustered storage.
Clustered
local disk Reducer index storage
Input Map
documents
Local
disk Local Search
disk Engine

Simplified NFS data flows

Search
Engine
Reducer
Input Map Clustered
documents
index storage
Failure of a reducer Search engine
is cleaned up by reads mirrored
map-reduce index directly.
framework

Simplified NFS data flows
Search
Mirroring allows Engine
exact placement
of index data

Reducer
Input Map
documents Search
Engine
Aribitrary levels
of replication
also possible Mirrors

K-means
• Classic E-M based algorithm
• Given cluster centroids,
– Assign each data point to nearest centroid
– Accumulate new centroids
– Rinse, lather, repeat

K-means, the movie
Centroids

I
n Assign Aggregate
p to new
u Nearest centroids
t centroid

Parallel Stochastic Gradient Descent
Model

I
n
Train Average
p
sub models
u
model
t

VariationalDirichlet Assignment
Model

I
n
Gather Update
p
sufficient model
u
statistics
t

Old tricks, new dogs
Read from local disk
• Mapper from distributed cache
– Assign point to cluster
Read from
– Emit cluster id, (1, point) HDFS to local disk
• Combiner and reducer by distributed cache

– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n) Written by

• Output to HDFS map-reduce

Old tricks, new dogs
• Mapper
– Assign point to cluster Read
from
– Emit cluster id, (1, point) NFS

• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n) Written by
map-reduce
• Output to HDFS
MapR FS

Poor man’s Pregel
• Mapper
while not done:
read and accumulate input models
for each input:
accumulate model
write model
synchronize
reset input format
emit summary

• Lines in bold can use conventional I/O via NFS

60

Click modeling architecture
Side-data

Now via NFS

I
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling

Map-reduce

Click modeling architecture
Side-data

Map-reduce
cooperates Sequential
with NFS SGD
Learning
Sequential
SGD
I Learning
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling Sequential
SGD
Learning

Map-reduce Map-reduce

Hybrid model flow

Feature extraction
and Down
down sampling stream
modeling
Map-reduce

Deployed
Map-reduce Model
SVD
(PageRank)
(spectral)

??

Hybrid model flow

Feature extraction
and Down
down sampling stream
modeling

Deployed
Model
SVD
(PageRank)
(spectral)

Sequential
Map-reduce

Trivial visualization interface
• Map-reduce output is visible via NFS
$R
> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)
> plot(error ~ t, x)
> q(save=„n‟)

• Legacy visualization just works

Conclusions
• We used to know all this
• Tab completion used to work
• 5 years of work-arounds have clouded our
memories

• We just have to remember the future

Seattle Scalability Meetup - Ted Dunning - MapR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Seattle Scalability Meetup - Ted Dunning - MapR

Similar to Seattle Scalability Meetup - Ted Dunning - MapR (20)

More from clive boulton

More from clive boulton (20)

Recently uploaded

Recently uploaded (20)

Seattle Scalability Meetup - Ted Dunning - MapR

Editor's Notes