MapReduce Improvements in MapR Hadoop

1©MapR Technologies - Confidential
MapReduce Improvements in the
MapR Hadoop Distribution
Adam Bordelon, Senior Software Engineer at MapR
Big Data Madison meetup - 9/26/2013

2
What's this all about?
●
Background on Hadoop
●
Big Data: Distributed Filesystems
●
Big Compute:
– MapReduce
– Beyond MapReduce
●
Q&A
2

3
Hadoop History
http://s.wsj.net/public/resources/images/MI-BX925_GOOGLE_G_20130818173254.jpg

4
Big Data: Distributed FileSystems
Volume, Variety, Velocity:
Can't have big data without a scalable filesystem
http://www.lbisoftware.com/blog/wp-content/uploads/2013/06/data_mountain1.jpg

5
HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

6
HDFS Architectural Flaws
● Created for storing crawled web-page data
● Files cannot be modified once written/closed.
– Write-once; append-only
● Files cannot be read before they are closed.
– Must batch-load data
● NameNode stores (in memory)
– Directory/file tree, file->block mapping
– Block replica locations
● NameNode only scales to ~100 Million files
– Some users run jobs to concatenate small files
● Written in Java, slows during GC.

7
Solution: MapR FileSystem
● Visionary CTO/Co-Founder: M.C. Srivas
– Ran Google search infrastructure team
– Chief Storage Architect at Spinnaker Networks
● Take a step back: What kind of DFS do we need in
Hadoop/Distributed-Computer?
– Easy, Scalable, Reliable
● Want traditional apps to work with DFS
– Support random Read/Write
– Standard FS interface (NFS)
● HDFS compatible
– Drop-in replacement, no recompile

9
Easy: MapR Volumes
Groups related files/directories
into a single tree structure so
they can be easily organized,
managed, and secured.
●
Replication factor
●
Scheduled snapshots, mirroring
●
Data placement control
– By device-type, rack, or
geographic location
●
Quotas and usage tracking
●
Administrative permissions
100K+ Volumes are okay

10

Each container contains

Directories & files

Data blocks

Replicated on servers

No need to manage directly

Use MapR Volumes
Scalable: Containers
Files/directories are sharded into
blocks, which are placed into mini-NNs
(containers) on disks
Containers are
16-32 GB disk
segments,
placed on
nodes

11
CLDB
Scalable: Container Location DB
N1, N2
N3, N2
N1, N2
N1, N3
N3, N2
N1
N2
N3
Container location
database (CLDB) keeps
track of nodes hosting
each container and
replication chain order

Each container has a replication chain

Updates are transactional

Failures are handled by rearranging replication

Clients cache container locations

12
Scalability Statistics
Containers represent 16 - 32GB of data

Each can hold up to 1 Billion files and directories

100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container

25GB to cache all containers for 2EB cluster
− But not necessary, can page to disk

Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS
block-reports

Serve 100x more data-nodes

Increase container size to 64G to serve 4EB cluster

MapReduce performance not affected

13
Record-breaking Speed
Benchmark MapR
2.1.1
CDH 4.1.1 MapR
Speed
Increase
Terasort (1x replication, compression disabled)
Total 13m 35s 26m 6s 2X
Map 7m 58s 21m 8s 3X
Reduce 13m 32s 23m 37s 1.8X
DFSIO throughput/node
Read 1003 MB/s 656 MB/s 1.5X
Write 924 MB/s 654 MB/s 1.4X
YCSB (50% read, 50% update)
Throughput 36,584.4
op/s
12,500.5
op/s
2.9X
Runtime 3.80 hr 11.11 hr 2.9X
YCSB (95% read, 5% update)
Throughput 24,704.3
op/s
10,776.4
op/s
2.3X
Runtime 0.56 hr 1.29 hr 2.3X
MapR
w/Google
Apache
Hadoop
Time 54s 62s
Nodes 1003 1460
Disks 1003 5840
Cores 4012 11680
NEW WORLD RECORD
BREAK TERASORT MINUTE BARRIER
Benchmark hardware configuration:
10 servers, 12x2 cores (2.4 GHz), 12x2TB, 48 GB, 1x10GbE

14
Reliable: CLDB High Availability
● As easy as installing CLDB role on more nodes
– Writes go to CLDB master, replicated to slaves
– CLDB slaves can serve reads
● Distributed container metadata, so CLDB only
stores/recovers container locations
– Instant restart (<2 seconds), no single POF
● Shared nothing architecture
● (NFS Multinode HA too)

15
vs. Federated NN, NN HA
● Federated NameNodes
– Statically partition namespaces (like Volumes)
– Need additional NN (plus a standby) for each namespace
– Federated NN only in Hadoop-2.x (beta)
● NameNode HA
– NameNode responsible for both fs-namespace (metadata) info and block
locations; more data to checkpoint/recover.
– Starting standby NN from cold state can take tens-of-minutes for metadata,
an hour for block-locations. Need a hot standby.
– Metadata state
● All name space edits logged to shared (NFS/NAS) R/W storage, which must
also be HA; Standby polls edit log for changes.
● Or use Quorum Journal Manager, separate service/nodes
– Block locations
● Data nodes send block reports, location updates, heartbeats to both NNs

16
Reliable: Consistent Snapshots
● Automatic
de-duplication
● Saves space by
sharing blocks
● Lightning fast
● Zero performance loss
on writing to original
● Scheduled,
or on-demand
● Easy recovery with
drag and drop

18
MapR Filesystem Summary
● Easy
– Direct Access NFS
– MapR Volumes
● Fast
– C++ vs. Java
– Direct disk access,
no layered filesystems
– Lockless transactions
– High-speed RPC
– Native compression
● Scalable
– Containers,
distributed metadata
– Container Location DB
● Reliable
– CLDB High Availability
– Snapshots
– Mirroring

19
Big Compute: MapReduce
http://developer.yahoo.com/hadoop/tutorial/module4.html

20
Fast: Direct Shuffle
● Apache Shuffle
– Write map-outputs/spills to local file system
– Merge partitions for a map output into one file, index into it
– Reducers request partitions from Mappers' Http servlets
● MapR Direct Shuffle
– Write to Local Volume in MapR FS (rebalancing)
– Map-output file per reducer (no index file)
– Send shuffleRootFid with MapTaskCompletion on heartbeat
– Direct RPC from Reducer to Mapper using Fid
– Copy is just a file-system copy; no Http overhead
– More copy threads, wider merges

21
Fast: Express Lane
● Long-running jobs shouldn't hog all the slots in the
cluster and starve small, fast jobs (e.g. Hive queries)
● One or more small slots reserved on each node for
running small jobs
● Small jobs: <10 maps/reds, small input, time limit

23
Easy: Label-based Scheduling
● Assign labels to nodes or regex/glob expressions for nodes
– perfnode1* → “production”
– /.*ssd[0-9]*/ → “fast_ssd”
● Create label expressions for jobs/queues
– Queue “fast_prod” → “production && fast_ss”
● Tasks from these jobs/queues will only be assigned to nodes whose
labels match the expression.
● Combine with Data Placement policies for data and compute locality
● No static partitioning necessary
– Frequent labels file refresh
– New nodes automatically fall into appropriate regex/glob labels
– New jobs can specify label expression or use queue's or both
● http://www.mapr.com/doc/display/MapR/Placing+Jobs+on+Specified+Nodes

24
Other Improvements
● Parallel Split Computations in JobClient
– Might as well multi-thread it!
● Runaway Job Protection
– One user's fork-bomb shouldn't degrade others' performance
– CPU/memory firewalls protect system processes
● Map-side join locality
– Files in same directory/container follow same replication chain
– Same key ranges likely to be co-located on same node.
● Zero-config XML
– XML parsing takes too much time

25
MapR MapReduce Summary
● Fast
– Direct Shuffle
– Express Lane
– Parallel Split Computation
– Map-side Join Locality
– Zero-config XML
● Reliable
– JobTracker HA
– Runaway Job Protection
● Easy
– Label-based Scheduling

26
Beyond MapReduce...
http://www.nasa.gov/sites/default/files/potw1335a_0.jpg

27
M7: Enterprise-Grade HBase
Disks
ext3
JVM
DFS
JVM
HBase
Other
Distributions
Disks
Unified
Easy Dependable Fast
No RegionServers No compactions Consistent low latency
Seamless splits Instant recovery
from node failure
Real-time in-memory
configuration
Automatic merges Snapshots Disk and network
compression
In-memory column families Mirroring Reduced I/O to disk
 Unified Data Platform
 Increased Performance
 Simplified Administration

28
Apache Drill
Interactive analysis of Big Data using standard SQL
Based on Google Dremel
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
Hive
Pig
Fas
t
• Low latency queries
• Columnar execution
• Complement native interfaces
and MapReduce/Hive/Pig
Op
en
• Community driven open source project
• Under Apache Software Foundation
Mo
der
n
• Standard ANSI SQL:2003 (select/into)
• Nested/hierarchical data support
• Schema is optional
• Supports RDBMS, Hadoop and NoSQL

29
Apache YARN aka MR2
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

30
Why MapR?
http://www.mapr.com/products/why-mapr

31
Contact Us!
I'm not in Sales, so go to mapr.com to learn more:
– Integrations with AWS, GCE, Ubuntu, Lucidworks
– Partnerships, Customers
– Support, Training, Pricing
– Ecosystem Components
We're hiring!
University of Wisconsin-Madison Career Fair tomorrow
Email me at: abordelon@maprtech.com
31

MapReduce Improvements in MapR Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MapReduce Improvements in MapR Hadoop

Similar to MapReduce Improvements in MapR Hadoop (20)

Recently uploaded

Recently uploaded (20)

MapReduce Improvements in MapR Hadoop