MapReduce Improvements in MapR Hadoop
Upcoming SlideShare
Loading in...5

MapReduce Improvements in MapR Hadoop






Total Views
Views on SlideShare
Embed Views



2 Embeds 2 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    MapReduce Improvements in MapR Hadoop MapReduce Improvements in MapR Hadoop Presentation Transcript

    • 1©MapR Technologies - Confidential MapReduce Improvements in the MapR Hadoop Distribution Adam Bordelon, Senior Software Engineer at MapR Big Data Madison meetup - 9/26/2013
    • 2 What's this all about? ● Background on Hadoop ● Big Data: Distributed Filesystems ● Big Compute: – MapReduce – Beyond MapReduce ● Q&A 2
    • 3 Hadoop History
    • 4 Big Data: Distributed FileSystems Volume, Variety, Velocity: Can't have big data without a scalable filesystem
    • 5 HDFS Architecture
    • 6 HDFS Architectural Flaws ● Created for storing crawled web-page data ● Files cannot be modified once written/closed. – Write-once; append-only ● Files cannot be read before they are closed. – Must batch-load data ● NameNode stores (in memory) – Directory/file tree, file->block mapping – Block replica locations ● NameNode only scales to ~100 Million files – Some users run jobs to concatenate small files ● Written in Java, slows during GC.
    • 7 Solution: MapR FileSystem ● Visionary CTO/Co-Founder: M.C. Srivas – Ran Google search infrastructure team – Chief Storage Architect at Spinnaker Networks ● Take a step back: What kind of DFS do we need in Hadoop/Distributed-Computer? – Easy, Scalable, Reliable ● Want traditional apps to work with DFS – Support random Read/Write – Standard FS interface (NFS) ● HDFS compatible – Drop-in replacement, no recompile
    • 8 Easy: Posix-compliant NFS
    • 9 Easy: MapR Volumes Groups related files/directories into a single tree structure so they can be easily organized, managed, and secured. ● Replication factor ● Scheduled snapshots, mirroring ● Data placement control – By device-type, rack, or geographic location ● Quotas and usage tracking ● Administrative permissions 100K+ Volumes are okay
    • 10  Each container contains  Directories & files  Data blocks  Replicated on servers  No need to manage directly  Use MapR Volumes Scalable: Containers Files/directories are sharded into blocks, which are placed into mini-NNs (containers) on disks Containers are 16-32 GB disk segments, placed on nodes
    • 11 CLDB Scalable: Container Location DB N1, N2 N3, N2 N1, N2 N1, N3 N3, N2 N1 N2 N3 Container location database (CLDB) keeps track of nodes hosting each container and replication chain order  Each container has a replication chain  Updates are transactional  Failures are handled by rearranging replication  Clients cache container locations
    • 12 Scalability Statistics Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster − But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  MapReduce performance not affected
    • 13 Record-breaking Speed Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase Terasort (1x replication, compression disabled) Total 13m 35s 26m 6s 2X Map 7m 58s 21m 8s 3X Reduce 13m 32s 23m 37s 1.8X DFSIO throughput/node Read 1003 MB/s 656 MB/s 1.5X Write 924 MB/s 654 MB/s 1.4X YCSB (50% read, 50% update) Throughput 36,584.4 op/s 12,500.5 op/s 2.9X Runtime 3.80 hr 11.11 hr 2.9X YCSB (95% read, 5% update) Throughput 24,704.3 op/s 10,776.4 op/s 2.3X Runtime 0.56 hr 1.29 hr 2.3X MapR w/Google Apache Hadoop Time 54s 62s Nodes 1003 1460 Disks 1003 5840 Cores 4012 11680 NEW WORLD RECORD BREAK TERASORT MINUTE BARRIER Benchmark hardware configuration: 10 servers, 12x2 cores (2.4 GHz), 12x2TB, 48 GB, 1x10GbE
    • 14 Reliable: CLDB High Availability ● As easy as installing CLDB role on more nodes – Writes go to CLDB master, replicated to slaves – CLDB slaves can serve reads ● Distributed container metadata, so CLDB only stores/recovers container locations – Instant restart (<2 seconds), no single POF ● Shared nothing architecture ● (NFS Multinode HA too)
    • 15 vs. Federated NN, NN HA ● Federated NameNodes – Statically partition namespaces (like Volumes) – Need additional NN (plus a standby) for each namespace – Federated NN only in Hadoop-2.x (beta) ● NameNode HA – NameNode responsible for both fs-namespace (metadata) info and block locations; more data to checkpoint/recover. – Starting standby NN from cold state can take tens-of-minutes for metadata, an hour for block-locations. Need a hot standby. – Metadata state ● All name space edits logged to shared (NFS/NAS) R/W storage, which must also be HA; Standby polls edit log for changes. ● Or use Quorum Journal Manager, separate service/nodes – Block locations ● Data nodes send block reports, location updates, heartbeats to both NNs
    • 16 Reliable: Consistent Snapshots ● Automatic de-duplication ● Saves space by sharing blocks ● Lightning fast ● Zero performance loss on writing to original ● Scheduled, or on-demand ● Easy recovery with drag and drop
    • 17 Reliable: Mirroring
    • 18 MapR Filesystem Summary ● Easy – Direct Access NFS – MapR Volumes ● Fast – C++ vs. Java – Direct disk access, no layered filesystems – Lockless transactions – High-speed RPC – Native compression ● Scalable – Containers, distributed metadata – Container Location DB ● Reliable – CLDB High Availability – Snapshots – Mirroring
    • 19 Big Compute: MapReduce
    • 20 Fast: Direct Shuffle ● Apache Shuffle – Write map-outputs/spills to local file system – Merge partitions for a map output into one file, index into it – Reducers request partitions from Mappers' Http servlets ● MapR Direct Shuffle – Write to Local Volume in MapR FS (rebalancing) – Map-output file per reducer (no index file) – Send shuffleRootFid with MapTaskCompletion on heartbeat – Direct RPC from Reducer to Mapper using Fid – Copy is just a file-system copy; no Http overhead – More copy threads, wider merges
    • 21 Fast: Express Lane ● Long-running jobs shouldn't hog all the slots in the cluster and starve small, fast jobs (e.g. Hive queries) ● One or more small slots reserved on each node for running small jobs ● Small jobs: <10 maps/reds, small input, time limit
    • 22 Reliable: JobTracker HA
    • 23 Easy: Label-based Scheduling ● Assign labels to nodes or regex/glob expressions for nodes – perfnode1* → “production” – /.*ssd[0-9]*/ → “fast_ssd” ● Create label expressions for jobs/queues – Queue “fast_prod” → “production && fast_ss” ● Tasks from these jobs/queues will only be assigned to nodes whose labels match the expression. ● Combine with Data Placement policies for data and compute locality ● No static partitioning necessary – Frequent labels file refresh – New nodes automatically fall into appropriate regex/glob labels – New jobs can specify label expression or use queue's or both ●
    • 24 Other Improvements ● Parallel Split Computations in JobClient – Might as well multi-thread it! ● Runaway Job Protection – One user's fork-bomb shouldn't degrade others' performance – CPU/memory firewalls protect system processes ● Map-side join locality – Files in same directory/container follow same replication chain – Same key ranges likely to be co-located on same node. ● Zero-config XML – XML parsing takes too much time
    • 25 MapR MapReduce Summary ● Fast – Direct Shuffle – Express Lane – Parallel Split Computation – Map-side Join Locality – Zero-config XML ● Reliable – JobTracker HA – Runaway Job Protection ● Easy – Label-based Scheduling
    • 26 Beyond MapReduce...
    • 27 M7: Enterprise-Grade HBase Disks ext3 JVM DFS JVM HBase Other Distributions Disks Unified Easy Dependable Fast No RegionServers No compactions Consistent low latency Seamless splits Instant recovery from node failure Real-time in-memory configuration Automatic merges Snapshots Disk and network compression In-memory column families Mirroring Reduced I/O to disk  Unified Data Platform  Increased Performance  Simplified Administration
    • 28 Apache Drill Interactive analysis of Big Data using standard SQL Based on Google Dremel Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Fas t • Low latency queries • Columnar execution • Complement native interfaces and MapReduce/Hive/Pig Op en • Community driven open source project • Under Apache Software Foundation Mo der n • Standard ANSI SQL:2003 (select/into) • Nested/hierarchical data support • Schema is optional • Supports RDBMS, Hadoop and NoSQL
    • 29 Apache YARN aka MR2
    • 30 Why MapR?
    • 31 Contact Us! I'm not in Sales, so go to to learn more: – Integrations with AWS, GCE, Ubuntu, Lucidworks – Partnerships, Customers – Support, Training, Pricing – Ecosystem Components We're hiring! University of Wisconsin-Madison Career Fair tomorrow Email me at: 31
    • 32©MapR Technologies - Confidential Questions?