Apache Hadoop 3.0 Community Update

Apache Hadoop 3.0
Community Update
Sydney, September 2017
Sanjay Radia, Vinod Kumar Vavilapalli

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About.html
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
–Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria …
Ph.D., University of Waterloo
Page 2

Why Hadoop 3.0
 Lot of content in Trunk that did not make
it to 2.x branch
 JDK Upgrade – does not truly require
bumping major number
 Hadoop command scripts rewrite
(incompatible)
 Big features that need stabilizing major
release – Erasure codes
 YARN: long running services
 Ephemeral Ports (incompatible)
Driving Reasons Some features taking advantage of 3.0

Apache Hadoop 3.0
 HDFS: Erasure codes
 YARN:
– Long running services,
– Scheduler enhancements,
– Isolation & Docker
– UI
 Lots of Trunk content
 JDK8 and newer dependent libraries
- 3.0.0-alpha1 - Sep/3/2016
- Alpha2 - Jan/25/2017
- Alpha3 – May/16/2017
- Alpha4 – July/7/2017
- Beta/GA – Q4 2017 (Estimated)
Key Takeaways Release Timeline
3.0

Agenda
 Major changes you should know
before upgrade Hadoop 3.0
– JDK upgrade
– Dependency upgrade
– Change on default port for
daemon/services
– Shell script rewrite
 Features
– Hadoop Common
• Client-Side Classpath Isolation
• Shell script rewrite
– HDFS/Storage
• Erasure Coding
• Multiple Standby NameNodes
• Intradata balancer
• Cloud Storage: Support for Azure Data Lake, S3
consistency & performance
– YARN
• Support for long running services
• Scheduling enhancements: : App / Queue
Priorities, global scheduling, placement
strategies
• New UI
• ATS v2
– MAPREDUCE
• Task-level native optimization
HADOOP-11264

 Minimum JDK for Hadoop 3.0.x is JDK8 OOP-11858
– Oracle JDK 7 is EoL at April 2015!!
 Moving forward to use new features of JDK8
– Lambda Expressions – starting to use this
– Stream API
– Security enhancements
– Performance enhancement for HashMaps, IO/NIO, etc.
 Hadoop’s evolution with JDK upgrades
– Hadoop 2.6.x - JDK 6, 7, 8 or later
– Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
– Hadoop 3.0.x - JDK 8 or later
Hadoop Operation - JDK Upgrade

 Previously, the default ports of multiple Hadoop services were in the Linux ephemeral
port range (32768-61000)
– Can conflict with other apps running on the same node
– Can cause problem during rolling restart if another app takes the port
 New ports:
– Namenode ports: 50470  9871, 50070  9870, 8020  9820
– Secondary NN ports: 50091  9869, 50090  9868
– Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864
 KMS service port 16000  9600
Change of Default Ports for Hadoop Services

Classpath isolation (HADOOP-11656)
 Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava, Protobuf,
Jackson, Jetty, …
○ Potential conflicts with your app
dependencies (No shading)
 No separate HDFS client jar means
server jars are leaked
● NN, DN libraries pulled even though
not needed
 HDFS-6200: Split HDFS client into
separate JAR
 HADOOP-11804: Shaded hadoop-
client dependency
 YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)

HDFS
Support for Three NameNodes for HA
Intra data node balancer
Cloud storage improvements (see afternoon talk)
Erasure coding

Current (2.x) HDFS Replication Strategy
 Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– Reliability: tolerate 2 failures
 Good data locality, local shortcut
 Multiple copies => Parallel IO for parallel compute
 Very Fast block recovery and node recovery
– Parallel recover - the bigger the cluster the faster
– 10TB Node recovery 30sec to a few hours
 3/x storage overhead vs 1.4-1.6 of Erasure Code
– Remember that Hadoop’s JBod is very cheap
• 1/10 - 1/20 of SANs
• 1/10 – 1/5 of NFS
r1
Rack I
DataNode
r2
Rack II
DataNode
r3

Erasure Coding
 k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
 Reliability: tolerate m failures
 Save disk space
 Save I/O bandwidth on the write path
 1.5x storage
overhead
 Tolerate any 3
failures
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
3-replication (6, 3) Reed-Solomon
Maximum fault
Tolerance
2 3
Disk usage
(N byte of data)
3N 1.5N

Block Reconstruction
 Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
b4
Rack
b2
Rack
b3
Rack
b1
Rack
b6
Rack
b5
Rack RackRack
P1 P2
Rack
P3

Erasure Coding on Contiguous/Striped Blocks
Two Approaches
 EC on contiguous blocks
– Pros: Better for locality
– Cons: small files cannot be handled
 EC on striped blocks
– Pros: Leverage multiple disks in parallel
– Pros: Works for small small files
– Cons: No data locality for readers
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks

Erasure Coding Zone
 Create a zone on an empty directory
– Shell command:
hdfs erasurecode –createZone [-s <schemaName>] <path>
 All the files under a zone directory are automatically erasure coded
– Rename across zones with different EC schemas are disallowed

Write Pipeline for Replicated Files
 Write pipeline to datanodes
 Durability
– Use 3 replicas to tolerate maximum 2 failures
 Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
 Appendable
– Files can be reopened for append
* DN = DataNode
DN1 DN2 DN3
data data
ackack
Writer
data
ack

Parallel Write for EC Files
 Parallel write
– Client writes to a group of 9 datanodes at the same time
– Calculate Parity bits at client side, at Write Time
 Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
 Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover to any other
remaining replica to read the same data.
 Appendable (Same as replicated files)
– Files can be reopened for append
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……
Stipe size 1MB

EC: Write Failure Handling
 Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……

Replication: Slow Writers & Replace Datanode on Failure
 Write pipeline for replicated files
– Datanode can be replaced in case of failure.
 Slow writers
– A write pipeline may last for a long time
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
 EC files
– Do not support replace-datanode-on-failure.
– Slow writer improved
DN1 DN4
data
ack
DN3DN2
data
ack
Writer
data
ack

Reading with Parity Blocks
 Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
 Block reconstruction
– Read parity blocks to reconstruct
missing blocks
DN3
DN7
DN1
DN2
Reader
DN4
DN5
DN6
Block3
reconstruct
Block2
Block1
Block4
Block5
Block6Parity1

EC implications
 File data is striped across multiple nodes and racks
 Reads and writes are remote and cross-rack
 Reconstruction is network-intensive, reads m blocks cross-rack
– Need fast network
• Require high network bandwidth between client-server
• Dead DataNode implies high network traffic and reconstruction time
 Important to use optimized ISA-L for performance
– 1+ GB/s encode/decode speed, much faster than Java implementation
– CPU is no longer a bottleneck
 Need to combine data into larger files to avoid an explosion in replica count
– Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
– Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
Works best for archival / cold data usecases
Need
Fast
Network

EC performance – write performance faster with right EC lib

EC performance – TPC with no DN killed

EC performance - TPC with 2 DN killed

Erasure coding status
 Massive development effort by the Hadoop community
○ 20+ contributors from many companies (Hortonworks, Y! JP, Cloudera, Intel, Huawei, …)
○ 100s of commits over three years (started in 2014)
 Erasure coding is feature complete!
 Solidifying some user APIs in preparation for beta1
 Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Stress / endurance testing to ensure stability

Apache Hadoop 3.0 – YARN Enhancements
 YARN Scheduling Enhancements
 Support for Long Running Services
 Re-architecture for YARN Timeline Service - ATS v2
 Better elasticity and resource utilization
 Better resource isolation and Docker!!
 Better User Experiences
 Other Enhancements
3.0

Scheduling Enhancements
 Application priorities within a queue: YARN-1963
– In Queue A, App1 > App 2
 Inter-Queue priorities
– Q1 > Q2 irrespective of demand / capacity
– Previously based on unconsumed capacity
 Affinity / anti-affinity: YARN-1042
– More restraints on locations
• Affinity to rack (where you have your sibling)
• Anti-affinity (e.g. Hbase region servers)
 Global Scheduling: YARN-5139
– Get rid of scheduling triggered on node heartbeat
– Replaced with global scheduler that has parallel threads
• Globally optimal placement –expect evolution of the scheduler
• Critical for long running services – they stick to the allocation – better be a good one
• Enhanced container scheduling throughput (6x)

Scheduling Enhancements (Contd.)
 CapacityScheduler improvements
– Queue Management Improvements
• More Dynamic Queue reconfiguration
• REST API support for queue management
– Absolute resource configuration support
– Priority Support in Application and Queue
– Preemption improvements
• Inter-Queue preemption support

Key Drivers for Long Running Services
 Consolidation of Infrastructure
– Hadoop clusters have a lot of compute and storage resources (some unused)
• Can’t I use Hadoop’s resources for non-Hadoop load?
• Openstack is hard to manage/operate, can I use YARN?
• VMs are expensive, can I use YARN?
• But does it support Docker? – yes, we heard you
 Hadoop related Data Services that run outside a Hadoop cluster
– Why can’t I run them in the Hadoop cluster
 Run Hadoop services (Hive, HBase) on YARN
– Run Multiple instances
– Benefit from YARN’s Elasticity and resource management

Built-in support for long running Service in YARN
 A native YARN framework. YARN-4692
• Abstract common Framework (Similar to Slider) to support long running service
• More simplified API (to manage service lifecycle)
• Better support for long running service
 Recognition of long running service
• Affect the policy of preemption, container reservation, etc.
• Auto-restart of containers
• Containers for long running service are restarted on same node in case of local state
 Service/application upgrade support – YARN-4726
• In general, services are expected to run long enough to cross versions
 Dynamic container configuration
• Only ask for resources just enough, but adjust them at runtime (memory harder)

Discovery services in YARN
 Services can run on any YARN node; how do get its IP?
– It can also move due to node failure
 YARN Service Discovery via DNS: YARN-4757
– Expose existing service information in YARN registry via DNS
• YARN service registry’s records will be converted into DNS entries
– Discovery of container IP and service port via standard DNS lookups.
• Application
– zkapp1.user1.yarncluster.com -> 192.168.10.11:8080
• Container
– Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18

A More Powerful YARN
 Elastic Resource Model
– Dynamic Resource Configuration (YARN-291)
• Allow tune down/up on NM’s resource in runtime
– E.g. Helps when Hadoop cluster nodes are shared with other workloads
– E.g. Hadoop-on-Hadoop allows flexible resource allocation
– Graceful decommissioning of NodeManagers (YARN-914)
• Drains a node that’s being decommissioned to allow running containers to finish
• E.g. Removing a node for maintenance, Spot pricing on cloud, …
 Efficient Resource Utilization
– Support for container resizing (YARN-1197)
• Allows applications to change the size of an existing container
• E.g. long running services

More Powerful YARN (Contd.)
 Resource Isolation
– Resource isolation support for disk and network
• YARN-2619 (disk), YARN-2140 (network)
• Containers get a fair share of disk and network resources using Cgroups
– Docker support in LinuxContainerExecutor (YARN-3611)
• Support to launch Docker containers alongside process
• Packaging and resource isolation
– Packing easier e.g. TensorFlow
• Complements YARN’s support for long running services

Hadoop Apps
Docker on Yarn & YARN on YARN  - YCloud
YARN
MR Tez Spark
TensorFlow YARN
MR Tez Spark
Can use Yarn to test Hadoop!!

YARN New UI (YARN-3368)

Other YARN work planned in Hadoop 3.X
 Resource profiles (YARN-3926)
– Users can specify resource profile name instead of individual resources
– Resource types read via a config file
 YARN federation (YARN-2915)
– Allows YARN to scale out to tens of thousands of nodes
– Cluster of clusters which appear as a single cluster to an end user

Compatibility & Testing
3.0

Compatibility
 Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
 Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
 Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes

Testing and validation
 Extended alpha → beta → GA plan designed for stabilization
 EC already has some usagein production (700 nodes at Y! JP)
– Hortonworks has worked closely with this very large customer
 Hortonworks is integrating and testing HDP 3
– Integrating with all components of HDP stack
– HDP2 ++ integration tests
 Cloudera is also testing Hadoop 3 as part of their stack
 Plans for extensive HDFS EC testing by Hortonworks and Cloudera
 Happy synergy between 2.8.x and 3.0.x lines
– Shares much of the same code, fixes flow into both
– Yahoo! Deployments based on 2.8.0

Summary : What’s new in Apache Hadoop 3.0?
Storage Optimization
HDFS: Erasure codes
Improved Utilization
YARN: Long Running Services
YARN: Schedule Enhancements
Additional Workloads
YARN: Docker & Isolation
Easier to Use
New User Interface
Refactor Base
Lots of Trunk content
JDK8 and newer dependent libraries
3.0

Thank you!
Reminder: BoFs on Thursday

Apache Hadoop 3.0 Community Update

More Related Content

What's hot

Similar to Apache Hadoop 3.0 Community Update

More from DataWorks Summit

Recently uploaded

Apache Hadoop 3.0 Community Update

Editor's Notes