Apache Hadoop 3.0
Community Update
Sydney, September 2017
Sanjay Radia, Vinod Kumar Vavilapalli
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About.html
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
–Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria …
Ph.D., University of Waterloo
Page 2
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop 3.0
 Lot of content in Trunk that did not make
it to 2.x branch
 JDK Upgrade – does not truly require
bumping major number
 Hadoop command scripts rewrite
(incompatible)
 Big features that need stabilizing major
release – Erasure codes
 YARN: long running services
 Ephemeral Ports (incompatible)
Driving Reasons Some features taking advantage of 3.0
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0
 HDFS: Erasure codes
 YARN:
– Long running services,
– Scheduler enhancements,
– Isolation & Docker
– UI
 Lots of Trunk content
 JDK8 and newer dependent libraries
- 3.0.0-alpha1 - Sep/3/2016
- Alpha2 - Jan/25/2017
- Alpha3 – May/16/2017
- Alpha4 – July/7/2017
- Beta/GA – Q4 2017 (Estimated)
Key Takeaways Release Timeline
3.0
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Major changes you should know
before upgrade Hadoop 3.0
– JDK upgrade
– Dependency upgrade
– Change on default port for
daemon/services
– Shell script rewrite
 Features
– Hadoop Common
• Client-Side Classpath Isolation
• Shell script rewrite
– HDFS/Storage
• Erasure Coding
• Multiple Standby NameNodes
• Intradata balancer
• Cloud Storage: Support for Azure Data Lake, S3
consistency & performance
– YARN
• Support for long running services
• Scheduling enhancements: : App / Queue
Priorities, global scheduling, placement
strategies
• New UI
• ATS v2
– MAPREDUCE
• Task-level native optimization
HADOOP-11264
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Minimum JDK for Hadoop 3.0.x is JDK8 OOP-11858
– Oracle JDK 7 is EoL at April 2015!!
 Moving forward to use new features of JDK8
– Lambda Expressions – starting to use this
– Stream API
– Security enhancements
– Performance enhancement for HashMaps, IO/NIO, etc.
 Hadoop’s evolution with JDK upgrades
– Hadoop 2.6.x - JDK 6, 7, 8 or later
– Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
– Hadoop 3.0.x - JDK 8 or later
Hadoop Operation - JDK Upgrade
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Previously, the default ports of multiple Hadoop services were in the Linux ephemeral
port range (32768-61000)
– Can conflict with other apps running on the same node
– Can cause problem during rolling restart if another app takes the port
 New ports:
– Namenode ports: 50470  9871, 50070  9870, 8020  9820
– Secondary NN ports: 50091  9869, 50090  9868
– Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864
 KMS service port 16000  9600
Change of Default Ports for Hadoop Services
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classpath isolation (HADOOP-11656)
 Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava, Protobuf,
Jackson, Jetty, …
○ Potential conflicts with your app
dependencies (No shading)
 No separate HDFS client jar means
server jars are leaked
● NN, DN libraries pulled even though
not needed
 HDFS-6200: Split HDFS client into
separate JAR
 HADOOP-11804: Shaded hadoop-
client dependency
 YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Support for Three NameNodes for HA
Intra data node balancer
Cloud storage improvements (see afternoon talk)
Erasure coding
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current (2.x) HDFS Replication Strategy
 Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– Reliability: tolerate 2 failures
 Good data locality, local shortcut
 Multiple copies => Parallel IO for parallel compute
 Very Fast block recovery and node recovery
– Parallel recover - the bigger the cluster the faster
– 10TB Node recovery 30sec to a few hours
 3/x storage overhead vs 1.4-1.6 of Erasure Code
– Remember that Hadoop’s JBod is very cheap
• 1/10 - 1/20 of SANs
• 1/10 – 1/5 of NFS
r1
Rack I
DataNode
r2
Rack II
DataNode
r3
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding
 k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
 Reliability: tolerate m failures
 Save disk space
 Save I/O bandwidth on the write path
 1.5x storage
overhead
 Tolerate any 3
failures
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
3-replication (6, 3) Reed-Solomon
Maximum fault
Tolerance
2 3
Disk usage
(N byte of data)
3N 1.5N
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Block Reconstruction
 Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
b4
Rack
b2
Rack
b3
Rack
b1
Rack
b6
Rack
b5
Rack RackRack
P1 P2
Rack
P3
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding on Contiguous/Striped Blocks
Two Approaches
 EC on contiguous blocks
– Pros: Better for locality
– Cons: small files cannot be handled
 EC on striped blocks
– Pros: Leverage multiple disks in parallel
– Pros: Works for small small files
– Cons: No data locality for readers
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding Zone
 Create a zone on an empty directory
– Shell command:
hdfs erasurecode –createZone [-s <schemaName>] <path>
 All the files under a zone directory are automatically erasure coded
– Rename across zones with different EC schemas are disallowed
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Pipeline for Replicated Files
 Write pipeline to datanodes
 Durability
– Use 3 replicas to tolerate maximum 2 failures
 Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
 Appendable
– Files can be reopened for append
* DN = DataNode
DN1 DN2 DN3
data data
ackack
Writer
data
ack
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel Write for EC Files
 Parallel write
– Client writes to a group of 9 datanodes at the same time
– Calculate Parity bits at client side, at Write Time
 Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
 Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover to any other
remaining replica to read the same data.
 Appendable (Same as replicated files)
– Files can be reopened for append
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……
Stipe size 1MB
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC: Write Failure Handling
 Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Replication: Slow Writers & Replace Datanode on Failure
 Write pipeline for replicated files
– Datanode can be replaced in case of failure.
 Slow writers
– A write pipeline may last for a long time
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
 EC files
– Do not support replace-datanode-on-failure.
– Slow writer improved
DN1 DN4
data
ack
DN3DN2
data
ack
Writer
data
ack
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reading with Parity Blocks
 Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
 Block reconstruction
– Read parity blocks to reconstruct
missing blocks
DN3
DN7
DN1
DN2
Reader
DN4
DN5
DN6
Block3
reconstruct
Block2
Block1
Block4
Block5
Block6Parity1
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC implications
 File data is striped across multiple nodes and racks
 Reads and writes are remote and cross-rack
 Reconstruction is network-intensive, reads m blocks cross-rack
– Need fast network
• Require high network bandwidth between client-server
• Dead DataNode implies high network traffic and reconstruction time
 Important to use optimized ISA-L for performance
– 1+ GB/s encode/decode speed, much faster than Java implementation
– CPU is no longer a bottleneck
 Need to combine data into larger files to avoid an explosion in replica count
– Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
– Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
Works best for archival / cold data usecases
Need
Fast
Network
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC performance – write performance faster with right EC lib
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC performance – TPC with no DN killed
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC performance - TPC with 2 DN killed
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure coding status
 Massive development effort by the Hadoop community
○ 20+ contributors from many companies (Hortonworks, Y! JP, Cloudera, Intel, Huawei, …)
○ 100s of commits over three years (started in 2014)
 Erasure coding is feature complete!
 Solidifying some user APIs in preparation for beta1
 Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Stress / endurance testing to ensure stability
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0 – YARN Enhancements
 YARN Scheduling Enhancements
 Support for Long Running Services
 Re-architecture for YARN Timeline Service - ATS v2
 Better elasticity and resource utilization
 Better resource isolation and Docker!!
 Better User Experiences
 Other Enhancements
3.0
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Enhancements
 Application priorities within a queue: YARN-1963
– In Queue A, App1 > App 2
 Inter-Queue priorities
– Q1 > Q2 irrespective of demand / capacity
– Previously based on unconsumed capacity
 Affinity / anti-affinity: YARN-1042
– More restraints on locations
• Affinity to rack (where you have your sibling)
• Anti-affinity (e.g. Hbase region servers)
 Global Scheduling: YARN-5139
– Get rid of scheduling triggered on node heartbeat
– Replaced with global scheduler that has parallel threads
• Globally optimal placement –expect evolution of the scheduler
• Critical for long running services – they stick to the allocation – better be a good one
• Enhanced container scheduling throughput (6x)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Enhancements (Contd.)
 CapacityScheduler improvements
– Queue Management Improvements
• More Dynamic Queue reconfiguration
• REST API support for queue management
– Absolute resource configuration support
– Priority Support in Application and Queue
– Preemption improvements
• Inter-Queue preemption support
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Drivers for Long Running Services
 Consolidation of Infrastructure
– Hadoop clusters have a lot of compute and storage resources (some unused)
• Can’t I use Hadoop’s resources for non-Hadoop load?
• Openstack is hard to manage/operate, can I use YARN?
• VMs are expensive, can I use YARN?
• But does it support Docker? – yes, we heard you
 Hadoop related Data Services that run outside a Hadoop cluster
– Why can’t I run them in the Hadoop cluster
 Run Hadoop services (Hive, HBase) on YARN
– Run Multiple instances
– Benefit from YARN’s Elasticity and resource management
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Built-in support for long running Service in YARN
 A native YARN framework. YARN-4692
• Abstract common Framework (Similar to Slider) to support long running service
• More simplified API (to manage service lifecycle)
• Better support for long running service
 Recognition of long running service
• Affect the policy of preemption, container reservation, etc.
• Auto-restart of containers
• Containers for long running service are restarted on same node in case of local state
 Service/application upgrade support – YARN-4726
• In general, services are expected to run long enough to cross versions
 Dynamic container configuration
• Only ask for resources just enough, but adjust them at runtime (memory harder)
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Discovery services in YARN
 Services can run on any YARN node; how do get its IP?
– It can also move due to node failure
 YARN Service Discovery via DNS: YARN-4757
– Expose existing service information in YARN registry via DNS
• YARN service registry’s records will be converted into DNS entries
– Discovery of container IP and service port via standard DNS lookups.
• Application
– zkapp1.user1.yarncluster.com -> 192.168.10.11:8080
• Container
– Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A More Powerful YARN
 Elastic Resource Model
– Dynamic Resource Configuration (YARN-291)
• Allow tune down/up on NM’s resource in runtime
– E.g. Helps when Hadoop cluster nodes are shared with other workloads
– E.g. Hadoop-on-Hadoop allows flexible resource allocation
– Graceful decommissioning of NodeManagers (YARN-914)
• Drains a node that’s being decommissioned to allow running containers to finish
• E.g. Removing a node for maintenance, Spot pricing on cloud, …
 Efficient Resource Utilization
– Support for container resizing (YARN-1197)
• Allows applications to change the size of an existing container
• E.g. long running services
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Powerful YARN (Contd.)
 Resource Isolation
– Resource isolation support for disk and network
• YARN-2619 (disk), YARN-2140 (network)
• Containers get a fair share of disk and network resources using Cgroups
– Docker support in LinuxContainerExecutor (YARN-3611)
• Support to launch Docker containers alongside process
• Packaging and resource isolation
– Packing easier e.g. TensorFlow
• Complements YARN’s support for long running services
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Apps
Docker on Yarn & YARN on YARN  - YCloud
YARN
MR Tez Spark
TensorFlow YARN
MR Tez Spark
Can use Yarn to test Hadoop!!
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN New UI (YARN-3368)
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other YARN work planned in Hadoop 3.X
 Resource profiles (YARN-3926)
– Users can specify resource profile name instead of individual resources
– Resource types read via a config file
 YARN federation (YARN-2915)
– Allows YARN to scale out to tens of thousands of nodes
– Cluster of clusters which appear as a single cluster to an end user
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compatibility & Testing
3.0
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compatibility
 Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
 Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
 Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Testing and validation
 Extended alpha → beta → GA plan designed for stabilization
 EC already has some usagein production (700 nodes at Y! JP)
– Hortonworks has worked closely with this very large customer
 Hortonworks is integrating and testing HDP 3
– Integrating with all components of HDP stack
– HDP2 ++ integration tests
 Cloudera is also testing Hadoop 3 as part of their stack
 Plans for extensive HDFS EC testing by Hortonworks and Cloudera
 Happy synergy between 2.8.x and 3.0.x lines
– Shares much of the same code, fixes flow into both
– Yahoo! Deployments based on 2.8.0
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary : What’s new in Apache Hadoop 3.0?
Storage Optimization
HDFS: Erasure codes
Improved Utilization
YARN: Long Running Services
YARN: Schedule Enhancements
Additional Workloads
YARN: Docker & Isolation
Easier to Use
New User Interface
Refactor Base
Lots of Trunk content
JDK8 and newer dependent libraries
3.0
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Reminder: BoFs on Thursday

Apache Hadoop 3.0 Community Update

  • 1.
    Apache Hadoop 3.0 CommunityUpdate Sydney, September 2017 Sanjay Radia, Vinod Kumar Vavilapalli
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved About.html Sanjay Radia Chief Architect, Founder, Hortonworks Part of the original Hadoop team at Yahoo! since 2007 – Chief Architect of Hadoop Core at Yahoo! –Apache Hadoop PMC and Committer Prior Data center automation, virtualization, Java, HA, OSs, File Systems Startup, Sun Microsystems, Inria … Ph.D., University of Waterloo Page 2
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved Why Hadoop 3.0  Lot of content in Trunk that did not make it to 2.x branch  JDK Upgrade – does not truly require bumping major number  Hadoop command scripts rewrite (incompatible)  Big features that need stabilizing major release – Erasure codes  YARN: long running services  Ephemeral Ports (incompatible) Driving Reasons Some features taking advantage of 3.0
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hadoop 3.0  HDFS: Erasure codes  YARN: – Long running services, – Scheduler enhancements, – Isolation & Docker – UI  Lots of Trunk content  JDK8 and newer dependent libraries - 3.0.0-alpha1 - Sep/3/2016 - Alpha2 - Jan/25/2017 - Alpha3 – May/16/2017 - Alpha4 – July/7/2017 - Beta/GA – Q4 2017 (Estimated) Key Takeaways Release Timeline 3.0
  • 5.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda  Major changes you should know before upgrade Hadoop 3.0 – JDK upgrade – Dependency upgrade – Change on default port for daemon/services – Shell script rewrite  Features – Hadoop Common • Client-Side Classpath Isolation • Shell script rewrite – HDFS/Storage • Erasure Coding • Multiple Standby NameNodes • Intradata balancer • Cloud Storage: Support for Azure Data Lake, S3 consistency & performance – YARN • Support for long running services • Scheduling enhancements: : App / Queue Priorities, global scheduling, placement strategies • New UI • ATS v2 – MAPREDUCE • Task-level native optimization HADOOP-11264
  • 6.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved  Minimum JDK for Hadoop 3.0.x is JDK8 OOP-11858 – Oracle JDK 7 is EoL at April 2015!!  Moving forward to use new features of JDK8 – Lambda Expressions – starting to use this – Stream API – Security enhancements – Performance enhancement for HashMaps, IO/NIO, etc.  Hadoop’s evolution with JDK upgrades – Hadoop 2.6.x - JDK 6, 7, 8 or later – Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later – Hadoop 3.0.x - JDK 8 or later Hadoop Operation - JDK Upgrade
  • 7.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved  Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000) – Can conflict with other apps running on the same node – Can cause problem during rolling restart if another app takes the port  New ports: – Namenode ports: 50470  9871, 50070  9870, 8020  9820 – Secondary NN ports: 50091  9869, 50090  9868 – Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864  KMS service port 16000  9600 Change of Default Ports for Hadoop Services
  • 8.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Classpath isolation (HADOOP-11656)  Hadoop leaks lots of dependencies onto the application’s classpath ○ Known offenders: Guava, Protobuf, Jackson, Jetty, … ○ Potential conflicts with your app dependencies (No shading)  No separate HDFS client jar means server jars are leaked ● NN, DN libraries pulled even though not needed  HDFS-6200: Split HDFS client into separate JAR  HADOOP-11804: Shaded hadoop- client dependency  YARN-6466: Shade the task umbilical for a clean YARN container environment (ongoing)
  • 9.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved HDFS Support for Three NameNodes for HA Intra data node balancer Cloud storage improvements (see afternoon talk) Erasure coding
  • 10.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Current (2.x) HDFS Replication Strategy  Three replicas by default – 1st replica on local node, local rack or random node – 2nd and 3rd replicas on the same remote rack – Reliability: tolerate 2 failures  Good data locality, local shortcut  Multiple copies => Parallel IO for parallel compute  Very Fast block recovery and node recovery – Parallel recover - the bigger the cluster the faster – 10TB Node recovery 30sec to a few hours  3/x storage overhead vs 1.4-1.6 of Erasure Code – Remember that Hadoop’s JBod is very cheap • 1/10 - 1/20 of SANs • 1/10 – 1/5 of NFS r1 Rack I DataNode r2 Rack II DataNode r3
  • 11.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Erasure Coding  k data blocks + m parity blocks (k + m) – Example: Reed-Solomon 6+3  Reliability: tolerate m failures  Save disk space  Save I/O bandwidth on the write path  1.5x storage overhead  Tolerate any 3 failures b3b1 b2 P1b6b4 b5 P2 P3 6 data blocks 3 parity blocks 3-replication (6, 3) Reed-Solomon Maximum fault Tolerance 2 3 Disk usage (N byte of data) 3N 1.5N
  • 12.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Block Reconstruction  Block reconstruction overhead – Higher network bandwidth cost – Extra CPU overhead • Local Reconstruction Codes (LRC), Hitchhiker Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12. Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013. Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14. b4 Rack b2 Rack b3 Rack b1 Rack b6 Rack b5 Rack RackRack P1 P2 Rack P3
  • 13.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Erasure Coding on Contiguous/Striped Blocks Two Approaches  EC on contiguous blocks – Pros: Better for locality – Cons: small files cannot be handled  EC on striped blocks – Pros: Leverage multiple disks in parallel – Pros: Works for small small files – Cons: No data locality for readers C1 C2 C3 C4 C5 C6 PC1 PC2 PC3 C7 C8 C9 C10 C11 C12 PC4 PC5 PC6 stripe 1 stripe 2 stripe n b1 b2 b3 b4 b5 b6 P1 P2 P3 6 Data Blocks 3 Parity Blocks b3b1 b2 b6b4 b5 File f1 P1 P2 P3 parity blocks File f2 f3 data blocks
  • 14.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved Erasure Coding Zone  Create a zone on an empty directory – Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path>  All the files under a zone directory are automatically erasure coded – Rename across zones with different EC schemas are disallowed
  • 15.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Write Pipeline for Replicated Files  Write pipeline to datanodes  Durability – Use 3 replicas to tolerate maximum 2 failures  Visibility – Read is supported for being written files – Data can be made visible by hflush/hsync  Consistency – Client can start reading from any replica and failover to any other replica to read the same data  Appendable – Files can be reopened for append * DN = DataNode DN1 DN2 DN3 data data ackack Writer data ack
  • 16.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Parallel Write for EC Files  Parallel write – Client writes to a group of 9 datanodes at the same time – Calculate Parity bits at client side, at Write Time  Durability – (6, 3)-Reed-Solomon can tolerate maximum 3 failures  Visibility (Same as replicated files) – Read is supported for being written files – Data can be made visible by hflush/hsync  Consistency – Client can start reading from any 6 of the 9 replicas – When reading from a datanode fails, client can failover to any other remaining replica to read the same data.  Appendable (Same as replicated files) – Files can be reopened for append DN1 DN6 DN7 data parit y ack ackWriter data ack DN9 parit yack …… Stipe size 1MB
  • 17.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved EC: Write Failure Handling  Datanode failure – Client ignores the failed datanode and continue writing. – Able to tolerate 3 failures. – Require at least 6 datanodes. – Missing blocks will be reconstructed later. DN1 DN6 DN7 data parit y ack ackWriter data ack DN9 parit yack ……
  • 18.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Replication: Slow Writers & Replace Datanode on Failure  Write pipeline for replicated files – Datanode can be replaced in case of failure.  Slow writers – A write pipeline may last for a long time – The probability of datanode failures increases over time. – Need to replace datanode on failure.  EC files – Do not support replace-datanode-on-failure. – Slow writer improved DN1 DN4 data ack DN3DN2 data ack Writer data ack
  • 19.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Reading with Parity Blocks  Parallel read – Read from 6 Datanodes with data blocks – Support both stateful read and pread  Block reconstruction – Read parity blocks to reconstruct missing blocks DN3 DN7 DN1 DN2 Reader DN4 DN5 DN6 Block3 reconstruct Block2 Block1 Block4 Block5 Block6Parity1
  • 20.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved EC implications  File data is striped across multiple nodes and racks  Reads and writes are remote and cross-rack  Reconstruction is network-intensive, reads m blocks cross-rack – Need fast network • Require high network bandwidth between client-server • Dead DataNode implies high network traffic and reconstruction time  Important to use optimized ISA-L for performance – 1+ GB/s encode/decode speed, much faster than Java implementation – CPU is no longer a bottleneck  Need to combine data into larger files to avoid an explosion in replica count – Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) – Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) Works best for archival / cold data usecases Need Fast Network
  • 21.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved EC performance – write performance faster with right EC lib
  • 22.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved EC performance – TPC with no DN killed
  • 23.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved EC performance - TPC with 2 DN killed
  • 24.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Erasure coding status  Massive development effort by the Hadoop community ○ 20+ contributors from many companies (Hortonworks, Y! JP, Cloudera, Intel, Huawei, …) ○ 100s of commits over three years (started in 2014)  Erasure coding is feature complete!  Solidifying some user APIs in preparation for beta1  Current focus is on testing and integration efforts ○ Want the complete Hadoop stack to work with HDFS erasure coding enabled ○ Stress / endurance testing to ensure stability
  • 25.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hadoop 3.0 – YARN Enhancements  YARN Scheduling Enhancements  Support for Long Running Services  Re-architecture for YARN Timeline Service - ATS v2  Better elasticity and resource utilization  Better resource isolation and Docker!!  Better User Experiences  Other Enhancements 3.0
  • 26.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved Scheduling Enhancements  Application priorities within a queue: YARN-1963 – In Queue A, App1 > App 2  Inter-Queue priorities – Q1 > Q2 irrespective of demand / capacity – Previously based on unconsumed capacity  Affinity / anti-affinity: YARN-1042 – More restraints on locations • Affinity to rack (where you have your sibling) • Anti-affinity (e.g. Hbase region servers)  Global Scheduling: YARN-5139 – Get rid of scheduling triggered on node heartbeat – Replaced with global scheduler that has parallel threads • Globally optimal placement –expect evolution of the scheduler • Critical for long running services – they stick to the allocation – better be a good one • Enhanced container scheduling throughput (6x)
  • 27.
    32 © HortonworksInc. 2011 – 2016. All Rights Reserved Scheduling Enhancements (Contd.)  CapacityScheduler improvements – Queue Management Improvements • More Dynamic Queue reconfiguration • REST API support for queue management – Absolute resource configuration support – Priority Support in Application and Queue – Preemption improvements • Inter-Queue preemption support
  • 28.
    33 © HortonworksInc. 2011 – 2016. All Rights Reserved Key Drivers for Long Running Services  Consolidation of Infrastructure – Hadoop clusters have a lot of compute and storage resources (some unused) • Can’t I use Hadoop’s resources for non-Hadoop load? • Openstack is hard to manage/operate, can I use YARN? • VMs are expensive, can I use YARN? • But does it support Docker? – yes, we heard you  Hadoop related Data Services that run outside a Hadoop cluster – Why can’t I run them in the Hadoop cluster  Run Hadoop services (Hive, HBase) on YARN – Run Multiple instances – Benefit from YARN’s Elasticity and resource management
  • 29.
    34 © HortonworksInc. 2011 – 2016. All Rights Reserved Built-in support for long running Service in YARN  A native YARN framework. YARN-4692 • Abstract common Framework (Similar to Slider) to support long running service • More simplified API (to manage service lifecycle) • Better support for long running service  Recognition of long running service • Affect the policy of preemption, container reservation, etc. • Auto-restart of containers • Containers for long running service are restarted on same node in case of local state  Service/application upgrade support – YARN-4726 • In general, services are expected to run long enough to cross versions  Dynamic container configuration • Only ask for resources just enough, but adjust them at runtime (memory harder)
  • 30.
    35 © HortonworksInc. 2011 – 2016. All Rights Reserved Discovery services in YARN  Services can run on any YARN node; how do get its IP? – It can also move due to node failure  YARN Service Discovery via DNS: YARN-4757 – Expose existing service information in YARN registry via DNS • YARN service registry’s records will be converted into DNS entries – Discovery of container IP and service port via standard DNS lookups. • Application – zkapp1.user1.yarncluster.com -> 192.168.10.11:8080 • Container – Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18
  • 31.
    36 © HortonworksInc. 2011 – 2016. All Rights Reserved A More Powerful YARN  Elastic Resource Model – Dynamic Resource Configuration (YARN-291) • Allow tune down/up on NM’s resource in runtime – E.g. Helps when Hadoop cluster nodes are shared with other workloads – E.g. Hadoop-on-Hadoop allows flexible resource allocation – Graceful decommissioning of NodeManagers (YARN-914) • Drains a node that’s being decommissioned to allow running containers to finish • E.g. Removing a node for maintenance, Spot pricing on cloud, …  Efficient Resource Utilization – Support for container resizing (YARN-1197) • Allows applications to change the size of an existing container • E.g. long running services
  • 32.
    37 © HortonworksInc. 2011 – 2016. All Rights Reserved More Powerful YARN (Contd.)  Resource Isolation – Resource isolation support for disk and network • YARN-2619 (disk), YARN-2140 (network) • Containers get a fair share of disk and network resources using Cgroups – Docker support in LinuxContainerExecutor (YARN-3611) • Support to launch Docker containers alongside process • Packaging and resource isolation – Packing easier e.g. TensorFlow • Complements YARN’s support for long running services
  • 33.
    38 © HortonworksInc. 2011 – 2016. All Rights Reserved Hadoop Apps Docker on Yarn & YARN on YARN  - YCloud YARN MR Tez Spark TensorFlow YARN MR Tez Spark Can use Yarn to test Hadoop!!
  • 34.
    39 © HortonworksInc. 2011 – 2016. All Rights Reserved YARN New UI (YARN-3368)
  • 35.
    42 © HortonworksInc. 2011 – 2016. All Rights Reserved Other YARN work planned in Hadoop 3.X  Resource profiles (YARN-3926) – Users can specify resource profile name instead of individual resources – Resource types read via a config file  YARN federation (YARN-2915) – Allows YARN to scale out to tens of thousands of nodes – Cluster of clusters which appear as a single cluster to an end user
  • 36.
    43 © HortonworksInc. 2011 – 2016. All Rights Reserved Compatibility & Testing 3.0
  • 37.
    44 © HortonworksInc. 2011 – 2016. All Rights Reserved Compatibility  Preserves wire-compatibility with Hadoop 2 clients ○ Impossible to coordinate upgrading off-cluster Hadoop clients  Will support rolling upgrade from Hadoop 2 to Hadoop 3 ○ Can’t take downtime to upgrade a business-critical cluster  Not fully preserving API compatibility! ○ Dependency version bumps ○ Removal of deprecated APIs and tools ○ Shell script rewrite, rework of Hadoop tools scripts ○ Incompatible bug fixes
  • 38.
    45 © HortonworksInc. 2011 – 2016. All Rights Reserved Testing and validation  Extended alpha → beta → GA plan designed for stabilization  EC already has some usagein production (700 nodes at Y! JP) – Hortonworks has worked closely with this very large customer  Hortonworks is integrating and testing HDP 3 – Integrating with all components of HDP stack – HDP2 ++ integration tests  Cloudera is also testing Hadoop 3 as part of their stack  Plans for extensive HDFS EC testing by Hortonworks and Cloudera  Happy synergy between 2.8.x and 3.0.x lines – Shares much of the same code, fixes flow into both – Yahoo! Deployments based on 2.8.0
  • 39.
    46 © HortonworksInc. 2011 – 2016. All Rights Reserved Summary : What’s new in Apache Hadoop 3.0? Storage Optimization HDFS: Erasure codes Improved Utilization YARN: Long Running Services YARN: Schedule Enhancements Additional Workloads YARN: Docker & Isolation Easier to Use New User Interface Refactor Base Lots of Trunk content JDK8 and newer dependent libraries 3.0
  • 40.
    47 © HortonworksInc. 2011 – 2016. All Rights Reserved Thank you! Reminder: BoFs on Thursday

Editor's Notes

  • #5 Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  • #6 Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  • #17 it enables online EC which bypasses the conversion phase and immediately saves storage space; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple D​ataNodes​and eliminates the need to bundle multiple files into a single coding group.
  • #31 Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  • #32 Previously based on uncomsumed capacity If 70% capacity has lots of uncomsumed capcity it is scheduled first Now you can say that the 30% queue is higher priority
  • #34 Original Yarn design was not just for batch jobs. - we started with that but the design was general
  • #37 Graceful degradation - remove nodes gracefully - for cloud especially if you are using spot pricing
  • #40 App centric – top two left pictures Node centric Resource centric – load vs capacity – overall and by queues Cluster centric – nodes summary heatmap of resource usage across nodes
  • #44 Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  • #47 Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey