SlideShare a Scribd company logo
1 of 40
Apache Hadoop 3.0
Community Update
Sydney, September 2017
Sanjay Radia, Vinod Kumar Vavilapalli
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About.html
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
–Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria …
Ph.D., University of Waterloo
Page 2
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop 3.0
 Lot of content in Trunk that did not make
it to 2.x branch
 JDK Upgrade – does not truly require
bumping major number
 Hadoop command scripts rewrite
(incompatible)
 Big features that need stabilizing major
release – Erasure codes
 YARN: long running services
 Ephemeral Ports (incompatible)
Driving Reasons Some features taking advantage of 3.0
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0
 HDFS: Erasure codes
 YARN:
– Long running services,
– Scheduler enhancements,
– Isolation & Docker
– UI
 Lots of Trunk content
 JDK8 and newer dependent libraries
- 3.0.0-alpha1 - Sep/3/2016
- Alpha2 - Jan/25/2017
- Alpha3 – May/16/2017
- Alpha4 – July/7/2017
- Beta/GA – Q4 2017 (Estimated)
Key Takeaways Release Timeline
3.0
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Major changes you should know
before upgrade Hadoop 3.0
– JDK upgrade
– Dependency upgrade
– Change on default port for
daemon/services
– Shell script rewrite
 Features
– Hadoop Common
• Client-Side Classpath Isolation
• Shell script rewrite
– HDFS/Storage
• Erasure Coding
• Multiple Standby NameNodes
• Intradata balancer
• Cloud Storage: Support for Azure Data Lake, S3
consistency & performance
– YARN
• Support for long running services
• Scheduling enhancements: : App / Queue
Priorities, global scheduling, placement
strategies
• New UI
• ATS v2
– MAPREDUCE
• Task-level native optimization
HADOOP-11264
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Minimum JDK for Hadoop 3.0.x is JDK8 OOP-11858
– Oracle JDK 7 is EoL at April 2015!!
 Moving forward to use new features of JDK8
– Lambda Expressions – starting to use this
– Stream API
– Security enhancements
– Performance enhancement for HashMaps, IO/NIO, etc.
 Hadoop’s evolution with JDK upgrades
– Hadoop 2.6.x - JDK 6, 7, 8 or later
– Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
– Hadoop 3.0.x - JDK 8 or later
Hadoop Operation - JDK Upgrade
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Previously, the default ports of multiple Hadoop services were in the Linux ephemeral
port range (32768-61000)
– Can conflict with other apps running on the same node
– Can cause problem during rolling restart if another app takes the port
 New ports:
– Namenode ports: 50470  9871, 50070  9870, 8020  9820
– Secondary NN ports: 50091  9869, 50090  9868
– Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864
 KMS service port 16000  9600
Change of Default Ports for Hadoop Services
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classpath isolation (HADOOP-11656)
 Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava, Protobuf,
Jackson, Jetty, …
○ Potential conflicts with your app
dependencies (No shading)
 No separate HDFS client jar means
server jars are leaked
● NN, DN libraries pulled even though
not needed
 HDFS-6200: Split HDFS client into
separate JAR
 HADOOP-11804: Shaded hadoop-
client dependency
 YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Support for Three NameNodes for HA
Intra data node balancer
Cloud storage improvements (see afternoon talk)
Erasure coding
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current (2.x) HDFS Replication Strategy
 Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– Reliability: tolerate 2 failures
 Good data locality, local shortcut
 Multiple copies => Parallel IO for parallel compute
 Very Fast block recovery and node recovery
– Parallel recover - the bigger the cluster the faster
– 10TB Node recovery 30sec to a few hours
 3/x storage overhead vs 1.4-1.6 of Erasure Code
– Remember that Hadoop’s JBod is very cheap
• 1/10 - 1/20 of SANs
• 1/10 – 1/5 of NFS
r1
Rack I
DataNode
r2
Rack II
DataNode
r3
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding
 k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
 Reliability: tolerate m failures
 Save disk space
 Save I/O bandwidth on the write path
 1.5x storage
overhead
 Tolerate any 3
failures
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
3-replication (6, 3) Reed-Solomon
Maximum fault
Tolerance
2 3
Disk usage
(N byte of data)
3N 1.5N
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Block Reconstruction
 Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
b4
Rack
b2
Rack
b3
Rack
b1
Rack
b6
Rack
b5
Rack RackRack
P1 P2
Rack
P3
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding on Contiguous/Striped Blocks
Two Approaches
 EC on contiguous blocks
– Pros: Better for locality
– Cons: small files cannot be handled
 EC on striped blocks
– Pros: Leverage multiple disks in parallel
– Pros: Works for small small files
– Cons: No data locality for readers
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding Zone
 Create a zone on an empty directory
– Shell command:
hdfs erasurecode –createZone [-s <schemaName>] <path>
 All the files under a zone directory are automatically erasure coded
– Rename across zones with different EC schemas are disallowed
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Pipeline for Replicated Files
 Write pipeline to datanodes
 Durability
– Use 3 replicas to tolerate maximum 2 failures
 Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
 Appendable
– Files can be reopened for append
* DN = DataNode
DN1 DN2 DN3
data data
ackack
Writer
data
ack
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel Write for EC Files
 Parallel write
– Client writes to a group of 9 datanodes at the same time
– Calculate Parity bits at client side, at Write Time
 Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
 Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover to any other
remaining replica to read the same data.
 Appendable (Same as replicated files)
– Files can be reopened for append
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……
Stipe size 1MB
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC: Write Failure Handling
 Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Replication: Slow Writers & Replace Datanode on Failure
 Write pipeline for replicated files
– Datanode can be replaced in case of failure.
 Slow writers
– A write pipeline may last for a long time
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
 EC files
– Do not support replace-datanode-on-failure.
– Slow writer improved
DN1 DN4
data
ack
DN3DN2
data
ack
Writer
data
ack
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reading with Parity Blocks
 Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
 Block reconstruction
– Read parity blocks to reconstruct
missing blocks
DN3
DN7
DN1
DN2
Reader
DN4
DN5
DN6
Block3
reconstruct
Block2
Block1
Block4
Block5
Block6Parity1
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC implications
 File data is striped across multiple nodes and racks
 Reads and writes are remote and cross-rack
 Reconstruction is network-intensive, reads m blocks cross-rack
– Need fast network
• Require high network bandwidth between client-server
• Dead DataNode implies high network traffic and reconstruction time
 Important to use optimized ISA-L for performance
– 1+ GB/s encode/decode speed, much faster than Java implementation
– CPU is no longer a bottleneck
 Need to combine data into larger files to avoid an explosion in replica count
– Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
– Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
Works best for archival / cold data usecases
Need
Fast
Network
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC performance – write performance faster with right EC lib
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC performance – TPC with no DN killed
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC performance - TPC with 2 DN killed
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure coding status
 Massive development effort by the Hadoop community
○ 20+ contributors from many companies (Hortonworks, Y! JP, Cloudera, Intel, Huawei, …)
○ 100s of commits over three years (started in 2014)
 Erasure coding is feature complete!
 Solidifying some user APIs in preparation for beta1
 Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Stress / endurance testing to ensure stability
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0 – YARN Enhancements
 YARN Scheduling Enhancements
 Support for Long Running Services
 Re-architecture for YARN Timeline Service - ATS v2
 Better elasticity and resource utilization
 Better resource isolation and Docker!!
 Better User Experiences
 Other Enhancements
3.0
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Enhancements
 Application priorities within a queue: YARN-1963
– In Queue A, App1 > App 2
 Inter-Queue priorities
– Q1 > Q2 irrespective of demand / capacity
– Previously based on unconsumed capacity
 Affinity / anti-affinity: YARN-1042
– More restraints on locations
• Affinity to rack (where you have your sibling)
• Anti-affinity (e.g. Hbase region servers)
 Global Scheduling: YARN-5139
– Get rid of scheduling triggered on node heartbeat
– Replaced with global scheduler that has parallel threads
• Globally optimal placement –expect evolution of the scheduler
• Critical for long running services – they stick to the allocation – better be a good one
• Enhanced container scheduling throughput (6x)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Enhancements (Contd.)
 CapacityScheduler improvements
– Queue Management Improvements
• More Dynamic Queue reconfiguration
• REST API support for queue management
– Absolute resource configuration support
– Priority Support in Application and Queue
– Preemption improvements
• Inter-Queue preemption support
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Drivers for Long Running Services
 Consolidation of Infrastructure
– Hadoop clusters have a lot of compute and storage resources (some unused)
• Can’t I use Hadoop’s resources for non-Hadoop load?
• Openstack is hard to manage/operate, can I use YARN?
• VMs are expensive, can I use YARN?
• But does it support Docker? – yes, we heard you
 Hadoop related Data Services that run outside a Hadoop cluster
– Why can’t I run them in the Hadoop cluster
 Run Hadoop services (Hive, HBase) on YARN
– Run Multiple instances
– Benefit from YARN’s Elasticity and resource management
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Built-in support for long running Service in YARN
 A native YARN framework. YARN-4692
• Abstract common Framework (Similar to Slider) to support long running service
• More simplified API (to manage service lifecycle)
• Better support for long running service
 Recognition of long running service
• Affect the policy of preemption, container reservation, etc.
• Auto-restart of containers
• Containers for long running service are restarted on same node in case of local state
 Service/application upgrade support – YARN-4726
• In general, services are expected to run long enough to cross versions
 Dynamic container configuration
• Only ask for resources just enough, but adjust them at runtime (memory harder)
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Discovery services in YARN
 Services can run on any YARN node; how do get its IP?
– It can also move due to node failure
 YARN Service Discovery via DNS: YARN-4757
– Expose existing service information in YARN registry via DNS
• YARN service registry’s records will be converted into DNS entries
– Discovery of container IP and service port via standard DNS lookups.
• Application
– zkapp1.user1.yarncluster.com -> 192.168.10.11:8080
• Container
– Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A More Powerful YARN
 Elastic Resource Model
– Dynamic Resource Configuration (YARN-291)
• Allow tune down/up on NM’s resource in runtime
– E.g. Helps when Hadoop cluster nodes are shared with other workloads
– E.g. Hadoop-on-Hadoop allows flexible resource allocation
– Graceful decommissioning of NodeManagers (YARN-914)
• Drains a node that’s being decommissioned to allow running containers to finish
• E.g. Removing a node for maintenance, Spot pricing on cloud, …
 Efficient Resource Utilization
– Support for container resizing (YARN-1197)
• Allows applications to change the size of an existing container
• E.g. long running services
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Powerful YARN (Contd.)
 Resource Isolation
– Resource isolation support for disk and network
• YARN-2619 (disk), YARN-2140 (network)
• Containers get a fair share of disk and network resources using Cgroups
– Docker support in LinuxContainerExecutor (YARN-3611)
• Support to launch Docker containers alongside process
• Packaging and resource isolation
– Packing easier e.g. TensorFlow
• Complements YARN’s support for long running services
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Apps
Docker on Yarn & YARN on YARN  - YCloud
YARN
MR Tez Spark
TensorFlow YARN
MR Tez Spark
Can use Yarn to test Hadoop!!
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN New UI (YARN-3368)
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other YARN work planned in Hadoop 3.X
 Resource profiles (YARN-3926)
– Users can specify resource profile name instead of individual resources
– Resource types read via a config file
 YARN federation (YARN-2915)
– Allows YARN to scale out to tens of thousands of nodes
– Cluster of clusters which appear as a single cluster to an end user
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compatibility & Testing
3.0
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compatibility
 Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
 Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
 Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Testing and validation
 Extended alpha → beta → GA plan designed for stabilization
 EC already has some usagein production (700 nodes at Y! JP)
– Hortonworks has worked closely with this very large customer
 Hortonworks is integrating and testing HDP 3
– Integrating with all components of HDP stack
– HDP2 ++ integration tests
 Cloudera is also testing Hadoop 3 as part of their stack
 Plans for extensive HDFS EC testing by Hortonworks and Cloudera
 Happy synergy between 2.8.x and 3.0.x lines
– Shares much of the same code, fixes flow into both
– Yahoo! Deployments based on 2.8.0
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary : What’s new in Apache Hadoop 3.0?
Storage Optimization
HDFS: Erasure codes
Improved Utilization
YARN: Long Running Services
YARN: Schedule Enhancements
Additional Workloads
YARN: Docker & Isolation
Easier to Use
New User Interface
Refactor Base
Lots of Trunk content
JDK8 and newer dependent libraries
3.0
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Reminder: BoFs on Thursday

More Related Content

What's hot

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data InsightsDataWorks Summit
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeDataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 

What's hot (20)

Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 

Similar to Apache Hadoop 3.0 Community Update

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 

Similar to Apache Hadoop 3.0 Community Update (20)

Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 

Recently uploaded (20)

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 

Apache Hadoop 3.0 Community Update

  • 1. Apache Hadoop 3.0 Community Update Sydney, September 2017 Sanjay Radia, Vinod Kumar Vavilapalli
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About.html Sanjay Radia Chief Architect, Founder, Hortonworks Part of the original Hadoop team at Yahoo! since 2007 – Chief Architect of Hadoop Core at Yahoo! –Apache Hadoop PMC and Committer Prior Data center automation, virtualization, Java, HA, OSs, File Systems Startup, Sun Microsystems, Inria … Ph.D., University of Waterloo Page 2
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Hadoop 3.0  Lot of content in Trunk that did not make it to 2.x branch  JDK Upgrade – does not truly require bumping major number  Hadoop command scripts rewrite (incompatible)  Big features that need stabilizing major release – Erasure codes  YARN: long running services  Ephemeral Ports (incompatible) Driving Reasons Some features taking advantage of 3.0
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hadoop 3.0  HDFS: Erasure codes  YARN: – Long running services, – Scheduler enhancements, – Isolation & Docker – UI  Lots of Trunk content  JDK8 and newer dependent libraries - 3.0.0-alpha1 - Sep/3/2016 - Alpha2 - Jan/25/2017 - Alpha3 – May/16/2017 - Alpha4 – July/7/2017 - Beta/GA – Q4 2017 (Estimated) Key Takeaways Release Timeline 3.0
  • 5. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Major changes you should know before upgrade Hadoop 3.0 – JDK upgrade – Dependency upgrade – Change on default port for daemon/services – Shell script rewrite  Features – Hadoop Common • Client-Side Classpath Isolation • Shell script rewrite – HDFS/Storage • Erasure Coding • Multiple Standby NameNodes • Intradata balancer • Cloud Storage: Support for Azure Data Lake, S3 consistency & performance – YARN • Support for long running services • Scheduling enhancements: : App / Queue Priorities, global scheduling, placement strategies • New UI • ATS v2 – MAPREDUCE • Task-level native optimization HADOOP-11264
  • 6. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Minimum JDK for Hadoop 3.0.x is JDK8 OOP-11858 – Oracle JDK 7 is EoL at April 2015!!  Moving forward to use new features of JDK8 – Lambda Expressions – starting to use this – Stream API – Security enhancements – Performance enhancement for HashMaps, IO/NIO, etc.  Hadoop’s evolution with JDK upgrades – Hadoop 2.6.x - JDK 6, 7, 8 or later – Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later – Hadoop 3.0.x - JDK 8 or later Hadoop Operation - JDK Upgrade
  • 7. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000) – Can conflict with other apps running on the same node – Can cause problem during rolling restart if another app takes the port  New ports: – Namenode ports: 50470  9871, 50070  9870, 8020  9820 – Secondary NN ports: 50091  9869, 50090  9868 – Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864  KMS service port 16000  9600 Change of Default Ports for Hadoop Services
  • 8. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Classpath isolation (HADOOP-11656)  Hadoop leaks lots of dependencies onto the application’s classpath ○ Known offenders: Guava, Protobuf, Jackson, Jetty, … ○ Potential conflicts with your app dependencies (No shading)  No separate HDFS client jar means server jars are leaked ● NN, DN libraries pulled even though not needed  HDFS-6200: Split HDFS client into separate JAR  HADOOP-11804: Shaded hadoop- client dependency  YARN-6466: Shade the task umbilical for a clean YARN container environment (ongoing)
  • 9. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Support for Three NameNodes for HA Intra data node balancer Cloud storage improvements (see afternoon talk) Erasure coding
  • 10. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current (2.x) HDFS Replication Strategy  Three replicas by default – 1st replica on local node, local rack or random node – 2nd and 3rd replicas on the same remote rack – Reliability: tolerate 2 failures  Good data locality, local shortcut  Multiple copies => Parallel IO for parallel compute  Very Fast block recovery and node recovery – Parallel recover - the bigger the cluster the faster – 10TB Node recovery 30sec to a few hours  3/x storage overhead vs 1.4-1.6 of Erasure Code – Remember that Hadoop’s JBod is very cheap • 1/10 - 1/20 of SANs • 1/10 – 1/5 of NFS r1 Rack I DataNode r2 Rack II DataNode r3
  • 11. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure Coding  k data blocks + m parity blocks (k + m) – Example: Reed-Solomon 6+3  Reliability: tolerate m failures  Save disk space  Save I/O bandwidth on the write path  1.5x storage overhead  Tolerate any 3 failures b3b1 b2 P1b6b4 b5 P2 P3 6 data blocks 3 parity blocks 3-replication (6, 3) Reed-Solomon Maximum fault Tolerance 2 3 Disk usage (N byte of data) 3N 1.5N
  • 12. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Block Reconstruction  Block reconstruction overhead – Higher network bandwidth cost – Extra CPU overhead • Local Reconstruction Codes (LRC), Hitchhiker Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12. Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013. Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14. b4 Rack b2 Rack b3 Rack b1 Rack b6 Rack b5 Rack RackRack P1 P2 Rack P3
  • 13. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure Coding on Contiguous/Striped Blocks Two Approaches  EC on contiguous blocks – Pros: Better for locality – Cons: small files cannot be handled  EC on striped blocks – Pros: Leverage multiple disks in parallel – Pros: Works for small small files – Cons: No data locality for readers C1 C2 C3 C4 C5 C6 PC1 PC2 PC3 C7 C8 C9 C10 C11 C12 PC4 PC5 PC6 stripe 1 stripe 2 stripe n b1 b2 b3 b4 b5 b6 P1 P2 P3 6 Data Blocks 3 Parity Blocks b3b1 b2 b6b4 b5 File f1 P1 P2 P3 parity blocks File f2 f3 data blocks
  • 14. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure Coding Zone  Create a zone on an empty directory – Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path>  All the files under a zone directory are automatically erasure coded – Rename across zones with different EC schemas are disallowed
  • 15. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Write Pipeline for Replicated Files  Write pipeline to datanodes  Durability – Use 3 replicas to tolerate maximum 2 failures  Visibility – Read is supported for being written files – Data can be made visible by hflush/hsync  Consistency – Client can start reading from any replica and failover to any other replica to read the same data  Appendable – Files can be reopened for append * DN = DataNode DN1 DN2 DN3 data data ackack Writer data ack
  • 16. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parallel Write for EC Files  Parallel write – Client writes to a group of 9 datanodes at the same time – Calculate Parity bits at client side, at Write Time  Durability – (6, 3)-Reed-Solomon can tolerate maximum 3 failures  Visibility (Same as replicated files) – Read is supported for being written files – Data can be made visible by hflush/hsync  Consistency – Client can start reading from any 6 of the 9 replicas – When reading from a datanode fails, client can failover to any other remaining replica to read the same data.  Appendable (Same as replicated files) – Files can be reopened for append DN1 DN6 DN7 data parit y ack ackWriter data ack DN9 parit yack …… Stipe size 1MB
  • 17. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved EC: Write Failure Handling  Datanode failure – Client ignores the failed datanode and continue writing. – Able to tolerate 3 failures. – Require at least 6 datanodes. – Missing blocks will be reconstructed later. DN1 DN6 DN7 data parit y ack ackWriter data ack DN9 parit yack ……
  • 18. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Replication: Slow Writers & Replace Datanode on Failure  Write pipeline for replicated files – Datanode can be replaced in case of failure.  Slow writers – A write pipeline may last for a long time – The probability of datanode failures increases over time. – Need to replace datanode on failure.  EC files – Do not support replace-datanode-on-failure. – Slow writer improved DN1 DN4 data ack DN3DN2 data ack Writer data ack
  • 19. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reading with Parity Blocks  Parallel read – Read from 6 Datanodes with data blocks – Support both stateful read and pread  Block reconstruction – Read parity blocks to reconstruct missing blocks DN3 DN7 DN1 DN2 Reader DN4 DN5 DN6 Block3 reconstruct Block2 Block1 Block4 Block5 Block6Parity1
  • 20. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved EC implications  File data is striped across multiple nodes and racks  Reads and writes are remote and cross-rack  Reconstruction is network-intensive, reads m blocks cross-rack – Need fast network • Require high network bandwidth between client-server • Dead DataNode implies high network traffic and reconstruction time  Important to use optimized ISA-L for performance – 1+ GB/s encode/decode speed, much faster than Java implementation – CPU is no longer a bottleneck  Need to combine data into larger files to avoid an explosion in replica count – Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) – Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) Works best for archival / cold data usecases Need Fast Network
  • 21. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved EC performance – write performance faster with right EC lib
  • 22. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved EC performance – TPC with no DN killed
  • 23. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved EC performance - TPC with 2 DN killed
  • 24. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure coding status  Massive development effort by the Hadoop community ○ 20+ contributors from many companies (Hortonworks, Y! JP, Cloudera, Intel, Huawei, …) ○ 100s of commits over three years (started in 2014)  Erasure coding is feature complete!  Solidifying some user APIs in preparation for beta1  Current focus is on testing and integration efforts ○ Want the complete Hadoop stack to work with HDFS erasure coding enabled ○ Stress / endurance testing to ensure stability
  • 25. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hadoop 3.0 – YARN Enhancements  YARN Scheduling Enhancements  Support for Long Running Services  Re-architecture for YARN Timeline Service - ATS v2  Better elasticity and resource utilization  Better resource isolation and Docker!!  Better User Experiences  Other Enhancements 3.0
  • 26. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduling Enhancements  Application priorities within a queue: YARN-1963 – In Queue A, App1 > App 2  Inter-Queue priorities – Q1 > Q2 irrespective of demand / capacity – Previously based on unconsumed capacity  Affinity / anti-affinity: YARN-1042 – More restraints on locations • Affinity to rack (where you have your sibling) • Anti-affinity (e.g. Hbase region servers)  Global Scheduling: YARN-5139 – Get rid of scheduling triggered on node heartbeat – Replaced with global scheduler that has parallel threads • Globally optimal placement –expect evolution of the scheduler • Critical for long running services – they stick to the allocation – better be a good one • Enhanced container scheduling throughput (6x)
  • 27. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduling Enhancements (Contd.)  CapacityScheduler improvements – Queue Management Improvements • More Dynamic Queue reconfiguration • REST API support for queue management – Absolute resource configuration support – Priority Support in Application and Queue – Preemption improvements • Inter-Queue preemption support
  • 28. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Drivers for Long Running Services  Consolidation of Infrastructure – Hadoop clusters have a lot of compute and storage resources (some unused) • Can’t I use Hadoop’s resources for non-Hadoop load? • Openstack is hard to manage/operate, can I use YARN? • VMs are expensive, can I use YARN? • But does it support Docker? – yes, we heard you  Hadoop related Data Services that run outside a Hadoop cluster – Why can’t I run them in the Hadoop cluster  Run Hadoop services (Hive, HBase) on YARN – Run Multiple instances – Benefit from YARN’s Elasticity and resource management
  • 29. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Built-in support for long running Service in YARN  A native YARN framework. YARN-4692 • Abstract common Framework (Similar to Slider) to support long running service • More simplified API (to manage service lifecycle) • Better support for long running service  Recognition of long running service • Affect the policy of preemption, container reservation, etc. • Auto-restart of containers • Containers for long running service are restarted on same node in case of local state  Service/application upgrade support – YARN-4726 • In general, services are expected to run long enough to cross versions  Dynamic container configuration • Only ask for resources just enough, but adjust them at runtime (memory harder)
  • 30. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Discovery services in YARN  Services can run on any YARN node; how do get its IP? – It can also move due to node failure  YARN Service Discovery via DNS: YARN-4757 – Expose existing service information in YARN registry via DNS • YARN service registry’s records will be converted into DNS entries – Discovery of container IP and service port via standard DNS lookups. • Application – zkapp1.user1.yarncluster.com -> 192.168.10.11:8080 • Container – Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18
  • 31. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A More Powerful YARN  Elastic Resource Model – Dynamic Resource Configuration (YARN-291) • Allow tune down/up on NM’s resource in runtime – E.g. Helps when Hadoop cluster nodes are shared with other workloads – E.g. Hadoop-on-Hadoop allows flexible resource allocation – Graceful decommissioning of NodeManagers (YARN-914) • Drains a node that’s being decommissioned to allow running containers to finish • E.g. Removing a node for maintenance, Spot pricing on cloud, …  Efficient Resource Utilization – Support for container resizing (YARN-1197) • Allows applications to change the size of an existing container • E.g. long running services
  • 32. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Powerful YARN (Contd.)  Resource Isolation – Resource isolation support for disk and network • YARN-2619 (disk), YARN-2140 (network) • Containers get a fair share of disk and network resources using Cgroups – Docker support in LinuxContainerExecutor (YARN-3611) • Support to launch Docker containers alongside process • Packaging and resource isolation – Packing easier e.g. TensorFlow • Complements YARN’s support for long running services
  • 33. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Apps Docker on Yarn & YARN on YARN  - YCloud YARN MR Tez Spark TensorFlow YARN MR Tez Spark Can use Yarn to test Hadoop!!
  • 34. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN New UI (YARN-3368)
  • 35. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Other YARN work planned in Hadoop 3.X  Resource profiles (YARN-3926) – Users can specify resource profile name instead of individual resources – Resource types read via a config file  YARN federation (YARN-2915) – Allows YARN to scale out to tens of thousands of nodes – Cluster of clusters which appear as a single cluster to an end user
  • 36. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compatibility & Testing 3.0
  • 37. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compatibility  Preserves wire-compatibility with Hadoop 2 clients ○ Impossible to coordinate upgrading off-cluster Hadoop clients  Will support rolling upgrade from Hadoop 2 to Hadoop 3 ○ Can’t take downtime to upgrade a business-critical cluster  Not fully preserving API compatibility! ○ Dependency version bumps ○ Removal of deprecated APIs and tools ○ Shell script rewrite, rework of Hadoop tools scripts ○ Incompatible bug fixes
  • 38. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Testing and validation  Extended alpha → beta → GA plan designed for stabilization  EC already has some usagein production (700 nodes at Y! JP) – Hortonworks has worked closely with this very large customer  Hortonworks is integrating and testing HDP 3 – Integrating with all components of HDP stack – HDP2 ++ integration tests  Cloudera is also testing Hadoop 3 as part of their stack  Plans for extensive HDFS EC testing by Hortonworks and Cloudera  Happy synergy between 2.8.x and 3.0.x lines – Shares much of the same code, fixes flow into both – Yahoo! Deployments based on 2.8.0
  • 39. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary : What’s new in Apache Hadoop 3.0? Storage Optimization HDFS: Erasure codes Improved Utilization YARN: Long Running Services YARN: Schedule Enhancements Additional Workloads YARN: Docker & Isolation Easier to Use New User Interface Refactor Base Lots of Trunk content JDK8 and newer dependent libraries 3.0
  • 40. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you! Reminder: BoFs on Thursday

Editor's Notes

  1. Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  2. Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  3. it enables online EC which bypasses the conversion phase and immediately saves storage space; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple D​ataNodes​and eliminates the need to bundle multiple files into a single coding group.
  4. Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  5. Previously based on uncomsumed capacity If 70% capacity has lots of uncomsumed capcity it is scheduled first Now you can say that the 30% queue is higher priority
  6. Original Yarn design was not just for batch jobs. - we started with that but the design was general
  7. Graceful degradation - remove nodes gracefully - for cloud especially if you are using spot pricing
  8. App centric – top two left pictures Node centric Resource centric – load vs capacity – overall and by queues Cluster centric – nodes summary heatmap of resource usage across nodes
  9. Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey
  10. Data Trends From Characteristics of the Data to Data Consumption & Interaction According to IBM, every day we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years.  Insight from Data is a key competitive differentiator Open Source is evolving and adapting with these trends the fastest Adopting Hadoop is not a destination but a journey