SlideShare a Scribd company logo
1 of 54
Download to read offline
1© Cloudera, Inc. All rights reserved.
Apache Hadoop 3
Andrew Wang Daniel Templeton
andrew.wang@cloudera.com daniel@cloudera.com
2© Cloudera, Inc. All rights reserved.
Who We Are
Andrew Wang
● HDFS @ Cloudera
● Hadoop PMC Member
● Release Manager for Hadoop 3.0
Daniel Templeton
● YARN @ Cloudera
● Hadoop PMC Member
3© Cloudera, Inc. All rights reserved.
An Abbreviated History of Hadoop Releases
Date Release Major Notes
2007-11-04 0.14.1 First release at the ASF
2011-12-27 1.0.0 Security, HBase support
2012-05-23 2.0.0 YARN, NameNode HA, wire compat
2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels
2015-04-21 2.7.0 Truncate, Variable-length blocks, YARN Global Caching,
2017-03-22 2.8.0 Cloud improvement, Azure Data Lake, and etc.
2017-11-17 2.9.0 Stability Improvement
2017-12-13 3.0.0 Java 8, Erasure Coding, S3Guard, YARN Timeline Service
4© Cloudera, Inc. All rights reserved.
Motivation for Hadoop 3
● Upgrade minimum Java version to Java 8
○ Java 7 end-of-life in April 2015
○ Many Java libraries now only support Java 8
● HDFS erasure coding
○ Major feature that refactored core pieces of HDFS
○ Too big to backport to 2.x
● Classpath isolation
○ Potentially impacts all clients
● Other miscellaneous incompatible bugfixes and improvements
○ Hadoop 2.x was branched in 2011
○ 6 years of changes waiting for 3.0
5© Cloudera, Inc. All rights reserved.
Hadoop 3 Status and Release Plan
● After four alphas and one beta, 3.0.0 is out!
● Took close to two years from inception
● 3.0.1 and 3.1.0 are already in progress
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release
Release Date
3.0.0-alpha1 2016-09-03 ✔
3.0.0-alpha2 2017-01-25 ✔
3.0.0-alpha3 2017-05-26 ✔
3.0.0-alpha4 2017-07-07 ✔
3.0.0-beta1 2017-10-03 ✔
3.0.0 GA 2017-12-13 ✔
3.0.1 2017 Mar
6© Cloudera, Inc. All rights reserved.
HDFS & Hadoop Features
7© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
8© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
b1 b2 b3
b1 b2 b3
9© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
b1 b2 b3
b1 b2 b3
3 replicas
3 blocks
3 x 3 = 9 total replicas
9 / 3 = 200% overhead!
10© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
11© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
12© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
3 data blocks 2 parity blocks
3 + 2 = 5 replicas
5 / 3 = 67% overhead!
13© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
3 data blocks 2 parity blocks
3 + 2 = 5 replicas
5 / 3 = 67% overhead!
b1 b2 b10
/bigfoo.csv - 10 block file
p1 p4
10 data blocks 4 parity blocks
10 + 4 = 14 replicas
14 / 10 = 40% overhead!
... ...
14© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
15© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
X
16© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
Read 3 remaining blocks
b3
Run RS to recover b3
New copy of b3 recovered
X
17© Cloudera, Inc. All rights reserved.
Erasure coding (HDFS-7285)
● Motivation: improve storage efficiency of HDFS
○ ~2x the storage efficiency compared to 3x replication
○ Reduction of overhead from 200% to 40%
● Uses Reed-Solomon(k,m) erasure codes instead of replication
○ Support for multiple erasure coding policies
○ RS(3,2), RS(6,3), RS(10,4)
● Can improves data durability
○ RS(6,3) can tolerate 3 failures
○ RS(10,4) can tolerate 4 failures
● Missing blocks reconstructed from remaining blocks
18© Cloudera, Inc. All rights reserved.
EC implications
● File data is striped across multiple nodes and racks
● Reads and writes are remote and cross-rack
● Reconstruction is network-intensive, reads m blocks cross-rack
● Important to use Intel’s optimized ISA-L for performance
○ 1+ GB/s encode/decode speed, much faster than Java implementation
● Combine data into larger files to avoid an explosion in # replicas
○ Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
○ Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
● Works best for archival / cold data use cases
19© Cloudera, Inc. All rights reserved.
EC performance
20© Cloudera, Inc. All rights reserved.
EC performance
21© Cloudera, Inc. All rights reserved.
EC performance
22© Cloudera, Inc. All rights reserved.
Erasure coding status
● Massive development effort by the Hadoop community
○ 20+ contributors from many companies
■ Cloudera, Intel, Hortonworks, Huawei, Y! JP, …
○ 100s of commits over more than three years (started in 2014)
● Erasure coding is ready in 3.0.0 GA!
● Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Ongoing stress / endurance testing to ensure stability at scale
23© Cloudera, Inc. All rights reserved.
● Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava,
Protobuf, Jackson, Jetty, …
● No separate HDFS client jar means
server jars are leaked
● YARN / MR clients not shaded
● HDFS-6200: Split HDFS client into
separate JAR
● HADOOP-11804: Shaded
hadoop-client dependency
● YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)
Classpath isolation (HADOOP-11656)
24© Cloudera, Inc. All rights reserved.
Miscellaneous
● Supportability improvements
○ Shell script rewrite
○ Intra-DataNode balancer
○ Move default ports out of the ephemeral range
● Support for multiple Standby NameNodes
● Cloud enhancements
○ Support for Microsoft Azure Data Lake and Aliyun OSS
○ S3 consistency and performance improvements
● Tightened Hadoop compatibility policy
25© Cloudera, Inc. All rights reserved.
YARN Features
26© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
27© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
28© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
29© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
HDFS
Node
Manager
30© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
Spark
History
Server
31© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
Spark
History
Server
?
32© Cloudera, Inc. All rights reserved.
Application Timeline Service v2
● Store for application and system events and data
○ Distributed
○ Scalable
○ Structured Data Model
● Updated in real time
○ Application status
○ Application metrics
○ System metrics
● Fed by resource manager, node manager, and application masters
● REST API
33© Cloudera, Inc. All rights reserved.
Application Timeline Service v2
Resource
Manager
jobs
Application
Timeline
Service
HBase
34© Cloudera, Inc. All rights reserved.
Timeline
Reader
Timeline
Reader
Application Timeline Service v2
Resource
Manager
Timeline
Collecter
HBase Node
Manager
Application
Master
Timeline
Collecter
Timeline
Reader
35© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
36© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
37© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
38© Cloudera, Inc. All rights reserved.
Old YARN UI
39© Cloudera, Inc. All rights reserved.
New YARN UI
● Rich client application
○ Built on Node.js and Ember
● Improved visibility into cluster usage
○ Memory, CPU
○ By queues and applications
○ Sunburst graphs for hierarchical queues
○ NodeManager heatmap
● ATSv2 integration
○ Plot container start/stop events
○ Easy to capture delays in app execution
40© Cloudera, Inc. All rights reserved.
New YARN UI: Cluster Overview
41© Cloudera, Inc. All rights reserved.
New YARN UI: Queues
42© Cloudera, Inc. All rights reserved.
● Before Hadoop 3 memory and CPU are the only managed resources
● Resource Types allows adding new managed resources
○ Countable resources: GPUs, Disks etc.
○ Static resources: Java version, Python version, hardware profile, ...
■ Still in proposal stage
● Resource profiles
○ Similar conceptually to EC2 instance types
○ Capture complex resource request
● DRF for scheduling
● Current virtual CPU cores and memory resources work as before
Resource Types
43© Cloudera, Inc. All rights reserved.
YARN Federation
● YARN scalability
○ Twitter runs a 10k node cluster with fair scheduler
○ Yahoo! runs 4k node cluster with capacity scheduler
● Federation
○ Restrict users to sub-clusters based on policy
○ Scalability to 100k nodes and beyond
○ Independent cluster scheduling
44© Cloudera, Inc. All rights reserved.
YARN Federation
Router
Resource
Manager
Node Manager
Node Manager
Node Manager
Node Manager
Resource
Manager
Node Manager
Node Manager
Node Manager
Node Manager
Policy
Admin
45© Cloudera, Inc. All rights reserved.
Opportunistic Containers
● Scheduler’s job is to keep all resources busy
● Scheduling gaps
○ Nothing to run
○ Resource contention
○ Resource reservations
● Opportunistic containers fill those gaps
○ Requested explicitly
○ Dedicated scheduler
○ Queued at the node managers
○ Scheduled locally when resources are available
○ Preempted when guaranteed containers need to run
● Coming in 2.9 and 3.0
46© Cloudera, Inc. All rights reserved.
Oversubscription
● Resource utilization is typically
low in most clusters (20-50%)
○ Provision for peak usage
● Usage < Allocation
○ Mean Usage = ½ Peak Usage
47© Cloudera, Inc. All rights reserved.
Oversubscription
● Oversubscription
○ Allocate opportunistic containers to use allocated-but-unused resources
○ Jobs automatically use these unless they opt-out
○ Threshold to control aggressiveness of oversubscription
○ Threshold to trigger preemption
● Currently in progress
48© Cloudera, Inc. All rights reserved.
● Long Running Services
○ Slider merging into YARN
○ Docker support
● Scheduler improvements
○ Capacity scheduler
■ Performance and preemption
improvements
■ Online scheduling (“global
scheduler”)
■ Queue management
○ Fair scheduler
■ Performance and preemption
improvements
● High availability improvements
○ Better handling of transient
network issues
○ ZK-store scalability: Limit number
of children under a znode
● MapReduce Native Collector
(MAPREDUCE-2841)
○ Native implementation of the map
output collector
○ Up to 30% faster for
shuffle-intensive jobs
Other YARN Improvements
49© Cloudera, Inc. All rights reserved.
Summary: What’s new in Hadoop 3.0?
● Storage Optimization
○ HDFS: Erasure codes
● Improved Visibility into Cluster Operations
○ YARN: ATSv2
○ YARN: New UI
● Scalability & Multi-tenancy
○ YARN: Federation
● Improved Utilization
○ YARN: Opportunistic Containers
○ YARN: Oversubscription
● Refactor Base
○ Lots of Trunk content
○ JDK8 and newer dependent libraries
50© Cloudera, Inc. All rights reserved.
Compatibility and Testing
51© Cloudera, Inc. All rights reserved.
Compatibility
● Strong feedback from large users on the need for compatibility
● Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
● Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
● Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes
52© Cloudera, Inc. All rights reserved.
Testing and Validation
● Cloudera CDH 6 is based on upstream Hadoop 3.0.0
○ Running full test suite
○ Integration of Hadoop 3 with all components in CDH stack
○ Same integration tests used to validate CDH5
● Plans for extensive HDFS EC testing by Cloudera and Intel
● Happy synergy between 2.8.x and 3.0.x lines
○ Shares much of the same code, fixes flow into both
○ Yahoo! doing scale testing of 2.8.0
53© Cloudera, Inc. All rights reserved.
Conclusion
● Hadoop 3.0.0 GA is out!
● Shiny new features
○ HDFS erasure coding
○ Client classpath isolation
○ YARN ATSv2
○ YARN Federation
○ Opportunistic containers and oversubscription
● Great time to get involved in testing and validation
54© Cloudera, Inc. All rights reserved.
Thank you
Andrew Wang Daniel Templeton
andrew.wang@cloudera.com daniel@cloudera.com

More Related Content

What's hot

Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflixgreggulrich
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLYugabyte
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
MariaDB 마이그레이션 - 네오클로바
MariaDB 마이그레이션 - 네오클로바MariaDB 마이그레이션 - 네오클로바
MariaDB 마이그레이션 - 네오클로바NeoClova
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 

What's hot (20)

Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
MariaDB 마이그레이션 - 네오클로바
MariaDB 마이그레이션 - 네오클로바MariaDB 마이그레이션 - 네오클로바
MariaDB 마이그레이션 - 네오클로바
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 

Similar to Apache Hadoop 3

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityDinesh Chitlangia
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Big Data Spain
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYWangda Tan
 
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfCloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfwchevreuil
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith SharmaNewton Alex
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...Cloudera, Inc.
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...JordanHambleton
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 

Similar to Apache Hadoop 3 (20)

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfCloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
 
DDN Product Update from SC13
DDN Product Update from SC13DDN Product Update from SC13
DDN Product Update from SC13
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Apache Hadoop 3

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Hadoop 3 Andrew Wang Daniel Templeton andrew.wang@cloudera.com daniel@cloudera.com
  • 2. 2© Cloudera, Inc. All rights reserved. Who We Are Andrew Wang ● HDFS @ Cloudera ● Hadoop PMC Member ● Release Manager for Hadoop 3.0 Daniel Templeton ● YARN @ Cloudera ● Hadoop PMC Member
  • 3. 3© Cloudera, Inc. All rights reserved. An Abbreviated History of Hadoop Releases Date Release Major Notes 2007-11-04 0.14.1 First release at the ASF 2011-12-27 1.0.0 Security, HBase support 2012-05-23 2.0.0 YARN, NameNode HA, wire compat 2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels 2015-04-21 2.7.0 Truncate, Variable-length blocks, YARN Global Caching, 2017-03-22 2.8.0 Cloud improvement, Azure Data Lake, and etc. 2017-11-17 2.9.0 Stability Improvement 2017-12-13 3.0.0 Java 8, Erasure Coding, S3Guard, YARN Timeline Service
  • 4. 4© Cloudera, Inc. All rights reserved. Motivation for Hadoop 3 ● Upgrade minimum Java version to Java 8 ○ Java 7 end-of-life in April 2015 ○ Many Java libraries now only support Java 8 ● HDFS erasure coding ○ Major feature that refactored core pieces of HDFS ○ Too big to backport to 2.x ● Classpath isolation ○ Potentially impacts all clients ● Other miscellaneous incompatible bugfixes and improvements ○ Hadoop 2.x was branched in 2011 ○ 6 years of changes waiting for 3.0
  • 5. 5© Cloudera, Inc. All rights reserved. Hadoop 3 Status and Release Plan ● After four alphas and one beta, 3.0.0 is out! ● Took close to two years from inception ● 3.0.1 and 3.1.0 are already in progress https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release Release Date 3.0.0-alpha1 2016-09-03 ✔ 3.0.0-alpha2 2017-01-25 ✔ 3.0.0-alpha3 2017-05-26 ✔ 3.0.0-alpha4 2017-07-07 ✔ 3.0.0-beta1 2017-10-03 ✔ 3.0.0 GA 2017-12-13 ✔ 3.0.1 2017 Mar
  • 6. 6© Cloudera, Inc. All rights reserved. HDFS & Hadoop Features
  • 7. 7© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file
  • 8. 8© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file b1 b2 b3 b1 b2 b3
  • 9. 9© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 3 replicas 3 blocks 3 x 3 = 9 total replicas 9 / 3 = 200% overhead!
  • 10. 10© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file
  • 11. 11© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2
  • 12. 12© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2 3 data blocks 2 parity blocks 3 + 2 = 5 replicas 5 / 3 = 67% overhead!
  • 13. 13© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2 3 data blocks 2 parity blocks 3 + 2 = 5 replicas 5 / 3 = 67% overhead! b1 b2 b10 /bigfoo.csv - 10 block file p1 p4 10 data blocks 4 parity blocks 10 + 4 = 14 replicas 14 / 10 = 40% overhead! ... ...
  • 14. 14© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2)
  • 15. 15© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2) X
  • 16. 16© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2) Read 3 remaining blocks b3 Run RS to recover b3 New copy of b3 recovered X
  • 17. 17© Cloudera, Inc. All rights reserved. Erasure coding (HDFS-7285) ● Motivation: improve storage efficiency of HDFS ○ ~2x the storage efficiency compared to 3x replication ○ Reduction of overhead from 200% to 40% ● Uses Reed-Solomon(k,m) erasure codes instead of replication ○ Support for multiple erasure coding policies ○ RS(3,2), RS(6,3), RS(10,4) ● Can improves data durability ○ RS(6,3) can tolerate 3 failures ○ RS(10,4) can tolerate 4 failures ● Missing blocks reconstructed from remaining blocks
  • 18. 18© Cloudera, Inc. All rights reserved. EC implications ● File data is striped across multiple nodes and racks ● Reads and writes are remote and cross-rack ● Reconstruction is network-intensive, reads m blocks cross-rack ● Important to use Intel’s optimized ISA-L for performance ○ 1+ GB/s encode/decode speed, much faster than Java implementation ● Combine data into larger files to avoid an explosion in # replicas ○ Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) ○ Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) ● Works best for archival / cold data use cases
  • 19. 19© Cloudera, Inc. All rights reserved. EC performance
  • 20. 20© Cloudera, Inc. All rights reserved. EC performance
  • 21. 21© Cloudera, Inc. All rights reserved. EC performance
  • 22. 22© Cloudera, Inc. All rights reserved. Erasure coding status ● Massive development effort by the Hadoop community ○ 20+ contributors from many companies ■ Cloudera, Intel, Hortonworks, Huawei, Y! JP, … ○ 100s of commits over more than three years (started in 2014) ● Erasure coding is ready in 3.0.0 GA! ● Current focus is on testing and integration efforts ○ Want the complete Hadoop stack to work with HDFS erasure coding enabled ○ Ongoing stress / endurance testing to ensure stability at scale
  • 23. 23© Cloudera, Inc. All rights reserved. ● Hadoop leaks lots of dependencies onto the application’s classpath ○ Known offenders: Guava, Protobuf, Jackson, Jetty, … ● No separate HDFS client jar means server jars are leaked ● YARN / MR clients not shaded ● HDFS-6200: Split HDFS client into separate JAR ● HADOOP-11804: Shaded hadoop-client dependency ● YARN-6466: Shade the task umbilical for a clean YARN container environment (ongoing) Classpath isolation (HADOOP-11656)
  • 24. 24© Cloudera, Inc. All rights reserved. Miscellaneous ● Supportability improvements ○ Shell script rewrite ○ Intra-DataNode balancer ○ Move default ports out of the ephemeral range ● Support for multiple Standby NameNodes ● Cloud enhancements ○ Support for Microsoft Azure Data Lake and Aliyun OSS ○ S3 consistency and performance improvements ● Tightened Hadoop compatibility policy
  • 25. 25© Cloudera, Inc. All rights reserved. YARN Features
  • 26. 26© Cloudera, Inc. All rights reserved. Job History Server Resource Manager
  • 27. 27© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs
  • 28. 28© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server
  • 29. 29© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server HDFS Node Manager
  • 30. 30© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server Spark History Server
  • 31. 31© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server Spark History Server ?
  • 32. 32© Cloudera, Inc. All rights reserved. Application Timeline Service v2 ● Store for application and system events and data ○ Distributed ○ Scalable ○ Structured Data Model ● Updated in real time ○ Application status ○ Application metrics ○ System metrics ● Fed by resource manager, node manager, and application masters ● REST API
  • 33. 33© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Resource Manager jobs Application Timeline Service HBase
  • 34. 34© Cloudera, Inc. All rights reserved. Timeline Reader Timeline Reader Application Timeline Service v2 Resource Manager Timeline Collecter HBase Node Manager Application Master Timeline Collecter Timeline Reader
  • 35. 35© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 36. 36© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 37. 37© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 38. 38© Cloudera, Inc. All rights reserved. Old YARN UI
  • 39. 39© Cloudera, Inc. All rights reserved. New YARN UI ● Rich client application ○ Built on Node.js and Ember ● Improved visibility into cluster usage ○ Memory, CPU ○ By queues and applications ○ Sunburst graphs for hierarchical queues ○ NodeManager heatmap ● ATSv2 integration ○ Plot container start/stop events ○ Easy to capture delays in app execution
  • 40. 40© Cloudera, Inc. All rights reserved. New YARN UI: Cluster Overview
  • 41. 41© Cloudera, Inc. All rights reserved. New YARN UI: Queues
  • 42. 42© Cloudera, Inc. All rights reserved. ● Before Hadoop 3 memory and CPU are the only managed resources ● Resource Types allows adding new managed resources ○ Countable resources: GPUs, Disks etc. ○ Static resources: Java version, Python version, hardware profile, ... ■ Still in proposal stage ● Resource profiles ○ Similar conceptually to EC2 instance types ○ Capture complex resource request ● DRF for scheduling ● Current virtual CPU cores and memory resources work as before Resource Types
  • 43. 43© Cloudera, Inc. All rights reserved. YARN Federation ● YARN scalability ○ Twitter runs a 10k node cluster with fair scheduler ○ Yahoo! runs 4k node cluster with capacity scheduler ● Federation ○ Restrict users to sub-clusters based on policy ○ Scalability to 100k nodes and beyond ○ Independent cluster scheduling
  • 44. 44© Cloudera, Inc. All rights reserved. YARN Federation Router Resource Manager Node Manager Node Manager Node Manager Node Manager Resource Manager Node Manager Node Manager Node Manager Node Manager Policy Admin
  • 45. 45© Cloudera, Inc. All rights reserved. Opportunistic Containers ● Scheduler’s job is to keep all resources busy ● Scheduling gaps ○ Nothing to run ○ Resource contention ○ Resource reservations ● Opportunistic containers fill those gaps ○ Requested explicitly ○ Dedicated scheduler ○ Queued at the node managers ○ Scheduled locally when resources are available ○ Preempted when guaranteed containers need to run ● Coming in 2.9 and 3.0
  • 46. 46© Cloudera, Inc. All rights reserved. Oversubscription ● Resource utilization is typically low in most clusters (20-50%) ○ Provision for peak usage ● Usage < Allocation ○ Mean Usage = ½ Peak Usage
  • 47. 47© Cloudera, Inc. All rights reserved. Oversubscription ● Oversubscription ○ Allocate opportunistic containers to use allocated-but-unused resources ○ Jobs automatically use these unless they opt-out ○ Threshold to control aggressiveness of oversubscription ○ Threshold to trigger preemption ● Currently in progress
  • 48. 48© Cloudera, Inc. All rights reserved. ● Long Running Services ○ Slider merging into YARN ○ Docker support ● Scheduler improvements ○ Capacity scheduler ■ Performance and preemption improvements ■ Online scheduling (“global scheduler”) ■ Queue management ○ Fair scheduler ■ Performance and preemption improvements ● High availability improvements ○ Better handling of transient network issues ○ ZK-store scalability: Limit number of children under a znode ● MapReduce Native Collector (MAPREDUCE-2841) ○ Native implementation of the map output collector ○ Up to 30% faster for shuffle-intensive jobs Other YARN Improvements
  • 49. 49© Cloudera, Inc. All rights reserved. Summary: What’s new in Hadoop 3.0? ● Storage Optimization ○ HDFS: Erasure codes ● Improved Visibility into Cluster Operations ○ YARN: ATSv2 ○ YARN: New UI ● Scalability & Multi-tenancy ○ YARN: Federation ● Improved Utilization ○ YARN: Opportunistic Containers ○ YARN: Oversubscription ● Refactor Base ○ Lots of Trunk content ○ JDK8 and newer dependent libraries
  • 50. 50© Cloudera, Inc. All rights reserved. Compatibility and Testing
  • 51. 51© Cloudera, Inc. All rights reserved. Compatibility ● Strong feedback from large users on the need for compatibility ● Preserves wire-compatibility with Hadoop 2 clients ○ Impossible to coordinate upgrading off-cluster Hadoop clients ● Will support rolling upgrade from Hadoop 2 to Hadoop 3 ○ Can’t take downtime to upgrade a business-critical cluster ● Not fully preserving API compatibility! ○ Dependency version bumps ○ Removal of deprecated APIs and tools ○ Shell script rewrite, rework of Hadoop tools scripts ○ Incompatible bug fixes
  • 52. 52© Cloudera, Inc. All rights reserved. Testing and Validation ● Cloudera CDH 6 is based on upstream Hadoop 3.0.0 ○ Running full test suite ○ Integration of Hadoop 3 with all components in CDH stack ○ Same integration tests used to validate CDH5 ● Plans for extensive HDFS EC testing by Cloudera and Intel ● Happy synergy between 2.8.x and 3.0.x lines ○ Shares much of the same code, fixes flow into both ○ Yahoo! doing scale testing of 2.8.0
  • 53. 53© Cloudera, Inc. All rights reserved. Conclusion ● Hadoop 3.0.0 GA is out! ● Shiny new features ○ HDFS erasure coding ○ Client classpath isolation ○ YARN ATSv2 ○ YARN Federation ○ Opportunistic containers and oversubscription ● Great time to get involved in testing and validation
  • 54. 54© Cloudera, Inc. All rights reserved. Thank you Andrew Wang Daniel Templeton andrew.wang@cloudera.com daniel@cloudera.com