SlideShare a Scribd company logo
Copyright©2016 NTT corp. All Rights Reserved.
What’s new in
Hadoop Common and HDFS
@Hadoop Summit Tokyo 2016
Tsuyoshi Ozawa
NTT Software Innovation Center
2016/10/26
2Copyright©2016 NTT corp. All Rights Reserved.
• Tsuyoshi Ozawa
• Research & Engineer @ NTT
Twitter: @oza_x86_64
• Apache Hadoop Committer and PMC
• Introduction to Hadoop 2nd Edition(Japanese)” Chapter
22(YARN)
• Online article: gihyo.jp “Why and How does Hadoop work?”
About me
3Copyright©2016 NTT corp. All Rights Reserved.
• What’s new in Hadoop 3 Common and HDFS?
• Build
• Compiling source code with JDK 8
• Common
• Better Library Management
• Client-Side Class path Isolation
• Dependency Upgrade
• Support for Azure Data Lake Storage
• Shell script rewrite
• metrics2 sink plugin for Apache Kafka HADOOP-10949
• HDFS
• Erasure Coding Phase 1 HADOOP-11264
• MR, YARN -> Junping will talk!
Agenda
Copyright©2016 NTT corp. All Rights Reserved.
Build
5Copyright©2016 NTT corp. All Rights Reserved.
• We upgraded minimum JDK to JDK8
• HADOOP-11858
• Oracle JDK 7 is EoL at April 2015!!
• Moving forward to use new features of JDK8
• Hadoop 2.6.x
• JDK 6, 7, 8 or later
• Hadoop 2.7.x/2.8.x/2.9.x
• JDK 7, 8 or later
• Hadoop 3.0.x
• JDK 8 or later
Apache Hadoop 3.0.0 run on JDK 8 or later
Copyright©2016 NTT corp. All Rights Reserved.
Common
7Copyright©2016 NTT corp. All Rights Reserved.
• Jersey: 1.9 to 1.19
• the root element whose content is empty collection is changed
from null to empty object({}).
• grizzly-http-servlet: 2.1.2 to 2.2.21
• Guice: 3.0 to 4.0
• cglib: 2.2 to 3.2.0
• asm: 3.2 to 5.0.4
Dependency Upgrade
8Copyright©2016 NTT corp. All Rights Reserved.
Client-side classpath isolation
Problem
• Application code’s can
conflict with Hadoop’s
one
Solution
• Separating Server-side
jar and Client-side jar
• Like hbase-client,
dependencies are shared
HADOOP-11656/HADOOP-13070
Hadoop
Client
Server
Older
commons
User code
newer
commons
Single Jar File
Conflicts!!!
Hadoop
-client
shaded
User code
newer
commons
9Copyright©2016 NTT corp. All Rights Reserved.
• FileSystem API supports various storages
• HDFS
• Amazon S3
• Azure Blob Storage
• OpenStack Swift
• 3.0.0 supports Azure Data Lake Storage officially
Support for Azure Data Lake Storage
10Copyright©2016 NTT corp. All Rights Reserved.
• CLI are renewed!
• To fix bugs (e.g. HADOOP_CONF_DIR is honored sometimes)
• To introduce new features
E.g.
• To launch daemons,
Use {hadoop,yarn,hdfs} --daemon command instead of
{hadoop,yarn,hdfs}-daemons.sh
• To print various environment variables, java options, classpath,
etc “{hadoop,yarn,hdfs} --debug” option is supported
• Please check documents
• https://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-common/CommandsManual.html
• https://issues.apache.org/jira/browse/HADOOP-9902
Shell script rewrite
11Copyright©2016 NTT corp. All Rights Reserved.
• Metrics System 2 is collector of daemon metrics
• Hadoop’s daemon log can be dumped into
Apache Kafka
metrics2 sink plugin for Apache Kafka
Metrics
System2
DataNode
Metrics
NameNode
Metrics
NodeManager
Metrics
Apache Kafka Sink
(New!)
Copyright©2016 NTT corp. All Rights Reserved.
HDFS
Namenode
Multi Standby
13Copyright©2016 NTT corp. All Rights Reserved.
• Before: 1 Active – 1 Standby NameNode
• Need to recover immediately
after Active NN fails
• After: 1 Active - N standby NameNode can be chosen
• Be able to choose trade off
machine costs vs operation costs
NameNode Multi-Standby
NN
Active
NN
Standby
NN
Active
NN
Standby
NN
Standby
NN
Standby
NN
Standby
Copyright©2016 NTT corp. All Rights Reserved.
HDFS
Erasure Coding
15Copyright©2016 NTT corp. All Rights Reserved.
• Background
• HDFS uses Chain Replication for
higher throughput and strong consistency
• A case when replication factor is 3
• Pros
• Simplicity
• Network throughput can be suppressed
between client and replicas
• Cons
• High latency
• 33% of storage efficiency
Replication –traditional HDFS way-
DATA1 DATA1 DATA1Client
ACK
Data
16Copyright©2016 NTT corp. All Rights Reserved.
• Erasure Coding is another way to save storage
with fault tolerance
• Used in RAID 5/6
• Using “parity” instead of “copy” to recover
• Reed-Solomon coding is used
• If data is lost, recover is done with inverse matrix
Erasure Coding
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
𝑋02 𝑋01 𝑋02 𝑋03
𝑋12 𝑋11 𝑋12 𝑋13
𝒅 𝟏
𝒅 𝟐
𝒅 𝟑
𝒅 𝟒
× =
𝒅 𝟏
𝒅 𝟐
𝒅 𝟑
𝒅 𝟒
𝑐 𝟎
𝑐 𝟏 Parity Bits
Data Bits
FAST ’09: 7th, A Performance Evaluation and Examination of Open-Source
Erasure Coding Libraries For Storage
Storing these
values
instead of
only storing
data!
4 bits data – 2 bits parity read Solomon
17Copyright©2016 NTT corp. All Rights Reserved.
• Erasure coding is flexible:
tuning of data bits and parity bits can be done
• 6 data-bits, 3 parity-bits
• 3 replication vs (6, 3)-read Solomon
Effect of Erasure Coding
3-replication (6, 3) Reed-Solomon
Maximum fault
Tolerance
2 3
Disk usage
(N byte of data)
3N 1.5N
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
18Copyright©2016 NTT corp. All Rights Reserved.
• 2 approaches
• Striping : Splitting blocks into smaller block
• Pros Effective for small files
• Cons Less Data Locality to read block
• Contiguous
• Creating parities with blocks
• Pros Better Locality
• Cons Smaller files cannot be handled
Possible EC design in HDFS
1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB
ParitiesData
64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB
19Copyright©2016 NTT corp. All Rights Reserved.
• According to fsimage Analysis‘ report, files over 90%
are smaller than HDFS block size, 64MB
• Figure 3 source: fsimage Analysis
https://issues.apache.org/jira/secure/attachment/12690129/fsimage-
analysis-20150105.pdf
Which is better, striping or contiguous?
1 group: 6 blocks
Cluster 3Cluster 1
20Copyright©2016 NTT corp. All Rights Reserved.
• Starting from Striping to deal with smaller files
• Hadoop 3.0.0 implemented Phase 1.1 and Phase 1.2
Apache Hadoop’s decision
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
21Copyright©2016 NTT corp. All Rights Reserved.
• What’s changed?
• How to preserve a data in DataNode
• How to preserve a metadata in NameNode
• Client Write path
• Client Read path
Erasure Coding in HDFS (ver. 2016)
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
22Copyright©2016 NTT corp. All Rights Reserved.
• Block size data size: 1MB (not 64MB)
• Calculate Parity bits at client side, at Write Time
• Write in parallel (not chain replication)
How to preserve data in HDFS (write path)
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
23Copyright©2016 NTT corp. All Rights Reserved.
• Read 9 small blocks
• If no data is lost, never touch parities
How to retrieve data - (6, 3) Reed Solomon-
DataNode
DataNode1MB
1MB
Client
DataNode1MB
6 data
3 parities
…
…
Read 6 Data
24Copyright©2016 NTT corp. All Rights Reserved.
• Pros
• Low latency because of parallel write/read
• Good for small-size files
• Cons
• Require high network bandwidth between client-server
Network traffic
Workload 3-replication (6, 3) Reed-Solomon
Read 1 block 1 LN 1/6 LN + 5/6 RR
Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR + 7/6 RR
LN: Local Node
LR: Local Rack
RR: Remote Rack
25Copyright©2016 NTT corp. All Rights Reserved.
• Write path/Read path are changed!
• How much network traffic?
• How many small files?
• If network traffic is very high,
replication seems to be preferred
• If there are cold data and most of them are small, EC
is good option 
Operation Points
26Copyright©2016 NTT corp. All Rights Reserved.
• Build
• Upgrade minimum JDK to JDK 8
• Commons
• Be careful about Dependency Management of your project
if you write hand-coded MapReduce
• Shell script rewrite make operation easy
• Kafka Metrics2 Sink
• New FileSystem backend: Azure Data Lake
• HDFS
• Multiple Standby NameNode make operation flexible
• Erasure Coding
• Efficient disk usage than replication
• Every know-how will be changed!
Summary
27Copyright©2016 NTT corp. All Rights Reserved.
• Kai Zheng slide is good for a reference
• http://www.slideshare.net/HadoopSummit/debunking-the-myths-
of-hdfs-erasure-coding-performance
• HDFS Erasure Coding Design Document
• https://issues.apache.org/jira/secure/attachment/12697210/HDF
SErasureCodingDesign-20150206.pdf
• Fsimage Analysis
• https://issues.apache.org/jira/secure/attachment/12690129/fsim
age-analysis-20150105.pdf
• Hadoop 3.0.0-alpha RELEASE Note
• http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-
dist/hadoop-common/release/3.0.0-alpha1/CHANGES.3.0.0-
alpha1.html
References
28Copyright©2016 NTT corp. All Rights Reserved.
• Thanks all users, contributors, committers, and PMC of
Apache Hadoop!
• Especially, Andrew Wang had great effort to release
3.0.0-alpha!
• Thanks Kota Tsuyuzaki, a OpenStack Swift developer,
for reviewing my EC related slides!
Acknowledgement

More Related Content

What's hot

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
DataWorks Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 

What's hot (20)

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Viewers also liked

Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
NVIDIA
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017
NVIDIA
 

Viewers also liked (7)

Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017
 

Similar to What's new in Hadoop Common and HDFS

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
Erik Krogen
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
Ceph Community
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
Tsuyoshi OZAWA
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
Andrey Kudryavtsev
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
Jim St. Leger
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
DataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
NAVER D2
 

Similar to What's new in Hadoop Common and HDFS (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
isBullShit
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
AimanAthambawa1
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 

Recently uploaded (20)

Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 

What's new in Hadoop Common and HDFS

  • 1. Copyright©2016 NTT corp. All Rights Reserved. What’s new in Hadoop Common and HDFS @Hadoop Summit Tokyo 2016 Tsuyoshi Ozawa NTT Software Innovation Center 2016/10/26
  • 2. 2Copyright©2016 NTT corp. All Rights Reserved. • Tsuyoshi Ozawa • Research & Engineer @ NTT Twitter: @oza_x86_64 • Apache Hadoop Committer and PMC • Introduction to Hadoop 2nd Edition(Japanese)” Chapter 22(YARN) • Online article: gihyo.jp “Why and How does Hadoop work?” About me
  • 3. 3Copyright©2016 NTT corp. All Rights Reserved. • What’s new in Hadoop 3 Common and HDFS? • Build • Compiling source code with JDK 8 • Common • Better Library Management • Client-Side Class path Isolation • Dependency Upgrade • Support for Azure Data Lake Storage • Shell script rewrite • metrics2 sink plugin for Apache Kafka HADOOP-10949 • HDFS • Erasure Coding Phase 1 HADOOP-11264 • MR, YARN -> Junping will talk! Agenda
  • 4. Copyright©2016 NTT corp. All Rights Reserved. Build
  • 5. 5Copyright©2016 NTT corp. All Rights Reserved. • We upgraded minimum JDK to JDK8 • HADOOP-11858 • Oracle JDK 7 is EoL at April 2015!! • Moving forward to use new features of JDK8 • Hadoop 2.6.x • JDK 6, 7, 8 or later • Hadoop 2.7.x/2.8.x/2.9.x • JDK 7, 8 or later • Hadoop 3.0.x • JDK 8 or later Apache Hadoop 3.0.0 run on JDK 8 or later
  • 6. Copyright©2016 NTT corp. All Rights Reserved. Common
  • 7. 7Copyright©2016 NTT corp. All Rights Reserved. • Jersey: 1.9 to 1.19 • the root element whose content is empty collection is changed from null to empty object({}). • grizzly-http-servlet: 2.1.2 to 2.2.21 • Guice: 3.0 to 4.0 • cglib: 2.2 to 3.2.0 • asm: 3.2 to 5.0.4 Dependency Upgrade
  • 8. 8Copyright©2016 NTT corp. All Rights Reserved. Client-side classpath isolation Problem • Application code’s can conflict with Hadoop’s one Solution • Separating Server-side jar and Client-side jar • Like hbase-client, dependencies are shared HADOOP-11656/HADOOP-13070 Hadoop Client Server Older commons User code newer commons Single Jar File Conflicts!!! Hadoop -client shaded User code newer commons
  • 9. 9Copyright©2016 NTT corp. All Rights Reserved. • FileSystem API supports various storages • HDFS • Amazon S3 • Azure Blob Storage • OpenStack Swift • 3.0.0 supports Azure Data Lake Storage officially Support for Azure Data Lake Storage
  • 10. 10Copyright©2016 NTT corp. All Rights Reserved. • CLI are renewed! • To fix bugs (e.g. HADOOP_CONF_DIR is honored sometimes) • To introduce new features E.g. • To launch daemons, Use {hadoop,yarn,hdfs} --daemon command instead of {hadoop,yarn,hdfs}-daemons.sh • To print various environment variables, java options, classpath, etc “{hadoop,yarn,hdfs} --debug” option is supported • Please check documents • https://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-common/CommandsManual.html • https://issues.apache.org/jira/browse/HADOOP-9902 Shell script rewrite
  • 11. 11Copyright©2016 NTT corp. All Rights Reserved. • Metrics System 2 is collector of daemon metrics • Hadoop’s daemon log can be dumped into Apache Kafka metrics2 sink plugin for Apache Kafka Metrics System2 DataNode Metrics NameNode Metrics NodeManager Metrics Apache Kafka Sink (New!)
  • 12. Copyright©2016 NTT corp. All Rights Reserved. HDFS Namenode Multi Standby
  • 13. 13Copyright©2016 NTT corp. All Rights Reserved. • Before: 1 Active – 1 Standby NameNode • Need to recover immediately after Active NN fails • After: 1 Active - N standby NameNode can be chosen • Be able to choose trade off machine costs vs operation costs NameNode Multi-Standby NN Active NN Standby NN Active NN Standby NN Standby NN Standby NN Standby
  • 14. Copyright©2016 NTT corp. All Rights Reserved. HDFS Erasure Coding
  • 15. 15Copyright©2016 NTT corp. All Rights Reserved. • Background • HDFS uses Chain Replication for higher throughput and strong consistency • A case when replication factor is 3 • Pros • Simplicity • Network throughput can be suppressed between client and replicas • Cons • High latency • 33% of storage efficiency Replication –traditional HDFS way- DATA1 DATA1 DATA1Client ACK Data
  • 16. 16Copyright©2016 NTT corp. All Rights Reserved. • Erasure Coding is another way to save storage with fault tolerance • Used in RAID 5/6 • Using “parity” instead of “copy” to recover • Reed-Solomon coding is used • If data is lost, recover is done with inverse matrix Erasure Coding 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 𝑋02 𝑋01 𝑋02 𝑋03 𝑋12 𝑋11 𝑋12 𝑋13 𝒅 𝟏 𝒅 𝟐 𝒅 𝟑 𝒅 𝟒 × = 𝒅 𝟏 𝒅 𝟐 𝒅 𝟑 𝒅 𝟒 𝑐 𝟎 𝑐 𝟏 Parity Bits Data Bits FAST ’09: 7th, A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage Storing these values instead of only storing data! 4 bits data – 2 bits parity read Solomon
  • 17. 17Copyright©2016 NTT corp. All Rights Reserved. • Erasure coding is flexible: tuning of data bits and parity bits can be done • 6 data-bits, 3 parity-bits • 3 replication vs (6, 3)-read Solomon Effect of Erasure Coding 3-replication (6, 3) Reed-Solomon Maximum fault Tolerance 2 3 Disk usage (N byte of data) 3N 1.5N HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 18. 18Copyright©2016 NTT corp. All Rights Reserved. • 2 approaches • Striping : Splitting blocks into smaller block • Pros Effective for small files • Cons Less Data Locality to read block • Contiguous • Creating parities with blocks • Pros Better Locality • Cons Smaller files cannot be handled Possible EC design in HDFS 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB ParitiesData 64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB
  • 19. 19Copyright©2016 NTT corp. All Rights Reserved. • According to fsimage Analysis‘ report, files over 90% are smaller than HDFS block size, 64MB • Figure 3 source: fsimage Analysis https://issues.apache.org/jira/secure/attachment/12690129/fsimage- analysis-20150105.pdf Which is better, striping or contiguous? 1 group: 6 blocks Cluster 3Cluster 1
  • 20. 20Copyright©2016 NTT corp. All Rights Reserved. • Starting from Striping to deal with smaller files • Hadoop 3.0.0 implemented Phase 1.1 and Phase 1.2 Apache Hadoop’s decision HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 21. 21Copyright©2016 NTT corp. All Rights Reserved. • What’s changed? • How to preserve a data in DataNode • How to preserve a metadata in NameNode • Client Write path • Client Read path Erasure Coding in HDFS (ver. 2016) HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 22. 22Copyright©2016 NTT corp. All Rights Reserved. • Block size data size: 1MB (not 64MB) • Calculate Parity bits at client side, at Write Time • Write in parallel (not chain replication) How to preserve data in HDFS (write path) HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 23. 23Copyright©2016 NTT corp. All Rights Reserved. • Read 9 small blocks • If no data is lost, never touch parities How to retrieve data - (6, 3) Reed Solomon- DataNode DataNode1MB 1MB Client DataNode1MB 6 data 3 parities … … Read 6 Data
  • 24. 24Copyright©2016 NTT corp. All Rights Reserved. • Pros • Low latency because of parallel write/read • Good for small-size files • Cons • Require high network bandwidth between client-server Network traffic Workload 3-replication (6, 3) Reed-Solomon Read 1 block 1 LN 1/6 LN + 5/6 RR Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR + 7/6 RR LN: Local Node LR: Local Rack RR: Remote Rack
  • 25. 25Copyright©2016 NTT corp. All Rights Reserved. • Write path/Read path are changed! • How much network traffic? • How many small files? • If network traffic is very high, replication seems to be preferred • If there are cold data and most of them are small, EC is good option  Operation Points
  • 26. 26Copyright©2016 NTT corp. All Rights Reserved. • Build • Upgrade minimum JDK to JDK 8 • Commons • Be careful about Dependency Management of your project if you write hand-coded MapReduce • Shell script rewrite make operation easy • Kafka Metrics2 Sink • New FileSystem backend: Azure Data Lake • HDFS • Multiple Standby NameNode make operation flexible • Erasure Coding • Efficient disk usage than replication • Every know-how will be changed! Summary
  • 27. 27Copyright©2016 NTT corp. All Rights Reserved. • Kai Zheng slide is good for a reference • http://www.slideshare.net/HadoopSummit/debunking-the-myths- of-hdfs-erasure-coding-performance • HDFS Erasure Coding Design Document • https://issues.apache.org/jira/secure/attachment/12697210/HDF SErasureCodingDesign-20150206.pdf • Fsimage Analysis • https://issues.apache.org/jira/secure/attachment/12690129/fsim age-analysis-20150105.pdf • Hadoop 3.0.0-alpha RELEASE Note • http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project- dist/hadoop-common/release/3.0.0-alpha1/CHANGES.3.0.0- alpha1.html References
  • 28. 28Copyright©2016 NTT corp. All Rights Reserved. • Thanks all users, contributors, committers, and PMC of Apache Hadoop! • Especially, Andrew Wang had great effort to release 3.0.0-alpha! • Thanks Kota Tsuyuzaki, a OpenStack Swift developer, for reviewing my EC related slides! Acknowledgement