SlideShare a Scribd company logo
1 of 28
Download to read offline
Copyright©2016 NTT corp. All Rights Reserved.
What’s new in
Hadoop Common and HDFS
@Hadoop Summit Tokyo 2016
Tsuyoshi Ozawa
NTT Software Innovation Center
2016/10/26
2Copyright©2016 NTT corp. All Rights Reserved.
• Tsuyoshi Ozawa
• Research & Engineer @ NTT
Twitter: @oza_x86_64
• Apache Hadoop Committer and PMC
• Introduction to Hadoop 2nd Edition(Japanese)” Chapter
22(YARN)
• Online article: gihyo.jp “Why and How does Hadoop work?”
About me
3Copyright©2016 NTT corp. All Rights Reserved.
• What’s new in Hadoop 3 Common and HDFS?
• Build
• Compiling source code with JDK 8
• Common
• Better Library Management
• Client-Side Class path Isolation
• Dependency Upgrade
• Support for Azure Data Lake Storage
• Shell script rewrite
• metrics2 sink plugin for Apache Kafka HADOOP-10949
• HDFS
• Erasure Coding Phase 1 HADOOP-11264
• MR, YARN -> Junping will talk!
Agenda
Copyright©2016 NTT corp. All Rights Reserved.
Build
5Copyright©2016 NTT corp. All Rights Reserved.
• We upgraded minimum JDK to JDK8
• HADOOP-11858
• Oracle JDK 7 is EoL at April 2015!!
• Moving forward to use new features of JDK8
• Hadoop 2.6.x
• JDK 6, 7, 8 or later
• Hadoop 2.7.x/2.8.x/2.9.x
• JDK 7, 8 or later
• Hadoop 3.0.x
• JDK 8 or later
Apache Hadoop 3.0.0 run on JDK 8 or later
Copyright©2016 NTT corp. All Rights Reserved.
Common
7Copyright©2016 NTT corp. All Rights Reserved.
• Jersey: 1.9 to 1.19
• the root element whose content is empty collection is changed
from null to empty object({}).
• grizzly-http-servlet: 2.1.2 to 2.2.21
• Guice: 3.0 to 4.0
• cglib: 2.2 to 3.2.0
• asm: 3.2 to 5.0.4
Dependency Upgrade
8Copyright©2016 NTT corp. All Rights Reserved.
Client-side classpath isolation
Problem
• Application code’s can
conflict with Hadoop’s
one
Solution
• Separating Server-side
jar and Client-side jar
• Like hbase-client,
dependencies are shared
HADOOP-11656/HADOOP-13070
Hadoop
Client
Server
Older
commons
User code
newer
commons
Single Jar File
Conflicts!!!
Hadoop
-client
shaded
User code
newer
commons
9Copyright©2016 NTT corp. All Rights Reserved.
• FileSystem API supports various storages
• HDFS
• Amazon S3
• Azure Blob Storage
• OpenStack Swift
• 3.0.0 supports Azure Data Lake Storage officially
Support for Azure Data Lake Storage
10Copyright©2016 NTT corp. All Rights Reserved.
• CLI are renewed!
• To fix bugs (e.g. HADOOP_CONF_DIR is honored sometimes)
• To introduce new features
E.g.
• To launch daemons,
Use {hadoop,yarn,hdfs} --daemon command instead of
{hadoop,yarn,hdfs}-daemons.sh
• To print various environment variables, java options, classpath,
etc “{hadoop,yarn,hdfs} --debug” option is supported
• Please check documents
• https://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-common/CommandsManual.html
• https://issues.apache.org/jira/browse/HADOOP-9902
Shell script rewrite
11Copyright©2016 NTT corp. All Rights Reserved.
• Metrics System 2 is collector of daemon metrics
• Hadoop’s daemon log can be dumped into
Apache Kafka
metrics2 sink plugin for Apache Kafka
Metrics
System2
DataNode
Metrics
NameNode
Metrics
NodeManager
Metrics
Apache Kafka Sink
(New!)
Copyright©2016 NTT corp. All Rights Reserved.
HDFS
Namenode
Multi Standby
13Copyright©2016 NTT corp. All Rights Reserved.
• Before: 1 Active – 1 Standby NameNode
• Need to recover immediately
after Active NN fails
• After: 1 Active - N standby NameNode can be chosen
• Be able to choose trade off
machine costs vs operation costs
NameNode Multi-Standby
NN
Active
NN
Standby
NN
Active
NN
Standby
NN
Standby
NN
Standby
NN
Standby
Copyright©2016 NTT corp. All Rights Reserved.
HDFS
Erasure Coding
15Copyright©2016 NTT corp. All Rights Reserved.
• Background
• HDFS uses Chain Replication for
higher throughput and strong consistency
• A case when replication factor is 3
• Pros
• Simplicity
• Network throughput can be suppressed
between client and replicas
• Cons
• High latency
• 33% of storage efficiency
Replication –traditional HDFS way-
DATA1 DATA1 DATA1Client
ACK
Data
16Copyright©2016 NTT corp. All Rights Reserved.
• Erasure Coding is another way to save storage
with fault tolerance
• Used in RAID 5/6
• Using “parity” instead of “copy” to recover
• Reed-Solomon coding is used
• If data is lost, recover is done with inverse matrix
Erasure Coding
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
𝑋02 𝑋01 𝑋02 𝑋03
𝑋12 𝑋11 𝑋12 𝑋13
𝒅 𝟏
𝒅 𝟐
𝒅 𝟑
𝒅 𝟒
× =
𝒅 𝟏
𝒅 𝟐
𝒅 𝟑
𝒅 𝟒
𝑐 𝟎
𝑐 𝟏 Parity Bits
Data Bits
FAST ’09: 7th, A Performance Evaluation and Examination of Open-Source
Erasure Coding Libraries For Storage
Storing these
values
instead of
only storing
data!
4 bits data – 2 bits parity read Solomon
17Copyright©2016 NTT corp. All Rights Reserved.
• Erasure coding is flexible:
tuning of data bits and parity bits can be done
• 6 data-bits, 3 parity-bits
• 3 replication vs (6, 3)-read Solomon
Effect of Erasure Coding
3-replication (6, 3) Reed-Solomon
Maximum fault
Tolerance
2 3
Disk usage
(N byte of data)
3N 1.5N
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
18Copyright©2016 NTT corp. All Rights Reserved.
• 2 approaches
• Striping : Splitting blocks into smaller block
• Pros Effective for small files
• Cons Less Data Locality to read block
• Contiguous
• Creating parities with blocks
• Pros Better Locality
• Cons Smaller files cannot be handled
Possible EC design in HDFS
1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB
ParitiesData
64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB
19Copyright©2016 NTT corp. All Rights Reserved.
• According to fsimage Analysis‘ report, files over 90%
are smaller than HDFS block size, 64MB
• Figure 3 source: fsimage Analysis
https://issues.apache.org/jira/secure/attachment/12690129/fsimage-
analysis-20150105.pdf
Which is better, striping or contiguous?
1 group: 6 blocks
Cluster 3Cluster 1
20Copyright©2016 NTT corp. All Rights Reserved.
• Starting from Striping to deal with smaller files
• Hadoop 3.0.0 implemented Phase 1.1 and Phase 1.2
Apache Hadoop’s decision
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
21Copyright©2016 NTT corp. All Rights Reserved.
• What’s changed?
• How to preserve a data in DataNode
• How to preserve a metadata in NameNode
• Client Write path
• Client Read path
Erasure Coding in HDFS (ver. 2016)
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
22Copyright©2016 NTT corp. All Rights Reserved.
• Block size data size: 1MB (not 64MB)
• Calculate Parity bits at client side, at Write Time
• Write in parallel (not chain replication)
How to preserve data in HDFS (write path)
HDFS Erasure Coding Design Document:
https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra
sureCodingDesign-20150206.pdf
23Copyright©2016 NTT corp. All Rights Reserved.
• Read 9 small blocks
• If no data is lost, never touch parities
How to retrieve data - (6, 3) Reed Solomon-
DataNode
DataNode1MB
1MB
Client
DataNode1MB
6 data
3 parities
…
…
Read 6 Data
24Copyright©2016 NTT corp. All Rights Reserved.
• Pros
• Low latency because of parallel write/read
• Good for small-size files
• Cons
• Require high network bandwidth between client-server
Network traffic
Workload 3-replication (6, 3) Reed-Solomon
Read 1 block 1 LN 1/6 LN + 5/6 RR
Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR + 7/6 RR
LN: Local Node
LR: Local Rack
RR: Remote Rack
25Copyright©2016 NTT corp. All Rights Reserved.
• Write path/Read path are changed!
• How much network traffic?
• How many small files?
• If network traffic is very high,
replication seems to be preferred
• If there are cold data and most of them are small, EC
is good option 
Operation Points
26Copyright©2016 NTT corp. All Rights Reserved.
• Build
• Upgrade minimum JDK to JDK 8
• Commons
• Be careful about Dependency Management of your project
if you write hand-coded MapReduce
• Shell script rewrite make operation easy
• Kafka Metrics2 Sink
• New FileSystem backend: Azure Data Lake
• HDFS
• Multiple Standby NameNode make operation flexible
• Erasure Coding
• Efficient disk usage than replication
• Every know-how will be changed!
Summary
27Copyright©2016 NTT corp. All Rights Reserved.
• Kai Zheng slide is good for a reference
• http://www.slideshare.net/HadoopSummit/debunking-the-myths-
of-hdfs-erasure-coding-performance
• HDFS Erasure Coding Design Document
• https://issues.apache.org/jira/secure/attachment/12697210/HDF
SErasureCodingDesign-20150206.pdf
• Fsimage Analysis
• https://issues.apache.org/jira/secure/attachment/12690129/fsim
age-analysis-20150105.pdf
• Hadoop 3.0.0-alpha RELEASE Note
• http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-
dist/hadoop-common/release/3.0.0-alpha1/CHANGES.3.0.0-
alpha1.html
References
28Copyright©2016 NTT corp. All Rights Reserved.
• Thanks all users, contributors, committers, and PMC of
Apache Hadoop!
• Especially, Andrew Wang had great effort to release
3.0.0-alpha!
• Thanks Kota Tsuyuzaki, a OpenStack Swift developer,
for reviewing my EC related slides!
Acknowledgement

More Related Content

What's hot

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 

What's hot (20)

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Viewers also liked

Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017NVIDIA
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017NVIDIA
 

Viewers also liked (7)

Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017
 

Similar to What's new in Hadoop Common and HDFS

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Ceph Community
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansAndrey Kudryavtsev
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettJim St. Leger
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Community
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!DataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 

Similar to What's new in Hadoop Common and HDFS (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 

More from DataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 

Recently uploaded (20)

Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 

What's new in Hadoop Common and HDFS

  • 1. Copyright©2016 NTT corp. All Rights Reserved. What’s new in Hadoop Common and HDFS @Hadoop Summit Tokyo 2016 Tsuyoshi Ozawa NTT Software Innovation Center 2016/10/26
  • 2. 2Copyright©2016 NTT corp. All Rights Reserved. • Tsuyoshi Ozawa • Research & Engineer @ NTT Twitter: @oza_x86_64 • Apache Hadoop Committer and PMC • Introduction to Hadoop 2nd Edition(Japanese)” Chapter 22(YARN) • Online article: gihyo.jp “Why and How does Hadoop work?” About me
  • 3. 3Copyright©2016 NTT corp. All Rights Reserved. • What’s new in Hadoop 3 Common and HDFS? • Build • Compiling source code with JDK 8 • Common • Better Library Management • Client-Side Class path Isolation • Dependency Upgrade • Support for Azure Data Lake Storage • Shell script rewrite • metrics2 sink plugin for Apache Kafka HADOOP-10949 • HDFS • Erasure Coding Phase 1 HADOOP-11264 • MR, YARN -> Junping will talk! Agenda
  • 4. Copyright©2016 NTT corp. All Rights Reserved. Build
  • 5. 5Copyright©2016 NTT corp. All Rights Reserved. • We upgraded minimum JDK to JDK8 • HADOOP-11858 • Oracle JDK 7 is EoL at April 2015!! • Moving forward to use new features of JDK8 • Hadoop 2.6.x • JDK 6, 7, 8 or later • Hadoop 2.7.x/2.8.x/2.9.x • JDK 7, 8 or later • Hadoop 3.0.x • JDK 8 or later Apache Hadoop 3.0.0 run on JDK 8 or later
  • 6. Copyright©2016 NTT corp. All Rights Reserved. Common
  • 7. 7Copyright©2016 NTT corp. All Rights Reserved. • Jersey: 1.9 to 1.19 • the root element whose content is empty collection is changed from null to empty object({}). • grizzly-http-servlet: 2.1.2 to 2.2.21 • Guice: 3.0 to 4.0 • cglib: 2.2 to 3.2.0 • asm: 3.2 to 5.0.4 Dependency Upgrade
  • 8. 8Copyright©2016 NTT corp. All Rights Reserved. Client-side classpath isolation Problem • Application code’s can conflict with Hadoop’s one Solution • Separating Server-side jar and Client-side jar • Like hbase-client, dependencies are shared HADOOP-11656/HADOOP-13070 Hadoop Client Server Older commons User code newer commons Single Jar File Conflicts!!! Hadoop -client shaded User code newer commons
  • 9. 9Copyright©2016 NTT corp. All Rights Reserved. • FileSystem API supports various storages • HDFS • Amazon S3 • Azure Blob Storage • OpenStack Swift • 3.0.0 supports Azure Data Lake Storage officially Support for Azure Data Lake Storage
  • 10. 10Copyright©2016 NTT corp. All Rights Reserved. • CLI are renewed! • To fix bugs (e.g. HADOOP_CONF_DIR is honored sometimes) • To introduce new features E.g. • To launch daemons, Use {hadoop,yarn,hdfs} --daemon command instead of {hadoop,yarn,hdfs}-daemons.sh • To print various environment variables, java options, classpath, etc “{hadoop,yarn,hdfs} --debug” option is supported • Please check documents • https://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-common/CommandsManual.html • https://issues.apache.org/jira/browse/HADOOP-9902 Shell script rewrite
  • 11. 11Copyright©2016 NTT corp. All Rights Reserved. • Metrics System 2 is collector of daemon metrics • Hadoop’s daemon log can be dumped into Apache Kafka metrics2 sink plugin for Apache Kafka Metrics System2 DataNode Metrics NameNode Metrics NodeManager Metrics Apache Kafka Sink (New!)
  • 12. Copyright©2016 NTT corp. All Rights Reserved. HDFS Namenode Multi Standby
  • 13. 13Copyright©2016 NTT corp. All Rights Reserved. • Before: 1 Active – 1 Standby NameNode • Need to recover immediately after Active NN fails • After: 1 Active - N standby NameNode can be chosen • Be able to choose trade off machine costs vs operation costs NameNode Multi-Standby NN Active NN Standby NN Active NN Standby NN Standby NN Standby NN Standby
  • 14. Copyright©2016 NTT corp. All Rights Reserved. HDFS Erasure Coding
  • 15. 15Copyright©2016 NTT corp. All Rights Reserved. • Background • HDFS uses Chain Replication for higher throughput and strong consistency • A case when replication factor is 3 • Pros • Simplicity • Network throughput can be suppressed between client and replicas • Cons • High latency • 33% of storage efficiency Replication –traditional HDFS way- DATA1 DATA1 DATA1Client ACK Data
  • 16. 16Copyright©2016 NTT corp. All Rights Reserved. • Erasure Coding is another way to save storage with fault tolerance • Used in RAID 5/6 • Using “parity” instead of “copy” to recover • Reed-Solomon coding is used • If data is lost, recover is done with inverse matrix Erasure Coding 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 𝑋02 𝑋01 𝑋02 𝑋03 𝑋12 𝑋11 𝑋12 𝑋13 𝒅 𝟏 𝒅 𝟐 𝒅 𝟑 𝒅 𝟒 × = 𝒅 𝟏 𝒅 𝟐 𝒅 𝟑 𝒅 𝟒 𝑐 𝟎 𝑐 𝟏 Parity Bits Data Bits FAST ’09: 7th, A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage Storing these values instead of only storing data! 4 bits data – 2 bits parity read Solomon
  • 17. 17Copyright©2016 NTT corp. All Rights Reserved. • Erasure coding is flexible: tuning of data bits and parity bits can be done • 6 data-bits, 3 parity-bits • 3 replication vs (6, 3)-read Solomon Effect of Erasure Coding 3-replication (6, 3) Reed-Solomon Maximum fault Tolerance 2 3 Disk usage (N byte of data) 3N 1.5N HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 18. 18Copyright©2016 NTT corp. All Rights Reserved. • 2 approaches • Striping : Splitting blocks into smaller block • Pros Effective for small files • Cons Less Data Locality to read block • Contiguous • Creating parities with blocks • Pros Better Locality • Cons Smaller files cannot be handled Possible EC design in HDFS 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB 1MB ParitiesData 64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB 64MB
  • 19. 19Copyright©2016 NTT corp. All Rights Reserved. • According to fsimage Analysis‘ report, files over 90% are smaller than HDFS block size, 64MB • Figure 3 source: fsimage Analysis https://issues.apache.org/jira/secure/attachment/12690129/fsimage- analysis-20150105.pdf Which is better, striping or contiguous? 1 group: 6 blocks Cluster 3Cluster 1
  • 20. 20Copyright©2016 NTT corp. All Rights Reserved. • Starting from Striping to deal with smaller files • Hadoop 3.0.0 implemented Phase 1.1 and Phase 1.2 Apache Hadoop’s decision HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 21. 21Copyright©2016 NTT corp. All Rights Reserved. • What’s changed? • How to preserve a data in DataNode • How to preserve a metadata in NameNode • Client Write path • Client Read path Erasure Coding in HDFS (ver. 2016) HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 22. 22Copyright©2016 NTT corp. All Rights Reserved. • Block size data size: 1MB (not 64MB) • Calculate Parity bits at client side, at Write Time • Write in parallel (not chain replication) How to preserve data in HDFS (write path) HDFS Erasure Coding Design Document: https://issues.apache.org/jira/secure/attachment/12697210/HDFSEra sureCodingDesign-20150206.pdf
  • 23. 23Copyright©2016 NTT corp. All Rights Reserved. • Read 9 small blocks • If no data is lost, never touch parities How to retrieve data - (6, 3) Reed Solomon- DataNode DataNode1MB 1MB Client DataNode1MB 6 data 3 parities … … Read 6 Data
  • 24. 24Copyright©2016 NTT corp. All Rights Reserved. • Pros • Low latency because of parallel write/read • Good for small-size files • Cons • Require high network bandwidth between client-server Network traffic Workload 3-replication (6, 3) Reed-Solomon Read 1 block 1 LN 1/6 LN + 5/6 RR Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR + 7/6 RR LN: Local Node LR: Local Rack RR: Remote Rack
  • 25. 25Copyright©2016 NTT corp. All Rights Reserved. • Write path/Read path are changed! • How much network traffic? • How many small files? • If network traffic is very high, replication seems to be preferred • If there are cold data and most of them are small, EC is good option  Operation Points
  • 26. 26Copyright©2016 NTT corp. All Rights Reserved. • Build • Upgrade minimum JDK to JDK 8 • Commons • Be careful about Dependency Management of your project if you write hand-coded MapReduce • Shell script rewrite make operation easy • Kafka Metrics2 Sink • New FileSystem backend: Azure Data Lake • HDFS • Multiple Standby NameNode make operation flexible • Erasure Coding • Efficient disk usage than replication • Every know-how will be changed! Summary
  • 27. 27Copyright©2016 NTT corp. All Rights Reserved. • Kai Zheng slide is good for a reference • http://www.slideshare.net/HadoopSummit/debunking-the-myths- of-hdfs-erasure-coding-performance • HDFS Erasure Coding Design Document • https://issues.apache.org/jira/secure/attachment/12697210/HDF SErasureCodingDesign-20150206.pdf • Fsimage Analysis • https://issues.apache.org/jira/secure/attachment/12690129/fsim age-analysis-20150105.pdf • Hadoop 3.0.0-alpha RELEASE Note • http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project- dist/hadoop-common/release/3.0.0-alpha1/CHANGES.3.0.0- alpha1.html References
  • 28. 28Copyright©2016 NTT corp. All Rights Reserved. • Thanks all users, contributors, committers, and PMC of Apache Hadoop! • Especially, Andrew Wang had great effort to release 3.0.0-alpha! • Thanks Kota Tsuyuzaki, a OpenStack Swift developer, for reviewing my EC related slides! Acknowledgement