SlideShare a Scribd company logo
1 of 41
Download to read offline
HCFS 初探
Introduction to
Hadoop Compatible File System
Jazz Yao-Tsung Wang
Co-founder of Hadoop.TW
https://fb.com/groups/hadoop.tw
2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017
HELLO!
I am Jazz Wang
Co-Founder of Hadoop.TW.
Hadoop Evangelist since 2008.
Open Source Promoter. System Admin (Ops).
You can find me at @jazzwang_tw or
https://fb.com/groups/hadoop.tw ,
https://forum.hadoop.tw
1.
What is
HCFS?
Let’s start with
brief introduction to
Apache Hadoop
Apache Hadoop from 0.x to 1.x
Master Worker #1 Worker #2 Worker #3
NameNode
DataNode DataNode DataNode DataNode
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
TrackerComputation
Layer
MapReduce
Storage
Layer
HDFS
Master Worker #1 Worker #2 Worker #3
NameNode
DataNode DataNode DataNode DataNode
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
HDFS
Apache Hadoop from 2.x to 3.x
Container
Needs / Trends:
Hadoop on the Cloud
http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw
Why Hadoop on the Cloud ?
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
Why might you need HCFS ...
https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply
_comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
Spark / Hive
/ Impala ...
“
https://aws.amazon.com/lambda/
https://cloud.google.com/functions/
http://www.forbes.com/sites/janakirammsv/2016/02/09/google-brings-serverless-computing-to-its-cloud-platform/#76e1aa9425b8
Docker
Microservice
Serverless
NoOps !?!
$$$
Master Worker #1 Worker #2 Worker #3
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
HCFS
What is HCFS ?
Windows
Azure Blob
AWS
S3
Google
Cloud Storage
CephFS
Hadoop Compatible File System
HCFS implementations
- Cloud Storage Connector ( for Public Cloud Provider )
https://wiki.apache.org/hadoop/HCFS
AWS S3
s3://
Hadoop 0.10
~ Hadoop 2.7
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/
s3n://
Hadoop 0.18
~ Hadoop 2.6
s3a:// Hadoop 2.7+
AWS EMRFS ?? 3rd party http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
Windows Azure
Storage Blob
wasb:// Hadoop 2.7+
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/
https://issues.apache.org/jira/browse/HADOOP-9629
Azure Data Lake adl:// Hadoop 3.0+
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/
https://docs.microsoft.com/zh-tw/azure/data-lake-store/data-lake-store-h
dinsight-hadoop-use-portal
Google Cloud
Storage
gs://
3rd party
Hadoop 1.x
Hadoop 2.x
https://cloud.google.com/hadoop/google-cloud-storage-connector
https://github.com/GoogleCloudPlatform/bigdata-interop
HCFS implementations ( for Private Cloud Provider )
OpenStack
Swift
( rackspace )
swift:// Hadoop 2.7+
https://issues.apache.org/jira/browse/HADOOP-8545
http://hadoop.apache.org/docs/r2.7.3/hadoop-openstack/
https://github.com/steveloughran/Hadoop-and-Swift-integration/
CephFS
( OpenStack )
ceph://
3rd party
Hadoop 1.1.x
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://github.com/houbin/cephfs-hadoop
Cassandra
File System
cfs:// 3rd party
http://www.datastax.com/dev/blog/cassandra-file-system-design
http://www.datastax.com/resources/whitepapers/hdfs-vs-cfs
GlusterFS glusterfs:/// 3rd party
https://github.com/gluster/glusterfs-hadoop
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Hadoop/
OrangeFS
3rd party
Hadoop 1.2.1
Hadoop 2.6.0
http://docs.orangefs.com/v_2_8_8/index.htm#Hadoop_Client.htm
http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
QFS ( KFS ) qfs:// 3rd party https://github.com/quantcast/qfs/wiki/Migration-Guide
Lustre 3rd party http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
MapR
File System
3rd party
https://www.mapr.com/products/mapr-fs
https://community.mapr.com/thread/7027
HCFS Architecture
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
New API
https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169
http://www.slideshare.net/jazzwang/hadoop-69818883
https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169
http://www.slideshare.net/jazzwang/hadoop-69818883
AWS S3 Authentication
Support
Azure Blob support
encrypted Key
CephFS is not work well with
YARN because of JNI (Java
Native Interface) :(
Only HDFS and Azure Blob
support HBase !!
2.
AWS S3
Use Case :
Amazon EMR
Three generation of S3 support
s3:// s3n:// s3a://
The ‘classic’ s3: filesystem
The second-generation, s3n: filesystem,
making it easy to share data between hadoop and
other applications via the S3 object store
The third generation, s3a: filesystem.
replacement for s3n:, supports larger files and
promises higher performance.
introduced in Hadoop 0.10.0 (HADOOP-574)
deprecated and will be removed from Hadoop 3.0
introduced in Hadoop 0.18.0 (HADOOP-930)
rename support in Hadoop 0.19.0 (HADOOP-3361)
Hadoop 2.6 and earlier
introduced in Hadoop 2.6.0 (HADOOP-11571)
recommended for Hadoop 2.7 and later
Uploaded files can be larger than 5GB, but they
are not interoperable with other S3 tools.
requires a compatible version of jets3t requires exact version of amazon-aws-sdk
core-site.xml core-site.xml core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>AWS secret key</value>
</property>
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
1. You cannot use S3 as a replacement for HDFS
2. Amazon S3 is an "object store"
▸ eventual consistency
▸ non-atomic rename and delete operations.
3. Your AWS credentials are valuable
▸ core-site.xmlis readable in cluster-wide
▸ Don’t use embedding the credentials in the URI
▸ S3A supports more authentication mechanisms
4. Amazon's EMR Service is based upon Apache Hadoop, but
contains modifications and their own, proprietary, S3 client.
WARNING!!
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
For Mac OS X +
brew install hadoop
export HADOOP_CONF_DIR=${PATH of core-site.xml)
export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/*
hadoop fs -ls s3n://${bucket}/
For Linux / Windows - use BigTop docker image
docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs
# cd /data
/data# export HADOOP_CONF_DIR=${PATH of core-site.xml)
/data# hadoop fs -ls s3n://${bucket}/
DEMO
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
To enable more log4j messages, you could try :
export HADOOP_ROOT_LOGGER=DEBUG,console
hadoop fs -ls s3n://${bucket}/
To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW)
Using s3n:// , you have to put a config file jets3t.properties
$ cat jets3t.properties
s3service.s3-endpoint=s3.hicloud.net
s3service.https-only=false
Using s3a:// , you could add following to core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.hicloud.net</value>
<description>default is s3.amazonaws.com</description>
</property>
Undocumented Secrets 除錯/繞道密技
3.
Windows Azure
Storage Blob
Use Case :
HDInsight /
Azure Data Lake
1. hadoop-azure.jar is located at
- /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH)
- ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew)
2. Depends on Azure Storage SDK for Java -
https://github.com/Azure/azure-storage-java
3. Features
▸ Supports configuration of multiple Azure Blob Storage accounts.
▸ Supports both page blobs and block blobs
▸ wasbs:// scheme for SSL encrypted access.
▸ Can act as a source of data in a MapReduce job, or a sink.
▸ Tested on both Linux and Windows.
4. Limitation
▸ The append operation is not implemented.
▸ File owner and group are persisted,
but the permissions model is not enforced.
▸ File last access time is not tracked.
Hadoop Azure Support: Azure Blob Storage
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
In core-site.xml
<property>
<name>fs.azure.account.key. youraccount.blob.core.windows.net</name>
<value>YOUR ACCESS KEY</value>
</property>
Examples:
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
> hadoop fs -put testFile
wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
> hadoop fs -cat
wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
Configurations
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
My Use Case :
rsync between local and wasb
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
Take advantage of hadoop distcp
- Backup
hadoop distcp -update ${SOURCE_DIR} 
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR}
- Restore
hadoop distcp 
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR} 
${RESTOR_DIR}
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage
Use Case in TenMax:
Read / Write files from/to Azure Blob Storage
Spring Boot
FileSystem
Web Application
File System
Abstraction Layer
core-site.xml
Azure Blob
Storage
Cloud Storage
Take Hadoop as a
Java Library to access
Hybrid Cloud Storage
4.
Ceph
Master Worker #1 Worker #2 Worker #3
Mon
OSD OSD OSD OSD
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
Ceph
High Level Architecture of Hadoop 2.x with CephFS
Mon Mon
hdfs01
192.168.1.239
hdfs02
192.168.1.238
hdfs03
192.168.1.237
hdfs04
192.168.1.236
virtual network ( hub )
node11
192.168.1.201
node21
192.168.1.211
node31
192.168.1.221
Ceph
mon
Ceph
OSD
Ceph
OSD
Ceph
OSD
Ceph
OSD
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
1. Compile https://github.com/ceph/cephfs-hadoop
2. Copy cephfs-hadoop.jar
and place it at ${HADOOP_HOME}/lib/
3. Copy ceph.conf and ceph.client.${ID}.keyring
to /etc/ceph
4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/
5. Copy JNI related files to ${HADOOP_HOME}/lib/native/
ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so
ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so
CephFS installation
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://github.com/ceph/cephfs-hadoop
Known Issue :
MRAppMaster can not read find cephfs_jni
Root Cause :
There is no -Djava.library.path for MRAppMaster
Root Cause :
There is no -Djava.library.path for MRAppMaster
G.G
Official Support is limited to Hadoop 1.1.x
http://docs.ceph.com/docs/master/cephfs/hadoop/
Why it works
for MRv1??
Let’s take
a look at
MapReduce v1
Architecture
Why doesn’t
it work
on YARN??
Let’s take
a look at
YARN
Architecture
Without correct configuration,
HCFS or YARN Application that use JNI will fail :(
http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can
cause programs to no longer function if hadoop native libraries are used. These values should be set as part
of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
How to solve this issue ?
Official document and souce code said so ...
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c
re/src/main/resources/mapred-default.xml#L267
Conclusion
▸ S3 and WASB are the most mature HCFS.
▹ Sorry taht I’m not sure about Google Cloud Storage :(
▸ You’ll need more integration test for Hadoop Ecosystem
when using HCFS.
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage
Take Hadoop as a
Java Library to access
Hybrid Cloud Storage
THANKS!
Any questions?
You can find me at @jazzwang_tw &
https://fb.com/groups/hadoop.tw
CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)
PRESENTATION DESIGN
This presentations uses the following typographies and colors:
▸ Titles: Montserrat
▸ Body copy: Karla
You can download the fonts on this page:
http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka
rla:400,400italic,700,700italic

More Related Content

What's hot

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Presto Fast SQL on Anything
Presto Fast SQL on AnythingPresto Fast SQL on Anything
Presto Fast SQL on AnythingAlluxio, Inc.
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Alluxio, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomAlluxio, Inc.
 
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructureオラクルエンジニア通信
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For HadoopCloudera, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 

What's hot (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Presto Fast SQL on Anything
Presto Fast SQL on AnythingPresto Fast SQL on Anything
Presto Fast SQL on Anything
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
[Cloudera World Tokyo 2018] Cloudera on Oracle Cloud Infrastructure
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Cloudera
ClouderaCloudera
Cloudera
 
Apache: Big Data North America 2017 参加報告 #streamctjp
Apache: Big Data North America 2017 参加報告  #streamctjpApache: Big Data North America 2017 参加報告  #streamctjp
Apache: Big Data North America 2017 参加報告 #streamctjp
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 

Viewers also liked

2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product ManagerJazz Yao-Tsung Wang
 
2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for AgricultureJazz Yao-Tsung Wang
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Jazz Yao-Tsung Wang
 
淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況Jazz Yao-Tsung Wang
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Jazz Yao-Tsung Wang
 
Introduction to K8S Big Data SIG
Introduction to K8S Big Data SIGIntroduction to K8S Big Data SIG
Introduction to K8S Big Data SIGJazz Yao-Tsung Wang
 
From Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookieFrom Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookieJazz Yao-Tsung Wang
 

Viewers also liked (10)

2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager2017-03-27 From Researcher To Product Manager
2017-03-27 From Researcher To Product Manager
 
社群、協會、國際連結
社群、協會、國際連結社群、協會、國際連結
社群、協會、國際連結
 
2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture2006-11-16 RFID and OSS for Agriculture
2006-11-16 RFID and OSS for Agriculture
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望
 
淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況淺談台灣巨量資料產業發展現況
淺談台灣巨量資料產業發展現況
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
 
When R meet Hadoop
When R meet HadoopWhen R meet Hadoop
When R meet Hadoop
 
Introduction to K8S Big Data SIG
Introduction to K8S Big Data SIGIntroduction to K8S Big Data SIG
Introduction to K8S Big Data SIG
 
From Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookieFrom Browser Fingerprint to SuperCookie
From Browser Fingerprint to SuperCookie
 
Data Pipeline Matters
Data Pipeline MattersData Pipeline Matters
Data Pipeline Matters
 

Similar to Introduction to HCFS

Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости HadoopPositive Hack Days
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase clientShashwat Shriparv
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 

Similar to Introduction to HCFS (20)

Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Ex-8-hive.pptx
Ex-8-hive.pptxEx-8-hive.pptx
Ex-8-hive.pptx
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 

Recently uploaded

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Introduction to HCFS

  • 1. HCFS 初探 Introduction to Hadoop Compatible File System Jazz Yao-Tsung Wang Co-founder of Hadoop.TW https://fb.com/groups/hadoop.tw 2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017
  • 2. HELLO! I am Jazz Wang Co-Founder of Hadoop.TW. Hadoop Evangelist since 2008. Open Source Promoter. System Admin (Ops). You can find me at @jazzwang_tw or https://fb.com/groups/hadoop.tw , https://forum.hadoop.tw
  • 3. 1. What is HCFS? Let’s start with brief introduction to Apache Hadoop
  • 4. Apache Hadoop from 0.x to 1.x Master Worker #1 Worker #2 Worker #3 NameNode DataNode DataNode DataNode DataNode Job Tracker Task Tracker Task Tracker Task Tracker Task TrackerComputation Layer MapReduce Storage Layer HDFS
  • 5. Master Worker #1 Worker #2 Worker #3 NameNode DataNode DataNode DataNode DataNode Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer HDFS Apache Hadoop from 2.x to 3.x Container
  • 6. Needs / Trends: Hadoop on the Cloud http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw
  • 7. Why Hadoop on the Cloud ? http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production https://www.youtube.com/watch?v=XehH3iJJy3Q
  • 8. Why might you need HCFS ... https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply _comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}
  • 11. Master Worker #1 Worker #2 Worker #3 Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer HCFS What is HCFS ? Windows Azure Blob AWS S3 Google Cloud Storage CephFS Hadoop Compatible File System
  • 12. HCFS implementations - Cloud Storage Connector ( for Public Cloud Provider ) https://wiki.apache.org/hadoop/HCFS AWS S3 s3:// Hadoop 0.10 ~ Hadoop 2.7 https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/ s3n:// Hadoop 0.18 ~ Hadoop 2.6 s3a:// Hadoop 2.7+ AWS EMRFS ?? 3rd party http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html Windows Azure Storage Blob wasb:// Hadoop 2.7+ http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/ https://issues.apache.org/jira/browse/HADOOP-9629 Azure Data Lake adl:// Hadoop 3.0+ https://hadoop.apache.org/docs/current/hadoop-azure-datalake/ https://docs.microsoft.com/zh-tw/azure/data-lake-store/data-lake-store-h dinsight-hadoop-use-portal Google Cloud Storage gs:// 3rd party Hadoop 1.x Hadoop 2.x https://cloud.google.com/hadoop/google-cloud-storage-connector https://github.com/GoogleCloudPlatform/bigdata-interop
  • 13. HCFS implementations ( for Private Cloud Provider ) OpenStack Swift ( rackspace ) swift:// Hadoop 2.7+ https://issues.apache.org/jira/browse/HADOOP-8545 http://hadoop.apache.org/docs/r2.7.3/hadoop-openstack/ https://github.com/steveloughran/Hadoop-and-Swift-integration/ CephFS ( OpenStack ) ceph:// 3rd party Hadoop 1.1.x http://docs.ceph.com/docs/master/cephfs/hadoop/ https://github.com/houbin/cephfs-hadoop Cassandra File System cfs:// 3rd party http://www.datastax.com/dev/blog/cassandra-file-system-design http://www.datastax.com/resources/whitepapers/hdfs-vs-cfs GlusterFS glusterfs:/// 3rd party https://github.com/gluster/glusterfs-hadoop https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Hadoop/ OrangeFS 3rd party Hadoop 1.2.1 Hadoop 2.6.0 http://docs.orangefs.com/v_2_8_8/index.htm#Hadoop_Client.htm http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm QFS ( KFS ) qfs:// 3rd party https://github.com/quantcast/qfs/wiki/Migration-Guide Lustre 3rd party http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre MapR File System 3rd party https://www.mapr.com/products/mapr-fs https://community.mapr.com/thread/7027
  • 16. https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169 http://www.slideshare.net/jazzwang/hadoop-69818883 AWS S3 Authentication Support Azure Blob support encrypted Key CephFS is not work well with YARN because of JNI (Java Native Interface) :( Only HDFS and Azure Blob support HBase !!
  • 17. 2. AWS S3 Use Case : Amazon EMR
  • 18. Three generation of S3 support s3:// s3n:// s3a:// The ‘classic’ s3: filesystem The second-generation, s3n: filesystem, making it easy to share data between hadoop and other applications via the S3 object store The third generation, s3a: filesystem. replacement for s3n:, supports larger files and promises higher performance. introduced in Hadoop 0.10.0 (HADOOP-574) deprecated and will be removed from Hadoop 3.0 introduced in Hadoop 0.18.0 (HADOOP-930) rename support in Hadoop 0.19.0 (HADOOP-3361) Hadoop 2.6 and earlier introduced in Hadoop 2.6.0 (HADOOP-11571) recommended for Hadoop 2.7 and later Uploaded files can be larger than 5GB, but they are not interoperable with other S3 tools. requires a compatible version of jets3t requires exact version of amazon-aws-sdk core-site.xml core-site.xml core-site.xml <property> <name>fs.s3.awsAccessKeyId</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>AWS secret key</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>AWS secret key</value> </property> <property> <name>fs.s3a.access.key</name> <value>AWS access key ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>AWS secret key</value> </property> https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  • 19. 1. You cannot use S3 as a replacement for HDFS 2. Amazon S3 is an "object store" ▸ eventual consistency ▸ non-atomic rename and delete operations. 3. Your AWS credentials are valuable ▸ core-site.xmlis readable in cluster-wide ▸ Don’t use embedding the credentials in the URI ▸ S3A supports more authentication mechanisms 4. Amazon's EMR Service is based upon Apache Hadoop, but contains modifications and their own, proprietary, S3 client. WARNING!! https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  • 20. For Mac OS X + brew install hadoop export HADOOP_CONF_DIR=${PATH of core-site.xml) export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/* hadoop fs -ls s3n://${bucket}/ For Linux / Windows - use BigTop docker image docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs # cd /data /data# export HADOOP_CONF_DIR=${PATH of core-site.xml) /data# hadoop fs -ls s3n://${bucket}/ DEMO https://wiki.apache.org/hadoop/AmazonS3 http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
  • 21. To enable more log4j messages, you could try : export HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls s3n://${bucket}/ To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW) Using s3n:// , you have to put a config file jets3t.properties $ cat jets3t.properties s3service.s3-endpoint=s3.hicloud.net s3service.https-only=false Using s3a:// , you could add following to core-site.xml <property> <name>fs.s3a.endpoint</name> <value>s3.hicloud.net</value> <description>default is s3.amazonaws.com</description> </property> Undocumented Secrets 除錯/繞道密技
  • 22. 3. Windows Azure Storage Blob Use Case : HDInsight / Azure Data Lake
  • 23. 1. hadoop-azure.jar is located at - /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH) - ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew) 2. Depends on Azure Storage SDK for Java - https://github.com/Azure/azure-storage-java 3. Features ▸ Supports configuration of multiple Azure Blob Storage accounts. ▸ Supports both page blobs and block blobs ▸ wasbs:// scheme for SSL encrypted access. ▸ Can act as a source of data in a MapReduce job, or a sink. ▸ Tested on both Linux and Windows. 4. Limitation ▸ The append operation is not implemented. ▸ File owner and group are persisted, but the permissions model is not enforced. ▸ File last access time is not tracked. Hadoop Azure Support: Azure Blob Storage http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
  • 24. In core-site.xml <property> <name>fs.azure.account.key. youraccount.blob.core.windows.net</name> <value>YOUR ACCESS KEY</value> </property> Examples: > hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir > hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile > hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile Configurations http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
  • 25. My Use Case : rsync between local and wasb http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html Take advantage of hadoop distcp - Backup hadoop distcp -update ${SOURCE_DIR} wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR} - Restore hadoop distcp wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR} ${RESTOR_DIR} Take Hadoop as a rsync tool to sync with Hybrid Cloud Storage
  • 26. Use Case in TenMax: Read / Write files from/to Azure Blob Storage Spring Boot FileSystem Web Application File System Abstraction Layer core-site.xml Azure Blob Storage Cloud Storage Take Hadoop as a Java Library to access Hybrid Cloud Storage
  • 28. Master Worker #1 Worker #2 Worker #3 Mon OSD OSD OSD OSD Resource Manager Node Manager Node Manager Node Manager Node ManagerComputation Layer YARN Storage Layer Ceph High Level Architecture of Hadoop 2.x with CephFS Mon Mon
  • 29. hdfs01 192.168.1.239 hdfs02 192.168.1.238 hdfs03 192.168.1.237 hdfs04 192.168.1.236 virtual network ( hub ) node11 192.168.1.201 node21 192.168.1.211 node31 192.168.1.221 Ceph mon Ceph OSD Ceph OSD Ceph OSD Ceph OSD Resource Manager Node Manager Node Manager Node Manager
  • 30. 1. Compile https://github.com/ceph/cephfs-hadoop 2. Copy cephfs-hadoop.jar and place it at ${HADOOP_HOME}/lib/ 3. Copy ceph.conf and ceph.client.${ID}.keyring to /etc/ceph 4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/ 5. Copy JNI related files to ${HADOOP_HOME}/lib/native/ ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so CephFS installation http://docs.ceph.com/docs/master/cephfs/hadoop/ https://github.com/ceph/cephfs-hadoop
  • 31. Known Issue : MRAppMaster can not read find cephfs_jni
  • 32. Root Cause : There is no -Djava.library.path for MRAppMaster
  • 33. Root Cause : There is no -Djava.library.path for MRAppMaster
  • 34. G.G Official Support is limited to Hadoop 1.1.x http://docs.ceph.com/docs/master/cephfs/hadoop/
  • 35. Why it works for MRv1?? Let’s take a look at MapReduce v1 Architecture
  • 36. Why doesn’t it work on YARN?? Let’s take a look at YARN Architecture
  • 37. Without correct configuration, HCFS or YARN Application that use JNI will fail :( http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
  • 38. WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings. How to solve this issue ? Official document and souce code said so ... https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c re/src/main/resources/mapred-default.xml#L267
  • 39. Conclusion ▸ S3 and WASB are the most mature HCFS. ▹ Sorry taht I’m not sure about Google Cloud Storage :( ▸ You’ll need more integration test for Hadoop Ecosystem when using HCFS. Take Hadoop as a rsync tool to sync with Hybrid Cloud Storage Take Hadoop as a Java Library to access Hybrid Cloud Storage
  • 40. THANKS! Any questions? You can find me at @jazzwang_tw & https://fb.com/groups/hadoop.tw
  • 41. CREDITS Special thanks to all the people who made and released these awesome resources for free: ▸ Presentation template by SlidesCarnival ▸ Photographs by Death to the Stock Photo (license) PRESENTATION DESIGN This presentations uses the following typographies and colors: ▸ Titles: Montserrat ▸ Body copy: Karla You can download the fonts on this page: http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka rla:400,400italic,700,700italic