Introduction to HCFS

HCFS 初探
Introduction to
Hadoop Compatible File System
Jazz Yao-Tsung Wang
Co-founder of Hadoop.TW
https://fb.com/groups/hadoop.tw
2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017

HELLO!
I am Jazz Wang
Co-Founder of Hadoop.TW.
Hadoop Evangelist since 2008.
Open Source Promoter. System Admin (Ops).
You can find me at @jazzwang_tw or
https://fb.com/groups/hadoop.tw ,
https://forum.hadoop.tw

1.
What is
HCFS?
Let’s start with
brief introduction to
Apache Hadoop

Apache Hadoop from 0.x to 1.x
Master Worker #1 Worker #2 Worker #3
NameNode
DataNode DataNode DataNode DataNode
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
TrackerComputation
Layer
MapReduce
Storage
Layer
HDFS

NameNode
DataNode DataNode DataNode DataNode
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
HDFS
Apache Hadoop from 2.x to 3.x
Container

Needs / Trends:
Hadoop on the Cloud
http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw

Why Hadoop on the Cloud ?
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q

Why might you need HCFS ...
https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply
_comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}

Spark / Hive
/ Impala ...

“
https://aws.amazon.com/lambda/
https://cloud.google.com/functions/
http://www.forbes.com/sites/janakirammsv/2016/02/09/google-brings-serverless-computing-to-its-cloud-platform/#76e1aa9425b8
Docker
Microservice
Serverless
NoOps !?!
$$$

Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
HCFS
What is HCFS ?
Windows
Azure Blob
AWS
S3
Google
Cloud Storage
CephFS
Hadoop Compatible File System

HCFS implementations
- Cloud Storage Connector ( for Public Cloud Provider )
https://wiki.apache.org/hadoop/HCFS
AWS S3
s3://
Hadoop 0.10
~ Hadoop 2.7
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/
s3n://
Hadoop 0.18
~ Hadoop 2.6
s3a:// Hadoop 2.7+
AWS EMRFS ?? 3rd party http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
Windows Azure
Storage Blob
wasb:// Hadoop 2.7+
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/
https://issues.apache.org/jira/browse/HADOOP-9629
Azure Data Lake adl:// Hadoop 3.0+
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/
https://docs.microsoft.com/zh-tw/azure/data-lake-store/data-lake-store-h
dinsight-hadoop-use-portal
Google Cloud
Storage
gs://
3rd party
Hadoop 1.x
Hadoop 2.x
https://cloud.google.com/hadoop/google-cloud-storage-connector
https://github.com/GoogleCloudPlatform/bigdata-interop

HCFS implementations ( for Private Cloud Provider )
OpenStack
Swift
( rackspace )
swift:// Hadoop 2.7+
https://issues.apache.org/jira/browse/HADOOP-8545
http://hadoop.apache.org/docs/r2.7.3/hadoop-openstack/
https://github.com/steveloughran/Hadoop-and-Swift-integration/
CephFS
( OpenStack )
ceph://
3rd party
Hadoop 1.1.x
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://github.com/houbin/cephfs-hadoop
Cassandra
File System
cfs:// 3rd party
http://www.datastax.com/dev/blog/cassandra-file-system-design
http://www.datastax.com/resources/whitepapers/hdfs-vs-cfs
GlusterFS glusterfs:/// 3rd party
https://github.com/gluster/glusterfs-hadoop
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Hadoop/
OrangeFS
3rd party
Hadoop 1.2.1
Hadoop 2.6.0
http://docs.orangefs.com/v_2_8_8/index.htm#Hadoop_Client.htm
http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm
QFS ( KFS ) qfs:// 3rd party https://github.com/quantcast/qfs/wiki/Migration-Guide
Lustre 3rd party http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre
MapR
File System
3rd party
https://www.mapr.com/products/mapr-fs
https://community.mapr.com/thread/7027

HCFS Architecture
New API

https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169
http://www.slideshare.net/jazzwang/hadoop-69818883

https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/51169
http://www.slideshare.net/jazzwang/hadoop-69818883
AWS S3 Authentication
Support
Azure Blob support
encrypted Key
CephFS is not work well with
YARN because of JNI (Java
Native Interface) :(
Only HDFS and Azure Blob
support HBase !!

2.
AWS S3
Use Case :
Amazon EMR

Three generation of S3 support
s3:// s3n:// s3a://
The ‘classic’ s3: filesystem
The second-generation, s3n: filesystem,
making it easy to share data between hadoop and
other applications via the S3 object store
The third generation, s3a: filesystem.
replacement for s3n:, supports larger files and
promises higher performance.
introduced in Hadoop 0.10.0 (HADOOP-574)
deprecated and will be removed from Hadoop 3.0
rename support in Hadoop 0.19.0 (HADOOP-3361)
Hadoop 2.6 and earlier
recommended for Hadoop 2.7 and later
Uploaded files can be larger than 5GB, but they
are not interoperable with other S3 tools.
requires a compatible version of jets3t requires exact version of amazon-aws-sdk
core-site.xml core-site.xml core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
</property>
<property>
<name>fs.s3a.access.key</name>
</property>
<property>
<name>fs.s3a.secret.key</name>
</property>
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html

1. You cannot use S3 as a replacement for HDFS
2. Amazon S3 is an "object store"
▸ eventual consistency
▸ non-atomic rename and delete operations.
3. Your AWS credentials are valuable
▸ core-site.xmlis readable in cluster-wide
▸ Don’t use embedding the credentials in the URI
▸ S3A supports more authentication mechanisms
4. Amazon's EMR Service is based upon Apache Hadoop, but
contains modifications and their own, proprietary, S3 client.
WARNING!!

For Mac OS X +
brew install hadoop
export HADOOP_CONF_DIR=${PATH of core-site.xml)
export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/*
hadoop fs -ls s3n://${bucket}/
For Linux / Windows - use BigTop docker image
docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs
# cd /data
/data# export HADOOP_CONF_DIR=${PATH of core-site.xml)
/data# hadoop fs -ls s3n://${bucket}/
DEMO

To enable more log4j messages, you could try :
export HADOOP_ROOT_LOGGER=DEBUG,console
hadoop fs -ls s3n://${bucket}/
To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW)
Using s3n:// , you have to put a config file jets3t.properties
$ cat jets3t.properties
s3service.s3-endpoint=s3.hicloud.net
s3service.https-only=false
Using s3a:// , you could add following to core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.hicloud.net</value>
<description>default is s3.amazonaws.com</description>
</property>
Undocumented Secrets 除錯/繞道密技

3.
Windows Azure
Storage Blob
Use Case :
HDInsight /
Azure Data Lake

1. hadoop-azure.jar is located at
- /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH)
- ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew)
2. Depends on Azure Storage SDK for Java -
https://github.com/Azure/azure-storage-java
3. Features
▸ Supports configuration of multiple Azure Blob Storage accounts.
▸ Supports both page blobs and block blobs
▸ wasbs:// scheme for SSL encrypted access.
▸ Can act as a source of data in a MapReduce job, or a sink.
▸ Tested on both Linux and Windows.
4. Limitation
▸ The append operation is not implemented.
▸ File owner and group are persisted,
but the permissions model is not enforced.
▸ File last access time is not tracked.
Hadoop Azure Support: Azure Blob Storage
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html

In core-site.xml
<property>
<name>fs.azure.account.key. youraccount.blob.core.windows.net</name>
<value>YOUR ACCESS KEY</value>
</property>
Examples:
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
> hadoop fs -put testFile
wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
> hadoop fs -cat
wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
Configurations

My Use Case :
rsync between local and wasb
Take advantage of hadoop distcp
- Backup
hadoop distcp -update ${SOURCE_DIR}
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR}
- Restore
hadoop distcp
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR}
${RESTOR_DIR}
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage

Use Case in TenMax:
Read / Write files from/to Azure Blob Storage
Spring Boot
FileSystem
Web Application
File System
Abstraction Layer
core-site.xml
Azure Blob
Storage
Cloud Storage
Take Hadoop as a
Java Library to access

Mon
OSD OSD OSD OSD
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Node
ManagerComputation
Layer
YARN
Storage
Layer
Ceph
High Level Architecture of Hadoop 2.x with CephFS
Mon Mon

hdfs01
192.168.1.239
hdfs02
192.168.1.238
hdfs03
192.168.1.237
hdfs04
192.168.1.236
virtual network ( hub )
node11
192.168.1.201
node21
192.168.1.211
node31
192.168.1.221
Ceph
mon
Ceph
OSD
Ceph
OSD
Ceph
OSD
Ceph
OSD
Resource
Manager
Node
Manager
Node
Manager
Node
Manager

1. Compile https://github.com/ceph/cephfs-hadoop
2. Copy cephfs-hadoop.jar
and place it at ${HADOOP_HOME}/lib/
3. Copy ceph.conf and ceph.client.${ID}.keyring
to /etc/ceph
4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/
5. Copy JNI related files to ${HADOOP_HOME}/lib/native/
ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so
ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so
CephFS installation
https://github.com/ceph/cephfs-hadoop

Known Issue :
MRAppMaster can not read find cephfs_jni

Root Cause :
There is no -Djava.library.path for MRAppMaster

G.G
Official Support is limited to Hadoop 1.1.x

Why it works
for MRv1??
Let’s take
a look at
MapReduce v1
Architecture

Why doesn’t
it work
on YARN??
Let’s take
a look at
YARN
Architecture

Without correct configuration,
HCFS or YARN Application that use JNI will fail :(
http://docs.orangefs.com/v_2_9/Hadoop_Use_Cases.htm

WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can
cause programs to no longer function if hadoop native libraries are used. These values should be set as part
of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
How to solve this issue ?
Official document and souce code said so ...
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c
re/src/main/resources/mapred-default.xml#L267

Conclusion
▸ S3 and WASB are the most mature HCFS.
▹ Sorry taht I’m not sure about Google Cloud Storage :(
▸ You’ll need more integration test for Hadoop Ecosystem
when using HCFS.
Take Hadoop as a
rsync tool to sync with
Take Hadoop as a
Java Library to access

THANKS!
Any questions?
You can find me at @jazzwang_tw &
https://fb.com/groups/hadoop.tw

CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)
PRESENTATION DESIGN
This presentations uses the following typographies and colors:
▸ Titles: Montserrat
▸ Body copy: Karla
You can download the fonts on this page:
http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka
rla:400,400italic,700,700italic

Introduction to HCFS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Introduction to HCFS

Similar to Introduction to HCFS (20)

Recently uploaded

Recently uploaded (20)

Introduction to HCFS