Unblocking The Main Thread Solving ANRs and Frozen Frames
Introduction to HCFS
1. HCFS 初探
Introduction to
Hadoop Compatible File System
Jazz Yao-Tsung Wang
Co-founder of Hadoop.TW
https://fb.com/groups/hadoop.tw
2017-01-21 Hadoop.TW & GCPUG.TW Meetup #1 2017
2. HELLO!
I am Jazz Wang
Co-Founder of Hadoop.TW.
Hadoop Evangelist since 2008.
Open Source Promoter. System Admin (Ops).
You can find me at @jazzwang_tw or
https://fb.com/groups/hadoop.tw ,
https://forum.hadoop.tw
6. Needs / Trends:
Hadoop on the Cloud
http://www.slideshare.net/jazzwang/hadoop-deployment-model-osdctw
7. Why Hadoop on the Cloud ?
http://www.slideshare.net/HadoopSummit/hadoop-cloud-storage-object-store-integration-in-production
https://www.youtube.com/watch?v=XehH3iJJy3Q
8. Why might you need HCFS ...
https://www.facebook.com/groups/hadoop.tw/permalink/1061706333938741/?comment_id=1072414466201261&reply
_comment_id=1073302882779086&comment_tracking={%22tn%22%3A%22R%22}
18. Three generation of S3 support
s3:// s3n:// s3a://
The ‘classic’ s3: filesystem
The second-generation, s3n: filesystem,
making it easy to share data between hadoop and
other applications via the S3 object store
The third generation, s3a: filesystem.
replacement for s3n:, supports larger files and
promises higher performance.
introduced in Hadoop 0.10.0 (HADOOP-574)
deprecated and will be removed from Hadoop 3.0
introduced in Hadoop 0.18.0 (HADOOP-930)
rename support in Hadoop 0.19.0 (HADOOP-3361)
Hadoop 2.6 and earlier
introduced in Hadoop 2.6.0 (HADOOP-11571)
recommended for Hadoop 2.7 and later
Uploaded files can be larger than 5GB, but they
are not interoperable with other S3 tools.
requires a compatible version of jets3t requires exact version of amazon-aws-sdk
core-site.xml core-site.xml core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS secret key</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>AWS access key ID</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>AWS secret key</value>
</property>
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
19. 1. You cannot use S3 as a replacement for HDFS
2. Amazon S3 is an "object store"
▸ eventual consistency
▸ non-atomic rename and delete operations.
3. Your AWS credentials are valuable
▸ core-site.xmlis readable in cluster-wide
▸ Don’t use embedding the credentials in the URI
▸ S3A supports more authentication mechanisms
4. Amazon's EMR Service is based upon Apache Hadoop, but
contains modifications and their own, proprietary, S3 client.
WARNING!!
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
20. For Mac OS X +
brew install hadoop
export HADOOP_CONF_DIR=${PATH of core-site.xml)
export HADOOP_CLASSPATH=/usr/local/opt/hadoop/libexec/share/hadoop/tools/lib/*
hadoop fs -ls s3n://${bucket}/
For Linux / Windows - use BigTop docker image
docker run -it --name hcfs -h hcfs -v $(pwd):/data jazzwang/bigtop-hdfs
# cd /data
/data# export HADOOP_CONF_DIR=${PATH of core-site.xml)
/data# hadoop fs -ls s3n://${bucket}/
DEMO
https://wiki.apache.org/hadoop/AmazonS3
http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html
21. To enable more log4j messages, you could try :
export HADOOP_ROOT_LOGGER=DEBUG,console
hadoop fs -ls s3n://${bucket}/
To access unofficial S3 services such as hicloud S3 and Ceph S3 (RGW)
Using s3n:// , you have to put a config file jets3t.properties
$ cat jets3t.properties
s3service.s3-endpoint=s3.hicloud.net
s3service.https-only=false
Using s3a:// , you could add following to core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.hicloud.net</value>
<description>default is s3.amazonaws.com</description>
</property>
Undocumented Secrets 除錯/繞道密技
23. 1. hadoop-azure.jar is located at
- /usr/lib/hadoop-mapreduce/hadoop-azure.jar (bigtop , CDH)
- ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure.jar ( official tar.gz , Mac brew)
2. Depends on Azure Storage SDK for Java -
https://github.com/Azure/azure-storage-java
3. Features
▸ Supports configuration of multiple Azure Blob Storage accounts.
▸ Supports both page blobs and block blobs
▸ wasbs:// scheme for SSL encrypted access.
▸ Can act as a source of data in a MapReduce job, or a sink.
▸ Tested on both Linux and Windows.
4. Limitation
▸ The append operation is not implemented.
▸ File owner and group are persisted,
but the permissions model is not enforced.
▸ File last access time is not tracked.
Hadoop Azure Support: Azure Blob Storage
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
25. My Use Case :
rsync between local and wasb
http://hadoop.apache.org/docs/r2.7.3/hadoop-azure/index.html
Take advantage of hadoop distcp
- Backup
hadoop distcp -update ${SOURCE_DIR}
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR}
- Restore
hadoop distcp
wasb://yourcontainer@youraccount.blob.core.windows.net/${BACKUP_DIR}
${RESTOR_DIR}
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage
26. Use Case in TenMax:
Read / Write files from/to Azure Blob Storage
Spring Boot
FileSystem
Web Application
File System
Abstraction Layer
core-site.xml
Azure Blob
Storage
Cloud Storage
Take Hadoop as a
Java Library to access
Hybrid Cloud Storage
30. 1. Compile https://github.com/ceph/cephfs-hadoop
2. Copy cephfs-hadoop.jar
and place it at ${HADOOP_HOME}/lib/
3. Copy ceph.conf and ceph.client.${ID}.keyring
to /etc/ceph
4. Copy cephfs-java.jar to ${HADOOP_HOME}/lib/
5. Copy JNI related files to ${HADOOP_HOME}/lib/native/
ln -s libcephfs.so.1 /usr/lib/hadoop/lib/native/libcephfs.so
ln -s libcephfs_jni.so.1 /usr/lib/hadoop/lib/native/libcephfs_jni.so
CephFS installation
http://docs.ceph.com/docs/master/cephfs/hadoop/
https://github.com/ceph/cephfs-hadoop
38. WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can
cause programs to no longer function if hadoop native libraries are used. These values should be set as part
of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
How to solve this issue ?
Official document and souce code said so ...
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/NativeLibraries.html#Native_Shared_Libraries
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-c
re/src/main/resources/mapred-default.xml#L267
39. Conclusion
▸ S3 and WASB are the most mature HCFS.
▹ Sorry taht I’m not sure about Google Cloud Storage :(
▸ You’ll need more integration test for Hadoop Ecosystem
when using HCFS.
Take Hadoop as a
rsync tool to sync with
Hybrid Cloud Storage
Take Hadoop as a
Java Library to access
Hybrid Cloud Storage
41. CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)
PRESENTATION DESIGN
This presentations uses the following typographies and colors:
▸ Titles: Montserrat
▸ Body copy: Karla
You can download the fonts on this page:
http://www.google.com/fonts/#UsePlace:use/Collection:Montserrat:400,700|Ka
rla:400,400italic,700,700italic