SlideShare a Scribd company logo
HADOOP
Interacting with HDFS
1
For University Program on Apache Hadoop & Apache Apex
→ What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2
→ Hadoop ←
❏ Open source software
- a Java framework
- initial release: December 10, 2011
❏ It provides both,
❏ Storage → [HDFS]
❏ Processing → [MapReduce]
❏ HDFS: Hadoop Distributed File System
3
→ How Hadoop addresses the need? ←
❏ Big data Ocean
■ Have multiple machines. Each will store some portion of data, not the entire data.
❏ Expensive hardware
■ Use commodity hardware. Simple and cheap.
❏ Frequent Failures and Difficult recovery
■ Have multiple copies of data. Have the copies in different machines.
❏ Scaling up with more machines
■ If more processing is needed, add new machines on the fly
4
→ HDFS ←
❏ Runs on Commodity hardware: Doesn't require expensive machines
❏ Large Files; Write-once, Read-many (WORM)
❏ Files are split into blocks
❏ Actual blocks go to DataNodes
❏ The metadata is stored at NameNode
❏ Replicate blocks to different node
❏ Default configuration:
■ Block size = 128MB
■ Replication Factor = 3
5
6
7
8
→ Where NOT TO use HDFS ←
❏ Low latency data access
■ HDFS is optimized for high throughput of data at the expense of latency.
❏ Large number of small files
■ Namenode has the entire file-system metadata in memory.
■ Too much metadata as compared to actual data.
❏ Multiple writers / Arbitrary file modifications
■ No support for multiple writers for a file
■ Always append to end of a file
9
→ Some Key Concepts ←
❏ NameNode
❏ DataNodes
❏ JobTracker
❏ TaskTrackers
❏ ResourceManager (MRv2)
❏ NodeManager (MRv2)
❏ ApplicationMaster (MRv2)
10
→ NameNode & DataNodes ←
❏ NameNode:
■ Centerpiece of HDFS: The Master
■ Only stores the block metadata: block-name, block-location etc.
■ Critical component; When down, whole cluster is considered down; Single point of failure
■ Should be configured with higher RAM
❏ DataNode:
■ Stores the actual data: The Slave
■ In constant communication with NameNode
■ When down, it does not affect the availability of data/cluster
■ Should be configured with higher disk space
❏ SecondaryNameNode:
■ Doesn't actually act as a NameNode
■ Stores the image of primary NameNode at certain checkpoint
■ Used as backup to restore NameNode
11
12
→ JobTracker & TaskTrackers ←
❏ JobTracker:
■ Talks to the NameNode to determine location of the data
■ Monitors all TaskTrackers and submits status of the job back to the client
■ When down, HDFS is still functional; no new MR job; existing jobs halted
■ Replaced by ResourceManager/ApplicationMaster in MRv2
❏ TaskTracker:
■ Runs on all DataNodes
■ TaskTracker communicates with JobTracker signaling the task progress
■ TaskTracker failure is not considered fatal
■ Replaced by NodeManager in MRv2
13
→ ResourceManager & NodeManager ←
❏ Present in Hadoop v2.0
❏ Equivalent of JobTracker & TaskTracker in v1.0
❏ ResourceManager (RM):
■ Runs usually at NameNode; Distributes resources among applications.
■ Two main components: Scheduler and ApplicationsManager (AM)
❏ NodeManager (NM):
■ Per-node framework agent
■ Responsible for containers
■ Monitors their resource usage
■ Reports the stats to RM
Central ResourceManager and Node specific Manager together is called YARN
14
15
→ Hadoop 1.0 vs. 2.0 ←
❏ HDFS 1.0:
■ Single point of failure
■ Horizontal scaling performance issue
❏ HDFS 2.0:
■ HDFS High Availability
■ HDFS Snapshot
■ Improved performance
■ HDFS Federation
16
17
HDFS Federation
→ Interacting with HDFS ←
❏ Command prompt:
■ Similar to Linux terminal commands
■ Unix is the model, POSIX is the API
❏ Web Interface:
■ Similar to browsing a FTP site on web
18
Interacting With HDFS
On Command Prompt
19
→ Notes ←
File Paths on HDFS:
■ hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
■ hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
■ /user/USERNAME/demo/file.txt
■ demo/file.txt
File System:
■ Local: local file system (linux)
■ HDFS: hadoop file system
At some places:
The terms “file” and “directory” has the same meaning.
20
→ Before we start ←
❏ Command:
■ hdfs
❏ Usage:
■ hdfs [--config confdir] COMMAND
❏ Example:
■ hdfs dfs
■ hdfs dfsadmin
■ hdfs fsck
■ hdfs namenode
■ hdfs datanode
21
hdfs `dfs` commands
22
→ In general Syntax for `dfs` commands ←
hdfs
dfs
-<COMMAND>
-[OPTIONS]
<PARAMETERS>
e.g.
hdfs dfs -ls -R /user/USERNAME/demo/data/
23
0. Do It yourself
❏ Syntax:
■ hdfs dfs -help [COMMAND … ]
■ hdfs dfs -usage [COMMAND … ]
❏ Example:
■ hdfs dfs -help cat
■ hdfs dfs -usage cat
24
1. List the file/directory
❏ Syntax:
■ hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>
❏ Example:
■ hdfs dfs -ls
■ hdfs dfs -ls /
■ hdfs dfs -ls /user/USERNAME/demo/list-dir-example
■ hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example
25
2. Creating a directory
❏ Syntax:
■ hdfs dfs -mkdir [-p] <hdfs-dir-path>
❏ Example:
■ hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example
■ hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-
example/dir1/dir2/dir3
26
3. Create a file on local & put it on HDFS
❏ Syntax:
■ vi filename.txt
■ hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
❏ Example:
■ vi file-copy-to-hdfs.txt
■ hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-
example/
27
4. Get a file from HDFS to local
❏ Syntax:
■ hdfs dfs -get <hdfs-file-path> [local-dir-path]
❏ Example:
■ hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-
hdfs.txt ~/demo/
28
5. Copy From LOCAL To HDFS
❏ Syntax:
■ hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
❏ Example:
■ hdfs dfs -copyFromLocal file-copy-to-hdfs.txt
/user/USERNAME/demo/copyFromLocal-example/
29
6. Copy To LOCAL From HDFS
❏ Syntax:
■ hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
❏ Example:
■ hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-
example/file-copy-from-hdfs.txt ~/demo/
30
7. Move a file from local to HDFS
❏ Syntax:
■ hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
❏ Example:
■ hdfs dfs -moveFromLocal /path/to/file.txt
/user/USERNAME/demo/moveFromLocal-example/
31
8. Copy a file within HDFS
❏ Syntax:
■ hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>
❏ Example:
■ hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt
/user/USERNAME/demo/data/
32
9. Move a file within HDFS
❏ Syntax:
■ hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>
❏ Example:
■ hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt
/user/USERNAME/demo/data/
33
10. Merge files on HDFS
❏ Syntax:
■ hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>
❏ Examples:
■ hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/
/path/to/all-files.txt
34
11. View file contents
❏ Syntax:
■ hdfs dfs -cat <hdfs-file-path>
■ hdfs dfs -tail <hdfs-file-path>
■ hdfs dfs -text <hdfs-file-path>
❏ Examples:
■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt
■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head
35
12. Remove files/dirs from HDFS
❏ Syntax:
■ hdfs dfs -rm [options] <hdfs-file-path>
❏ Examples:
■ hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt
■ hdfs dfs -rm -R /user/USERNAME/demo/remove-example/
■ hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/
36
13. Change file/dir properties
❏ Syntax:
■ hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>
■ hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>
■ hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>
❏ Examples:
■ hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-
properties.txt
37
14. Check the file size
❏ Syntax:
■ hdfs dfs -du <hdfs-file-path>
❏ Examples:
■ hdfs dfs -du /user/USERNAME/demo/data/file.txt
■ hdfs dfs -du -s -h /user/USERNAME/demo/data/
38
15. Create a zero byte file in HDFS
❏ Syntax:
■ hdfs dfs -touchz <hdfs-file-path>
❏ Examples:
■ hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt
39
16. File test operations
❏ Syntax:
■ hdfs dfs -test -[defsz] <hdfs-file-path>
❏ Examples:
■ hdfs dfs -test -e /user/USERNAME/demo/data/file.txt
❏ echo $?
40
17. Get FileSystem Statistics
❏ Syntax:
■ hdfs dfs -stat [format] <hdfs-file-path>
❏ Format Options:
■ %b - file size in blocks, %g - group name of owner
■ %n - filename %o - block size
■ %r - replication %u - user name of owner
■ %y - modification date
41
18. Get File/Dir Counts
❏ Syntax:
■ hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>
❏ Example:
■ hdfs dfs -count -v /user/USERNAME/demo/
42
19. Set replication factor
❏ Syntax:
■ hdfs dfs -setrep -w -R n <hdfs-file-path>
❏ Examples:
■ hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt
43
20. Set Block Size
❏ Syntax:
■ hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path>
<hdfs-file-path>
❏ Examples:
■ hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt
/user/USERNAME/demo/block-example/
44
21. Empty the HDFS trash
❏ Syntax:
■ hdfs dfs -expunge
❏ Location:
45
Other hdfs commands (admin)
46
22. HDFS Admin Commands: fsck
❏ Syntax:
❏ hdfs fsck <hdfs-file-path>
❏ Options:
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
47
48
23. HDFS Admin Commands: dfsadmin
❏ Syntax:
■ hdfs dfsadmin
❏ Options:
[-report [-live] [-dead] [-decommissioning]]
[-safemode enter | leave | get | wait]
[-refreshNodes]
[-refresh <host:ipc_port> <key> [arg1..argn]]
[-shutdownDatanode <datanode:port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-help [cmd]]
❏ Examples:
■ hdfs dfsadmin -report -live
49
50
24. HDFS Admin Commands: namenode
❏ Syntax:
■ hdfs namenode
❏ Options:
[-checkpoint] |
[-format [-clusterid cid ] [-force] [-nonInteractive] ] |
[-upgrade [-clusterid cid] ] |
[-rollback] |
[-recover [-force] ] |
[-metadataVersion ]
❏ Examples:
■ hdfs namenode -help
51
25. HDFS Admin Commands: getconf
❏ Syntax:
■ hdfs getconf [-options]
❏ Options:
[ -namenodes ] [ -secondaryNameNodes ]
[ -backupNodes ] [ -includeFile ]
[ -excludeFile ] [ -nnRpcAddresses ]
[ -confKey [key] ]
52
Again,,, THE most important command !!
❏ Syntax:
■ hdfs dfs -help [options]
■ hdfs dfs -usage [options]
❏ Examples:
■ hdfs dfs -help help
■ hdfs dfs -usage usage
53
Interacting With HDFS
In Web Browser
54
Web HDFS
URL:
http://namenode:50070/explorer.html
Examples:
http://localhost:50070/explorer.html
http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
55
References
1. http://www.hadoopinrealworld.com
2. http://www.slideshare.net/sanjeeb85/hdfscommandreference
3. http://www.slideshare.net/jaganadhg/hdfs-10509123
4. http://www.slideshare.net/praveenbhat2/adv-os-presentation
5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf
7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html
9. http://hadoop.apache.org/docs/r1.2.1/distcp.html
56
Thank You!!
Please send your questions at:
pradeep@datatorrent.com
pradeep.n.kumbhar@gmail.com
57
© 2016 DataTorrent
Resources
58
• Apache Apex website - http://apex.incubator.apache.org/
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Startup Program – Free Enterprise License for Startups, Educational Institutions,
Non-Profits - https://www.datatorrent.com/product/startup-accelerator/
• Cloud Trial - http://web.datatorrent.com/cloudtrial.html
© 2016 DataTorrent
We Are Hiring
59
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
© 2016 DataTorrent
Upcoming Events
60
• March 15th
– …
• March 17th
6pm PST – Title
• March 24th
9am PST – Title
• …
APPENDIX
61
Copy data from one node to another node in HDFS
❏ Description:
❏ Copy data between clusters
❏ Syntax:
■ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
■ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs:
//nn2:8020/bar/foo
■ hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
■ hdfs://nn1:8020/foo/a
■ hdfs://nn1:8020/foo/b
62

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Anand Kulkarni
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 

What's hot (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 

Similar to Interacting with hdfs

Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
DataWorks Summit
 
5c_BigData_Hadoop_HDFS.PPTX
5c_BigData_Hadoop_HDFS.PPTX5c_BigData_Hadoop_HDFS.PPTX
5c_BigData_Hadoop_HDFS.PPTX
Miguel720844
 

Similar to Interacting with hdfs (20)

Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
MapReduce1.pptx
MapReduce1.pptxMapReduce1.pptx
MapReduce1.pptx
 
Data analysis on hadoop
Data analysis on hadoopData analysis on hadoop
Data analysis on hadoop
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Basic command of hadoop
Basic command of hadoopBasic command of hadoop
Basic command of hadoop
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glance
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
5c_BigData_Hadoop_HDFS.PPTX
5c_BigData_Hadoop_HDFS.PPTX5c_BigData_Hadoop_HDFS.PPTX
5c_BigData_Hadoop_HDFS.PPTX
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 

Interacting with hdfs

  • 1. HADOOP Interacting with HDFS 1 For University Program on Apache Hadoop & Apache Apex
  • 2. → What's the “Need” ? ← ❏ Big data Ocean ❏ Expensive hardware ❏ Frequent Failures and Difficult recovery ❏ Scaling up with more machines 2
  • 3. → Hadoop ← ❏ Open source software - a Java framework - initial release: December 10, 2011 ❏ It provides both, ❏ Storage → [HDFS] ❏ Processing → [MapReduce] ❏ HDFS: Hadoop Distributed File System 3
  • 4. → How Hadoop addresses the need? ← ❏ Big data Ocean ■ Have multiple machines. Each will store some portion of data, not the entire data. ❏ Expensive hardware ■ Use commodity hardware. Simple and cheap. ❏ Frequent Failures and Difficult recovery ■ Have multiple copies of data. Have the copies in different machines. ❏ Scaling up with more machines ■ If more processing is needed, add new machines on the fly 4
  • 5. → HDFS ← ❏ Runs on Commodity hardware: Doesn't require expensive machines ❏ Large Files; Write-once, Read-many (WORM) ❏ Files are split into blocks ❏ Actual blocks go to DataNodes ❏ The metadata is stored at NameNode ❏ Replicate blocks to different node ❏ Default configuration: ■ Block size = 128MB ■ Replication Factor = 3 5
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. → Where NOT TO use HDFS ← ❏ Low latency data access ■ HDFS is optimized for high throughput of data at the expense of latency. ❏ Large number of small files ■ Namenode has the entire file-system metadata in memory. ■ Too much metadata as compared to actual data. ❏ Multiple writers / Arbitrary file modifications ■ No support for multiple writers for a file ■ Always append to end of a file 9
  • 10. → Some Key Concepts ← ❏ NameNode ❏ DataNodes ❏ JobTracker ❏ TaskTrackers ❏ ResourceManager (MRv2) ❏ NodeManager (MRv2) ❏ ApplicationMaster (MRv2) 10
  • 11. → NameNode & DataNodes ← ❏ NameNode: ■ Centerpiece of HDFS: The Master ■ Only stores the block metadata: block-name, block-location etc. ■ Critical component; When down, whole cluster is considered down; Single point of failure ■ Should be configured with higher RAM ❏ DataNode: ■ Stores the actual data: The Slave ■ In constant communication with NameNode ■ When down, it does not affect the availability of data/cluster ■ Should be configured with higher disk space ❏ SecondaryNameNode: ■ Doesn't actually act as a NameNode ■ Stores the image of primary NameNode at certain checkpoint ■ Used as backup to restore NameNode 11
  • 12. 12
  • 13. → JobTracker & TaskTrackers ← ❏ JobTracker: ■ Talks to the NameNode to determine location of the data ■ Monitors all TaskTrackers and submits status of the job back to the client ■ When down, HDFS is still functional; no new MR job; existing jobs halted ■ Replaced by ResourceManager/ApplicationMaster in MRv2 ❏ TaskTracker: ■ Runs on all DataNodes ■ TaskTracker communicates with JobTracker signaling the task progress ■ TaskTracker failure is not considered fatal ■ Replaced by NodeManager in MRv2 13
  • 14. → ResourceManager & NodeManager ← ❏ Present in Hadoop v2.0 ❏ Equivalent of JobTracker & TaskTracker in v1.0 ❏ ResourceManager (RM): ■ Runs usually at NameNode; Distributes resources among applications. ■ Two main components: Scheduler and ApplicationsManager (AM) ❏ NodeManager (NM): ■ Per-node framework agent ■ Responsible for containers ■ Monitors their resource usage ■ Reports the stats to RM Central ResourceManager and Node specific Manager together is called YARN 14
  • 15. 15
  • 16. → Hadoop 1.0 vs. 2.0 ← ❏ HDFS 1.0: ■ Single point of failure ■ Horizontal scaling performance issue ❏ HDFS 2.0: ■ HDFS High Availability ■ HDFS Snapshot ■ Improved performance ■ HDFS Federation 16
  • 18. → Interacting with HDFS ← ❏ Command prompt: ■ Similar to Linux terminal commands ■ Unix is the model, POSIX is the API ❏ Web Interface: ■ Similar to browsing a FTP site on web 18
  • 19. Interacting With HDFS On Command Prompt 19
  • 20. → Notes ← File Paths on HDFS: ■ hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt ■ hdfs://localhost:8020/user/USERNAME/demo/data/file.txt ■ /user/USERNAME/demo/file.txt ■ demo/file.txt File System: ■ Local: local file system (linux) ■ HDFS: hadoop file system At some places: The terms “file” and “directory” has the same meaning. 20
  • 21. → Before we start ← ❏ Command: ■ hdfs ❏ Usage: ■ hdfs [--config confdir] COMMAND ❏ Example: ■ hdfs dfs ■ hdfs dfsadmin ■ hdfs fsck ■ hdfs namenode ■ hdfs datanode 21
  • 23. → In general Syntax for `dfs` commands ← hdfs dfs -<COMMAND> -[OPTIONS] <PARAMETERS> e.g. hdfs dfs -ls -R /user/USERNAME/demo/data/ 23
  • 24. 0. Do It yourself ❏ Syntax: ■ hdfs dfs -help [COMMAND … ] ■ hdfs dfs -usage [COMMAND … ] ❏ Example: ■ hdfs dfs -help cat ■ hdfs dfs -usage cat 24
  • 25. 1. List the file/directory ❏ Syntax: ■ hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path> ❏ Example: ■ hdfs dfs -ls ■ hdfs dfs -ls / ■ hdfs dfs -ls /user/USERNAME/demo/list-dir-example ■ hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example 25
  • 26. 2. Creating a directory ❏ Syntax: ■ hdfs dfs -mkdir [-p] <hdfs-dir-path> ❏ Example: ■ hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example ■ hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir- example/dir1/dir2/dir3 26
  • 27. 3. Create a file on local & put it on HDFS ❏ Syntax: ■ vi filename.txt ■ hdfs dfs -put [options] <local-file-path> <hdfs-dir-path> ❏ Example: ■ vi file-copy-to-hdfs.txt ■ hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put- example/ 27
  • 28. 4. Get a file from HDFS to local ❏ Syntax: ■ hdfs dfs -get <hdfs-file-path> [local-dir-path] ❏ Example: ■ hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from- hdfs.txt ~/demo/ 28
  • 29. 5. Copy From LOCAL To HDFS ❏ Syntax: ■ hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path> ❏ Example: ■ hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/ 29
  • 30. 6. Copy To LOCAL From HDFS ❏ Syntax: ■ hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path> ❏ Example: ■ hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal- example/file-copy-from-hdfs.txt ~/demo/ 30
  • 31. 7. Move a file from local to HDFS ❏ Syntax: ■ hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path> ❏ Example: ■ hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/ 31
  • 32. 8. Copy a file within HDFS ❏ Syntax: ■ hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path> ❏ Example: ■ hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/ 32
  • 33. 9. Move a file within HDFS ❏ Syntax: ■ hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path> ❏ Example: ■ hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/ 33
  • 34. 10. Merge files on HDFS ❏ Syntax: ■ hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path> ❏ Examples: ■ hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt 34
  • 35. 11. View file contents ❏ Syntax: ■ hdfs dfs -cat <hdfs-file-path> ■ hdfs dfs -tail <hdfs-file-path> ■ hdfs dfs -text <hdfs-file-path> ❏ Examples: ■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt ■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head 35
  • 36. 12. Remove files/dirs from HDFS ❏ Syntax: ■ hdfs dfs -rm [options] <hdfs-file-path> ❏ Examples: ■ hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt ■ hdfs dfs -rm -R /user/USERNAME/demo/remove-example/ ■ hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/ 36
  • 37. 13. Change file/dir properties ❏ Syntax: ■ hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path> ■ hdfs dfs -chmod [-R] <permissions> <hdfs-file-path> ■ hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path> ❏ Examples: ■ hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change- properties.txt 37
  • 38. 14. Check the file size ❏ Syntax: ■ hdfs dfs -du <hdfs-file-path> ❏ Examples: ■ hdfs dfs -du /user/USERNAME/demo/data/file.txt ■ hdfs dfs -du -s -h /user/USERNAME/demo/data/ 38
  • 39. 15. Create a zero byte file in HDFS ❏ Syntax: ■ hdfs dfs -touchz <hdfs-file-path> ❏ Examples: ■ hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt 39
  • 40. 16. File test operations ❏ Syntax: ■ hdfs dfs -test -[defsz] <hdfs-file-path> ❏ Examples: ■ hdfs dfs -test -e /user/USERNAME/demo/data/file.txt ❏ echo $? 40
  • 41. 17. Get FileSystem Statistics ❏ Syntax: ■ hdfs dfs -stat [format] <hdfs-file-path> ❏ Format Options: ■ %b - file size in blocks, %g - group name of owner ■ %n - filename %o - block size ■ %r - replication %u - user name of owner ■ %y - modification date 41
  • 42. 18. Get File/Dir Counts ❏ Syntax: ■ hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path> ❏ Example: ■ hdfs dfs -count -v /user/USERNAME/demo/ 42
  • 43. 19. Set replication factor ❏ Syntax: ■ hdfs dfs -setrep -w -R n <hdfs-file-path> ❏ Examples: ■ hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt 43
  • 44. 20. Set Block Size ❏ Syntax: ■ hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path> ❏ Examples: ■ hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/ 44
  • 45. 21. Empty the HDFS trash ❏ Syntax: ■ hdfs dfs -expunge ❏ Location: 45
  • 46. Other hdfs commands (admin) 46
  • 47. 22. HDFS Admin Commands: fsck ❏ Syntax: ❏ hdfs fsck <hdfs-file-path> ❏ Options: [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] [-includeSnapshots] 47
  • 48. 48
  • 49. 23. HDFS Admin Commands: dfsadmin ❏ Syntax: ■ hdfs dfsadmin ❏ Options: [-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]] ❏ Examples: ■ hdfs dfsadmin -report -live 49
  • 50. 50
  • 51. 24. HDFS Admin Commands: namenode ❏ Syntax: ■ hdfs namenode ❏ Options: [-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ] ❏ Examples: ■ hdfs namenode -help 51
  • 52. 25. HDFS Admin Commands: getconf ❏ Syntax: ■ hdfs getconf [-options] ❏ Options: [ -namenodes ] [ -secondaryNameNodes ] [ -backupNodes ] [ -includeFile ] [ -excludeFile ] [ -nnRpcAddresses ] [ -confKey [key] ] 52
  • 53. Again,,, THE most important command !! ❏ Syntax: ■ hdfs dfs -help [options] ■ hdfs dfs -usage [options] ❏ Examples: ■ hdfs dfs -help help ■ hdfs dfs -usage usage 53
  • 54. Interacting With HDFS In Web Browser 54
  • 56. References 1. http://www.hadoopinrealworld.com 2. http://www.slideshare.net/sanjeeb85/hdfscommandreference 3. http://www.slideshare.net/jaganadhg/hdfs-10509123 4. http://www.slideshare.net/praveenbhat2/adv-os-presentation 5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html 6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf 7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- hdfs/HDFSCommands.html 8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- common/FileSystemShell.html 9. http://hadoop.apache.org/docs/r1.2.1/distcp.html 56
  • 57. Thank You!! Please send your questions at: pradeep@datatorrent.com pradeep.n.kumbhar@gmail.com 57
  • 58. © 2016 DataTorrent Resources 58 • Apache Apex website - http://apex.incubator.apache.org/ • Subscribe - http://apex.incubator.apache.org/community.html • Download - http://apex.incubator.apache.org/downloads.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Facebook - https://www.facebook.com/ApacheApex/ • Meetup - http://www.meetup.com/topics/apache-apex • Startup Program – Free Enterprise License for Startups, Educational Institutions, Non-Profits - https://www.datatorrent.com/product/startup-accelerator/ • Cloud Trial - http://web.datatorrent.com/cloudtrial.html
  • 59. © 2016 DataTorrent We Are Hiring 59 • jobs@datatorrent.com • Developers/Architects • QA Automation Developers • Information Developers • Build and Release
  • 60. © 2016 DataTorrent Upcoming Events 60 • March 15th – … • March 17th 6pm PST – Title • March 24th 9am PST – Title • …
  • 62. Copy data from one node to another node in HDFS ❏ Description: ❏ Copy data between clusters ❏ Syntax: ■ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo ■ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs: //nn2:8020/bar/foo ■ hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo Where srclist.file contains ■ hdfs://nn1:8020/foo/a ■ hdfs://nn1:8020/foo/b 62