Interacting with hdfs

HADOOP
Interacting with HDFS
1
For University Program on Apache Hadoop & Apache Apex

→ What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2

→ Hadoop ←
❏ Open source software
- a Java framework
- initial release: December 10, 2011
❏ It provides both,
❏ Storage → [HDFS]
❏ Processing → [MapReduce]
❏ HDFS: Hadoop Distributed File System
3

→ How Hadoop addresses the need? ←
❏ Big data Ocean
■ Have multiple machines. Each will store some portion of data, not the entire data.
❏ Expensive hardware
■ Use commodity hardware. Simple and cheap.
❏ Frequent Failures and Difficult recovery
■ Have multiple copies of data. Have the copies in different machines.
❏ Scaling up with more machines
■ If more processing is needed, add new machines on the fly
4

→ HDFS ←
❏ Runs on Commodity hardware: Doesn't require expensive machines
❏ Large Files; Write-once, Read-many (WORM)
❏ Files are split into blocks
❏ Actual blocks go to DataNodes
❏ The metadata is stored at NameNode
❏ Replicate blocks to different node
❏ Default configuration:
■ Block size = 128MB
■ Replication Factor = 3
5

→ Where NOT TO use HDFS ←
❏ Low latency data access
■ HDFS is optimized for high throughput of data at the expense of latency.
❏ Large number of small files
■ Namenode has the entire file-system metadata in memory.
■ Too much metadata as compared to actual data.
❏ Multiple writers / Arbitrary file modifications
■ No support for multiple writers for a file
■ Always append to end of a file
9

→ Some Key Concepts ←
❏ NameNode
❏ DataNodes
❏ JobTracker
❏ TaskTrackers
❏ ResourceManager (MRv2)
❏ NodeManager (MRv2)
❏ ApplicationMaster (MRv2)
10

→ NameNode & DataNodes ←
❏ NameNode:
■ Centerpiece of HDFS: The Master
■ Only stores the block metadata: block-name, block-location etc.
■ Critical component; When down, whole cluster is considered down; Single point of failure
■ Should be configured with higher RAM
❏ DataNode:
■ Stores the actual data: The Slave
■ In constant communication with NameNode
■ When down, it does not affect the availability of data/cluster
■ Should be configured with higher disk space
❏ SecondaryNameNode:
■ Doesn't actually act as a NameNode
■ Stores the image of primary NameNode at certain checkpoint
■ Used as backup to restore NameNode
11

→ JobTracker & TaskTrackers ←
❏ JobTracker:
■ Talks to the NameNode to determine location of the data
■ Monitors all TaskTrackers and submits status of the job back to the client
■ When down, HDFS is still functional; no new MR job; existing jobs halted
■ Replaced by ResourceManager/ApplicationMaster in MRv2
❏ TaskTracker:
■ Runs on all DataNodes
■ TaskTracker communicates with JobTracker signaling the task progress
■ TaskTracker failure is not considered fatal
■ Replaced by NodeManager in MRv2
13

→ ResourceManager & NodeManager ←
❏ Present in Hadoop v2.0
❏ Equivalent of JobTracker & TaskTracker in v1.0
❏ ResourceManager (RM):
■ Runs usually at NameNode; Distributes resources among applications.
■ Two main components: Scheduler and ApplicationsManager (AM)
❏ NodeManager (NM):
■ Per-node framework agent
■ Responsible for containers
■ Monitors their resource usage
■ Reports the stats to RM
Central ResourceManager and Node specific Manager together is called YARN
14

→ Hadoop 1.0 vs. 2.0 ←
❏ HDFS 1.0:
■ Single point of failure
■ Horizontal scaling performance issue
❏ HDFS 2.0:
■ HDFS High Availability
■ HDFS Snapshot
■ Improved performance
■ HDFS Federation
16

→ Interacting with HDFS ←
❏ Command prompt:
■ Similar to Linux terminal commands
■ Unix is the model, POSIX is the API
❏ Web Interface:
■ Similar to browsing a FTP site on web
18

Interacting With HDFS
On Command Prompt
19

→ Notes ←
File Paths on HDFS:
■ hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
■ hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
■ /user/USERNAME/demo/file.txt
■ demo/file.txt
File System:
■ Local: local file system (linux)
■ HDFS: hadoop file system
At some places:
The terms “file” and “directory” has the same meaning.
20

→ Before we start ←
❏ Command:
■ hdfs
❏ Usage:
■ hdfs [--config confdir] COMMAND
❏ Example:
■ hdfs dfs
■ hdfs dfsadmin
■ hdfs fsck
■ hdfs namenode
■ hdfs datanode
21

→ In general Syntax for `dfs` commands ←
hdfs
dfs
-<COMMAND>
-[OPTIONS]
<PARAMETERS>
e.g.
hdfs dfs -ls -R /user/USERNAME/demo/data/
23

0. Do It yourself
❏ Syntax:
■ hdfs dfs -help [COMMAND … ]
■ hdfs dfs -usage [COMMAND … ]
❏ Example:
■ hdfs dfs -help cat
■ hdfs dfs -usage cat
24

1. List the file/directory
❏ Syntax:
■ hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>
❏ Example:
■ hdfs dfs -ls
■ hdfs dfs -ls /
■ hdfs dfs -ls /user/USERNAME/demo/list-dir-example
■ hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example
25

2. Creating a directory
❏ Syntax:
■ hdfs dfs -mkdir [-p] <hdfs-dir-path>
❏ Example:
■ hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example
■ hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-
example/dir1/dir2/dir3
26

3. Create a file on local & put it on HDFS
❏ Syntax:
■ vi filename.txt
■ hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
❏ Example:
■ vi file-copy-to-hdfs.txt
■ hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-
example/
27

4. Get a file from HDFS to local
❏ Syntax:
■ hdfs dfs -get <hdfs-file-path> [local-dir-path]
❏ Example:
■ hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-
hdfs.txt ~/demo/
28

5. Copy From LOCAL To HDFS
❏ Syntax:
■ hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
❏ Example:
■ hdfs dfs -copyFromLocal file-copy-to-hdfs.txt
/user/USERNAME/demo/copyFromLocal-example/
29

6. Copy To LOCAL From HDFS
❏ Syntax:
■ hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
❏ Example:
■ hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-
example/file-copy-from-hdfs.txt ~/demo/
30

7. Move a file from local to HDFS
❏ Syntax:
■ hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
❏ Example:
■ hdfs dfs -moveFromLocal /path/to/file.txt
/user/USERNAME/demo/moveFromLocal-example/
31

8. Copy a file within HDFS
❏ Syntax:
■ hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>
❏ Example:
■ hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt
/user/USERNAME/demo/data/
32

9. Move a file within HDFS
❏ Syntax:
■ hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>
❏ Example:
■ hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt
/user/USERNAME/demo/data/
33

10. Merge files on HDFS
❏ Syntax:
■ hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>
❏ Examples:
■ hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/
/path/to/all-files.txt
34

11. View file contents
❏ Syntax:
■ hdfs dfs -cat <hdfs-file-path>
■ hdfs dfs -tail <hdfs-file-path>
■ hdfs dfs -text <hdfs-file-path>
❏ Examples:
■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt
■ hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head
35

12. Remove files/dirs from HDFS
❏ Syntax:
■ hdfs dfs -rm [options] <hdfs-file-path>
❏ Examples:
■ hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt
■ hdfs dfs -rm -R /user/USERNAME/demo/remove-example/
■ hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/
36

13. Change file/dir properties
❏ Syntax:
■ hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>
■ hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>
■ hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>
❏ Examples:
■ hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-
properties.txt
37

14. Check the file size
❏ Syntax:
■ hdfs dfs -du <hdfs-file-path>
❏ Examples:
■ hdfs dfs -du /user/USERNAME/demo/data/file.txt
■ hdfs dfs -du -s -h /user/USERNAME/demo/data/
38

15. Create a zero byte file in HDFS
❏ Syntax:
■ hdfs dfs -touchz <hdfs-file-path>
❏ Examples:
■ hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt
39

16. File test operations
❏ Syntax:
■ hdfs dfs -test -[defsz] <hdfs-file-path>
❏ Examples:
■ hdfs dfs -test -e /user/USERNAME/demo/data/file.txt
❏ echo $?
40

17. Get FileSystem Statistics
❏ Syntax:
■ hdfs dfs -stat [format] <hdfs-file-path>
❏ Format Options:
■ %b - file size in blocks, %g - group name of owner
■ %n - filename %o - block size
■ %r - replication %u - user name of owner
■ %y - modification date
41

18. Get File/Dir Counts
❏ Syntax:
■ hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>
❏ Example:
■ hdfs dfs -count -v /user/USERNAME/demo/
42

19. Set replication factor
❏ Syntax:
■ hdfs dfs -setrep -w -R n <hdfs-file-path>
❏ Examples:
■ hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt
43

20. Set Block Size
❏ Syntax:
■ hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path>
<hdfs-file-path>
❏ Examples:
■ hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt
/user/USERNAME/demo/block-example/
44

21. Empty the HDFS trash
❏ Syntax:
■ hdfs dfs -expunge
❏ Location:
45

Other hdfs commands (admin)
46

22. HDFS Admin Commands: fsck
❏ Syntax:
❏ hdfs fsck <hdfs-file-path>
❏ Options:
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
47

23. HDFS Admin Commands: dfsadmin
❏ Syntax:
■ hdfs dfsadmin
❏ Options:
[-report [-live] [-dead] [-decommissioning]]
[-safemode enter | leave | get | wait]
[-refreshNodes]
[-refresh <host:ipc_port> <key> [arg1..argn]]
[-shutdownDatanode <datanode:port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-help [cmd]]
❏ Examples:
■ hdfs dfsadmin -report -live
49

24. HDFS Admin Commands: namenode
❏ Syntax:
■ hdfs namenode
❏ Options:
[-checkpoint] |
[-format [-clusterid cid ] [-force] [-nonInteractive] ] |
[-upgrade [-clusterid cid] ] |
[-rollback] |
[-recover [-force] ] |
[-metadataVersion ]
❏ Examples:
■ hdfs namenode -help
51

25. HDFS Admin Commands: getconf
❏ Syntax:
■ hdfs getconf [-options]
❏ Options:
[ -namenodes ] [ -secondaryNameNodes ]
[ -backupNodes ] [ -includeFile ]
[ -excludeFile ] [ -nnRpcAddresses ]
[ -confKey [key] ]
52

Again,,, THE most important command !!
❏ Syntax:
■ hdfs dfs -help [options]
■ hdfs dfs -usage [options]
❏ Examples:
■ hdfs dfs -help help
■ hdfs dfs -usage usage
53

Interacting With HDFS
In Web Browser
54

Web HDFS
URL:
http://namenode:50070/explorer.html
Examples:
http://localhost:50070/explorer.html
http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
55

References
1. http://www.hadoopinrealworld.com
2. http://www.slideshare.net/sanjeeb85/hdfscommandreference
3. http://www.slideshare.net/jaganadhg/hdfs-10509123
4. http://www.slideshare.net/praveenbhat2/adv-os-presentation
5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf
7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/FileSystemShell.html
9. http://hadoop.apache.org/docs/r1.2.1/distcp.html
56

Thank You!!
Please send your questions at:
pradeep@datatorrent.com
pradeep.n.kumbhar@gmail.com
57

© 2016 DataTorrent
Resources
58
• Apache Apex website - http://apex.incubator.apache.org/
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Startup Program – Free Enterprise License for Startups, Educational Institutions,
Non-Profits - https://www.datatorrent.com/product/startup-accelerator/
• Cloud Trial - http://web.datatorrent.com/cloudtrial.html

© 2016 DataTorrent
We Are Hiring
59
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release

Copy data from one node to another node in HDFS
❏ Description:
❏ Copy data between clusters
❏ Syntax:
■ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
■ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs:
//nn2:8020/bar/foo
■ hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
■ hdfs://nn1:8020/foo/a
■ hdfs://nn1:8020/foo/b
62

Interacting with hdfs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interacting with hdfs

Similar to Interacting with hdfs (20)

Recently uploaded

Recently uploaded (20)

Interacting with hdfs