The slides were created for one University Program on Apache Hadoop + Apache Apex workshop.
It explains almost all the hdfs related commands in details along with the examples.
2. → What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2
3. → Hadoop ←
❏ Open source software
- a Java framework
- initial release: December 10, 2011
❏ It provides both,
❏ Storage → [HDFS]
❏ Processing → [MapReduce]
❏ HDFS: Hadoop Distributed File System
3
4. → How Hadoop addresses the need? ←
❏ Big data Ocean
■ Have multiple machines. Each will store some portion of data, not the entire data.
❏ Expensive hardware
■ Use commodity hardware. Simple and cheap.
❏ Frequent Failures and Difficult recovery
■ Have multiple copies of data. Have the copies in different machines.
❏ Scaling up with more machines
■ If more processing is needed, add new machines on the fly
4
5. → HDFS ←
❏ Runs on Commodity hardware: Doesn't require expensive machines
❏ Large Files; Write-once, Read-many (WORM)
❏ Files are split into blocks
❏ Actual blocks go to DataNodes
❏ The metadata is stored at NameNode
❏ Replicate blocks to different node
❏ Default configuration:
■ Block size = 128MB
■ Replication Factor = 3
5
9. → Where NOT TO use HDFS ←
❏ Low latency data access
■ HDFS is optimized for high throughput of data at the expense of latency.
❏ Large number of small files
■ Namenode has the entire file-system metadata in memory.
■ Too much metadata as compared to actual data.
❏ Multiple writers / Arbitrary file modifications
■ No support for multiple writers for a file
■ Always append to end of a file
9
11. → NameNode & DataNodes ←
❏ NameNode:
■ Centerpiece of HDFS: The Master
■ Only stores the block metadata: block-name, block-location etc.
■ Critical component; When down, whole cluster is considered down; Single point of failure
■ Should be configured with higher RAM
❏ DataNode:
■ Stores the actual data: The Slave
■ In constant communication with NameNode
■ When down, it does not affect the availability of data/cluster
■ Should be configured with higher disk space
❏ SecondaryNameNode:
■ Doesn't actually act as a NameNode
■ Stores the image of primary NameNode at certain checkpoint
■ Used as backup to restore NameNode
11
13. → JobTracker & TaskTrackers ←
❏ JobTracker:
■ Talks to the NameNode to determine location of the data
■ Monitors all TaskTrackers and submits status of the job back to the client
■ When down, HDFS is still functional; no new MR job; existing jobs halted
■ Replaced by ResourceManager/ApplicationMaster in MRv2
❏ TaskTracker:
■ Runs on all DataNodes
■ TaskTracker communicates with JobTracker signaling the task progress
■ TaskTracker failure is not considered fatal
■ Replaced by NodeManager in MRv2
13
14. → ResourceManager & NodeManager ←
❏ Present in Hadoop v2.0
❏ Equivalent of JobTracker & TaskTracker in v1.0
❏ ResourceManager (RM):
■ Runs usually at NameNode; Distributes resources among applications.
■ Two main components: Scheduler and ApplicationsManager (AM)
❏ NodeManager (NM):
■ Per-node framework agent
■ Responsible for containers
■ Monitors their resource usage
■ Reports the stats to RM
Central ResourceManager and Node specific Manager together is called YARN
14
18. → Interacting with HDFS ←
❏ Command prompt:
■ Similar to Linux terminal commands
■ Unix is the model, POSIX is the API
❏ Web Interface:
■ Similar to browsing a FTP site on web
18
20. → Notes ←
File Paths on HDFS:
■ hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
■ hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
■ /user/USERNAME/demo/file.txt
■ demo/file.txt
File System:
■ Local: local file system (linux)
■ HDFS: hadoop file system
At some places:
The terms “file” and “directory” has the same meaning.
20
27. 3. Create a file on local & put it on HDFS
❏ Syntax:
■ vi filename.txt
■ hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
❏ Example:
■ vi file-copy-to-hdfs.txt
■ hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-
example/
27
28. 4. Get a file from HDFS to local
❏ Syntax:
■ hdfs dfs -get <hdfs-file-path> [local-dir-path]
❏ Example:
■ hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-
hdfs.txt ~/demo/
28
29. 5. Copy From LOCAL To HDFS
❏ Syntax:
■ hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
❏ Example:
■ hdfs dfs -copyFromLocal file-copy-to-hdfs.txt
/user/USERNAME/demo/copyFromLocal-example/
29
30. 6. Copy To LOCAL From HDFS
❏ Syntax:
■ hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
❏ Example:
■ hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-
example/file-copy-from-hdfs.txt ~/demo/
30
31. 7. Move a file from local to HDFS
❏ Syntax:
■ hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
❏ Example:
■ hdfs dfs -moveFromLocal /path/to/file.txt
/user/USERNAME/demo/moveFromLocal-example/
31