This document provides an overview of Hadoop Distributed File System (HDFS), MapReduce, and Apache Pig. It describes how HDFS stores and replicates large files across clusters of machines for high throughput access. MapReduce is introduced as a programming model for processing large datasets in parallel. Word count is used as an example MapReduce job. Apache Pig is presented as a framework for analyzing large datasets with a higher level of abstraction than MapReduce. Finally, common HDFS commands and a sample Pig script are shown.
3. HDFS
What Problem Does HDFS Solve?
Storing Large Data
A single file as large as a Petabyte
Store a single file across many machines in a
cluster
Tolerant to failure of one or more nodes
High throughput parallel access to data
4. HDFS
HDFS does not work for
Low latency random access
Files that need to be modified at runtime
5. HDFS
How HDFS work?
Consider a Large File, multiple GB
File Divided into Blocks
Each Block is given an identifier
Block size typically 64MB
Blocks kept on different machines
• Leads to a higher throughput of data
6. HDFS
How HDFS work?
Replicate Blocks
• Fault tolerance, guard against data loss and corruption
• Default is 3-fold replication, but configurable per file
• Individual blocks are replicated
7. HDFS Architecture
Namenode and Datanodes
One namenode, many datanodes
“Master-slave” architecture
Namenode stores Metadata
Datanodes store actual data blocks
8. MapReduce
Programming model for large Scale Data
Processing
• First used in the context of “big data” in a system
from Google: “MapReduce: simplified data
processing on large clusters”, Jeffrey Dean and
Sanjay Ghemawat, Google Inc.
• Programmer Describes Computation in two steps, Map
and Reduce
9. MapReduce Example: Word Count
Problem: Count the number of occurrences of each
word within a text corpus, and output them to a file
Input: text corpus (say all words from the New York
Times archives), a file in HDFS
Output: for each unique word in the corpus, the
number of occurrences of the word
16. MapReduce Parallelization
Different Map Steps can run in parallel
All Map steps must complete before any Reduce
step begins
Different Reduce Steps can run in parallel
Automatic parallelization of a MapReduce
program
17. Apache Pig
Framework for large scale data processing, at a
higher level of abstraction than MapReduce.
Writes programs faster than MapReduce for
processing large datasets
20. Common hdfs commands
Starts with hdfs dfs -….
or hadoop fs -….
• See the contents of a folder:
• hdfs dfs –ls <location>
• Copy from Local machine to HDFS
• First copy the required file to the local machine
via WinScp
• hdfs dfs –copyFromLocal <local machine
location> <location in HDFS>
21. Common hdfs commands
• Copy to Local machine from HDFS
• hdfs dfs –copyToLocal <local machine location>
<location in HDFS>
• Then copy the required file from the local
machine to your machine via WinScp
22. Common hdfs commands
• Make a new directory in hdfs:
• hdfs dfs –mkdir <hdfs directory location>
• See the tail of a file in hdfs:
• hdfs dfs –tail <hdfs file location>
• See the top of a file in hdfs:
• hdfs dfs –cat <hdfs file name>|head -10
23. Pig Script
A sample script on INRIX XD Data
Inrix XD data Schema:
code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
24. Pig Script
Inrix XD data Schema:
segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
Problem: Count the number of occurrences of
confidence score = 30 for any 10 segments for June
23rd 2016 Inrix XD data, and output them to a file
25. Pig Script
Run the script:
pig –x tez <script location in Local machine>
Store the output in local machine
hdfs dfs –getmerge <hdfs location> <local machine location>