HDFS, MapReduce and
Apache Pig Tutorial
Pranamesh Chakraborty
Resources: CPRE-419 course in ISU (Large Scale Data Analysis)
Hadoop
Job Software available
Data Storage
Hadoop Distributed File
Storage (HDFS)
Parallel Processing MapReduce
Scripting, SQL Pig, Hive
HDFS
What Problem Does HDFS Solve?
Storing Large Data
 A single file as large as a Petabyte
 Store a single file across many machines in a
cluster
 Tolerant to failure of one or more nodes
 High throughput parallel access to data
HDFS
HDFS does not work for
 Low latency random access
 Files that need to be modified at runtime
HDFS
How HDFS work?
 Consider a Large File, multiple GB
 File Divided into Blocks
 Each Block is given an identifier
 Block size typically 64MB
 Blocks kept on different machines
• Leads to a higher throughput of data
HDFS
How HDFS work?
 Replicate Blocks
• Fault tolerance, guard against data loss and corruption
• Default is 3-fold replication, but configurable per file
• Individual blocks are replicated
HDFS Architecture
Namenode and Datanodes
 One namenode, many datanodes
 “Master-slave” architecture
 Namenode stores Metadata
 Datanodes store actual data blocks
MapReduce
Programming model for large Scale Data
Processing
• First used in the context of “big data” in a system
from Google: “MapReduce: simplified data
processing on large clusters”, Jeffrey Dean and
Sanjay Ghemawat, Google Inc.
• Programmer Describes Computation in two steps, Map
and Reduce
MapReduce Example: Word Count
 Problem: Count the number of occurrences of each
word within a text corpus, and output them to a file
 Input: text corpus (say all words from the New York
Times archives), a file in HDFS
 Output: for each unique word in the corpus, the
number of occurrences of the word
MapReduce Example: Word Count
Always think in the notion of (key, value) pairs
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Parallelization
 Different Map Steps can run in parallel
 All Map steps must complete before any Reduce
step begins
 Different Reduce Steps can run in parallel
 Automatic parallelization of a MapReduce
program
Apache Pig
 Framework for large scale data processing, at a
higher level of abstraction than MapReduce.
Writes programs faster than MapReduce for
processing large datasets
Apache Pig
Apache Pig
Resources:
Reference Manual:
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html
Built-in functions
https://pig.apache.org/docs/r0.11.1/func.html
Common hdfs commands
Starts with hdfs dfs -….
or hadoop fs -….
• See the contents of a folder:
• hdfs dfs –ls <location>
• Copy from Local machine to HDFS
• First copy the required file to the local machine
via WinScp
• hdfs dfs –copyFromLocal <local machine
location> <location in HDFS>
Common hdfs commands
• Copy to Local machine from HDFS
• hdfs dfs –copyToLocal <local machine location>
<location in HDFS>
• Then copy the required file from the local
machine to your machine via WinScp
Common hdfs commands
• Make a new directory in hdfs:
• hdfs dfs –mkdir <hdfs directory location>
• See the tail of a file in hdfs:
• hdfs dfs –tail <hdfs file location>
• See the top of a file in hdfs:
• hdfs dfs –cat <hdfs file name>|head -10
Pig Script
A sample script on INRIX XD Data
Inrix XD data Schema:
code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
Pig Script
Inrix XD data Schema:
segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
 Problem: Count the number of occurrences of
confidence score = 30 for any 10 segments for June
23rd 2016 Inrix XD data, and output them to a file
Pig Script
Run the script:
pig –x tez <script location in Local machine>
Store the output in local machine
hdfs dfs –getmerge <hdfs location> <local machine location>

Hadoop, Map Reduce and Apache Pig tutorial

  • 1.
    HDFS, MapReduce and ApachePig Tutorial Pranamesh Chakraborty Resources: CPRE-419 course in ISU (Large Scale Data Analysis)
  • 2.
    Hadoop Job Software available DataStorage Hadoop Distributed File Storage (HDFS) Parallel Processing MapReduce Scripting, SQL Pig, Hive
  • 3.
    HDFS What Problem DoesHDFS Solve? Storing Large Data  A single file as large as a Petabyte  Store a single file across many machines in a cluster  Tolerant to failure of one or more nodes  High throughput parallel access to data
  • 4.
    HDFS HDFS does notwork for  Low latency random access  Files that need to be modified at runtime
  • 5.
    HDFS How HDFS work? Consider a Large File, multiple GB  File Divided into Blocks  Each Block is given an identifier  Block size typically 64MB  Blocks kept on different machines • Leads to a higher throughput of data
  • 6.
    HDFS How HDFS work? Replicate Blocks • Fault tolerance, guard against data loss and corruption • Default is 3-fold replication, but configurable per file • Individual blocks are replicated
  • 7.
    HDFS Architecture Namenode andDatanodes  One namenode, many datanodes  “Master-slave” architecture  Namenode stores Metadata  Datanodes store actual data blocks
  • 8.
    MapReduce Programming model forlarge Scale Data Processing • First used in the context of “big data” in a system from Google: “MapReduce: simplified data processing on large clusters”, Jeffrey Dean and Sanjay Ghemawat, Google Inc. • Programmer Describes Computation in two steps, Map and Reduce
  • 9.
    MapReduce Example: WordCount  Problem: Count the number of occurrences of each word within a text corpus, and output them to a file  Input: text corpus (say all words from the New York Times archives), a file in HDFS  Output: for each unique word in the corpus, the number of occurrences of the word
  • 10.
    MapReduce Example: WordCount Always think in the notion of (key, value) pairs
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    MapReduce Parallelization  DifferentMap Steps can run in parallel  All Map steps must complete before any Reduce step begins  Different Reduce Steps can run in parallel  Automatic parallelization of a MapReduce program
  • 17.
    Apache Pig  Frameworkfor large scale data processing, at a higher level of abstraction than MapReduce. Writes programs faster than MapReduce for processing large datasets
  • 18.
  • 19.
  • 20.
    Common hdfs commands Startswith hdfs dfs -…. or hadoop fs -…. • See the contents of a folder: • hdfs dfs –ls <location> • Copy from Local machine to HDFS • First copy the required file to the local machine via WinScp • hdfs dfs –copyFromLocal <local machine location> <location in HDFS>
  • 21.
    Common hdfs commands •Copy to Local machine from HDFS • hdfs dfs –copyToLocal <local machine location> <location in HDFS> • Then copy the required file from the local machine to your machine via WinScp
  • 22.
    Common hdfs commands •Make a new directory in hdfs: • hdfs dfs –mkdir <hdfs directory location> • See the tail of a file in hdfs: • hdfs dfs –tail <hdfs file location> • See the top of a file in hdfs: • hdfs dfs –cat <hdfs file name>|head -10
  • 23.
    Pig Script A samplescript on INRIX XD Data Inrix XD data Schema: code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
  • 24.
    Pig Script Inrix XDdata Schema: segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime  Problem: Count the number of occurrences of confidence score = 30 for any 10 segments for June 23rd 2016 Inrix XD data, and output them to a file
  • 25.
    Pig Script Run thescript: pig –x tez <script location in Local machine> Store the output in local machine hdfs dfs –getmerge <hdfs location> <local machine location>