This document provides an overview of a Hadoop tutorial covering the basics of data intensive computing with Hadoop. The tutorial includes hands-on experience with MapReduce programming patterns using Hadoop Streaming and Java, working with the Hadoop filesystem HDFS, and using the Pig data processing language. Sections cover preparing the Hadoop environment, simple and advanced MapReduce examples, user-defined counters, and exploring the Pig language.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Presentació a càrrec d'Ismael Fernández i Cristian
Gomollón (tècnics d'Aplicacions al CSUC) duta a terme a la jornada de formació "Com usar el servei de càlcul del CSUC" celebrada el 8 d'octubre de 2019 al CSUC.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Presentació a càrrec d'Ismael Fernández i Cristian
Gomollón (tècnics d'Aplicacions al CSUC) duta a terme a la jornada de formació "Com usar el servei de càlcul del CSUC" celebrada el 8 d'octubre de 2019 al CSUC.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
Presentació a càrrec d'Ismael Fernández, tècnic d'Aplicacions al CSUC, duta a terme a la "4a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 17 de març de 2021 en format virtual.
This presentation will give a quick introduction on how to use slurm, the scheduler that runs programs (scripts) in HPC. Targeted for audience who are new to the Lawrencium or who may want to learn a few more things in troubleshooting their jobs.
Presentació a càrrec de Cristian
Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "4a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 17 de març de 2021 en format virtual.
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
Tom Kraljevic discusses big data integration with Hadoop - how open source big data H2O works within a Hadoop cluster, what we've learned while integrating, and how to get the most out of your big data on Hadoop.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Presentació a càrrec de Cristian Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de març de 2021 en format virtual.
Presentació a càrrec d'Ismael Fernández, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de desembre de 2021 en format virtual.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
Presentació a càrrec d'Ismael Fernández, tècnic d'Aplicacions al CSUC, duta a terme a la "4a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 17 de març de 2021 en format virtual.
This presentation will give a quick introduction on how to use slurm, the scheduler that runs programs (scripts) in HPC. Targeted for audience who are new to the Lawrencium or who may want to learn a few more things in troubleshooting their jobs.
Presentació a càrrec de Cristian
Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "4a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 17 de març de 2021 en format virtual.
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
Tom Kraljevic discusses big data integration with Hadoop - how open source big data H2O works within a Hadoop cluster, what we've learned while integrating, and how to get the most out of your big data on Hadoop.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Presentació a càrrec de Cristian Gomollon, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de març de 2021 en format virtual.
Presentació a càrrec d'Ismael Fernández, tècnic d'Aplicacions al CSUC, duta a terme a la "5a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 16 de desembre de 2021 en format virtual.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.
The title "Big Data using Hadoop.pdf" suggests that the document is likely a PDF file that focuses on the utilization of Hadoop technology in the context of Big Data. Hadoop is a popular open-source framework for distributed storage and processing of large datasets. The document is expected to cover various aspects of working with big data, emphasizing the role of Hadoop in managing and analyzing vast amounts of information.
Spark adds some abstractions and generalizations and performance optimizations to achieve much better efficiency especially in iterative workloads. Yet, spark does not concern itself with being a data file system while Hadoop has what is called HDFS.
Spark can leverage existing distributed files systems (like HDFS), a distributed data base (like HBase), traditional databases through its JDBC or ODBC adaptors, and flat files in local file systems or on a file store like S3 in Amazon cloud.
Hadoop MapReduce framework is similar to Spark in that it uses master slave-like paradigm. It has one Master node (which consists of a job tracker, name node, and RAM) and Worker Nodes (each worker node consists of a task tracker, data node, and a RAM). The task tracker in a worker node is analogues to an executor in Spark environment.
1. Hadoop Tutorial
GridKa School 2011
Ahmad Hammad,
Ariel Garcay
Karlsruhe Institute of Technology
September 7, 2011
Abstract
This tutorial intends to guide you through the basics of Data Intensive Comput-
ing with the Hadoop Toolkit. At the end of the course you will hopefully have an
overview and hands-on experience about the Map-Reduce computing pattern and
its Hadoop implementation, about the Hadoop
2. lesystem HDFS, and about some
higher level tools built on top of these like the data processing language Pig.
hammad@kit.edu
ygarcia@kit.edu
1
8. Hands-on block 1
Preparation
1.1 Logging-in
The tutorial will be performed in an existing Hadoop installation, a cluster of 55 nodes
with a Hadoop
9. lesystem HDFS of around 100 TB.
You should log-in via ssh into
hadoop.lsdf.kit.edu # Port 22
from the login nodes provided to you by the GridKA School:
gks-1-101.fzk.de # Port 24
gks-2-151.fzk.de
NOTE: Just use the same credentials (username and password) for both accounts.
1.2 Getting aquainted with HDFS
First we will perform some data management operations on the Hadoop Distributed
Filesystem HDFS. The commands have a strong similarity to the standard Unix/Linux
ones.
We will denote with the suxes HFile and HDir
12. le or a folder. Similarly, we use LFile, LDir and
LPath for the corresponding objects in the local disk. For instance, some of the most
useful HDFS commands are the following:
hadoop fs -ls /
hadoop fs -ls myHPath
hadoop fs -cat myHFile
hadoop fs -df
hadoop fs -cp srcHFile destHPath
hadoop fs -mv srcHFile destHPath
3
13. hadoop fs -rm myHFile
hadoop fs -rmr myHDir
hadoop fs -du myHDir
hadoop fs -mkdir myHDir
hadoop fs -get myHFile myCopyLFile
hadoop fs -put myLFile myCopyHFile
You can get all possible fs subcommands by typing
hadoop fs
Exercises:
1. List the top level folder of the HDFS
19. nding the maxi-
mum temperature
The aim of this block is to get some
20. rst-hand experience on how Hadoop MapReduce
works. We will use a simple Streaming exercise to achieve that,
21. nding the highest
temperature for each year in a real world climate data set.
Consider the following weather data set sample:
$ cat input_local/sample.txt
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
^ ^ ^
year +/- temperature quality
positions 16-19 positions 88-92 93
4 5 1
The temperature is multiplied by 10. The temperature value is considered MISSING if
is equal to +9999. The value of the quality
ag indicates the quality of the measurement.
It has to match one of the following values: 0, 1, 4, 5 or 9; otherwise the temperature
value is considered invalid.
Exercise: Write two scripts in a script language of your choice (like Bash, Python) to
act as Map and Reduce functions for
23. le sample.txt. These two scripts should act as described below.
The Map:
- reads the input data from standard input STDIN line-by-line
- parses every line by: year, temperature and quality
- tests if the parsed temperature is valid. That is the case when:
temp != +9999 and re.match([01459], quality) // Python code
- outputs the year and the valid temperature as a tab-separated key-value pair
string to standard output STDOUT.
5
24. The Reduce:
- reads data from standard input STDIN line-by-line
- splits the input line by the tab-separator to get the key and its value
-
25. nds the maximum temperature for each year and prints it to STDOUT
2.1.1 Testing the map and reduce
26. les without Hadoop
You can test the map and reduce scripts without using Hadoop. This helps to make clear
the programming concept. Lets
27. rst check what the map output is:
$ cd mapreduce1
$ cat ../input_local/sample.txt | ./map.py
Now you can run the complete map-reduce chain, to obtain the maximum temperature
for each year:
$ cat ../input_local/sample.txt | ./map.py | sort | ./reduce.py
2.1.2 MapReduce with Hadoop Streaming
1. Run the MapReduce Streaming job on the local
28. le system. What is the calculated
maximum temperature? for which year?
Notice: write the following command all in one line, or use a backslash () at the
end of each line as shown in point 2.
$ hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
-conf ../conf/hadoop-local.xml
-input ../input_local/sample.txt
-output myLocalOutput
-mapper ./map.py
-reducer ./reduce.py
-file ./map.py
-file ./reduce.py
2. Run the MapReduce Streaming on HDFS. Where and what is the output of the
calculated max temperature of the job?
$ hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar n
-input /share/data/gks2011/input/sample.txt n
-output myHdfsOutput n
-mapper ./map.py n
-reducer ./reduce.py n
-file ./map.py n
-file ./reduce.py
6
29. Important: Before a repeating a run for a second time you always have to delete the
output folder given with -output or you must select a new one, otherwise Hadoop will
abort the execution.
$ hadoop fs -rmr myHdfsOutput
In case of the local
30. le sytem run:
$ rm -rf myLocalOutput
2.1.3 Optional
Can you tell how many MapTasks and ReduceTasks have been started for this MR job?
7
31. Hands-on block 3
MapReduce II: Developing MR
programs in Java
3.1 Finding the maximum temperature with a Java
MR job
In this block you will repeat the calculation of the previous one using a native Hadoop
MR program.
Exercise: Based on the
32. le MyJobSkeleton.java in your mapreduce2/ folder try to
replace all question-mark placeholders (?) in the
33. le MyJob.java to have a functioning
MapReduce Java program, that can
34. nd the max temperature for each year as described
in the previous block.
To test the program:
# Create a directory for your compiled classes
$ mkdir myclasses
# Compile your code
$ javac -classpath /usr/lib/hadoop/hadoop-core.jar
-d myclasses MyJob.java
# Create a jar
$ jar -cvf myjob.jar -C myclasses .
# Run
$ hadoop jar myjob.jar MyJob
/share/data/gks2011/input/sample.txt myHdfsOutput
Important: replace gs099 with your actual account name.
3.2 Optional MapReduce exercise
Run the program with the following input:
8
38. ned
Counters
4.1 Understanding the RecordParser
Hadoop supports a quite sophisticated reporting framework for helping the user to keep
track of his Hadoop job status.
Exercise: In the directory mapreduce3/ you will
43. ned Counters' Enum and call the incrCounter()
method to increment the right counter at the marked places in the code. Compile, create
the jar, and run your MR job with either of the following input data:
/share/data/gks2011/input/all
/share/data/gks2011/input/bigfile.txt
What do your Counters report?
To compile the program do:
$ mkdir myclasses
$ javac -classpath /usr/lib/hadoop/hadoop-core.jar -d myclasses
RecordParser.java MyJobWithCounters.java
10
44. $ jar -cvf MyJobWithCounters.jar -C myclasses .
$ hadoop jar MyJobWithCounters.jar MyJobWithCounters
input/all outputcounts
$ hadoop jar MyJobWithCounters.jar MyJobWithCounters
input/bigfile.txt outputcounts2
11
45. Hands-on block 5
The Pig data processing language
5.1 Working with Pigs
Pig is a data
ow language based on Hadoop. The Pig interpreter transforms your Pig
commands into MapReduce programs which are then run by Hadoop, usually in the
cluster infrastructure, in a way completely transparent for you.
5.1.1 Starting the interpreter
The Pig interpreter (called Grunt) can be started in either of two modes:
local
Pig programs are executed locally, only local
46. les can be used (no HDFS)
MapReduce
Pig programs are executed in the full Hadoop environment, with
47. les in HDFS only
To run these modes use
pig -x local
pig -x mapreduce # Default
Note: You can also create and run Pig scripts in batch (non-interactive) mode:
pig myPigScript.pig
Exercise: Start the Grunt shell {in local mode for now{ and with reduced debugging:
pig -x local -d warn
Then get aquainted with some of the Pig shell's utility commands shown in Table 5.1.
Remember that you started the shell in local mode, therefore you will be browsing the
local
49. grunt help
grunt pwd
grunt fs -ls
grunt ls
grunt cp README-PLEASE.txt /tmp
grunt cat /tmp/README-PLEASE.txt
grunt fs -rm /tmp/README-PLEASE.txt
...
Utility commands
help Prints some help :-)
quit Just that
set debug [onjo] Enables verbose debugging
fs -CMD HDFS commands (work also for local
50. les)
ls, cp, cat, rm, rmr, ... Same commands as above (less output)
cd Change directory
Table 5.1: Grunt shell's utility commands
5.1.2 The Pig Latin language basics
The Pig language supports several data types: 4 scalar numeric types, 2 array types,
and 3 composite data types as shown in Table 5.2. Note the examples on the rightmost
column: short integers and double
oats are the default types, otherwise the suxes
L or F need to be used. Important for understanding the Pig language and this tutorial
are the tuples and bags.
Simple data types
int Signed 32 bit integer 117
long Signed 64 bit integer 117L
oat 32-bit
oating point 3.14F or 1.0E6F
double 64-bit
oating point 3.14 or 1.41421E2
chararray UTF8 character array (string) Hello world!
bytearray Byte array (binary object)
Complex data types
tuple Ordered set of
51. elds (1,A,2.0)
bag Collection of tuples: unordered,
possibly dierent tuple types
f(1,2),(2,3)g
map Set of key value pairs: keys are
unique chararrays
[key#value]
Table 5.2: The Pig Latin data types
Having said that, let's start hands on :-)
13
55. le containing mean daily temperature
records, similar to the ones used earlier in this tutorial.
grunt cd pig
grunt data = LOAD 'sample-data.txt'
AS (loc:long, date:chararray, temp:float, count:int);
grunt DUMP data;
...
grunt part_data = LIMIT data 5;
grunt DUMP part_data;
...
Notice how you can dump all the data or just part of it using an auxiliary variable. Can
you explain why one of the tuples in data appears as
(,YEAR_MO_DAY,,) ?
Notice also that the real processing of data in Pig only takes place when you request
some
56. nal result, for instance with DUMP or STORE. Moreover, you can also ask Pig about
variables and some sample data:
grunt DESCRIBE data;
...
grunt ILLUSTRATE data;
...
The variable (a.k.a. alias) data is a bag of tuples. The illustrate command illustrates
the variable with dierent sample data each time... sometimes you might see a null pointer
exception with our unprepared data: can you explain why?
Exercise: Now we will learn to
57. nd the maximum temperature in our small data set.
grunt temps = FOREACH data GENERATE temp;
grunt temps_group = GROUP temps ALL;
grunt max_temp = FOREACH temps_group GENERATE MAX(temps);
grunt DUMP max_temp;
(71.6)
As you see above, Pig doesn't allow you to directly apply a function (MAX()) to your data
variables, but on a single-column bag.
Remember, Pig is not a normal programming language but a
data processing language based on MapReduce and Hadoop! This
language structure is required to allow a direct mapping of your processing
instructions to MapReduce!
Use DESCRIBE and DUMP to understand how the Pig instructions above work.
14
58. NOTE: if you change and reload a relation, like temps = above, you must reload also
all dependent relations (temps group, max temp) to achieve correct results!
Exercise: Repeat the exercise above but converting the temperature to degrees Celsius
instead of Fahrenheit:
TCelsius =
5
9
(TFahrenheit 32)
Hint: you can use mathematical formulas in the GENERATE part of a relation, but
you cannot operate with the results of a function like MAX(). Don't forget that numbers
without decimal dot are interpreted as integers!
Data IO commands
LOAD a 1 = LOAD 'file' [USING function] [AS schema];
STORE STORE a 2 INTO 'folder' [USING function];
DUMP DUMP a 3;
LIMIT a 4 = LIMIT a 3 number;
Diagnostic commands
DESCRIBE DESCRIBE a 5;
Show the schema (data types) of the relation
EXPLAIN EXPLAIN a 6;
Display the execution plan of the relation
ILLUSTRATE ILLUSTRATE a 7;
Display step by step how data is transformed (from the LOAD
to the desired relation)
5.2 Using more realistic data
Above we have used a tiny data
59. le with 20 lines of sample data. Now we will run Pig in
MapReduce mode to process bigger
60. les.
pig
Remember that now Pig with only allow you to use HDFS...
Exercise: Load data from a 200MB data
62. les are now not TAB-separated {as expected by default by Pig{ but space-separated,
we need to explicitely tell Pig the LOAD function to use:
grunt cd /share/data/gks2011/pig/all-years
grunt data = LOAD 'climate-1973.txt' USING PigStorage(' ')
AS (loc:long, wban:int, date:chararray,
temp:float, count:int);
grunt part_data = LIMIT data 5;
15
63. grunt DUMP part_data;
...
Check that the data was correctly loaded using the LIMIT or the ILLUSTRATE operators.
grunt temps = FOREACH data GENERATE temp;
grunt temps_group = GROUP temps ALL;
grunt max_temp = FOREACH temps_group GENERATE MAX(temps);
grunt DUMP max_temp;
(109.9)
Exercise: Repeat the above exercise measuring the time it takes to
65. le, and then compare with the time it takes to process the whole
folder (13 GB instead of 200 MB). Pig can load all
66. les in a folder at once if you pass it
a folder path:
grunt data = LOAD '/share/data/gks2011/pig/all-years/'
USING PigStorage(' ')
AS (loc:long, wban:int, date:chararray,
temp:float, count:int);
5.3 A more advanced exercise
In this realistic data set, the data is not perfect or fully cleaned up: if you look carefully,
for instance, you will see a message
Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 7996 time(s).
This is due to the label lines mixed inside the
67. le:
STN--- WBAN YEARMODA TEMP ...
We will remove those lines from the input data by using the FILTER operator. As the
warnings come from the castings in the LOAD operation, we now postpone the casts for a
later step, after the
68. lter was done:
grunt cd /share/data/gks2011/pig/all-years
grunt data_raw = LOAD 'climate-1973.txt' USING PigStorage(' ')
AS (loc, wban, date:chararray, temp, count);
grunt data_flt = FILTER data_raw BY date != 'YEARMODA';
grunt data = FOREACH data_flt GENERATE (long)loc, (int)wban,
date, (float)temp, (int)count;
grunt temps = FOREACH data GENERATE ((temp-32.0)*5.0/9.0);
grunt temps_group = GROUP temps ALL;
grunt max_temp = FOREACH temps_group GENERATE MAX(temps);
grunt DUMP max_temp;
(43.27777862548828)
16
69. Also the mean daily temperatures were obtained from averaging a variable number of
measurements: the amount is given in the 5th column, variable count. You might want
to
70. lter all mean values obtained with less than {say{ 5 measurements out. This is left
as an exercise to the reader.
5.4 Some extra Pig commands
Some relational operators
FILTER Use it to work with tuples or rows of data
FOREACH Use it to work with columns of data
GROUP Use it to group data in a single relation
ORDER Sort a relation based on one or more
71. elds
...
Some built-in functions
AVG Calculate the average of numeric values in a single-column bag
COUNT Calculate the number of tuples in a bag
MAX/MIN Calculate the maximum/minimum value in a single-column bag
SUM Calculate the sum of values in a single-column bag
...
17
72. Hands-on block 6
Extras
6.1 Installing your own Hadoop
The Hadoop community has its main online presence in:
http://hadoop.apache.org/
Although you can download the latest source code and release tarballs from that location,
we strongly suggest you to use the more production-ready Cloudera distribution:
http://www.cloudera.com/
Cloudera provides ready to use Hadoop Linux packages for several distributions, as well as
a Hadoop Installer for con
73. guring your own Hadoop cluster, and also a VMWare appliance
precon