Pig installation and test run

4,864
-1

Published on

This presentation help to install Hadoop Pig and run your sample program. Also provided the details how to use Pig Editor in Cloudera

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,864
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
120
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Pig installation and test run

  1. 1. Pig Setup and Test run By Kannan Kalidasan
  2. 2. Pig Introduction Pig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java code. Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop. Help to increase productivity by not writing many lines of Java code. It supports a variety of data types and also support user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript. I recommended To learn Programming Pig – Allan Gates book. Author explain the concepts in clear and simple way.
  3. 3. Pig Prompt is GRUNT pig grunts … $ pig grunt>
  4. 4. Pig session has two modes Local Mode : Access to a single machine. All files are installed and run using your local host and file system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to specify the mode. pig -x local MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode; To add Hadoop Conf details to Pig Class path export PIG_CLASSPATH=$HADOOP_HOME/conf/ both below commands are same and Start the pig session in MapReduce mode. pig or pig -x mapreduce
  5. 5. Note to Remember ... ● Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and proceed with our work. ● Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster. ● In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
  6. 6. Pig Installation 1. Download the stable version of tarbal. http://mirror.nexcess.net/apache/pig/pig-0.12.0/ pig-0.12.0.tar.gz Release notes link http://pig.apache.org/releases.html#Download
  7. 7. Pig Installation ... 2.Copy the downloaded package to /usr/local /usr/local kannan@kannandreams:/usr/local$ ls -ltr total 119460 -rwxr-xr-x 1 root root 63851630 Nov 11 02:11 hadoop-1.2.1.tar.gz drwxr-xr-x 16 hduser hadoop 4096 Nov 11 23:47 hadoop -rwxrwxrwx 1 root root 58433159 Dec 3 00:55 pig-0.11.1.tar.gz kannan@kannandreams:/usr/local$
  8. 8. Pig Installation ... 3. unzip and change the owner sudo tar xzf pig-0.11.1.tar.gz sudo mv pig-0.11.1 pig sudo chown -R hduser:hadoop pig chown command change the owner of the directory pig from root to hadoop user hduser. 4.Login to Hadoop user hduser and set the environment variables. kannan@kannandreams:/usr/local$ su – hduser Add the below two lines in ~/.bashrc file. export PIG_HOME=”/usr/local/pig” export PATH=$PATH:$PIG_HOME/bin
  9. 9. Pig Installation ... 5. Source the profile file to reflect the changes hduser@kannandreams:~$ . .bashrc hduser@kannandreams:~$ 6.check the pig command output of the command mentioned below is not complete one. hduser@kannandreams:~$ pig -help Warning: $HADOOP_HOME is deprecated. Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
  10. 10. Test Run ... 7. Create a sample file for processing ( file name as pigcsv ) Extension for the file doesn’t matter . it will understand based on mime type of the file. sample file – create a file in HDFS directory with the below contents “2006″; “2007″; “2008″; “2008″; “2008″; “2008″; “2007″;
  11. 11. Test Run ... 8. Pig Scripts Method 1 to run the pig script : Save the pig scripts as <<filename>>.pig ( In my case, it is pig_test. pig ) and run as $ pig -x mapreduce pig_test.pig OR $ pig pig_test.pig SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ USING PigStorage(‘;’) AS (Year:chararray); GroupByYear = GROUP SampleRecord BY Year; CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); STORE CountByYear INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
  12. 12. Test Run ... Method 2 to run the pig script : line ends with ; is considered as one statement grunt>SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ >> USING PigStorage(‘;’) AS (Year:chararray); grunt>GroupByYear = GROUP SampleRecord BY Year; grunt>CountByYear = FOREACH GroupByYear >>GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); grunt>STORE CountByYear >>INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
  13. 13. Test Run ... 9. Output : hduser@kannandreams:/usr/local/hadoop/bin$ hadoop fs -cat /user/hduser/pigoutput/part-r-00000 Warning: $HADOOP_HOME is deprecated. “2006″:1 “2007″:2 “2008″:4 “Year”:1 hduser@kannandreams:/usr/local/hadoop/bin$
  14. 14. Script Explanation Load the file into a variable by mentioning the delimiter (‘;’) and Header name and its type. Use comma to include more than one column data available in file.By Default , Pig loads files delimited by tab. Need to explicitly mention type of delimiter character. SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ USING PigStorage(‘;’) AS (Year:chararray); Group the variable stored data by year GroupByYear = GROUP SampleRecord BY Year;
  15. 15. Script Explanation ... Count the records for each group set and generate the output as Key:Value.Its your wish how you want to generate the file output.$0 is the group by criteria and $1 is the output of the count CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); Store the variable in a file STORE CountByYear INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’); For Complete Script commands , refer http://pig.apache.org/docs/r0.10.0/start.html#data-results
  16. 16. Pig in Cloudera Pig Editor in Cloudera are explained in my blog. http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!
  17. 17. Thank You !!! mail : kannanpoem1984@gmail.com @kannanpoem on twitter Blog: http://kannandreams.wordpress.com/about/ FB Community: www.facebook.com/groups/huge360/ HUGE - Hadoop User Group & Enthusiasts Huge , Yes Its All about "BIG" Data This has been created to build a group to get expertise and experts in Hadoop and Big Data .
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×