Pig installation and test run


Published on

This presentation help to install Hadoop Pig and run your sample program. Also provided the details how to use Pig Editor in Cloudera

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Pig installation and test run

  1. 1. Pig Setup and Test run By Kannan Kalidasan
  2. 2. Pig Introduction Pig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java code. Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop. Help to increase productivity by not writing many lines of Java code. It supports a variety of data types and also support user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript. I recommended To learn Programming Pig – Allan Gates book. Author explain the concepts in clear and simple way.
  3. 3. Pig Prompt is GRUNT pig grunts … $ pig grunt>
  4. 4. Pig session has two modes Local Mode : Access to a single machine. All files are installed and run using your local host and file system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to specify the mode. pig -x local MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode; To add Hadoop Conf details to Pig Class path export PIG_CLASSPATH=$HADOOP_HOME/conf/ both below commands are same and Start the pig session in MapReduce mode. pig or pig -x mapreduce
  5. 5. Note to Remember ... ● Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and proceed with our work. ● Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster. ● In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
  6. 6. Pig Installation 1. Download the stable version of tarbal. http://mirror.nexcess.net/apache/pig/pig-0.12.0/ pig-0.12.0.tar.gz Release notes link http://pig.apache.org/releases.html#Download
  7. 7. Pig Installation ... 2.Copy the downloaded package to /usr/local /usr/local kannan@kannandreams:/usr/local$ ls -ltr total 119460 -rwxr-xr-x 1 root root 63851630 Nov 11 02:11 hadoop-1.2.1.tar.gz drwxr-xr-x 16 hduser hadoop 4096 Nov 11 23:47 hadoop -rwxrwxrwx 1 root root 58433159 Dec 3 00:55 pig-0.11.1.tar.gz kannan@kannandreams:/usr/local$
  8. 8. Pig Installation ... 3. unzip and change the owner sudo tar xzf pig-0.11.1.tar.gz sudo mv pig-0.11.1 pig sudo chown -R hduser:hadoop pig chown command change the owner of the directory pig from root to hadoop user hduser. 4.Login to Hadoop user hduser and set the environment variables. kannan@kannandreams:/usr/local$ su – hduser Add the below two lines in ~/.bashrc file. export PIG_HOME=”/usr/local/pig” export PATH=$PATH:$PIG_HOME/bin
  9. 9. Pig Installation ... 5. Source the profile file to reflect the changes hduser@kannandreams:~$ . .bashrc hduser@kannandreams:~$ 6.check the pig command output of the command mentioned below is not complete one. hduser@kannandreams:~$ pig -help Warning: $HADOOP_HOME is deprecated. Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
  10. 10. Test Run ... 7. Create a sample file for processing ( file name as pigcsv ) Extension for the file doesn’t matter . it will understand based on mime type of the file. sample file – create a file in HDFS directory with the below contents “2006″; “2007″; “2008″; “2008″; “2008″; “2008″; “2007″;
  11. 11. Test Run ... 8. Pig Scripts Method 1 to run the pig script : Save the pig scripts as <<filename>>.pig ( In my case, it is pig_test. pig ) and run as $ pig -x mapreduce pig_test.pig OR $ pig pig_test.pig SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ USING PigStorage(‘;’) AS (Year:chararray); GroupByYear = GROUP SampleRecord BY Year; CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); STORE CountByYear INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
  12. 12. Test Run ... Method 2 to run the pig script : line ends with ; is considered as one statement grunt>SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ >> USING PigStorage(‘;’) AS (Year:chararray); grunt>GroupByYear = GROUP SampleRecord BY Year; grunt>CountByYear = FOREACH GroupByYear >>GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); grunt>STORE CountByYear >>INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
  13. 13. Test Run ... 9. Output : hduser@kannandreams:/usr/local/hadoop/bin$ hadoop fs -cat /user/hduser/pigoutput/part-r-00000 Warning: $HADOOP_HOME is deprecated. “2006″:1 “2007″:2 “2008″:4 “Year”:1 hduser@kannandreams:/usr/local/hadoop/bin$
  14. 14. Script Explanation Load the file into a variable by mentioning the delimiter (‘;’) and Header name and its type. Use comma to include more than one column data available in file.By Default , Pig loads files delimited by tab. Need to explicitly mention type of delimiter character. SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ USING PigStorage(‘;’) AS (Year:chararray); Group the variable stored data by year GroupByYear = GROUP SampleRecord BY Year;
  15. 15. Script Explanation ... Count the records for each group set and generate the output as Key:Value.Its your wish how you want to generate the file output.$0 is the group by criteria and $1 is the output of the count CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); Store the variable in a file STORE CountByYear INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’); For Complete Script commands , refer http://pig.apache.org/docs/r0.10.0/start.html#data-results
  16. 16. Pig in Cloudera Pig Editor in Cloudera are explained in my blog. http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!
  17. 17. Thank You !!! mail : kannanpoem1984@gmail.com @kannanpoem on twitter Blog: http://kannandreams.wordpress.com/about/ FB Community: www.facebook.com/groups/huge360/ HUGE - Hadoop User Group & Enthusiasts Huge , Yes Its All about "BIG" Data This has been created to build a group to get expertise and experts in Hadoop and Big Data .