Your SlideShare is downloading. ×
Pig installation and test run
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Pig installation and test run

3,110
views

Published on

This presentation help to install Hadoop Pig and run your sample program. Also provided the details how to use Pig Editor in Cloudera

This presentation help to install Hadoop Pig and run your sample program. Also provided the details how to use Pig Editor in Cloudera

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,110
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
75
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Pig Setup and Test run By Kannan Kalidasan
  • 2. Pig Introduction Pig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java code. Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop. Help to increase productivity by not writing many lines of Java code. It supports a variety of data types and also support user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript. I recommended To learn Programming Pig – Allan Gates book. Author explain the concepts in clear and simple way.
  • 3. Pig Prompt is GRUNT pig grunts … $ pig grunt>
  • 4. Pig session has two modes Local Mode : Access to a single machine. All files are installed and run using your local host and file system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to specify the mode. pig -x local MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode; To add Hadoop Conf details to Pig Class path export PIG_CLASSPATH=$HADOOP_HOME/conf/ both below commands are same and Start the pig session in MapReduce mode. pig or pig -x mapreduce
  • 5. Note to Remember ... ● Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and proceed with our work. ● Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster. ● In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
  • 6. Pig Installation 1. Download the stable version of tarbal. http://mirror.nexcess.net/apache/pig/pig-0.12.0/ pig-0.12.0.tar.gz Release notes link http://pig.apache.org/releases.html#Download
  • 7. Pig Installation ... 2.Copy the downloaded package to /usr/local /usr/local kannan@kannandreams:/usr/local$ ls -ltr total 119460 -rwxr-xr-x 1 root root 63851630 Nov 11 02:11 hadoop-1.2.1.tar.gz drwxr-xr-x 16 hduser hadoop 4096 Nov 11 23:47 hadoop -rwxrwxrwx 1 root root 58433159 Dec 3 00:55 pig-0.11.1.tar.gz kannan@kannandreams:/usr/local$
  • 8. Pig Installation ... 3. unzip and change the owner sudo tar xzf pig-0.11.1.tar.gz sudo mv pig-0.11.1 pig sudo chown -R hduser:hadoop pig chown command change the owner of the directory pig from root to hadoop user hduser. 4.Login to Hadoop user hduser and set the environment variables. kannan@kannandreams:/usr/local$ su – hduser Add the below two lines in ~/.bashrc file. export PIG_HOME=”/usr/local/pig” export PATH=$PATH:$PIG_HOME/bin
  • 9. Pig Installation ... 5. Source the profile file to reflect the changes hduser@kannandreams:~$ . .bashrc hduser@kannandreams:~$ 6.check the pig command output of the command mentioned below is not complete one. hduser@kannandreams:~$ pig -help Warning: $HADOOP_HOME is deprecated. Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
  • 10. Test Run ... 7. Create a sample file for processing ( file name as pigcsv ) Extension for the file doesn’t matter . it will understand based on mime type of the file. sample file – create a file in HDFS directory with the below contents “2006″; “2007″; “2008″; “2008″; “2008″; “2008″; “2007″;
  • 11. Test Run ... 8. Pig Scripts Method 1 to run the pig script : Save the pig scripts as <<filename>>.pig ( In my case, it is pig_test. pig ) and run as $ pig -x mapreduce pig_test.pig OR $ pig pig_test.pig SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ USING PigStorage(‘;’) AS (Year:chararray); GroupByYear = GROUP SampleRecord BY Year; CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); STORE CountByYear INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
  • 12. Test Run ... Method 2 to run the pig script : line ends with ; is considered as one statement grunt>SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ >> USING PigStorage(‘;’) AS (Year:chararray); grunt>GroupByYear = GROUP SampleRecord BY Year; grunt>CountByYear = FOREACH GroupByYear >>GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); grunt>STORE CountByYear >>INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
  • 13. Test Run ... 9. Output : hduser@kannandreams:/usr/local/hadoop/bin$ hadoop fs -cat /user/hduser/pigoutput/part-r-00000 Warning: $HADOOP_HOME is deprecated. “2006″:1 “2007″:2 “2008″:4 “Year”:1 hduser@kannandreams:/usr/local/hadoop/bin$
  • 14. Script Explanation Load the file into a variable by mentioning the delimiter (‘;’) and Header name and its type. Use comma to include more than one column data available in file.By Default , Pig loads files delimited by tab. Need to explicitly mention type of delimiter character. SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’ USING PigStorage(‘;’) AS (Year:chararray); Group the variable stored data by year GroupByYear = GROUP SampleRecord BY Year;
  • 15. Script Explanation ... Count the records for each group set and generate the output as Key:Value.Its your wish how you want to generate the file output.$0 is the group by criteria and $1 is the output of the count CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1))); Store the variable in a file STORE CountByYear INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’); For Complete Script commands , refer http://pig.apache.org/docs/r0.10.0/start.html#data-results
  • 16. Pig in Cloudera Pig Editor in Cloudera are explained in my blog. http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!
  • 17. Thank You !!! mail : kannanpoem1984@gmail.com @kannanpoem on twitter Blog: http://kannandreams.wordpress.com/about/ FB Community: www.facebook.com/groups/huge360/ HUGE - Hadoop User Group & Enthusiasts Huge , Yes Its All about "BIG" Data This has been created to build a group to get expertise and experts in Hadoop and Big Data .