Introduction To PIG<br />The evolution of data processing frameworks<br />
What is PIG?<br />Pig is a platform for analyzing large data sets that consists of a high-level language for expressing da...
Why PIG?<br />Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data ...
File Formats<br />PigStorage<br />Custom Load / Store Functions<br />
Installing PIG<br />Download / Unpack tarball (pig.apache.org)<br />Install RPM / DEB package (cloudera.com)<br />
Running PIG<br />Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.<br />Script File: Place Pi...
Run Modes<br />Local Mode: To run Pig in local mode, you need access to a single machine.<br />Hadoop(mapreduce) Mode: To ...
Sample PIG script<br />A = load 'passwd' using PigStorage(':'); <br />B = foreach A generate $0 as id;<br />store B into ‘...
Sample Script With Schema<br />A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);<br />B = FOREACH A GENE...
Eval Functions<br />AVG<br />CONCAT<br />Example<br />COUNT<br />COUNT_STAR<br />DIFF<br />IsEmpty<br />MAX<br />MIN<br />...
Math Functions<br /># Math Functions<br />ABS<br />ACOS<br />ASIN<br />ATAN<br />CBRT<br />CEIL<br />COSH<br />COS<br />EX...
Pig Types<br />
Sample CW PIG script<br />RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');<...
Sample PIG script (Filtering)<br />RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wid...
What is PIG UDF?<br />UDF  - User Defined Function<br />Types of UDF’s:<br />Eval Functions (extends EvalFunc<String>)<br ...
Sample UDF<br />public class TopLevelDomain extends EvalFunc<String> {<br />	@Override<br />	public String exec(Tupletuple...
UDF In Action<br />REGISTER '$WORK_DIR/pig-support.jar';<br />DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomai...
Resources<br />Apache PIG http://pig.apache.org/<br />Apache Hadoophttp://hadoop.apache.org/<br />Cloudera CDH https://wik...
PIG DEMO<br />
Upcoming SlideShare
Loading in...5
×

Introduction to Apache Pig

11,089

Published on

An introductory presentation that was delivered by a co-worker at the NYC Hadoop Meetup

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,089
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
448
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Introduction to Apache Pig

  1. 1. Introduction To PIG<br />The evolution of data processing frameworks<br />
  2. 2. What is PIG?<br />Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs<br />Pig generates and compiles a Map/Reduce program(s) on the fly.<br />
  3. 3. Why PIG?<br />Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.<br />
  4. 4. File Formats<br />PigStorage<br />Custom Load / Store Functions<br />
  5. 5. Installing PIG<br />Download / Unpack tarball (pig.apache.org)<br />Install RPM / DEB package (cloudera.com)<br />
  6. 6. Running PIG<br />Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.<br />Script File: Place Pig commands in a script file and run the script.<br />Embedded Program: Embed Pig commands in a host language and run the program. <br />
  7. 7. Run Modes<br />Local Mode: To run Pig in local mode, you need access to a single machine.<br />Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation. <br />
  8. 8. Sample PIG script<br />A = load 'passwd' using PigStorage(':'); <br />B = foreach A generate $0 as id;<br />store B into ‘id.out’;<br />
  9. 9. Sample Script With Schema<br />A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);<br />B = FOREACH A GENERATE myudfs.UPPER(name);<br />
  10. 10. Eval Functions<br />AVG<br />CONCAT<br />Example<br />COUNT<br />COUNT_STAR<br />DIFF<br />IsEmpty<br />MAX<br />MIN<br />SIZE<br />SUM<br />TOKENIZE<br />
  11. 11. Math Functions<br /># Math Functions<br />ABS<br />ACOS<br />ASIN<br />ATAN<br />CBRT<br />CEIL<br />COSH<br />COS<br />EXP<br />FLOOR<br />LOG<br />LOG10<br />RANDOM<br />ROUND<br />SIN<br />SINH<br />SQRT<br />TAN<br />TANH<br />
  12. 12. Pig Types<br />
  13. 13. Sample CW PIG script<br />RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');<br />input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;<br />GroupedInput = GROUP input BY (Category, TagId, URL);<br />result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;<br />STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();<br />
  14. 14. Sample PIG script (Filtering)<br />RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');<br />input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;<br />defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);<br />GroupedInput = GROUP defFilter BY (Category, TagId, URL);<br />result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;<br />STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();<br />
  15. 15. What is PIG UDF?<br />UDF - User Defined Function<br />Types of UDF’s:<br />Eval Functions (extends EvalFunc<String>)<br />Aggregate Functions (extends EvalFunc<Long> implements Algebraic)<br />Filter Functions (extends FilterFunc)<br />UDFContext<br />Allows UDFs to get access to the JobConfobject<br />Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.<br />
  16. 16. Sample UDF<br />public class TopLevelDomain extends EvalFunc<String> {<br /> @Override<br /> public String exec(Tupletuple) throws IOException {<br /> Object o = tuple.get(0);<br /> if (o == null) {<br /> return null;<br /> }<br /> return Validator.getTLD(o.toString());<br /> }<br />}<br />
  17. 17. UDF In Action<br />REGISTER '$WORK_DIR/pig-support.jar';<br />DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();<br />AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain<br />
  18. 18. Resources<br />Apache PIG http://pig.apache.org/<br />Apache Hadoophttp://hadoop.apache.org/<br />Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation<br />
  19. 19. PIG DEMO<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×