Introduction to Apache Pig

15,815 views

Published on

An introductory presentation that was delivered by a co-worker at the NYC Hadoop Meetup

Published in: Technology, Business

Introduction to Apache Pig

  1. 1. Introduction To PIG<br />The evolution of data processing frameworks<br />
  2. 2. What is PIG?<br />Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs<br />Pig generates and compiles a Map/Reduce program(s) on the fly.<br />
  3. 3. Why PIG?<br />Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.<br />
  4. 4. File Formats<br />PigStorage<br />Custom Load / Store Functions<br />
  5. 5. Installing PIG<br />Download / Unpack tarball (pig.apache.org)<br />Install RPM / DEB package (cloudera.com)<br />
  6. 6. Running PIG<br />Grunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.<br />Script File: Place Pig commands in a script file and run the script.<br />Embedded Program: Embed Pig commands in a host language and run the program. <br />
  7. 7. Run Modes<br />Local Mode: To run Pig in local mode, you need access to a single machine.<br />Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation. <br />
  8. 8. Sample PIG script<br />A = load 'passwd' using PigStorage(':'); <br />B = foreach A generate $0 as id;<br />store B into ‘id.out’;<br />
  9. 9. Sample Script With Schema<br />A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);<br />B = FOREACH A GENERATE myudfs.UPPER(name);<br />
  10. 10. Eval Functions<br />AVG<br />CONCAT<br />Example<br />COUNT<br />COUNT_STAR<br />DIFF<br />IsEmpty<br />MAX<br />MIN<br />SIZE<br />SUM<br />TOKENIZE<br />
  11. 11. Math Functions<br /># Math Functions<br />ABS<br />ACOS<br />ASIN<br />ATAN<br />CBRT<br />CEIL<br />COSH<br />COS<br />EXP<br />FLOOR<br />LOG<br />LOG10<br />RANDOM<br />ROUND<br />SIN<br />SINH<br />SQRT<br />TAN<br />TANH<br />
  12. 12. Pig Types<br />
  13. 13. Sample CW PIG script<br />RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');<br />input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;<br />GroupedInput = GROUP input BY (Category, TagId, URL);<br />result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;<br />STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();<br />
  14. 14. Sample PIG script (Filtering)<br />RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');<br />input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;<br />defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);<br />GroupedInput = GROUP defFilter BY (Category, TagId, URL);<br />result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;<br />STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();<br />
  15. 15. What is PIG UDF?<br />UDF - User Defined Function<br />Types of UDF’s:<br />Eval Functions (extends EvalFunc<String>)<br />Aggregate Functions (extends EvalFunc<Long> implements Algebraic)<br />Filter Functions (extends FilterFunc)<br />UDFContext<br />Allows UDFs to get access to the JobConfobject<br />Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.<br />
  16. 16. Sample UDF<br />public class TopLevelDomain extends EvalFunc<String> {<br /> @Override<br /> public String exec(Tupletuple) throws IOException {<br /> Object o = tuple.get(0);<br /> if (o == null) {<br /> return null;<br /> }<br /> return Validator.getTLD(o.toString());<br /> }<br />}<br />
  17. 17. UDF In Action<br />REGISTER '$WORK_DIR/pig-support.jar';<br />DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();<br />AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain<br />
  18. 18. Resources<br />Apache PIG http://pig.apache.org/<br />Apache Hadoophttp://hadoop.apache.org/<br />Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation<br />
  19. 19. PIG DEMO<br />

×