Pig: Data Analysis Tool in Cloud

4,531 views

Published on

Presentation in Java One conference in Beijing 2010

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,531
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
99
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Pig: Data Analysis Tool in Cloud

  1. 1. Pig: Data Analysis Tool in Cloud <br />Jeff Zhang<br />zjffdu@gmail.com<br />Committer of Pig in ASF<br />
  2. 2. Agenda<br />Background<br />What is Pig<br />Brief introduction of Pig internals<br />Demo<br />Q/A<br />
  3. 3. Data Explosion<br />Web 2.0<br /><ul><li>More digit terminal</li></li></ul><li>What we have for data analysis<br />RDBMS (Scalability)<br />Parallel RDBMS (Expensive)<br />Programming Language (Too complex)<br />HadoopMapReduce (Still too complex for non-hadoop users)<br />
  4. 4. Then, Pig’s Coming<br />
  5. 5. What is Pig <br />Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs. <br />Ease of programming<br />Optimization opportunities<br />Extensibility<br />Built upon Hadoop<br />
  6. 6. A simple example of Pig-Latin <br />1291950309812, http://snda.com/page_1 <br />1291950309822, http://snda.com/page_2 <br />1291950309832, http://snda.com/page_3<br />….<br /><ul><li> Page view </li></ul>raw_data = load '/java_one/pv' UsingPigStorage(‘,')         as (time_stamp : long, url : chararray);pages = foreachraw_datagenerateurl;pages = grouppagesbyurl;pages = foreachpagesgenerategroupasurl, COUNT(pages.url) aspv;<br /><ul><li>The most 10 popular pages</li></ul>result = orderpages bypvdesc;top10 = limitresult 10;dumptop10;<br />
  7. 7. Operators in Pig-Latin<br />Load - a = load ‘data’ usingPigStorage(‘t’) as (f1:int ,f2:double,f3:chararray)<br />Store - store a into ‘/test/output’ usingPigStorage(‘,’) <br />Dump - dump a<br />Filter - b = foreach a by f1 > 0 and f2 == ‘java_one’<br />Foreach - b = foreach a generate f1, f3<br />Group - b= group a by f3;<br />Join - b = Join a by f1, b by f1;<br />Describe - describe b;<br />….<br />
  8. 8. Data Structure in Pig<br />Cell  field in database<br />- Primitive types: int, long, float, double, bytearray, chararrar,nul<br />- Complex types: map, tuple, databag<br />Tuple row<br />(1, 1.2, “java”)<br />DataBag table or view <br />{ (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }<br />
  9. 9. How to use Pig<br />Grunt (Interactive Shell)<br />Java API<br />Other languages (in future)<br />
  10. 10. Architecture of Pig<br />Grunt (Interactive shell)<br />PigServer (Java API) <br />Parser (PigLatinLogicalPlan)<br />PigContext<br />Optimizer (LogicalPlan LogicalPlan)<br />Compiler (LogicalPlan PhysiclaPlan  MapReducePlan)<br />ExecutionEngine<br />Hadoop<br />
  11. 11. Three basic operations of Pig<br />Group by<br />Join<br />Order<br />
  12. 12. How Pig do Group by<br />Data Source  Split  Mapper  Partition  Reducer<br />(A,1)<br />(B,2)<br />(C,3)<br />(A,1)<br />(B,2)<br />(C,3)<br />(B,4)<br />(B,5)<br />(C,6)<br />(A,7)<br />(E,8)<br />(D,9)<br />(A,{(A,1),(A,7)}<br />(C,{(C,3),(C,6)})<br />(E,{(E,8)})<br />(B,4)<br />(B,5)<br />(C,6)<br />(B,{(B,2),(B,4),(B,5)}<br />(D,{(D,9)}<br />(A,7)<br />(E,8)<br />(D,9)<br />
  13. 13. How Pig do Join<br />Data Source  Split  Mapper  Partition  Reducer<br />(1,A1)<br />(4,A4)<br />(3,A3)<br />(5,A5)<br />(2,A2)<br />(1,A1)<br />(4,A4)<br />(5,B5)<br />(1,B1)<br />((1,A1),(1,B1))<br />((3,A3),(3,B3))<br />((5,A5),(5,B5))<br />(3,A3)<br />(5,A5)<br />(3,B3)<br />(2,B2)<br />(5,B5)<br />(1,B1)<br />(3,B3)<br />(2,B2)<br />(4,B4)<br />((2,A2)(2,B2))<br />((4,B4),(4,B4))<br />(2,A2)<br />(4,B4)<br />
  14. 14. How Pig do Sort<br />Data Source  Split  Mapper  Range Partition  Reducer<br />(100)<br />(200)<br />(900)<br />(50)<br />(100)<br />(200)<br />(300)<br />(400)<br />(100)<br />(200)<br />(900)<br />(50)<br />(600)<br />(800)<br />(300)<br />(400)<br />(50)<br />(600)<br />(800)<br />(600)<br />(800)<br />(300)<br />(400)<br />
  15. 15. UDF (User-Defined-Function)<br />register myudf.jar;<br />raw_data= load ‘/java_one/udf’ as (name:chararray);<br />firstnames = foreachraw_datageneratemyudf.FirstName (name); <br />storefirstnamesinto ‘/java_one/udf_output’;<br />public class FirstNameextendsEvalFunc<String>{<br /> @Override<br /> public String exec(Tuple input) throwsIOException {<br /> String name=input.get(0).toString();<br />….<br />returnfirstname;<br />}<br />}<br />
  16. 16. What Storage Pig Supports<br />HDFS<br />Plain Text<br />Binary format<br />Customized format (XML, JSON, Protobuf, Thrift…)<br />RDBMS(DBStorage)<br />Cassandra (CassandraStorage)<br />HBase(HBaseStorage)<br />
  17. 17. What fields can Pig be applied <br />Data Analysis<br />Text Processing<br />ETL<br />Machine Learning<br />
  18. 18. Who’s using Pig<br />More: http://wiki.apache.org/pig/PoweredBy<br />
  19. 19. References<br />http://pig.apache.org (Pig official site)<br />http://hadoop.apache.org (Hadoop official site)<br />https://github.com/zjffdu/RAF-PIG (Rich API for Pig)<br />
  20. 20. Demo<br />
  21. 21. Thank you !<br /> Q&A<br />

×