Sql saturday pig session (wes floyd) v2

4,524 views

Published on

Published in: Technology

Sql saturday pig session (wes floyd) v2

  1. 1. Data Manipulation with Pig Page 1 Wes Floyd - @weswfloyd
  2. 2. Page 2
  3. 3. Pig History • Born from Yahoo Research then Apache incubated • Built to avoid low level programming of Map/Reduce without Hive/SQL queries • Committers from: Yahoo, Hortonworks, LinkedIn, SalesForce, IBM, Twitter, Netflix, and others • Alan Gates on Pig Page 3
  4. 4. Pig • An engine for executing programs on top of Hadoop • It provides a language, Pig Latin, to specify these programs Page 4
  5. 5. HDP: Enterprise Hadoop Platform Page 5 Hortonworks Data Platform (HDP) •  The ONLY 100% open source and complete platform •  Integrates full range of enterprise-ready services •  Certified and tested at scale •  Engineered for deep ecosystem interoperability OS/VM   Cloud   Appliance   PLATFORM     SERVICES   HADOOP     CORE   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS     DATA  PLATFORM  (HDP)   OPERATIONAL   SERVICES   DATA   SERVICES   HDFS   SQOOP   FLUME   NFS   LOAD  &     EXTRACT   WebHDFS   KNOX*   OOZIE   AMBARI   FALCON*   YARN       MAP       TEZ  REDUCE   HIVE  &   HCATALOG   PIG  HBASE  
  6. 6. Why use Pig? • Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18 - 25 Page 6
  7. 7. In Map-Reduce Page 7 170 lines of code, 4 hours to write
  8. 8. In Pig Latin Users  =  load  ‘input/users’  using  PigStorage(‘,’)  as  (name:chararray,  age:int);   Fltrd  =  filter  Users  by  age  >=  18  and  age  <=  25;   Pages  =  load  ‘input/pages’  using  PigStorage(‘,’)    as  (user:chararray,   url:chararray);   Jnd  =  join  Fltrd  by  name,  Pages  by  user;   Grpd  =  group  Jnd  by  url;   Smmd  =  foreach  Grpd  generate  group,COUNT(Jnd)  as  clicks;   Srtd  =  order  Smmd  by  clicks  desc;   Top5  =  limit  Srtd  5;   store  Top5  into  ‘output/top5sites’  using  PigStorage(‘,’);   Page 8 9 lines of code, 15 minutes to write 170 lines to 9 lines of code
  9. 9. Essence of Pig • Map-Reduce is too low a level, SQL too high • Pig-Latin, a language intended to sit between the two – Provides standard relational transforms (join, sort, etc.) – Schemas are optional, used when available, can be defined at runtime – User Defined Functions are first class citizens Page 9
  10. 10. Pig Architecture 10 Hadoop Pig Client: Parses, validates, optimizes, plans, coordinates execution Data stored in HDFS Processing done via MapReduce
  11. 11. Pig Elements Page 11 •  High-level scripting language •  Requires no metadata or schema •  Statements translated into a series of MapReduce jobs Pig Latin •  Interactive shellGrunt •  Shared repository for User Defined Functions (UDFs)Piggybank
  12. 12. Pig Latin Data Flow Page 12 LOAD (HDFS/HCat) TRANSFORM (Pig) DUMP or STORE (HDFS/HCAT) Read data to be manipulated from the file system Manipulate the data Output data to the screen or store for processing In code: •  VARIABLE1  =  LOAD  [somedata]   •  VARIABLE2  =  [TRANSFORM  operation]   •  STORE  VARIABLE2  INTO  ‘[some  location]’  
  13. 13. Pig Relations 1.  A bag is a collection of unordered tuples (can be different sizes). 2.  A tuple is an ordered set of fields. 3.  A field is a piece of data. Pig Latin statements work with relations Field Field 1 Field 2 Field 3 Tuple Bag
  14. 14. FILTER, GROUP, FOREACH, ORDER Page 14 logevents  =  LOAD  'input/my.log'  AS  (date:chararray,    level:chararray,  code:int,  message:chararray);   severe  =  FILTER  logevents  BY  (level  ==  'severe’    AND    code  >=  500);   grouped  =  GROUP  severe  BY  code;   e1  =  LOAD  'pig/input/File1'  USING  PigStorage(',')                AS  (name:chararray,age:int,   zip:int,salary:double);   f  =  FOREACH  e1  GENERATE  age,  salary;   g  =  ORDER  f  BY  age  
  15. 15. JOIN, GROUP, LIMIT Page 15 employees  =  LOAD  ‘[somefile]’              AS  (name:chararray,age:int,  zip:int,salary:double);   agegroup  =  GROUP  employees  BY  age;                   h  =  LIMIT  agegroup  100;   e1  =  LOAD  ’[somefile]'  USING  PigStorage(',')                  AS  (name:chararray,  age:int,  zip:int,   salary:double);   e2  =  LOAD  '[somefile]'  USING  PigStorage(',')                  AS  (name:chararray,  phone:chararray);   e3  =  JOIN  e1  BY  name,  e2  BY  name;  
  16. 16. Pig Basics Demo Page 16
  17. 17. Grunt Command Line Demo Page 17
  18. 18. Hive vs Pig Page 18 Pig and Hive work well together and many businesses use both. Hive is a good choice: •  when you want to query the data •  when you need an answer to specific questions •  if you are familiar with SQL Pig is a good choice: •  for ETL (Extract -> Transform -> Load) •  for preparing data for easier analysis •  when you have a long series of steps to perform
  19. 19. Tool Comparison Page 19© Hortonworks 2012 Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists, char, varchar, decimal, … Schema Encoded in app Declared in script or read by loader Read from metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata
  20. 20. T-SQL vs Hadoop Ecosystem Page 20 T-SQL PIG Hive Query Data Yes Yes (in bulk) Yes Local Variables Yes Yes No Conditional Logic Yes limited limited Procedural Programming Yes No No UDFs No Yes Yes
  21. 21. HCatalog: Data Sharing is Hard Page 21 Photo Credit: totalAldo via Flickr This is programmer Bob, he uses Pig to crunch data. This is analyst Joe, he uses Hive to build reports and answer ad-hoc queries. Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command. Ok Bob, I need today’s data Dude, we need HCatalog © Hortonworks Inc. 2012
  22. 22. Pig Example Page 22 Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; © Hortonworks 2013
  23. 23. Pig Example Page 23 Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; Using HCatalog: raw = load 'rawevents' using HCatLoader(); botless = filter raw by myudfs.NotABot(user) and ds == '20120530'; grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer(); No need to know file location No need to declare schema Partition filter © Hortonworks 2013
  24. 24. Tools With HCatalog Page 24 Feature MapReduce + HCatalog Pig + HCatalog Hive Record format Record Tuple Record Data model int, float, string, maps, structs, lists int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Read from metadata Read from metadata Read from metadata Data location Read from metadata Read from metadata Read from metadata Data format Read from metadata Read from metadata Read from metadata •  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed
  25. 25. Pig with HCat Demo Page 25
  26. 26. Data & Metadata REST Services APIs Page 26 HDFS HBase External Store Existing & New Applications MapReduce Pig Hive HCatalog WebHCat RESTful Web Services WebHDFS & WebHCat provide RESTful API as “front door” for Hadoop •  Opens the door to languages other than Java •  Thin clients via web services vs. fat-clients in gateway •  Insulation from interface changes release to release Opens Hadoop to integration with existing and new applications WebHDFS
  27. 27. RESTful API Access for Pig • Code example  curl  -­‐s  -­‐d  user.name=hue                  -­‐d  execute=”<pig  script>”                  'http://localhost:50111/templeton/v1/pig' •  RestSharp (restsharp.org/) – Simple REST and HTTP API Client for .NET Page 27
  28. 28. WebHCat REST API Page 28 Page 28© Hortonworks 2012 Hadoop/ HCatalog Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table { "tables": ["counted","processed",], "database": "default" } •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents { "table": "rawevents", "database": "default” }{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }
  29. 29. Pig with WebHCat Demo Page 29
  30. 30. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez Page 30 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  31. 31. Pig on Tez - Design 3 Logical Plan Tez Plan MR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler
  32. 32. Performance numbers 3 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Replicated Join (2.8x) Join + Groupby (1.5x) Join + Groupby + Orderby (1.5x) 3 way Split + Join + Groupby + Orderby (2.6x) Timeinsecs MR Tez
  33. 33. User Defined Functions • Ultimate in extensibility and portability • Custom processing – Java – Python – JavaScript – Ruby • Integration with MapReduce phases – Map – Combine – Reduce
  34. 34. User Defined Functions public class MyUDF extends EvalFunc<DataBag> implements Algebraic {! …! }! • Algebraic functions • 3-phase execution – Map – called once for each tuple – Combiner – called zero or more times for each map result – Reduce
  35. 35. User Defined Functions public class MyUDF extends EvalFunc<DataBag> implements Accumulator {! …! }! • Accmulator functions • Incremental processing of data • Called in both map and reduce phase
  36. 36. User Defined Functions public class MyUDF extends FilterFunc {! …! }! • Filter functions • Returns boolean based on processing of the tuple • Called in both map and reduce phase
  37. 37. Questions & Answers Page 37

×