Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pig Hands On November


Published on

Published in: Technology, Sports
  • Be the first to comment

  • Be the first to like this

Pig Hands On November

  1. 1. Hands-­‐on  Pig  with  the   NFL  Play  by  Play  Dataset   Headline  Goes  Here   Ryan  Bosshart    |    Systems  Engineer     Speaker   Nov  2013  v1  Name  or  Subhead  Goes  Here   1 DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12  
  2. 2. Outline   •  What  is  Pig     •  Pig  LaLn  by  Example     •  Data  Model/Architecture     •  Hands-­‐on  with  Pig   2
  3. 3. What  is  Pig?     Give  me  every  run  in  the  2010  season:     SELECT  *  FROM  playbyplay  WHERE  playtype  =  "RUN”  and  year  =  2010;   playbyplay  =  LOAD  'playbyplay’  ….;   run_plays  =  FILTER  playbyplay  BY  (playtype=='RUN')  AND  (year==2010);   DUMP  run_plays;   3  
  4. 4. Components   •  Pig  resides  on  user  machine   •  Job  submiced  to  cluster  &  executed  on  cluster     •  No  need  to  install  anything  extra  on  cluster   Hadoop   Cluster   Pig  Input   (Client  Machine)   4 ©2011 Cloudera, Inc. All Rights Reserved.
  5. 5. Accessing Pig   •  Grunt,  the  pig  shell   •  Submit  a  script  directly   •  PigServer  Java  class,  a  JDBC  like  interface   •  Hue   •  Allows  textual  &  graphical  scripLng   5
  6. 6. Example   6  
  7. 7. How  Pig  Works   Pig  La2n:  Count  Job     A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x> 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; 7 •  •  •  •  •  •  Parses   Checks   OpLmizes   Plans  execuLon   Submits  jar  to  Hadoop   Monitors  job  progress   ExecuLon  Plan   Map:  Filter   Reduce:  Counter  
  8. 8. Starting Grunt   •  To  start  the  Pig  shell  (Grunt),  start  a  terminal  and  run   $ pig $ pig –x local   •  Should  see  a  prompt  like:   grunt> 8 ©2011 Cloudera, Inc. All Rights Reserved.
  9. 9. Data Types   •  Scalar  types   •  •  •  •  •  •    Int   Long   Float   Double   Chararray   Bytearray   •  Complex  types   •  Map:  associaLve  array   •  Tuple:  ordered  list  of  data,  elements  may  be  of  any  scalar  or  complex  type   •  Bag:  unordered  collecLon  of  tuples   9 ©2011 Cloudera, Inc. All Rights Reserved.
  10. 10. Load Returns a Bag   •  LOAD  statements  return  a  tuple.  Each  tuple  has  mulLple   elements,  which  can  be  referenced  by  posiLon  or  by  name.     arrests  =  LOAD  'arrests.csv'  USING  PigStorage(',')  AS(year:int,  team:chararray,  player:chararray); •  A  set  of  tuples  is  referred  to  as  a  bag  (normally  unordered) (2002,KC,Willie  Roaf)   (2002,OAK,Darrell  Russell)   (2002,NYJ,Aaron  Beasley)   10 ©2011 Cloudera, Inc. All Rights Reserved.
  11. 11. Bags & FOREACH   •  The  FOREACH…GENERATE  statement  iterates  over  the  members   of  a  bag     Players = FOREACH arrests GENERATE player;   •  The  result  of  a  FOREACH  is  another  bag   •  Elements  are  named  as  the  input  bag   11 ©2011 Cloudera, Inc. All Rights Reserved.
  12. 12. Positional Reference   •  The  following  creates  idenLcal  output  data     Players = FOREACH arrests GENERATE $2;   •  …But  the  elements  of  arrests  aren’t  named  “player”  –  unless   you  do  this:     Players = FOREACH arrest GENERATE $2 AS player; 12 ©2011 Cloudera, Inc. All Rights Reserved.
  13. 13. Grouping   •  In  Pig  grouping  is  a  separate  operaLon  from  applying  aggregate   funcLons     •  The  output  of  the  group  statement  is  (key,  bag),  where  key  is  the   group  key  and  bag  contains  a  tuple  for  every  record  with  that   key     arrests  =  LOAD  'arrests.csv'  USING  PigStorage(',')  AS(year:int,  team:chararray,  player:chararray);   arrests_by_team  =  GROUP  arrests  BY  team;   13 ©2011 Cloudera, Inc. All Rights Reserved.
  14. 14. Grouping & Types   •  GROUP  BY  makes  an  output  bag  containing  tuples,  containing   more  bags   Gprd = GROUP arrests BY user; •  In:  BagOf(year,  team,  player)     •  Out:  BagOf(group,  BagOf(year,  team,  player),  named  arrests)     •  The  grouping  item  is  always  named  “group”   14
  15. 15. GROUP  arrests  BY  team;   (2010,TEN,Derrick  Morgan)   (2010,TEN,Vince  Young)   (2010,TEN,Kenny  Bric)   (2010,WAS,Fred  Davis)   (2010,WAS,Albert  Haynesworth)   (2010,WAS,Fred  Davis)   (2010,WAS,Fred  Davis)   (2010,WAS,Joe  Joseph)   arrests   15   (TEN,   {(2010,TEN,Derrick  Morgan),   (2010,TEN,Vince  Young),   (2010,TEN,Kenny  Bric)})     (WAS,   {(2010,WAS,Fred  Davis),   (2010,WAS,Albert  Haynesworth), (2010,WAS,Fred  Davis),   (2010,WAS,Fred  Davis),   (2010,WAS,Joe  Joseph)})   (group,  arrests)  
  16. 16. CounLng  Arrests  by  Team   num_arrests  =  FOREACH  arrests_by_team  GENERATE  group  AS   team,  COUNT(arrests)  AS  total;   (TEN,   {(2010,TEN,Derrick  Morgan),   (2010,TEN,Vince  Young),   (2010,TEN,Kenny  Bric)})     (WAS,   {(2010,WAS,Fred  Davis),   (2010,WAS,Albert  Haynesworth), (2010,WAS,Fred  Davis),   (2010,WAS,Fred  Davis),   (2010,WAS,Joe  Joseph)})   16       Results:   (SEA,20)   (STL,9)   (TEN,31)   (WAS,16)          
  17. 17. Using Types   •  •  By  default  Pig  treats  data  as  un-­‐typed   User  can  declare  types  of  data  at  load  Lme     arrests = LOAD 'arrests.csv' USING PigStorage(',') AS(year:int, team:chararray, player:chararray); •  If  data  type  is  not  declared  but  script  treats  value  as  a  certain   type,  Pig  will  assume  it  is  of  that  type  and  cast  it     arrests = LOAD 'arrests.csv' USING PigStorage(',') AS(year, team, player); Two_digit_year = FOREACH arrests GENERATE year - 2000; -- cast to int 17
  18. 18. Ordering   ordered_arrest_count = ORDER num_arrests BY total; •  Sort  the  teams  by  total  number  of  arrests       18 ©2011 Cloudera, Inc. All Rights Reserved.
  19. 19. Filtering   •  Now  let’s  apply  a  filter  to  the  teams  so  that  we  only  get  the   baddest  of  the  bad     subset = FILTER ordered_num_arrests BY (total>20) Results:   (KC,21) (SD,21) (CHI,23) (IND,23) (MIA,24) (TB,25) (JAC,25) (DEN,30) (TEN,31) (CIN,32) (MIN,38) 19 ©2011 Cloudera, Inc. All Rights Reserved.
  20. 20. Loading Data •  Welcome  to  the  Pig  data  loader     •  PigStorage:  loads/stores  relaLons  using  field-­‐delimited  text  format.     •  BinStorage:  loads/stores  relaLons  from  or  to  binary  files     •  BinaryStorage:  loads/stores  relaLons  containing  only  single-­‐field   tuples  with  a  value  of  type  bytearray     •  TextLoader:  Loads  relaLons  from  a  plain-­‐text  format     •  PigDump:  Stores  relaLons  by  wriLng  the  toString()  represetaLon  of   tuples,  one  per  line.   20 ©2011 Cloudera, Inc. All Rights Reserved.
  21. 21. Sharing  Metadata   •  Use HCatalog $ pig –useHCatalog grunt> playbyplay= LOAD ’playbyplay' USING org.apache.hcatalog.pig.HCatLoader(); grunt> STORE newdata INTO ’newtable' USING org.apache.hcatalog.pig.HCatStorer(); •  *need to upload some jars to enable this: •
  22. 22. User Defined Functions   •  Pig  provides  two  statements:  Register  &  Define   •  Register:  Registers  a  JAR  file  with  the  Pig  runLme   •  Define:  Creates  an  alias  for  a  UDF  or  streaming  script   •  Can  be  used  to  do  column  transformaLon,  filtering,  ordering,   custom  aggregaLon.   •  For  example,  you  want  to  write  custom  logic  to  do  user   session  analysis:     log = LOAD ‘excite-small.log’ AS (user, time, query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group, SessionAnalysis(log); STORE cntd INTO ‘output’; 22 ©2011 Cloudera, Inc. All Rights Reserved.
  23. 23. Try it! $ pig grunt> arrests = LOAD 'arrests.csv' USING PigStorage(',') AS (year,team,player); grunt> grouped_arrests = GROUP arrests BY team; grunt> num_arrests = FOREACH grouped_arrests GENERATE group AS team, COUNT(arrests) AS total; grunt> ordered_arrests = ORDER num_arrests BY total; grunt> bad_boys = FILTER ordered_arrests BY (total>20); 23
  24. 24. 24