Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Effective Hive Queries


Published on

A session from Qubole Best Practice Webinar Series- “Big Data Secrets from the Pros”. Covers how to make Apache Hive queries run faster by
a. Better layout of data on HDFS via partitioning and bucketing
b. Designing test queries by using block and bucket sampling before running the queries on large datasets
c. Using bucket map joins and parallel processing to run queries faster

Visit for more information.

Published in: Technology, Business

Effective Hive Queries

  1. 1. Effec%ve  Hive  Queries                                Secrets  From  the  Pros   We  will  be  star,ng  at  11:03  PDT   Use  the  Chat  Pane  in  GoToWebinar  to  Ask  Ques%ons!   Assess  your  level  and  learn  new  stuff   This  webinar  is  intended  for  intermediate  audiences   (familiar  with  Apache  Hive  and  Hadoop,  but  not  experts)   ?  
  2. 2. AGENDA   This  Webinar  provides  %ps  on  improving  the  performance  and   beJer  u%lizing  resources  using  the  following  best  prac%ces:   •  Data  Layout  (Par%%ons  and  Buckets)   •  Data  Sampling  (Bucket  and  Block  sampling)   •  Data  Processing  (Bucket  Map  Join  and  Parallel   execu%on)  
  3. 3. Dataset  Used   #  of  records:    276M  records   Columns:   I%nerary  ID   Year  &Quarter  of  Travel   Trip  Origin  City  &  State   Trip  Des%na%on  City  &  State   Distance  between  Origin  &  Des%na%on   Airline  Bookings  All   Includes  stops  at  intermediate  ci%es   #  of  records:    116M  records   Columns:   I%nerary  ID   Year  &Quarter  of  Travel   Trip  Origin  City  &  State   Trip  Des%na%on  City  &  State   Distance  between  Origin  &  Des%na%on   Airline  Bookings  Origin  Only   Only  first  leg  of  travel   #  of  records:    50   Columns:   State  code  &  Name   Popula%on   Census   Human  popula%on  by  US  State  
  4. 4. #1  -­‐  Data  Par%%oning     •  Problem  PaJern   –  Query  a  subset  of  data  in  a  table   –  Subset  iden%fied  by  “Column_Name  =  X”  filter   •  Solu%on  paJern   –  Layout  data  in  sub-­‐directories  with  each  directory  associated   with  a  value  of  the  par%%on  column   –  The  filter  on  par%%on  column  just  picks  a  single  sub  directory   •  Approach   –  Use  PARTITION  BY  clause   •  Benefit   –  Par%%on  pruning   –  2.7x  faster  on  a  query  on  Airline  Bookings  Dataset  (29  seconds)  
  5. 5. #1  -­‐  Data  Par%%oning   Airline  Bookings  All  Table   Origin  State  (Par%%on   Column  /  Sub-­‐directory)   CA   WY  AL   File1001.dat   File1002.dat   File100n.dat   File3001.dat   File3002.dat   File300n.dat   Filex001.dat   Filex002.dat   Filex00n.dat   Files  inside  the   par%%on   SELECT  origin_city,  origin_state   FROM  Airline_Bookings_All   WHERE  origin_state  =  ‘CA’   CREATE  TABLE  Airline_Bookings_All   ….   PARTITIONED  BY  (origin_state  STRING)  
  6. 6. #2  -­‐  Data  Bucke%ng   •  Problem  PaJern   –  Join  data  in  two  large  tables  efficiently   –  Sample  data  inside  a  table  efficiently   •  Solu%on  paJern   –  More  efficient  processing  by  storing  data  in  hash  buckets   •  Approach   •  Use  bucke%ng  using  CLUSTERED  BY  ..  INTO  n  BUCKETS   •  Benefit   –  Bucket  Map  Join   –  Bucket  Sampling  
  7. 7. #2  –  Data  Bucke%ng   CREATE  TABLE  Airline_Bookings_All   …   CLUSTERED  BY  (i%nid)  INTO  64  BUCKETS   set  hive.enforce.bucke%ng  =  true;   INSERT  OVERWRITE  TABLE  Airline_Bookings_All   SELECT  …   FROM  ..   Ailrine_Bookings_All   File00.dat   File63.dat   File01.dat   Each  File  contains  all   the  rows  that   correspond  to  the   same  hash  of  i%nid   column  
  8. 8. #2  -­‐  Data  Bucke%ng   a   File1001.dat   File1002.dat   File100n.dat   Filex001.dat   Filex002.dat   Filex00m.dat   Files  containing  table   data  bucketed  on  a   column   b   set  hive.op%mize.bucketmapjoin  =  true;     SELECT  /*+  MAPJOIN(a,  b)  */  a.*,  b.*   FROM  Airline_Bookings_All  a  JOIN  Airline_Bookings_Origin_Only  b   ON  a.i%nid  =  b.i%nid   Note:     1.  Both  the  tables  are  bucketed  on  i%nid  column   2.  The  numbers  of  buckets  in  the  two  tables  are  a  strict  mul%ple  of  each  other  
  9. 9. #3  -­‐  Bucket  Sampling   •  Problem  PaJern   –  Work  on  joinable  samples  of  data  from  different  tables   •  Solu%on  paJern   –  Use  Bucket  Sampling   •  Approach   •  TABLESAMPLE  (BUCKET  x  OUT  OF  Y  ON  column)   •  Benefit   –  Useful  while  working  with  sample  data  and  joins  
  10. 10. #3  -­‐  Bucket  Sampling   Filex002.dat   Filex030.dat   Filex064.dat   Files  containing  bookings  data   bucketed  on  i%nid   a   SELECT  a.*,  b.*   FROM  Airline_Bookings_All  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  a                      ,  Airline_Bookings_Origin_Only  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  b   WHERE  a.i%nid  =  b.i%nid   Filex001.dat   Filex063.dat   Filey002.dat   Filey030.dat   Filey064.dat   b   Filey001.dat   Filey063.dat  
  11. 11. #4  –  Block  Sampling   •  Problem  PaJern   –  View  a  sample  of  a  data  with  in  a  table   –  Sample  size  expressed  as  number  of  rows,  %age  of  data,  or   number  of  MBs   •  Solu%on  paJern   –  Use  Block  sampling   •  Approach   –  Use  TABLESAMPLE  (n%,  nM,  or  n  ROWS)   •  Benefit   –  Geyng  a  random  sample  from  the  table   –  More  op%ons  to  specify  how  many  samples  to  generate  
  12. 12. #5  –  Parallel  Execu%on   SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM ( SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8 )a JOIN ( SELECT itinid, origin, originstate FROM air_travel_origins_8 )B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY a.year, a.quarter, a.origin, a.originstate; Stage  1   Stage  2   Stage  3   Stage  1   Stage  2   Stage  3   Stage  1   Stage  2   Stage  3   set  hive.exec.parallel  =  false;   set  hive.exec.parallel  =  true;  
  13. 13. Summary   •  Iterate  quickly  on  Query  Design   – Use  Bucket  and  Block  Sampling   •  Run  queries  faster   – Par%%oning  to  invoke  Par%%on  Pruning   – Bucke%ng  to  invoke  Bucket  Map  Joins   – Execute  complex  queries  in  parallel  
  14. 14. THANK  YOU   Managed  Cluster   Built-­‐In  Connectors   Friendly  User-­‐Interface   Dedicated  Support   •  100%  Managed  Hadoop  Cluster  in  the  Cloud   •  Auto-­‐Scaling  Cluster.  Full  Life-­‐cycle  Management   •  +12  Connectors  to  Applica%ons  and  Data  Sources   •  14-­‐Day  Free  Trial  (free  account  available)   •  24/7  Customer  Support   What’s  Included?   è  ç