SlideShare a Scribd company logo
Hive Anatomy



Data Infrastructure Team, Facebook
Part of Apache Hadoop Hive Project
Overview	
  
•  Conceptual	
  level	
  architecture	
  
•  (Pseudo-­‐)code	
  level	
  architecture	
  
       •  Parser	
  
       •  Seman:c	
  analyzer	
  
       •  Execu:on	
  
•  Example:	
  adding	
  a	
  new	
  Semijoin	
  Operator	
  
Conceptual	
  Level	
  Architecture 	
  	
  
•  Hive	
  Components:	
  
   –  Parser	
  (antlr):	
  HiveQL	
  	
  Abstract	
  Syntax	
  Tree	
  (AST)	
  
   –  Seman:c	
  Analyzer:	
  AST	
  	
  DAG	
  of	
  MapReduce	
  Tasks	
  
       •  Logical	
  Plan	
  Generator:	
  AST	
  	
  operator	
  trees	
  
       •  Op:mizer	
  (logical	
  rewrite):	
  operator	
  trees	
  	
  operator	
  trees	
  	
  
       •  Physical	
  Plan	
  Generator:	
  operator	
  trees	
  -­‐>	
  MapReduce	
  Tasks	
  
   –  Execu:on	
  Libraries:	
  	
  
       •  Operator	
  implementa:ons,	
  UDF/UDAF/UDTF	
  
       •  SerDe	
  &	
  ObjectInspector,	
  Metastore	
  
       •  FileFormat	
  &	
  RecordReader	
  
Hive	
  User/Applica:on	
  Interfaces	
  

                              CliDriver	
               Hive	
  shell	
  script	
     HiPal*	
  



Driver	
                    HiveServer	
                        ODBC	
  


                            JDBCDriver	
  
                           (inheren:ng	
  
                              Driver)	
  



       *HiPal is a Web-basedHive client developed internally at Facebook.
Parser	
  
                               •  ANTLR	
  is	
  a	
  parser	
  generator.	
  	
  
                               •  $HIVE_SRC/ql/src/java/org/
                                  apache/hadoop/hive/ql/parse/
                                  Hive.g	
  
                               •  Hive.g	
  defines	
  keywords,	
  
ParseDriver	
     Driver	
        tokens,	
  transla:ons	
  from	
  
                                  HiveQL	
  to	
  AST	
  (ASTNode.java)	
  
                               •  Every	
  :me	
  Hive.g	
  is	
  changed,	
  
                                  you	
  need	
  to	
  ‘ant	
  clean’	
  first	
  and	
  
                                  rebuild	
  using	
  ‘ant	
  package’	
  
Seman:c	
  Analyzer	
  
                                          •  BaseSeman:cAnalyzer	
  is	
  the	
  
                                             base	
  class	
  for	
  
                                             DDLSeman:cAnalyzer	
  and	
  
                                             Seman:cAnalyzer	
  
BaseSeman:cAnalyzer	
        Driver	
  
                                             –  Seman:cAnalyzer	
  handles	
  
                                                queries,	
  DML,	
  and	
  some	
  DDL	
  
                                                (create-­‐table)	
  
                                             –  DDLSeman:cAnalyzer	
  
                                                handles	
  alter	
  table	
  etc.	
  
Logical	
  Plan	
  Genera:on	
  
•  Seman:cAnalyzer.analyzeInternal()	
  is	
  the	
  main	
  
   funciton	
  
   –  doPhase1():	
  recursively	
  traverse	
  AST	
  tree	
  and	
  check	
  for	
  
      seman:c	
  errors	
  and	
  gather	
  metadata	
  which	
  is	
  put	
  in	
  QB	
  
      and	
  	
  QBParseInfo.	
  
   –  getMetaData():	
  query	
  metastore	
  and	
  get	
  metadata	
  for	
  the	
  
      data	
  sources	
  and	
  put	
  them	
  into	
  QB	
  and	
  QBParseInfo.	
  
   –  genPlan():	
  takes	
  the	
  QB/QBParseInfo	
  and	
  AST	
  tree	
  and	
  
      generate	
  an	
  operator	
  tree.	
  
Logical	
  Plan	
  Genera:on	
  (cont.)	
  
•  genPlan()	
  is	
  recursively	
  called	
  for	
  each	
  subqueries	
  
   (QB),	
  and	
  output	
  the	
  root	
  of	
  the	
  operator	
  tree.	
  
•  For	
  each	
  subquery,	
  genPlan	
  create	
  operators	
  
   “bocom-­‐up”*	
  staring	
  from	
  
   FROMWHEREGROUPBYORDERBY	
  	
  SELECT	
  
•  In	
  the	
  FROM	
  clause,	
  generate	
  a	
  TableScanOperator	
  
   for	
  each	
  source	
  table,	
  Then	
  genLateralView()	
  and	
  
   genJoinPlan().	
  	
  
    *Hive code actually names each leaf operator as “root” and its downstream operators as children.
Logical	
  Plan	
  Genera:on	
  (cont.)	
  
•  genBodyPlan()	
  is	
  then	
  called	
  to	
  handle	
  WHERE-­‐
   GROUPBY-­‐ORDERBY-­‐SELECT	
  clauses.	
  
    –  genFilterPlan()	
  for	
  WHERE	
  clause	
  
    –  genGroupByPalnMapAgg1MR/2MR()	
  for	
  map-­‐side	
  par:al	
  
       aggrega:on	
  
    –  genGroupByPlan1MR/2MR()	
  for	
  reduce-­‐side	
  aggrega:on	
  
    –  genSelectPlan()	
  for	
  SELECT-­‐clause	
  
    –  genReduceSink()	
  for	
  marking	
  the	
  boundary	
  between	
  map/
       reduce	
  phases.	
  
    –  genFileSink()	
  to	
  store	
  intermediate	
  results	
  
Op:mizer	
  
•  The	
  resul:ng	
  operator	
  tree,	
  along	
  with	
  other	
  
   parsing	
  info,	
  is	
  stored	
  in	
  ParseContext	
  and	
  passed	
  
   to	
  Op:mizer.	
  
•  Op:mizer	
  is	
  a	
  set	
  of	
  Transforma:on	
  rules	
  on	
  the	
  
   operator	
  tree.	
  	
  
•  The	
  transforma:on	
  rules	
  are	
  specified	
  by	
  a	
  regexp	
  
   pacern	
  on	
  the	
  tree	
  and	
  a	
  Worker/Dispatcher	
  
   framework.	
  	
  	
  
Op:mizer	
  (cont.)	
  
•  Current	
  rewrite	
  rules	
  include:	
  
    –  ColumnPruner	
  
    –  PredicatePushDown	
  
    –  Par::onPruner	
  
    –  GroupByOp:mizer	
  
    –  SamplePruner	
  
    –  MapJoinProcessor	
  
    –  UnionProcessor	
  
    –  JoinReorder	
  
Physical	
  Plan	
  Genera:on	
  
•  genMapRedWorks()	
  takes	
  the	
  QB/QBParseInfo	
  and	
  the	
  
   operator	
  tree	
  and	
  generate	
  a	
  DAG	
  of	
  
   MapReduceTasks.	
  
•  The	
  genera:on	
  is	
  also	
  based	
  on	
  the	
  Worker/
   Dispatcher	
  framework	
  while	
  traversing	
  the	
  operator	
  
   tree.	
  
•  Different	
  task	
  types:	
  MapRedTask,	
  Condi:onalTask,	
  
   FetchTask,	
  MoveTask,	
  DDLTask,	
  CounterTask	
  
•  Validate()	
  on	
  the	
  physical	
  plan	
  is	
  called	
  at	
  the	
  end	
  of	
  
   Driver.compile().	
  	
  
Preparing	
  Execu:on	
  
•  Driver.execute	
  takes	
  the	
  output	
  from	
  Driver.compile	
  
   and	
  prepare	
  hadoop	
  command	
  line	
  (in	
  local	
  mode)	
  or	
  
   call	
  ExecDriver.execute	
  (in	
  remote	
  mode).	
  
    –  Start	
  a	
  session	
  
    –  Execute	
  PreExecu:onHooks	
  
    –  Create	
  a	
  Runnable	
  for	
  each	
  Task	
  that	
  can	
  be	
  executed	
  in	
  
       parallel	
  and	
  launch	
  Threads	
  within	
  a	
  certain	
  limit	
  
    –  Monitor	
  Thread	
  status	
  and	
  update	
  Session	
  
    –  Execute	
  PostExecu:onHooks	
  
Preparing	
  Execu:on	
  (cont.)	
  
•  Hadoop	
  jobs	
  are	
  started	
  from	
  
   MapRedTask.execute().	
  
    –  Get	
  info	
  of	
  all	
  needed	
  JAR	
  files	
  with	
  ExecDriver	
  as	
  the	
  
       star:ng	
  class	
  
    –  Serialize	
  the	
  Physical	
  Plan	
  (MapRedTask)	
  to	
  an	
  XML	
  file	
  
    –  Gather	
  other	
  info	
  such	
  as	
  Hadoop	
  version	
  and	
  prepare	
  
       the	
  hadoop	
  command	
  line	
  
    –  Execute	
  the	
  hadoop	
  command	
  line	
  in	
  a	
  separate	
  
       process.	
  	
  
Star:ng	
  Hadoop	
  Jobs	
  
•  ExecDriver	
  deserialize	
  the	
  plan	
  from	
  the	
  XML	
  file	
  and	
  
   call	
  execute().	
  
•  Execute()	
  set	
  up	
  #	
  of	
  reducers,	
  job	
  scratch	
  dir,	
  the	
  
   star:ng	
  mapper	
  class	
  (ExecMapper)	
  and	
  star:ng	
  
   reducer	
  class	
  (ExecReducer),	
  and	
  other	
  info	
  to	
  JobConf	
  
   and	
  submit	
  the	
  job	
  through	
  hadoop.mapred.JobClient.	
  	
  
•  The	
  query	
  plan	
  is	
  again	
  serialized	
  into	
  a	
  file	
  and	
  put	
  
   into	
  DistributedCache	
  to	
  be	
  sent	
  out	
  to	
  mappers/
   reducers	
  before	
  the	
  job	
  is	
  started.	
  
Operator	
  
•  ExecMapper	
  create	
  a	
  MapOperator	
  as	
  the	
  parent	
  of	
  all	
  root	
  operators	
  in	
  
   the	
  query	
  plan	
  and	
  start	
  execu:ng	
  on	
  the	
  operator	
  tree.	
  
•  Each	
  Operator	
  class	
  comes	
  with	
  a	
  descriptor	
  class,	
  which	
  contains	
  
   metadata	
  passing	
  from	
  compila:on	
  to	
  execu:on.	
  
     –  Any	
  metadata	
  variable	
  that	
  needs	
  to	
  be	
  passed	
  should	
  have	
  a	
  public	
  secer	
  &	
  
        gecer	
  in	
  order	
  for	
  the	
  XML	
  serializer/deserializer	
  to	
  work.	
  
•  The	
  operator’s	
  interface	
  contains:	
  
     –    ini:lize():	
  called	
  once	
  per	
  operator	
  life:me	
  
     –    startGroup()	
  *:called	
  once	
  for	
  each	
  group	
  (groupby/join)	
  
     –    process()*:	
  called	
  one	
  for	
  each	
  input	
  row.	
  
     –    endGroup()*:	
  called	
  once	
  for	
  each	
  group	
  (groupby/join)	
  
     –    close():	
  called	
  once	
  per	
  operator	
  life:me	
  
Example:	
  adding	
  Semijoin	
  operator	
  
•  Lek	
  semijoin	
  is	
  similar	
  to	
  inner	
  join	
  except	
  only	
  the	
  
   lek	
  hand	
  side	
  table	
  is	
  output	
  and	
  no	
  duplicated	
  
   join	
  values	
  if	
  there	
  are	
  duplicated	
  keys	
  in	
  RHS	
  
   table.	
  
    –  IN/EXISTS	
  seman:cs	
  
    –  SELECT	
  *	
  FROM	
  S	
  LEFT	
  SEMI	
  JOIN	
  T	
  ON	
  S.KEY	
  =	
  T.KEY	
  
       AND	
  S.VALUE	
  =	
  T.VALUE	
  
    –  Output	
  all	
  columns	
  in	
  table	
  S	
  if	
  its	
  (key,value)	
  matches	
  
       at	
  least	
  one	
  (key,value)	
  pair	
  in	
  T	
  
Semijoin	
  Implementa:on	
  
•  Parser:	
  adding	
  the	
  SEMI	
  keyword	
  
•  Seman:cAnalyzer:	
  
   –  doPhase1():	
  keep	
  a	
  mapping	
  of	
  the	
  RHS	
  table	
  name	
  
      and	
  its	
  join	
  key	
  columns	
  in	
  QBParseInfo.	
  	
  
   –  genJoinTree:	
  set	
  new	
  join	
  type	
  in	
  joinDesc.joinCond	
  
   –  genJoinPlan:	
  	
  
       •  generate	
  a	
  map-­‐side	
  par:al	
  groupby	
  operator	
  right	
  aker	
  the	
  
          TableScanOperator	
  for	
  the	
  RHS	
  table.	
  The	
  input	
  &	
  output	
  
          columns	
  of	
  the	
  groupby	
  operator	
  is	
  the	
  RHS	
  join	
  keys.	
  	
  
Semijoin	
  Implementa:on	
  (cont.)	
  
•  Seman:cAnalyzer	
  
   –  genJoinOperator:	
  generate	
  a	
  JoinOperator	
  (lek	
  semi	
  
      type)	
  and	
  set	
  the	
  output	
  fields	
  as	
  the	
  LHS	
  table’s	
  fields	
  
•  Execu:on	
  
   –  In	
  CommonJoinOperator,	
  implement	
  lek	
  semi	
  join	
  with	
  
      early-­‐exit	
  op:miza:on:	
  as	
  long	
  as	
  the	
  RHS	
  table	
  of	
  lek	
  
      semi	
  join	
  is	
  non-­‐null,	
  return	
  the	
  row	
  from	
  the	
  LHS	
  
      table.	
  
Debugging	
  
•  Debugging	
  compile-­‐:me	
  code	
  (Driver	
  :ll	
  
   ExecDriver)	
  is	
  rela:vely	
  easy	
  since	
  it	
  is	
  running	
  on	
  
   the	
  JVM	
  on	
  the	
  client	
  side.	
  
•  Debugging	
  execu:on	
  :me	
  code	
  (aker	
  ExecDriver	
  
   calls	
  hadoop)	
  need	
  some	
  configura:on.	
  See	
  wiki	
  
   hcp://wiki.apache.org/hadoop/Hive/
   DeveloperGuide#Debugging_Hive_code	
  
Questions?

More Related Content

What's hot

Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
Sqoop
SqoopSqoop
Mongo db
Mongo dbMongo db
Mongo db
Akshay Mathur
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Databricks
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
Spark
SparkSpark

What's hot (20)

Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Sqoop
SqoopSqoop
Sqoop
 
Mongo db
Mongo dbMongo db
Mongo db
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Spark
SparkSpark
Spark
 

Viewers also liked

How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Hive sourcecodereading
Hive sourcecodereadingHive sourcecodereading
Hive sourcecodereadingwyukawa
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
Jeff Hammerbacher
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3moai kids
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Someshwar Kale
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 

Viewers also liked (20)

How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive sourcecodereading
Hive sourcecodereadingHive sourcecodereading
Hive sourcecodereading
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Hive Anatomy

Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
trihug
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
Bartosz Dobrzelecki
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
Ilya Bogunov
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
markgrover
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
Chicago Hadoop Users Group
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
DaeHyung Lee
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 

Similar to Hive Anatomy (20)

Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

Hive Anatomy

  • 1. Hive Anatomy Data Infrastructure Team, Facebook Part of Apache Hadoop Hive Project
  • 2. Overview   •  Conceptual  level  architecture   •  (Pseudo-­‐)code  level  architecture   •  Parser   •  Seman:c  analyzer   •  Execu:on   •  Example:  adding  a  new  Semijoin  Operator  
  • 3. Conceptual  Level  Architecture     •  Hive  Components:   –  Parser  (antlr):  HiveQL    Abstract  Syntax  Tree  (AST)   –  Seman:c  Analyzer:  AST    DAG  of  MapReduce  Tasks   •  Logical  Plan  Generator:  AST    operator  trees   •  Op:mizer  (logical  rewrite):  operator  trees    operator  trees     •  Physical  Plan  Generator:  operator  trees  -­‐>  MapReduce  Tasks   –  Execu:on  Libraries:     •  Operator  implementa:ons,  UDF/UDAF/UDTF   •  SerDe  &  ObjectInspector,  Metastore   •  FileFormat  &  RecordReader  
  • 4.
  • 5. Hive  User/Applica:on  Interfaces   CliDriver   Hive  shell  script   HiPal*   Driver   HiveServer   ODBC   JDBCDriver   (inheren:ng   Driver)   *HiPal is a Web-basedHive client developed internally at Facebook.
  • 6. Parser   •  ANTLR  is  a  parser  generator.     •  $HIVE_SRC/ql/src/java/org/ apache/hadoop/hive/ql/parse/ Hive.g   •  Hive.g  defines  keywords,   ParseDriver   Driver   tokens,  transla:ons  from   HiveQL  to  AST  (ASTNode.java)   •  Every  :me  Hive.g  is  changed,   you  need  to  ‘ant  clean’  first  and   rebuild  using  ‘ant  package’  
  • 7. Seman:c  Analyzer   •  BaseSeman:cAnalyzer  is  the   base  class  for   DDLSeman:cAnalyzer  and   Seman:cAnalyzer   BaseSeman:cAnalyzer   Driver   –  Seman:cAnalyzer  handles   queries,  DML,  and  some  DDL   (create-­‐table)   –  DDLSeman:cAnalyzer   handles  alter  table  etc.  
  • 8. Logical  Plan  Genera:on   •  Seman:cAnalyzer.analyzeInternal()  is  the  main   funciton   –  doPhase1():  recursively  traverse  AST  tree  and  check  for   seman:c  errors  and  gather  metadata  which  is  put  in  QB   and    QBParseInfo.   –  getMetaData():  query  metastore  and  get  metadata  for  the   data  sources  and  put  them  into  QB  and  QBParseInfo.   –  genPlan():  takes  the  QB/QBParseInfo  and  AST  tree  and   generate  an  operator  tree.  
  • 9. Logical  Plan  Genera:on  (cont.)   •  genPlan()  is  recursively  called  for  each  subqueries   (QB),  and  output  the  root  of  the  operator  tree.   •  For  each  subquery,  genPlan  create  operators   “bocom-­‐up”*  staring  from   FROMWHEREGROUPBYORDERBY    SELECT   •  In  the  FROM  clause,  generate  a  TableScanOperator   for  each  source  table,  Then  genLateralView()  and   genJoinPlan().     *Hive code actually names each leaf operator as “root” and its downstream operators as children.
  • 10. Logical  Plan  Genera:on  (cont.)   •  genBodyPlan()  is  then  called  to  handle  WHERE-­‐ GROUPBY-­‐ORDERBY-­‐SELECT  clauses.   –  genFilterPlan()  for  WHERE  clause   –  genGroupByPalnMapAgg1MR/2MR()  for  map-­‐side  par:al   aggrega:on   –  genGroupByPlan1MR/2MR()  for  reduce-­‐side  aggrega:on   –  genSelectPlan()  for  SELECT-­‐clause   –  genReduceSink()  for  marking  the  boundary  between  map/ reduce  phases.   –  genFileSink()  to  store  intermediate  results  
  • 11. Op:mizer   •  The  resul:ng  operator  tree,  along  with  other   parsing  info,  is  stored  in  ParseContext  and  passed   to  Op:mizer.   •  Op:mizer  is  a  set  of  Transforma:on  rules  on  the   operator  tree.     •  The  transforma:on  rules  are  specified  by  a  regexp   pacern  on  the  tree  and  a  Worker/Dispatcher   framework.      
  • 12. Op:mizer  (cont.)   •  Current  rewrite  rules  include:   –  ColumnPruner   –  PredicatePushDown   –  Par::onPruner   –  GroupByOp:mizer   –  SamplePruner   –  MapJoinProcessor   –  UnionProcessor   –  JoinReorder  
  • 13. Physical  Plan  Genera:on   •  genMapRedWorks()  takes  the  QB/QBParseInfo  and  the   operator  tree  and  generate  a  DAG  of   MapReduceTasks.   •  The  genera:on  is  also  based  on  the  Worker/ Dispatcher  framework  while  traversing  the  operator   tree.   •  Different  task  types:  MapRedTask,  Condi:onalTask,   FetchTask,  MoveTask,  DDLTask,  CounterTask   •  Validate()  on  the  physical  plan  is  called  at  the  end  of   Driver.compile().    
  • 14. Preparing  Execu:on   •  Driver.execute  takes  the  output  from  Driver.compile   and  prepare  hadoop  command  line  (in  local  mode)  or   call  ExecDriver.execute  (in  remote  mode).   –  Start  a  session   –  Execute  PreExecu:onHooks   –  Create  a  Runnable  for  each  Task  that  can  be  executed  in   parallel  and  launch  Threads  within  a  certain  limit   –  Monitor  Thread  status  and  update  Session   –  Execute  PostExecu:onHooks  
  • 15. Preparing  Execu:on  (cont.)   •  Hadoop  jobs  are  started  from   MapRedTask.execute().   –  Get  info  of  all  needed  JAR  files  with  ExecDriver  as  the   star:ng  class   –  Serialize  the  Physical  Plan  (MapRedTask)  to  an  XML  file   –  Gather  other  info  such  as  Hadoop  version  and  prepare   the  hadoop  command  line   –  Execute  the  hadoop  command  line  in  a  separate   process.    
  • 16. Star:ng  Hadoop  Jobs   •  ExecDriver  deserialize  the  plan  from  the  XML  file  and   call  execute().   •  Execute()  set  up  #  of  reducers,  job  scratch  dir,  the   star:ng  mapper  class  (ExecMapper)  and  star:ng   reducer  class  (ExecReducer),  and  other  info  to  JobConf   and  submit  the  job  through  hadoop.mapred.JobClient.     •  The  query  plan  is  again  serialized  into  a  file  and  put   into  DistributedCache  to  be  sent  out  to  mappers/ reducers  before  the  job  is  started.  
  • 17. Operator   •  ExecMapper  create  a  MapOperator  as  the  parent  of  all  root  operators  in   the  query  plan  and  start  execu:ng  on  the  operator  tree.   •  Each  Operator  class  comes  with  a  descriptor  class,  which  contains   metadata  passing  from  compila:on  to  execu:on.   –  Any  metadata  variable  that  needs  to  be  passed  should  have  a  public  secer  &   gecer  in  order  for  the  XML  serializer/deserializer  to  work.   •  The  operator’s  interface  contains:   –  ini:lize():  called  once  per  operator  life:me   –  startGroup()  *:called  once  for  each  group  (groupby/join)   –  process()*:  called  one  for  each  input  row.   –  endGroup()*:  called  once  for  each  group  (groupby/join)   –  close():  called  once  per  operator  life:me  
  • 18. Example:  adding  Semijoin  operator   •  Lek  semijoin  is  similar  to  inner  join  except  only  the   lek  hand  side  table  is  output  and  no  duplicated   join  values  if  there  are  duplicated  keys  in  RHS   table.   –  IN/EXISTS  seman:cs   –  SELECT  *  FROM  S  LEFT  SEMI  JOIN  T  ON  S.KEY  =  T.KEY   AND  S.VALUE  =  T.VALUE   –  Output  all  columns  in  table  S  if  its  (key,value)  matches   at  least  one  (key,value)  pair  in  T  
  • 19. Semijoin  Implementa:on   •  Parser:  adding  the  SEMI  keyword   •  Seman:cAnalyzer:   –  doPhase1():  keep  a  mapping  of  the  RHS  table  name   and  its  join  key  columns  in  QBParseInfo.     –  genJoinTree:  set  new  join  type  in  joinDesc.joinCond   –  genJoinPlan:     •  generate  a  map-­‐side  par:al  groupby  operator  right  aker  the   TableScanOperator  for  the  RHS  table.  The  input  &  output   columns  of  the  groupby  operator  is  the  RHS  join  keys.    
  • 20. Semijoin  Implementa:on  (cont.)   •  Seman:cAnalyzer   –  genJoinOperator:  generate  a  JoinOperator  (lek  semi   type)  and  set  the  output  fields  as  the  LHS  table’s  fields   •  Execu:on   –  In  CommonJoinOperator,  implement  lek  semi  join  with   early-­‐exit  op:miza:on:  as  long  as  the  RHS  table  of  lek   semi  join  is  non-­‐null,  return  the  row  from  the  LHS   table.  
  • 21. Debugging   •  Debugging  compile-­‐:me  code  (Driver  :ll   ExecDriver)  is  rela:vely  easy  since  it  is  running  on   the  JVM  on  the  client  side.   •  Debugging  execu:on  :me  code  (aker  ExecDriver   calls  hadoop)  need  some  configura:on.  See  wiki   hcp://wiki.apache.org/hadoop/Hive/ DeveloperGuide#Debugging_Hive_code