1
Introduction to Apache Hive (and
HCatalog)	
  
Mark Grover
github.com/markgrover/nyc-hug-
hive
Me!
•  Contributor	
  to	
  Apache	
  Hive	
  
•  Sec3on	
  Author	
  of	
  O’Reilly’s	
  Programming	
  Hive	
  book	
  
...
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Preamble
•  This	
  is	
  a	
  remote	
  talk	
  
•  Feel	
  free	
  to	
  ask	
  ques3ons	
  any	
  3me!	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Hive
•  Data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Enables	
  Extract/Transform/Load	
  (ETL)	
  
•  Associate	
 ...
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Why use Hive?
•  MapReduce	
  is	
  catered	
  towards	
  developers	
  
•  Run	
  SQL-­‐like	
  queries	
  that	
  get	
 ...
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Hive features
•  Create	
  table,	
  create	
  view,	
  create	
  index	
  -­‐	
  DDL	
  
•  Select,	
  where	
  clause,	
...
Hive features
•  Pluggable	
  custom	
  Input/Output	
  format	
  
•  Pluggable	
  Serializa3on	
  Deserializa3on	
  libra...
What Hive does NOT support
•  OLTP	
  workloads	
  -­‐	
  low	
  latency	
  
•  Correlated	
  subqueries	
  
•  Not	
  sup...
Other Hive features
•  Par33oning	
  
•  Sampling	
  
•  Bucke3ng	
  
•  Various	
  join	
  op3miza3ons	
  
•  Integra3on	...
Connecting to Hive
•  Hive	
  Shell	
  
•  JDBC	
  driver	
  
•  ODBC	
  driver	
  
•  Thri?	
  client	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Hive metastore
•  Backed	
  by	
  RDBMS	
  
•  Derby,	
  MySQL,	
  PostgreSQL,	
  etc.	
  supported	
  
•  Default	
  Embe...
Hive architecture
Hive Remote Mode
Hive server
Problems with Hive Server
•  No	
  sessions/concurrency	
  
•  Essen3ally	
  need	
  1	
  server	
  per	
  client	
  
•  S...
Hive server 2
Hive architecture
•  Compiler	
  
•  Parser	
  
•  Type	
  checking	
  
•  Seman3c	
  Analyzer	
  
•  Plan	
  Genera3on	
 ...
Hive architecture
•  Execu3on	
  Engine	
  
•  Plan	
  
•  Operators	
  
•  SerDes	
  
•  UDFs/UDAFs/UDTFs	
  
•  Metastor...
Architecture Summary
•  Use	
  remote	
  metastore	
  service	
  for	
  sharing	
  the	
  
metastore	
  with	
  HCatalog	
...
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Hive Metastore Remote Mode
HCatalog
•  Sub-­‐component	
  of	
  Hive	
  
•  Table	
  and	
  storage	
  management	
  service	
  
•  Public	
  APIs	
 ...
WebHCat
REST
REST
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatal...
Applications of Hive
•  Web	
  Analy3cs	
  
•  Retail	
  
•  Healthcare	
  
•  Spam	
  detec3on	
  
•  Data	
  Mining	
  
...
How to install Hive
•  Download	
  Apache	
  Hive	
  tarball	
  
•  Use	
  Apache	
  Bigtop	
  packages	
  
•  Use	
  a	
 ...
Want to learn more about Hive?
Contact info
@mark_grover	
  
github.com/markgrover	
  
linkedin.com/in/grovermark	
  
mgrover@cloudera.com	
  
	
  
	
  
Upcoming SlideShare
Loading in...5
×

Introduction to Hive and HCatalog

3,014

Published on

Introduction to Hive and HCatalog presentation by Mark Grover at NYC HUG. A video of this presentation is available at https://www.youtube.com/watch?v=JGwhfr4qw5s

Published in: Engineering, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,014
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
125
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Hive and HCatalog"

  1. 1. 1 Introduction to Apache Hive (and HCatalog)   Mark Grover github.com/markgrover/nyc-hug- hive
  2. 2. Me! •  Contributor  to  Apache  Hive   •  Sec3on  Author  of  O’Reilly’s  Programming  Hive  book   •  So?ware  Developer  at  Cloudera   •  @mark_grover   •  mgrover@cloudera.com   •  hFps://github.com/markgrover/nyc-­‐hug-­‐hive  
  3. 3. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  4. 4. Preamble •  This  is  a  remote  talk   •  Feel  free  to  ask  ques3ons  any  3me!  
  5. 5. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  6. 6. Hive •  Data  warehouse  system  for  Hadoop   •  Enables  Extract/Transform/Load  (ETL)   •  Associate  structure  with  a  variety  of  data  formats   •  Logical  Table  -­‐>  Physical  Loca3on   •  Logical  Table  -­‐>  Physical  Data  Format  Handler  (SerDe)   •  Integrates  with  HDFS,  HBase,  MongoDB,  etc.   •  Query  execu3on  in  MapReduce  
  7. 7. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  8. 8. Why use Hive? •  MapReduce  is  catered  towards  developers   •  Run  SQL-­‐like  queries  that  get  compiled  and  run  as   MapReduce  jobs   •  Data  in  Hadoop  even  though  generally  unstructured   has  some  vague  structure  associated  with  it   •  Benefits  of  MapReduce  +  HDFS  (Hadoop)   •  Fault  tolerant   •  Robust   •  Scalable  
  9. 9. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  10. 10. Hive features •  Create  table,  create  view,  create  index  -­‐  DDL   •  Select,  where  clause,  group  by,  order  by,  joins   •  Pluggable  User  Defined  Func3ons  -­‐  UDFs  (e.g   from_unix3me)   •  Pluggable  User  Defined  Aggregate  Func3ons  -­‐   UDAFs  (e.g.  count,  avg)   •  Pluggable  User  Defined  Table  Genera3ng  Func3ons   -­‐  UDTFs  (e.g.  explode)  
  11. 11. Hive features •  Pluggable  custom  Input/Output  format   •  Pluggable  Serializa3on  Deserializa3on  libraries   (SerDes)   •  Pluggable  custom  map  and  reduce  scripts  
  12. 12. What Hive does NOT support •  OLTP  workloads  -­‐  low  latency   •  Correlated  subqueries   •  Not  super  performant  with  small  amounts  of  data   •  How  much  data  do  you  need  to  call  it  “Big  Data”?  
  13. 13. Other Hive features •  Par33oning   •  Sampling   •  Bucke3ng   •  Various  join  op3miza3ons   •  Integra3on  with  HBase  and  other  storage  handlers   •  Views  –  Unmaterialized   •  Complex  data  types  –  arrays,  structs,  maps  
  14. 14. Connecting to Hive •  Hive  Shell   •  JDBC  driver   •  ODBC  driver   •  Thri?  client  
  15. 15. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  16. 16. Hive metastore •  Backed  by  RDBMS   •  Derby,  MySQL,  PostgreSQL,  etc.  supported   •  Default  Embedded  Derby   •  Not  recommend  for  anything  but  a  quick  Proof  of   Concept   •  3  different  modes  of  opera3on:   •  Embedded  Derby  (default)   •  Local   •  Remote  
  17. 17. Hive architecture
  18. 18. Hive Remote Mode
  19. 19. Hive server
  20. 20. Problems with Hive Server •  No  sessions/concurrency   •  Essen3ally  need  1  server  per  client   •  Security   •  Auding/Logging  
  21. 21. Hive server 2
  22. 22. Hive architecture •  Compiler   •  Parser   •  Type  checking   •  Seman3c  Analyzer   •  Plan  Genera3on   •  Task  Genera3on  
  23. 23. Hive architecture •  Execu3on  Engine   •  Plan   •  Operators   •  SerDes   •  UDFs/UDAFs/UDTFs   •  Metastore   •  Stores  schema  of  data   •  HCatalog  
  24. 24. Architecture Summary •  Use  remote  metastore  service  for  sharing  the   metastore  with  HCatalog  and  other  tools   •  Use  Hive  Server2  for  concurrent  queries  
  25. 25. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  26. 26. Hive Metastore Remote Mode
  27. 27. HCatalog •  Sub-­‐component  of  Hive   •  Table  and  storage  management  service   •  Public  APIs  and  webservice  wrappers  for  accessing   metadata  in  Hive  metastore   •  Metastore  contains  informa3on  of  interest  to  other   tools  (Pig,  MapReduce  jobs)   •  Expose  that  informa3on  as  REST  interface   •  WebHCat:  Web  Server  for  engaging  with  the  Hive   metastore    
  28. 28. WebHCat REST REST
  29. 29. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  30. 30. Applications of Hive •  Web  Analy3cs   •  Retail   •  Healthcare   •  Spam  detec3on   •  Data  Mining   •  Ad  op3miza3on   •  ETL  workloads  
  31. 31. How to install Hive •  Download  Apache  Hive  tarball   •  Use  Apache  Bigtop  packages   •  Use  a  Demo  VM  
  32. 32. Want to learn more about Hive?
  33. 33. Contact info @mark_grover   github.com/markgrover   linkedin.com/in/grovermark   mgrover@cloudera.com      
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×