SlideShare a Scribd company logo
1 of 33
Download to read offline
1
Introduction to Apache Hive (and
HCatalog)	
  
Mark Grover
github.com/markgrover/nyc-hug-
hive
Me!
•  Contributor	
  to	
  Apache	
  Hive	
  
•  Sec3on	
  Author	
  of	
  O’Reilly’s	
  Programming	
  Hive	
  book	
  
•  So?ware	
  Developer	
  at	
  Cloudera	
  
•  @mark_grover	
  
•  mgrover@cloudera.com	
  
•  hFps://github.com/markgrover/nyc-­‐hug-­‐hive	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Preamble
•  This	
  is	
  a	
  remote	
  talk	
  
•  Feel	
  free	
  to	
  ask	
  ques3ons	
  any	
  3me!	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive
•  Data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Enables	
  Extract/Transform/Load	
  (ETL)	
  
•  Associate	
  structure	
  with	
  a	
  variety	
  of	
  data	
  formats	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Loca3on	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Data	
  Format	
  Handler	
  (SerDe)	
  
•  Integrates	
  with	
  HDFS,	
  HBase,	
  MongoDB,	
  etc.	
  
•  Query	
  execu3on	
  in	
  MapReduce	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Why use Hive?
•  MapReduce	
  is	
  catered	
  towards	
  developers	
  
•  Run	
  SQL-­‐like	
  queries	
  that	
  get	
  compiled	
  and	
  run	
  as	
  
MapReduce	
  jobs	
  
•  Data	
  in	
  Hadoop	
  even	
  though	
  generally	
  unstructured	
  
has	
  some	
  vague	
  structure	
  associated	
  with	
  it	
  
•  Benefits	
  of	
  MapReduce	
  +	
  HDFS	
  (Hadoop)	
  
•  Fault	
  tolerant	
  
•  Robust	
  
•  Scalable	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive features
•  Create	
  table,	
  create	
  view,	
  create	
  index	
  -­‐	
  DDL	
  
•  Select,	
  where	
  clause,	
  group	
  by,	
  order	
  by,	
  joins	
  
•  Pluggable	
  User	
  Defined	
  Func3ons	
  -­‐	
  UDFs	
  (e.g	
  
from_unix3me)	
  
•  Pluggable	
  User	
  Defined	
  Aggregate	
  Func3ons	
  -­‐	
  
UDAFs	
  (e.g.	
  count,	
  avg)	
  
•  Pluggable	
  User	
  Defined	
  Table	
  Genera3ng	
  Func3ons	
  
-­‐	
  UDTFs	
  (e.g.	
  explode)	
  
Hive features
•  Pluggable	
  custom	
  Input/Output	
  format	
  
•  Pluggable	
  Serializa3on	
  Deserializa3on	
  libraries	
  
(SerDes)	
  
•  Pluggable	
  custom	
  map	
  and	
  reduce	
  scripts	
  
What Hive does NOT support
•  OLTP	
  workloads	
  -­‐	
  low	
  latency	
  
•  Correlated	
  subqueries	
  
•  Not	
  super	
  performant	
  with	
  small	
  amounts	
  of	
  data	
  
•  How	
  much	
  data	
  do	
  you	
  need	
  to	
  call	
  it	
  “Big	
  Data”?	
  
Other Hive features
•  Par33oning	
  
•  Sampling	
  
•  Bucke3ng	
  
•  Various	
  join	
  op3miza3ons	
  
•  Integra3on	
  with	
  HBase	
  and	
  other	
  storage	
  handlers	
  
•  Views	
  –	
  Unmaterialized	
  
•  Complex	
  data	
  types	
  –	
  arrays,	
  structs,	
  maps	
  
Connecting to Hive
•  Hive	
  Shell	
  
•  JDBC	
  driver	
  
•  ODBC	
  driver	
  
•  Thri?	
  client	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive metastore
•  Backed	
  by	
  RDBMS	
  
•  Derby,	
  MySQL,	
  PostgreSQL,	
  etc.	
  supported	
  
•  Default	
  Embedded	
  Derby	
  
•  Not	
  recommend	
  for	
  anything	
  but	
  a	
  quick	
  Proof	
  of	
  
Concept	
  
•  3	
  different	
  modes	
  of	
  opera3on:	
  
•  Embedded	
  Derby	
  (default)	
  
•  Local	
  
•  Remote	
  
Hive architecture
Hive Remote Mode
Hive server
Problems with Hive Server
•  No	
  sessions/concurrency	
  
•  Essen3ally	
  need	
  1	
  server	
  per	
  client	
  
•  Security	
  
•  Auding/Logging	
  
Hive server 2
Hive architecture
•  Compiler	
  
•  Parser	
  
•  Type	
  checking	
  
•  Seman3c	
  Analyzer	
  
•  Plan	
  Genera3on	
  
•  Task	
  Genera3on	
  
Hive architecture
•  Execu3on	
  Engine	
  
•  Plan	
  
•  Operators	
  
•  SerDes	
  
•  UDFs/UDAFs/UDTFs	
  
•  Metastore	
  
•  Stores	
  schema	
  of	
  data	
  
•  HCatalog	
  
Architecture Summary
•  Use	
  remote	
  metastore	
  service	
  for	
  sharing	
  the	
  
metastore	
  with	
  HCatalog	
  and	
  other	
  tools	
  
•  Use	
  Hive	
  Server2	
  for	
  concurrent	
  queries	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive Metastore Remote Mode
HCatalog
•  Sub-­‐component	
  of	
  Hive	
  
•  Table	
  and	
  storage	
  management	
  service	
  
•  Public	
  APIs	
  and	
  webservice	
  wrappers	
  for	
  accessing	
  
metadata	
  in	
  Hive	
  metastore	
  
•  Metastore	
  contains	
  informa3on	
  of	
  interest	
  to	
  other	
  
tools	
  (Pig,	
  MapReduce	
  jobs)	
  
•  Expose	
  that	
  informa3on	
  as	
  REST	
  interface	
  
•  WebHCat:	
  Web	
  Server	
  for	
  engaging	
  with	
  the	
  Hive	
  
metastore	
  	
  
WebHCat
REST
REST
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Applications of Hive
•  Web	
  Analy3cs	
  
•  Retail	
  
•  Healthcare	
  
•  Spam	
  detec3on	
  
•  Data	
  Mining	
  
•  Ad	
  op3miza3on	
  
•  ETL	
  workloads	
  
How to install Hive
•  Download	
  Apache	
  Hive	
  tarball	
  
•  Use	
  Apache	
  Bigtop	
  packages	
  
•  Use	
  a	
  Demo	
  VM	
  
Want to learn more about Hive?
Contact info
@mark_grover	
  
github.com/markgrover	
  
linkedin.com/in/grovermark	
  
mgrover@cloudera.com	
  
	
  
	
  

More Related Content

What's hot

Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)Hafiz Ismail
 
코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우Arawn Park
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
Real Life Clean Architecture
Real Life Clean ArchitectureReal Life Clean Architecture
Real Life Clean ArchitectureMattia Battiston
 
Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!Ted Wennmark
 
Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with TerraformTim Berry
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialColin Charles
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to CypherNeo4j
 
REST APIs in Laravel 101
REST APIs in Laravel 101REST APIs in Laravel 101
REST APIs in Laravel 101Samantha Geitz
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 

What's hot (20)

MySQL Cluster Basics
MySQL Cluster BasicsMySQL Cluster Basics
MySQL Cluster Basics
 
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
 
코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Real Life Clean Architecture
Real Life Clean ArchitectureReal Life Clean Architecture
Real Life Clean Architecture
 
Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!Upgrade to MySQL 8.0!
Upgrade to MySQL 8.0!
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with Terraform
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
MariaDB 10: The Complete Tutorial
MariaDB 10: The Complete TutorialMariaDB 10: The Complete Tutorial
MariaDB 10: The Complete Tutorial
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Intro to Cypher
Intro to CypherIntro to Cypher
Intro to Cypher
 
REST APIs in Laravel 101
REST APIs in Laravel 101REST APIs in Laravel 101
REST APIs in Laravel 101
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 

Viewers also liked

HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Prasan Samtani
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
 
Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and HiveZheng Shao
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Cloudera, Inc.
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 

Viewers also liked (20)

HCatalog
HCatalogHCatalog
HCatalog
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and Hive
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Restaurant management
Restaurant managementRestaurant management
Restaurant management
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 

Similar to Introduction to Hive and HCatalog

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Middleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeMiddleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeCale Hoopes
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Wen-Tien Chang
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCloudIDSummit
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesanynines GmbH
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode
 

Similar to Introduction to Hive and HCatalog (20)

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Middleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeMiddleware in Golang: InVision's Rye
Middleware in Golang: InVision's Rye
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 

Recently uploaded

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 

Recently uploaded (20)

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 

Introduction to Hive and HCatalog

  • 1. 1 Introduction to Apache Hive (and HCatalog)   Mark Grover github.com/markgrover/nyc-hug- hive
  • 2. Me! •  Contributor  to  Apache  Hive   •  Sec3on  Author  of  O’Reilly’s  Programming  Hive  book   •  So?ware  Developer  at  Cloudera   •  @mark_grover   •  mgrover@cloudera.com   •  hFps://github.com/markgrover/nyc-­‐hug-­‐hive  
  • 3. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 4. Preamble •  This  is  a  remote  talk   •  Feel  free  to  ask  ques3ons  any  3me!  
  • 5. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 6. Hive •  Data  warehouse  system  for  Hadoop   •  Enables  Extract/Transform/Load  (ETL)   •  Associate  structure  with  a  variety  of  data  formats   •  Logical  Table  -­‐>  Physical  Loca3on   •  Logical  Table  -­‐>  Physical  Data  Format  Handler  (SerDe)   •  Integrates  with  HDFS,  HBase,  MongoDB,  etc.   •  Query  execu3on  in  MapReduce  
  • 7. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 8. Why use Hive? •  MapReduce  is  catered  towards  developers   •  Run  SQL-­‐like  queries  that  get  compiled  and  run  as   MapReduce  jobs   •  Data  in  Hadoop  even  though  generally  unstructured   has  some  vague  structure  associated  with  it   •  Benefits  of  MapReduce  +  HDFS  (Hadoop)   •  Fault  tolerant   •  Robust   •  Scalable  
  • 9. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 10. Hive features •  Create  table,  create  view,  create  index  -­‐  DDL   •  Select,  where  clause,  group  by,  order  by,  joins   •  Pluggable  User  Defined  Func3ons  -­‐  UDFs  (e.g   from_unix3me)   •  Pluggable  User  Defined  Aggregate  Func3ons  -­‐   UDAFs  (e.g.  count,  avg)   •  Pluggable  User  Defined  Table  Genera3ng  Func3ons   -­‐  UDTFs  (e.g.  explode)  
  • 11. Hive features •  Pluggable  custom  Input/Output  format   •  Pluggable  Serializa3on  Deserializa3on  libraries   (SerDes)   •  Pluggable  custom  map  and  reduce  scripts  
  • 12. What Hive does NOT support •  OLTP  workloads  -­‐  low  latency   •  Correlated  subqueries   •  Not  super  performant  with  small  amounts  of  data   •  How  much  data  do  you  need  to  call  it  “Big  Data”?  
  • 13. Other Hive features •  Par33oning   •  Sampling   •  Bucke3ng   •  Various  join  op3miza3ons   •  Integra3on  with  HBase  and  other  storage  handlers   •  Views  –  Unmaterialized   •  Complex  data  types  –  arrays,  structs,  maps  
  • 14. Connecting to Hive •  Hive  Shell   •  JDBC  driver   •  ODBC  driver   •  Thri?  client  
  • 15. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 16. Hive metastore •  Backed  by  RDBMS   •  Derby,  MySQL,  PostgreSQL,  etc.  supported   •  Default  Embedded  Derby   •  Not  recommend  for  anything  but  a  quick  Proof  of   Concept   •  3  different  modes  of  opera3on:   •  Embedded  Derby  (default)   •  Local   •  Remote  
  • 20. Problems with Hive Server •  No  sessions/concurrency   •  Essen3ally  need  1  server  per  client   •  Security   •  Auding/Logging  
  • 22. Hive architecture •  Compiler   •  Parser   •  Type  checking   •  Seman3c  Analyzer   •  Plan  Genera3on   •  Task  Genera3on  
  • 23. Hive architecture •  Execu3on  Engine   •  Plan   •  Operators   •  SerDes   •  UDFs/UDAFs/UDTFs   •  Metastore   •  Stores  schema  of  data   •  HCatalog  
  • 24. Architecture Summary •  Use  remote  metastore  service  for  sharing  the   metastore  with  HCatalog  and  other  tools   •  Use  Hive  Server2  for  concurrent  queries  
  • 25. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 27. HCatalog •  Sub-­‐component  of  Hive   •  Table  and  storage  management  service   •  Public  APIs  and  webservice  wrappers  for  accessing   metadata  in  Hive  metastore   •  Metastore  contains  informa3on  of  interest  to  other   tools  (Pig,  MapReduce  jobs)   •  Expose  that  informa3on  as  REST  interface   •  WebHCat:  Web  Server  for  engaging  with  the  Hive   metastore    
  • 29. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 30. Applications of Hive •  Web  Analy3cs   •  Retail   •  Healthcare   •  Spam  detec3on   •  Data  Mining   •  Ad  op3miza3on   •  ETL  workloads  
  • 31. How to install Hive •  Download  Apache  Hive  tarball   •  Use  Apache  Bigtop  packages   •  Use  a  Demo  VM  
  • 32. Want to learn more about Hive?
  • 33. Contact info @mark_grover   github.com/markgrover   linkedin.com/in/grovermark   mgrover@cloudera.com