An intriduction to hive

An Introduction to
Apache HIVE

Credits
By: Reza Ameri

Semester: Fall 2013

Course: DDB

Prof: Dr. Naderi

Agenda
• Starting Note
– What is Hive
– What is cool about Hive
– Hive in use
– What Hive is not?

• Brief About Data Warehouse

An Introduction to Apache HIVE

2 of 31

Agenda- Contd.
• Hive Architecture
– Components
– Architecture Diagram

• Hive in Production
– HQL
– Data Insertion/Aggregation

• Performance
• Further Reading
• References

3 of 31

Starting Note
• What is Apache Hive?
– Open Source (Very Important!) So Free 
– Data Warehouse System on Hadoop
– Provides HQL(SQL like query interface)
– Suitable for Structured and Semi-Structured Data
– Capability to deal with different storages and file
formats

4 of 31

Starting Note- Contd.
• What is cool about Hive
– Let users use MR without thinking MR with
HiveQL interface.

• Some history
– Hive is made by Facebook!
– Developing by Netflix aslo.
– Amazon uses it in Amazon Elastic MapReduce

5 of 31

Starting Note- Contd.
• What Hive is not
– Does not use complex indexes so do not response
in a seconds!
– But it scales very well and, It works with data of
Peta Byte order
– It is not independent and it’s performance is tied
Hadoop


6 of 31

Brief About Data Warehouse
• OLAP vs OLTP
– DW is needed in OLAP
– We want report and summary not live data of
transactions for continuing the operate
– We need reports to make operation better not to
conduct and operation!
– We use ETL to populate data in DW.


7 of 31


Inmon approach
vs
Kimbal approach


8 of 31


Inmon approach
vs
Kimbal approach


9 of 31

• Other keywords
– ODS- Operational Data Store
– Fact Tables
– Data Mart
– Dimensions
– Concurrent ETLs


10 of 31

Hive Architecture
• Components
– Hadoop
– Driver
– Command Line Interface (CLI)
– Web Interface
– Metastore
– Thrift Server


11 of 31

Hive Architecture


12 of 31

Hive Architecture
Map Reduce

Web UI + Hive CLI + JDBC/ODBC

User-defined
Map-reduce Scripts

HDFS

Browse, Query, DDL
Hive QL
MetaStore

Parser

UDF/UDAF
substr
sum
average

Planner
Execution

Thrift API
Optimizer

SerDe
CSV
Thrift
Regex


FileFormats
TextFile
SequenceFile
RCFile

13 of 31

Hive Architecture- Contd.
– Internal Components
• Compiler and Planner
– It compiles and checks the input query and create an
execution plan.

• Optimizer
– It optimizes the execution plan before it runs.

• Execution Engine
– Runs the execution plan. It is guaranteed that execution plan
is DAG


14 of 31

Hive Architecture- Contd.
• Hive Data Model
– Any data in hive is categorized in
• Databases
– First level of abstraction.

• Tables
– Ordinary tables

• Partition
– To handle data transferring in MR.

• Bucket
– Facilitate the data access in partitions.


15 of 31

Hive in Production
• Log processing
– Daily Report
– User Activity Measurement

• Data/Text mining
– Machine learning (Training Data)

• Business intelligence
– Advertising Delivery
– Spam Detection

16 of 31

Hive in Production
– HQL
•
•
•
•
•

Create
Row Format
SerDe
Select
Cluster By/Distribute By

– Data Insertion/Aggregation


17 of 31

HQL- Samples
• CREATE TABLE
CREATE TABLE movies (movie_id int, movie_name string, tags
string)

• ROW FORMAT
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘:’;


18 of 31

HQL- Samples
• Partition
create table table_name (
id int,
date string,
name string)
partitioned by (date string)


19 of 31

HQL- Samples
• SerDe
– User Table with
“id::gender::age::occupation::zipcode” format.
CREATE TABLE USER (id INT, gender STRING, age INT,
occupation STRING, zipcode INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");

20 of 31

HQL- Samples
• Select
SELECT * FROM movies LIMIT 10;

• Distribute By
– Select * from movies distribute by tags;
– Select the column to organize data while sending
it to reducer.


21 of 20

Hive Process
• Data Insertion/Aggregation
– Bulk
• ETL
– Talend - Community version
– Sqoop (SQl to hadOOP, Apache license)
– SyncSort – Not Free!


22 of 31

Hive Process- Contd.
– STP(Straight Through Processing)
• Flume – Apache lisenced
• Chukwa - a part of Apache Hadoop distribution
• Scribe – Facebook solution for log processing
and aggregation.


23 of 31

• NetFlix Case Study
– Usage of Chukwa
– Log processing
– Count Errors per session
– Count Streams per day
– Ad-hoc queries like summaries (sum, max, min, …)


24 of 31



25 of 31

• Phase 1
– Hadoop job parses the logs and loads to Hive
every hour.
– Previous job should also run every 24 hours for
summary

• Phase 2
– Real-time log processing(parse/merge/load)
– Chukwa has non-stop log collection.


26 of 31

Performance
• According to Globant investigations
• Tables:


27 of 31

Performance


28 of 31

Performance


29 of 31

Further Reading
• Apache Drill
– Software framework that supports data-intensive, distributed
applications, for interactive analysis of large-scale datasets

• PIG
– MR Platform for creating and using MR on Hadoop

•
•
•
•
•
•
•

Oracle Big Data
DB2 10 and InfoSphere Warehouse
Parallel databases: Gamma, Bubba, Volcano
Google: Sawzall
Yahoo: Pig
IBM: JAQL
Microsoft: DradLINQ , SCOPE

30 of 31

References
•
•
•
•
•
•
•
•

https://www.facebook.com/note.php?note_id=89508453919
https://github.com/facebook/scribe
http://sqoop.apache.org/docs/
http://flume.apache.org/FlumeDeveloperGuide.html
Sqoop Database Import For Hadoop, Cloudera, Oct.2009
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
http://www.semantikoz.com/blog/the-free-apache-hive-book/
BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING,
Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2
• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook
team, 2009

31 of 31

Thanks…


An intriduction to hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An intriduction to hive

Similar to An intriduction to hive (20)

An intriduction to hive

Editor's Notes