SlideShare a Scribd company logo
Introduction to Apache Hive
Outline 
• What is Hive 
• Why Hive over MapReduce or Pig? 
- Advantages and disadvantages 
• Running Hive 
• HiveQL language 
• Examples 
• Internals 
• Hive vs Pig 
• Other Hadoop projects 
2
Hive 
• Data warehousing on top of Hadoop. 
• Designed to 
- enable easy data summarization 
- ad-hoc querying 
- analysis of large volumes of data. 
• HiveQL statements are automatically translated into 
MapReduce jobs 
3
Advantages of Hive 
• Higher level query language 
- Simplifies working with large amounts of data 
• Lower learning curve than Pig or MapReduce 
- HiveQL is much closer to SQL than Pig 
- Less trial and error than Pig 
4
Disadvantages 
• Updating data is complicated 
- Mainly because of using HDFS 
- Can add records 
- Can overwrite partitions 
• No real time access to data 
- Use other means like Hbase or Impala 
• High latency 
5
Running Hive 
• We will look at it more closely in the practice 
session, but you can run hive from 
- Hive web interface 
- Hive shell 
• $HIVE_HOME/bin/hive for interactive shell 
• Or you can run queries directly: $HIVE_HOME/bin/hive -e 
‘select a.col from tab1 a’ 
- JDBC - Java Database Connectivity 
• "jdbc:hive://host:port/dbname„ 
- Also possible to use hive directly in Python, C, C++, 
PHP 
6
Query Processor 
• Compiler 
- Parser 
- Type checking 
- Plan Generator 
- Task Generator 
• Execution engine 
- Plan 
- Operators 
- UDF (UDAF’s) 
7
HiveQL 
• Hive query language provides the basic SQL like operations. 
• These operations are: 
- Ability to filter rows from a table using a where clause. 
- Ability to select certain columns from the table using a select clause. 
- Ability to do equi-joins between two tables. 
- Ability to evaluate aggregations on multiple "group by" columns for 
the data stored in a table. 
- Ability to store the results of a query into another table. 
- Ability to download the contents of a table to a local directory. 
- Ability to store the results of a query in a hadoop dfs directory. 
- Ability to manage tables and partitions (create, drop and alter). 
- Ability to use custom scripts in chosen language (for map/reduce). 
8
Data units 
• Databases: Namespaces that separate tables and other data units from naming confliction. 
• Tables: 
- Homogeneous units of data which have the same schema. 
- Consists of specified columns accordingly to its schema 
• Partitions: 
- Each Table can have one or more partition Keys which determines how data is stored. 
- Partitions allow the user to efficiently identify the rows that satisfy a certain criteria. 
- It is the user's job to guarantee the relationship between partition name and data! 
- Partitions are virtual columns, they are not part of the data, but are derived on load. 
• Buckets (or Clusters): 
- Data in each partition may in turn be divided into Buckets based on the value of a hash 
function of some column of the Table. 
- For example the page_views table may be bucketed by userid to sample the data. 
9
Types 
• Types are associated with the columns in the tables. The 
following Primitive types are supported: 
- Integers 
• TINYINT - 1 byte integer 
• SMALLINT - 2 byte integer 
• INT - 4 byte integer 
• BIGINT - 8 byte integer 
- Boolean type 
• BOOLEAN 
- Floating point numbers 
• FLOAT 
• DOUBLE 
- String type 
• STRING 
10
Complex types 
• Structs 
- the elements within the type can be accessed using the DOT (.) 
notation. F 
- or example, for a column c of type STRUCT {a INT; b INT} the a field is 
accessed by the expression c.a 
• Maps (key-value tuples) 
- The elements are accessed using ['element name'] notation. 
- For example in a map M comprising of a mapping from 'group' -> gid 
the gid value can be accessed using M['group'] 
• Arrays (indexable lists) 
- The elements in the array have to be in the same type. 
- Elements can be accessed using the [n] notation where n is an index 
(zero-based) into the array. 
- For example for an array A having the elements ['a', 'b', 'c'], A[1] 
retruns 'b'. 
11
Built in functions 
• Round(number), floor(number), ceil(number) 
• Rand(seed) 
• Concat(string), substr(string, length) 
• Upper(string), lower(string), trim(string) 
• Yea(date), Month(date), Day (date) 
• Count(*), sum(col), avg(col), max(col), min(col) 
12
Create Table 
CREATE TABLE page_view(viewTime INT, userid 
BIGINT, 
page_url STRING, referrer_url STRING, 
ip STRING COMMENT 'IP Address of the 
User') 
COMMENT 'This is the page view table' 
PARTITIONED BY(dt STRING, country STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '1' 
STORED AS SEQUENCEFILE; 
13
Load Data 
• There are multiple ways to load data into Hive 
tables. 
• The user can create an external table that points 
to a specified location within HDFS. 
• The user can copy a file into the specified location 
in HDFS and create a table pointing to this 
location with all the relevant row format 
information. 
• Once this is done, the user can transform the 
data and insert them into any other Hive table. 
14
Load example 
• CREATE EXTERNAL TABLE page_view_stg(viewTime INT, 
userid BIGINT, 
• page_url STRING, referrer_url STRING, 
• ip STRING COMMENT 'IP Address of the User', 
• country STRING COMMENT 'country of 
origination') 
• COMMENT 'This is the staging page view table' 
• ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' 
LINES TERMINATED BY '12' 
• STORED AS TEXTFILE 
• LOCATION '/user/data/staging/page_view'; 
15
Load example 
• hadoop dfs -put /tmp/pv_2008-06-08.txt 
/user/data/staging/page_view 
• FROM page_view_stg pvs 
• INSERT OVERWRITE TABLE page_view 
PARTITION(dt='2008-06-08', country='US') 
• SELECT pvs.viewTime, pvs.userid, pvs.page_url, 
pvs.referrer_url, null, null, pvs.ip 
• WHERE pvs.country = 'US'; 
16
Loading and storing data 
• Loading data directly: 
LOAD DATA LOCAL INPATH /tmp/pv_2008-06- 
08_us.txt INTO TABLE page_view 
PARTITION(date='2008-06-08', country='US') 
17
Storing locally 
• To write the output into a local disk, for 
example to load it into an excel spreadsheet 
later: 
INSERT OVERWRITE LOCAL DIRECTORY 
'/tmp/pv_gender_sum' 
SELECT pv_gender_sum.* 
FROM pv_gender_sum; 
18
Display functions 
• SHOW TABLES; 
- Lists all tables 
• SHOW PARTITIONS page_view; 
- Lists partitions on a specific table 
• DESCRIBE EXTENDED page_view; 
- To show all metadata 
- Mainly for debugging 
19
INSERT 
INSERT OVERWRITE TABLE xyz_com_page_views 
SELECT page_views.* 
FROM page_views 
WHERE page_views.date >= '2008-03-01' AND 
page_views.date <= '2008-03-31' AND 
page_views.referrer_url like '%xyz.com'; 
20
Multiple table/file inserts 
• The output can be sent into multiple tables or even to hadoop dfs files 
(which can then be manipulated using hdfs utilities). 
• If along with the gender breakdown, one needed to find the breakdown of 
unique page views by age, one could accomplish that with the following 
query: 
FROM pv_users 
INSERT OVERWRITE TABLE pv_gender_sum 
SELECT pv_users.gender, count_distinct(pv_users.userid) 
GROUP BY pv_users.gender 
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' 
SELECT pv_users.age, count_distinct(pv_users.userid) 
GROUP BY pv_users.age; 
• The first insert clause sends the results of the first group by to a Hive table 
while the second one sends the results to a hadoop dfs files. 
21
JOIN 
• LEFT OUTER, RIGHT OUTER or FULL OUTER 
• Can join more than 2 tables at once 
• It is best to put the largest table on the rightmost 
side of the join to get the best performance. 
INSERT OVERWRITE TABLE pv_users 
SELECT pv.*, u.gender, u.age 
FROM user u JOIN page_view pv ON (pv.userid = 
u.id) 
WHERE pv.date = '2008-03-03'; 
22
Aggregations 
INSERT OVERWRITE TABLE pv_gender_agg 
SELECT pv_users.gender, count(DISTINCT 
pv_users.userid), count(*), sum(DISTINCT 
pv_users.userid) 
FROM pv_users 
GROUP BY pv_users.gender; 
23
Sampling 
INSERT OVERWRITE TABLE 
pv_gender_sum_sample 
SELECT pv_gender_sum.* 
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 
OUT OF 32); 
24
Union 
INSERT OVERWRITE TABLE actions_users 
SELECT u.id, actions.date 
FROM ( 
SELECT av.uid AS uid 
FROM action_video av 
WHERE av.date = '2008-06-03' 
UNION ALL 
SELECT ac.uid AS uid 
FROM action_comment ac 
WHERE ac.date = '2008-06-03' 
) actions JOIN users u ON(u.id = actions.uid); 
25
Running custom mapreduce 
FROM ( 
FROM pv_users 
MAP pv_users.userid, pv_users.date 
USING 'map_script.py' 
AS dt, uid 
CLUSTER BY dt) map_output 
INSERT OVERWRITE TABLE pv_users_reduced 
REDUCE map_output.dt, map_output.uid 
USING 'reduce_script.py' 
AS date, count; 
26
Map Example 
import sys 
import datetime 
for line in sys.stdin: 
line = line.strip() 
userid, unixtime = line.split('t') 
weekday = 
datetime.datetime.fromtimestamp(float(unixtime)). 
isoweekday() 
print ','.join([userid, str(weekday)]) 
27
Co-group 
FROM ( 
FROM ( 
FROM action_video av 
SELECT av.uid AS uid, av.id AS id, av.date AS date 
UNION ALL 
FROM action_comment ac 
SELECT ac.uid AS uid, ac.id AS id, ac.date AS date 
) union_actions 
SELECT union_actions.uid, union_actions.id, union_actions.date 
CLUSTER BY union_actions.uid) map 
INSERT OVERWRITE TABLE actions_reduced 
SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS 
(uid, id, reduced_val); 
28
UDF’s 
29
Java UDF 
package com.example.hive.udf; 
import org.apache.hadoop.hive.ql.exec.UDF; 
import org.apache.hadoop.io.Text; 
public final class Lower extends UDF { 
public Text evaluate(final Text s) { 
if (s == null) { return null; } 
return new Text(s.toString().toLowerCase()); 
} 
} 
30
Using UDF 
• create temporary function my_lower as 
'com.example.hive.udf.Lower'; 
• hive> select my_lower(title), sum(freq) from 
titles group by my_lower(title); 
31
Hive disadvantages 
• Same disadvantages as MapReduce and Pig 
- Slow start-up and clean-up of MapReduce jobs 
• It takes time for Hadoop to schedule MR jobs 
- Not suitable for interactive OLAP Analytics 
• When results are expected in < 1 sec 
• Designed for querying and not data 
transformation 
- Limitations of the SQL language 
- Tasks like co-grouping can get complicated 
32
Pig vs Hive 
Pig Hive 
Purpose Data transformation Ad-Hoc querying 
Language Something similiar to SQL SQL-like 
Difficulty Medium (Trial-and-error) Low to medium 
Schemas Yes (implicit) Yes (explicit) 
Join (Distributed) Yes Yes 
Shell Yes Yes 
Streaming Yes Yes 
Web interface No Yes 
Partitions No Yes 
UDF’s Yes Yes 
33
Pig vs Hive 
• SQL might not be the perfect language for 
expressing data transformation commands 
• Pig 
- Mainly for data transformations and processing 
- Unstructured data 
• Hive 
- Mainly for warehousing and querying data 
- Structured data 
34
More Hadoop projects 
• Hbase 
- Open-source distributed database ontop of HDFS 
- Hbase tables only use a single key 
- Tuned for real-time access to data 
• Cloudera Impala™ 
- Simplified, real time queries over HDFS 
- Bypass job schedulling, and remove everything 
else that makes MR slow. 
35
Big Picture 
• Store large amounts of data to HDFS 
• Process raw data: Pig 
• Build schema using Hive 
• Data querries: Hive 
• Real time access 
- access to data with Hbase 
- real time queries with Impala 
36
Is Hadoop enough? 
• Why use Hadoop for large scale data processing? 
- It is becoming a de facto standard in Big Data 
- Collaboration among Top Companies instead of 
vendor tool lock-in. 
• Amazon, Apache, Facebook, Yahoo!, etc all contribute to 
open source Hadoop 
- There are tools from setting up Hadoop cluster in 
minutes and importing data from relational databases 
to setting up workflows of MR, Pig and Hive. 
37

More Related Content

What's hot

Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
athusoo
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
6.hive
6.hive6.hive
Apache Hive
Apache HiveApache Hive
Apache Hive
Ajit Koti
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
Rohit Agrawal
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 

What's hot (20)

Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
6.hive
6.hive6.hive
6.hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 

Viewers also liked

An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
Minwoo Kim
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
Muralidharan Deenathayalan
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveTapan Avasthi
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Kristen
Kristen Kristen
Kristen
chglat
 
Costa Rica
Costa RicaCosta Rica
Costa Rica
chglat
 
Maddi
Maddi Maddi
Maddi
chglat
 
Blue Diamond Riviera Maya
Blue Diamond Riviera MayaBlue Diamond Riviera Maya
Blue Diamond Riviera Maya
chglat
 
Darius rm
Darius rmDarius rm
Darius rmchglat
 
Breathless Cabo
Breathless CaboBreathless Cabo
Breathless Cabo
chglat
 
Tiffany
Tiffany Tiffany
Tiffany
chglat
 
Jen Vacation
Jen VacationJen Vacation
Jen Vacation
chglat
 
creating your financial statments-slideshare
creating your financial statments-slidesharecreating your financial statments-slideshare
creating your financial statments-slideshareRob Place
 
Secrets Montego Bay
Secrets Montego BaySecrets Montego Bay
Secrets Montego Bay
chglat
 
Summer vacation deann2
Summer vacation deann2Summer vacation deann2
Summer vacation deann2
chglat
 
Jeff Mexico Options
Jeff Mexico OptionsJeff Mexico Options
Jeff Mexico Optionschglat
 

Viewers also liked (20)

An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Kristen
Kristen Kristen
Kristen
 
Costa Rica
Costa RicaCosta Rica
Costa Rica
 
Maddi
Maddi Maddi
Maddi
 
Blue Diamond Riviera Maya
Blue Diamond Riviera MayaBlue Diamond Riviera Maya
Blue Diamond Riviera Maya
 
Darius rm
Darius rmDarius rm
Darius rm
 
Breathless Cabo
Breathless CaboBreathless Cabo
Breathless Cabo
 
Tiffany
Tiffany Tiffany
Tiffany
 
Jen Vacation
Jen VacationJen Vacation
Jen Vacation
 
creating your financial statments-slideshare
creating your financial statments-slidesharecreating your financial statments-slideshare
creating your financial statments-slideshare
 
Secrets Montego Bay
Secrets Montego BaySecrets Montego Bay
Secrets Montego Bay
 
Summer vacation deann2
Summer vacation deann2Summer vacation deann2
Summer vacation deann2
 
Jeff Mexico Options
Jeff Mexico OptionsJeff Mexico Options
Jeff Mexico Options
 

Similar to Hive : WareHousing Over hadoop

02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
The strength of a spatial database
The strength of a spatial databaseThe strength of a spatial database
The strength of a spatial database
Peter Horsbøll Møller
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
AnandMHadoop
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
RamyaMurugesan12
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
WibiData
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
Unit 5-lecture4
Unit 5-lecture4Unit 5-lecture4
Unit 5-lecture4
vishal choudhary
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema
AMIS ADF Weblogic 12c launch Event 08  DVT And Websockets by Lucas JellemaAMIS ADF Weblogic 12c launch Event 08  DVT And Websockets by Lucas Jellema
AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema
Getting value from IoT, Integration and Data Analytics
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookZheng Shao
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
Moulay abdelaziz EL Amrani
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptxShshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
086ChintanPatel1
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 

Similar to Hive : WareHousing Over hadoop (20)

02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
The strength of a spatial database
The strength of a spatial databaseThe strength of a spatial database
The strength of a spatial database
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 
Unit 5-lecture4
Unit 5-lecture4Unit 5-lecture4
Unit 5-lecture4
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema
AMIS ADF Weblogic 12c launch Event 08  DVT And Websockets by Lucas JellemaAMIS ADF Weblogic 12c launch Event 08  DVT And Websockets by Lucas Jellema
AMIS ADF Weblogic 12c launch Event 08 DVT And Websockets by Lucas Jellema
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptxShshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 

More from Chirag Ahuja

Deploy hadoop cluster
Deploy hadoop clusterDeploy hadoop cluster
Deploy hadoop cluster
Chirag Ahuja
 
Word count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using javaWord count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using java
Chirag Ahuja
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
Chirag Ahuja
 
Flume
FlumeFlume
Hbase
HbaseHbase
Pig
PigPig
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Hdfs
HdfsHdfs
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 

More from Chirag Ahuja (10)

Deploy hadoop cluster
Deploy hadoop clusterDeploy hadoop cluster
Deploy hadoop cluster
 
Word count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using javaWord count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using java
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Flume
FlumeFlume
Flume
 
Hbase
HbaseHbase
Hbase
 
Pig
PigPig
Pig
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Hive : WareHousing Over hadoop

  • 2. Outline • What is Hive • Why Hive over MapReduce or Pig? - Advantages and disadvantages • Running Hive • HiveQL language • Examples • Internals • Hive vs Pig • Other Hadoop projects 2
  • 3. Hive • Data warehousing on top of Hadoop. • Designed to - enable easy data summarization - ad-hoc querying - analysis of large volumes of data. • HiveQL statements are automatically translated into MapReduce jobs 3
  • 4. Advantages of Hive • Higher level query language - Simplifies working with large amounts of data • Lower learning curve than Pig or MapReduce - HiveQL is much closer to SQL than Pig - Less trial and error than Pig 4
  • 5. Disadvantages • Updating data is complicated - Mainly because of using HDFS - Can add records - Can overwrite partitions • No real time access to data - Use other means like Hbase or Impala • High latency 5
  • 6. Running Hive • We will look at it more closely in the practice session, but you can run hive from - Hive web interface - Hive shell • $HIVE_HOME/bin/hive for interactive shell • Or you can run queries directly: $HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’ - JDBC - Java Database Connectivity • "jdbc:hive://host:port/dbname„ - Also possible to use hive directly in Python, C, C++, PHP 6
  • 7. Query Processor • Compiler - Parser - Type checking - Plan Generator - Task Generator • Execution engine - Plan - Operators - UDF (UDAF’s) 7
  • 8. HiveQL • Hive query language provides the basic SQL like operations. • These operations are: - Ability to filter rows from a table using a where clause. - Ability to select certain columns from the table using a select clause. - Ability to do equi-joins between two tables. - Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table. - Ability to store the results of a query into another table. - Ability to download the contents of a table to a local directory. - Ability to store the results of a query in a hadoop dfs directory. - Ability to manage tables and partitions (create, drop and alter). - Ability to use custom scripts in chosen language (for map/reduce). 8
  • 9. Data units • Databases: Namespaces that separate tables and other data units from naming confliction. • Tables: - Homogeneous units of data which have the same schema. - Consists of specified columns accordingly to its schema • Partitions: - Each Table can have one or more partition Keys which determines how data is stored. - Partitions allow the user to efficiently identify the rows that satisfy a certain criteria. - It is the user's job to guarantee the relationship between partition name and data! - Partitions are virtual columns, they are not part of the data, but are derived on load. • Buckets (or Clusters): - Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. - For example the page_views table may be bucketed by userid to sample the data. 9
  • 10. Types • Types are associated with the columns in the tables. The following Primitive types are supported: - Integers • TINYINT - 1 byte integer • SMALLINT - 2 byte integer • INT - 4 byte integer • BIGINT - 8 byte integer - Boolean type • BOOLEAN - Floating point numbers • FLOAT • DOUBLE - String type • STRING 10
  • 11. Complex types • Structs - the elements within the type can be accessed using the DOT (.) notation. F - or example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a • Maps (key-value tuples) - The elements are accessed using ['element name'] notation. - For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group'] • Arrays (indexable lists) - The elements in the array have to be in the same type. - Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. - For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'. 11
  • 12. Built in functions • Round(number), floor(number), ceil(number) • Rand(seed) • Concat(string), substr(string, length) • Upper(string), lower(string), trim(string) • Yea(date), Month(date), Day (date) • Count(*), sum(col), avg(col), max(col), min(col) 12
  • 13. Create Table CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; 13
  • 14. Load Data • There are multiple ways to load data into Hive tables. • The user can create an external table that points to a specified location within HDFS. • The user can copy a file into the specified location in HDFS and create a table pointing to this location with all the relevant row format information. • Once this is done, the user can transform the data and insert them into any other Hive table. 14
  • 15. Load example • CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, • page_url STRING, referrer_url STRING, • ip STRING COMMENT 'IP Address of the User', • country STRING COMMENT 'country of origination') • COMMENT 'This is the staging page view table' • ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' • STORED AS TEXTFILE • LOCATION '/user/data/staging/page_view'; 15
  • 16. Load example • hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view • FROM page_view_stg pvs • INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') • SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip • WHERE pvs.country = 'US'; 16
  • 17. Loading and storing data • Loading data directly: LOAD DATA LOCAL INPATH /tmp/pv_2008-06- 08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US') 17
  • 18. Storing locally • To write the output into a local disk, for example to load it into an excel spreadsheet later: INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum' SELECT pv_gender_sum.* FROM pv_gender_sum; 18
  • 19. Display functions • SHOW TABLES; - Lists all tables • SHOW PARTITIONS page_view; - Lists partitions on a specific table • DESCRIBE EXTENDED page_view; - To show all metadata - Mainly for debugging 19
  • 20. INSERT INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; 20
  • 21. Multiple table/file inserts • The output can be sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). • If along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query: FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; • The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files. 21
  • 22. JOIN • LEFT OUTER, RIGHT OUTER or FULL OUTER • Can join more than 2 tables at once • It is best to put the largest table on the rightmost side of the join to get the best performance. INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; 22
  • 23. Aggregations INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; 23
  • 24. Sampling INSERT OVERWRITE TABLE pv_gender_sum_sample SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32); 24
  • 25. Union INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid); 25
  • 26. Running custom mapreduce FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script.py' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script.py' AS date, count; 26
  • 27. Map Example import sys import datetime for line in sys.stdin: line = line.strip() userid, unixtime = line.split('t') weekday = datetime.datetime.fromtimestamp(float(unixtime)). isoweekday() print ','.join([userid, str(weekday)]) 27
  • 28. Co-group FROM ( FROM ( FROM action_video av SELECT av.uid AS uid, av.id AS id, av.date AS date UNION ALL FROM action_comment ac SELECT ac.uid AS uid, ac.id AS id, ac.date AS date ) union_actions SELECT union_actions.uid, union_actions.id, union_actions.date CLUSTER BY union_actions.uid) map INSERT OVERWRITE TABLE actions_reduced SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val); 28
  • 30. Java UDF package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } 30
  • 31. Using UDF • create temporary function my_lower as 'com.example.hive.udf.Lower'; • hive> select my_lower(title), sum(freq) from titles group by my_lower(title); 31
  • 32. Hive disadvantages • Same disadvantages as MapReduce and Pig - Slow start-up and clean-up of MapReduce jobs • It takes time for Hadoop to schedule MR jobs - Not suitable for interactive OLAP Analytics • When results are expected in < 1 sec • Designed for querying and not data transformation - Limitations of the SQL language - Tasks like co-grouping can get complicated 32
  • 33. Pig vs Hive Pig Hive Purpose Data transformation Ad-Hoc querying Language Something similiar to SQL SQL-like Difficulty Medium (Trial-and-error) Low to medium Schemas Yes (implicit) Yes (explicit) Join (Distributed) Yes Yes Shell Yes Yes Streaming Yes Yes Web interface No Yes Partitions No Yes UDF’s Yes Yes 33
  • 34. Pig vs Hive • SQL might not be the perfect language for expressing data transformation commands • Pig - Mainly for data transformations and processing - Unstructured data • Hive - Mainly for warehousing and querying data - Structured data 34
  • 35. More Hadoop projects • Hbase - Open-source distributed database ontop of HDFS - Hbase tables only use a single key - Tuned for real-time access to data • Cloudera Impala™ - Simplified, real time queries over HDFS - Bypass job schedulling, and remove everything else that makes MR slow. 35
  • 36. Big Picture • Store large amounts of data to HDFS • Process raw data: Pig • Build schema using Hive • Data querries: Hive • Real time access - access to data with Hbase - real time queries with Impala 36
  • 37. Is Hadoop enough? • Why use Hadoop for large scale data processing? - It is becoming a de facto standard in Big Data - Collaboration among Top Companies instead of vendor tool lock-in. • Amazon, Apache, Facebook, Yahoo!, etc all contribute to open source Hadoop - There are tools from setting up Hadoop cluster in minutes and importing data from relational databases to setting up workflows of MR, Pig and Hive. 37