Apache Hive
Agenda
• What is Apache Hive
• How to Setup
• Tutorial Examples
Hive
Hive Introduction
• Hive is a data warehouse infrastructure built on top of
hadoop
– Compile SQL queries as MapReduce jobs and run on hadoop
– HDFS for storage
• data warehouse (DW) is a database specific for analysis and reporting
purposes
HiveQL
JDBC ODBC
Thrift Server
Driver
MetaStore
web cli
Hadoop 也有 RDB 可以用 : Hive
• Hive = Hadoop的RDB
– 將結構化的資料檔案映射為資料庫表
– 提供SQL查詢功能( 轉譯SQL語法成
MapReduce程式)
• 適合:
– 有SQL 基礎的使用者且基本 SQL 能運算的事
• 特色:
– 可擴展、可自訂函數、容錯
• 限制:
– 執行時間較久
– 資料結構固定
– 無法修改
4
See : http://www.slideshare.net/Avkashslide/introduction-to-apache-hive-18003322
Hive Performance
See : http://hortonworks.com/blog/pig-performance-and-optimization-analysis/
Hive 架構提供了..
• 介面
– CLI
– WebUI
– API
• JDBC and ODBC
• Thrift Server (hiveserver)
– 使遠端Client可用 API 執
行 HiveQL
• Metastore
– DB, table, partition…
6
figure Source : http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive
大象遇到蜂 ( setup )
• 解壓縮 http://archive.cloudera.com/cdh5/cdh/5/hive-0.13.1-cdh5.3.2.tar.gz
• 修改
~/.bashrc
• 修改
conf/hive-env.sh
• 設定 metastore
on hdfs
• 啟動 pig shell
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/home/hadoop/hive
export PATH=$PATH:$HIVE_HOME/bin
$ hive
hive>
export HADOOP_HOME=/home/hadoop/hadoop
export HIVE_CONF_DIR=/home/hadoop/hive/conf
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse
Ps : 若沒有要用 mysql 當 metastore_db ,可以不用改 hive-site.xml
蜂 也會的程式設計
8
$ hive
hive> create table A(x int, y int, z int)
hive> load data local inpath ‘file1 ’ into table A;
hive> select * from A where y>10000
hive> insert table B select *
from A where y>10000
figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/
Hive和SQL 比較
Hive RDMS
查詢語法 HQL SQL
儲存體 HDFS
Raw Device or
Local FS
運算方法 MapReduce Excutor
延遲 非常高 低
處理數據規模 大 小
修改資料 NO YES
索引
Index, Bigmap
index…
複雜健全的索
引機制
9
Source : http://sishuok.com/forum/blogPost/list/6220.html
練習一 :
• 場景:
– 組織內有統一格式的出勤紀錄資料表,分散在全台各
縣市的各個部門的資料庫中。老闆要我蒐集全台的資
料統計所有員工的平均工時。DB內的table 都轉成csv
檔,並且餵進去 Hadoop 的HDFS了, ..
• 問題:
– 雖然我知道PIG可以降低MapReduce的門檻,但我還
是習慣 SQL 語法來實作,如果有一台超大又免費的DB
就好了…
• 解法:
10
 編列經費買台高效伺服器再裝個大容量的 sql server
 使用 Hive
用 Hive 整形後
11
北 A1 劉 12.5
HiveQL
> create table A (nm String, dp String, id String);
> create table B (id String, dt Date, hr int);
> create table final (dp String, id String , nm String, avg float);
> load data inpath ‘file1’ into table A;
> load data inpath ‘file2’ into table B;
> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr)
from a,b where b.hr > 8 and b.id = a.id group by a.id;
nm dp Id id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Tips :
• 沒有 local 模式;
• hive load 資料完後會將 input file 刪掉;
• 資料格式須嚴格檢查;
• 分欄位元僅1 byte;
• create table & load data 建議用 tool 匯入
資料較不會錯
練習一 : 實作
 cd ~;
 git clone
https://github.com/waue0920/hadoop_example.git
 cd ~/hadoop_example/hive/ex1
 hadoop fs -put *.txt ./
 hive -f exc1.hive
練習 : 將 exc1.hive 每一行單獨執行,並搭配 select * from
table_name 來看結果,如 :
hive> select * from final
Q : 資料是否有改進空間 ?
Q : 如何改進 ?
更多:table
CREATE TABLE page_view(
viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,‘
tblproperties ("skip.header.line.count"="1");
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
DROP TABLE pv_users;
ALTER TABLE old_table_name REPLACE
COLUMNS (col1 TYPE, ...);
Create Table Alter Table
Drop Table
更多:data
LOAD DATA
INPATH '/user/data/pv_2008-06-08_us.txt'
INTO TABLE page_view PARTITION(date='2008-06-08',
country='US')
INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31' AND
page_views.referrer_url like '%xyz.com';
Insert Table
Import data
更多:Query
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;
SELECT page_views.*
FROM page_views
WHERE
page_views.date >= '2008-03-01'
AND
page_views.date <= '2008-03-31'
AND
page_views.referrer_url like '%xyz.com';
Query
Partition Based Query
更多:aggregate
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u
JOIN
page_view pv
ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender,
count (DISTINCT pv_users.userid),
collect_set (pv_users.name)
FROM pv_users
GROUP BY pv_users.gender;
Joins
Aggregations
更多:union
• Tips:
– UNION : 兩個 SQL 語句所產生的欄位需要是同樣的資料種類
– JOIN : 透過相同鍵值將兩table合併成大的
INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
SELECT av.uid AS uid
FROM action_video av
WHERE av.date = '2008-06-03'
UNION ALL
SELECT ac.uid AS uid
FROM action_comment ac
WHERE ac.date = '2008-06-03'
) actions JOIN users u ON(u.id = actions.uid);
UNION ALL
更多:動動腦
name sex age
michael male 30
will male 35
shelley female 27
lucy female 57
steven male 30
sex_age.se
x
age name
female 57 lucy
male 35 will
See : http://willddy.github.io/2014/12/23/Hive-Get-MAX-MIN-Value-Rows.html
SELECT employee.sex_age.sex, employee.sex_age.age, name
FROM
employee JOIN
(
SELECT
max(sex_age.age) as max_age, sex_age.sex as sex
FROM employee
GROUP BY sex_age.sex
) maxage
ON employee.sex_age.age = maxage.max_age
AND employee.sex_age.sex = maxage.sex;
Solution
練習二
• 說明 : 使用實價登錄資訊,算出最高與最低房價
• Source :
http://plvr.land.moi.gov.tw//Download?type=zi
p&fileName=lvr_landcsv.zip
練習二
 cd ~/hadoop_example/hive/ex2
 ./get_taiwan_landprice.sh
 hadoop fs -rmr ./hive_2_input
 hadoop fs -put input ./hive_2_input
 hive
hive>
Ex1 : 算出 top 5 最高房價
Ex2 : 算出 top 5 最高每坪價格
文山區 臺北市文山區汀州路四段181~210號 54700000
文山區 臺北市文山區新光路一段65巷31~60號 28500000
文山區 臺北市文山區政大二街121~150號 11600000
文山區 臺北市文山區汀州路四段181~210號 213981 54700000
文山區 臺北市文山區新光路一段65巷31~60號 129398 28500000
文山區 臺北市文山區政大二街121~150號 100112 11600000
http://hive.3du.me/
Reference
• Hive 官方範例說明
– https://cwiki.apache.org/confluence/display/Hive/
Tutorial
• Hive 練習
– http://hive.3du.me/

Hadoop hive

  • 1.
  • 2.
    Agenda • What isApache Hive • How to Setup • Tutorial Examples
  • 3.
    Hive Hive Introduction • Hiveis a data warehouse infrastructure built on top of hadoop – Compile SQL queries as MapReduce jobs and run on hadoop – HDFS for storage • data warehouse (DW) is a database specific for analysis and reporting purposes HiveQL JDBC ODBC Thrift Server Driver MetaStore web cli
  • 4.
    Hadoop 也有 RDB可以用 : Hive • Hive = Hadoop的RDB – 將結構化的資料檔案映射為資料庫表 – 提供SQL查詢功能( 轉譯SQL語法成 MapReduce程式) • 適合: – 有SQL 基礎的使用者且基本 SQL 能運算的事 • 特色: – 可擴展、可自訂函數、容錯 • 限制: – 執行時間較久 – 資料結構固定 – 無法修改 4 See : http://www.slideshare.net/Avkashslide/introduction-to-apache-hive-18003322
  • 5.
    Hive Performance See :http://hortonworks.com/blog/pig-performance-and-optimization-analysis/
  • 6.
    Hive 架構提供了.. • 介面 –CLI – WebUI – API • JDBC and ODBC • Thrift Server (hiveserver) – 使遠端Client可用 API 執 行 HiveQL • Metastore – DB, table, partition… 6 figure Source : http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive
  • 7.
    大象遇到蜂 ( setup) • 解壓縮 http://archive.cloudera.com/cdh5/cdh/5/hive-0.13.1-cdh5.3.2.tar.gz • 修改 ~/.bashrc • 修改 conf/hive-env.sh • 設定 metastore on hdfs • 啟動 pig shell export JAVA_HOME=/usr/lib/jvm/java-7-oracle export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HIVE_HOME=/home/hadoop/hive export PATH=$PATH:$HIVE_HOME/bin $ hive hive> export HADOOP_HOME=/home/hadoop/hadoop export HIVE_CONF_DIR=/home/hadoop/hive/conf hadoop fs -mkdir -p /user/hive/warehouse hadoop fs -chmod g+w /tmp hadoop fs -chmod g+w /user/hive/warehouse Ps : 若沒有要用 mysql 當 metastore_db ,可以不用改 hive-site.xml
  • 8.
    蜂 也會的程式設計 8 $ hive hive>create table A(x int, y int, z int) hive> load data local inpath ‘file1 ’ into table A; hive> select * from A where y>10000 hive> insert table B select * from A where y>10000 figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/
  • 9.
    Hive和SQL 比較 Hive RDMS 查詢語法HQL SQL 儲存體 HDFS Raw Device or Local FS 運算方法 MapReduce Excutor 延遲 非常高 低 處理數據規模 大 小 修改資料 NO YES 索引 Index, Bigmap index… 複雜健全的索 引機制 9 Source : http://sishuok.com/forum/blogPost/list/6220.html
  • 10.
    練習一 : • 場景: –組織內有統一格式的出勤紀錄資料表,分散在全台各 縣市的各個部門的資料庫中。老闆要我蒐集全台的資 料統計所有員工的平均工時。DB內的table 都轉成csv 檔,並且餵進去 Hadoop 的HDFS了, .. • 問題: – 雖然我知道PIG可以降低MapReduce的門檻,但我還 是習慣 SQL 語法來實作,如果有一台超大又免費的DB 就好了… • 解法: 10  編列經費買台高效伺服器再裝個大容量的 sql server  使用 Hive
  • 11.
    用 Hive 整形後 11 北A1 劉 12.5 HiveQL > create table A (nm String, dp String, id String); > create table B (id String, dt Date, hr int); > create table final (dp String, id String , nm String, avg float); > load data inpath ‘file1’ into table A; > load data inpath ‘file2’ into table B; > Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8 and b.id = a.id group by a.id; nm dp Id id dt hr 劉 北 A1 A1 7/7 13 李 中 B1 A1 7/8 12 王 中 B2 A1 7/9 4 Tips : • 沒有 local 模式; • hive load 資料完後會將 input file 刪掉; • 資料格式須嚴格檢查; • 分欄位元僅1 byte; • create table & load data 建議用 tool 匯入 資料較不會錯
  • 12.
    練習一 : 實作 cd ~;  git clone https://github.com/waue0920/hadoop_example.git  cd ~/hadoop_example/hive/ex1  hadoop fs -put *.txt ./  hive -f exc1.hive 練習 : 將 exc1.hive 每一行單獨執行,並搭配 select * from table_name 來看結果,如 : hive> select * from final Q : 資料是否有改進空間 ? Q : 如何改進 ?
  • 13.
    更多:table CREATE TABLE page_view( viewTimeINT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘ tblproperties ("skip.header.line.count"="1"); STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; DROP TABLE pv_users; ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...); Create Table Alter Table Drop Table
  • 14.
    更多:data LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTOTABLE page_view PARTITION(date='2008-06-08', country='US') INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; Insert Table Import data
  • 15.
    更多:Query INSERT OVERWRITE TABLEuser_active SELECT user.* FROM user WHERE user.active = 1; SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; Query Partition Based Query
  • 16.
    更多:aggregate INSERT OVERWRITE TABLEpv_users SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid), collect_set (pv_users.name) FROM pv_users GROUP BY pv_users.gender; Joins Aggregations
  • 17.
    更多:union • Tips: – UNION: 兩個 SQL 語句所產生的欄位需要是同樣的資料種類 – JOIN : 透過相同鍵值將兩table合併成大的 INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid); UNION ALL
  • 18.
    更多:動動腦 name sex age michaelmale 30 will male 35 shelley female 27 lucy female 57 steven male 30 sex_age.se x age name female 57 lucy male 35 will See : http://willddy.github.io/2014/12/23/Hive-Get-MAX-MIN-Value-Rows.html SELECT employee.sex_age.sex, employee.sex_age.age, name FROM employee JOIN ( SELECT max(sex_age.age) as max_age, sex_age.sex as sex FROM employee GROUP BY sex_age.sex ) maxage ON employee.sex_age.age = maxage.max_age AND employee.sex_age.sex = maxage.sex; Solution
  • 19.
    練習二 • 說明 :使用實價登錄資訊,算出最高與最低房價 • Source : http://plvr.land.moi.gov.tw//Download?type=zi p&fileName=lvr_landcsv.zip
  • 20.
    練習二  cd ~/hadoop_example/hive/ex2 ./get_taiwan_landprice.sh  hadoop fs -rmr ./hive_2_input  hadoop fs -put input ./hive_2_input  hive hive> Ex1 : 算出 top 5 最高房價 Ex2 : 算出 top 5 最高每坪價格 文山區 臺北市文山區汀州路四段181~210號 54700000 文山區 臺北市文山區新光路一段65巷31~60號 28500000 文山區 臺北市文山區政大二街121~150號 11600000 文山區 臺北市文山區汀州路四段181~210號 213981 54700000 文山區 臺北市文山區新光路一段65巷31~60號 129398 28500000 文山區 臺北市文山區政大二街121~150號 100112 11600000 http://hive.3du.me/
  • 21.
    Reference • Hive 官方範例說明 –https://cwiki.apache.org/confluence/display/Hive/ Tutorial • Hive 練習 – http://hive.3du.me/

Editor's Notes

  • #7 http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/
  • #14 https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-UsageandExamples
  • #15 https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-UsageandExamples
  • #16 https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-UsageandExamples
  • #17 https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-UsageandExamples
  • #18 https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-UsageandExamples
  • #20 mkdir source; cd source wget "http://plvr.land.moi.gov.tw//Download?type=zip&fileName=lvr_landcsv.zip" -O lvr_landcsv.zip unzip lvr_landcsv.zip mkdir ../input; for i in $(ls ./*.CSV ) ;do iconv -c -f big5 -t utf8 $i -o $i".utf8";done mv *.utf8 ../input/ cd ../input/ rm *_BUILD.CSV.utf8 rm *_LAND.CSV.utf8 rm *_PARK.CSV.utf8
  • #21 A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3); B = LOAD 'B.txt' ; Y = FILTER A BY f1 == '8'; Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); ====== (1,{(1,2,3)}) (4,{(4,3,3),(4,2,1)}) (7,{(7,2,5)}) (8,{(8,4,3),(8,3,4)}) ====== X = GROUP A BY f1; ==== (1,{(1,2,3)}) (4,{(4,3,3),(4,2,1)}) (7,{(7,2,5)}) (8,{(8,4,3),(8,3,4)}) ==== Projection X = FOREACH A GENERATE f1, f2; ==== 1,2) (4,2) (8,3) (4,3) (7,2) (8,4) ==== X = FOREACH A GENERATE f1+f2 as sumf1f2; Y = FILTER X by sumf1f2 > 5.0; ===== (6.0) (11.0) (7.0) (9.0) (12.0) ===== C = COGROUP A BY $0 INNER, B BY $0 INNER; ==== (1,{(1,2,3)},{(1,3)}) (4,{(4,3,3),(4,2,1)},{(4,9),(4,6)}) (8,{(8,4,3),(8,3,4)},{(8,9)}) ====