Hive in Practice

Hive in Practice
Andras_Feher@epam.com
March 13, 2017

AGENDA
• About this presentation
• Refreshing memory
• Homework micro project
• Some basic HiveQL in connection with the homework
• Tips and trics
• Q and A

REFRESHING THE MEMORY ...
Data warehouse infrastructure built on Hadoop1
Initially developed by Facebook2
Access any type of data through SQL-like interface3
HiveQL: subset/extension of standard SQL4
Converts HiveQL (SQL-like) queries to MapReduce, Tez, Spark jobs5
Not suitable for OLTP6
Not a relational database7
Not performant with small amounts of data8
Internal vs. External tables9

HOMEWORK – THE USE CASE
Mainframe
Fund
transactions
......
Hadoop cluster
Hive
Fund
transactions
external table
Linux FS
Funds internal
table
Fund tr. archive
internal table
partitionedTOP 5 Funds
Report
...
1
2
3
Reporting
application

HOMEWORK – REPORT SAMPLE
TOP 5 FUNDS SOLD BY FUND MANAGER
Date: 2017-01-23
AEGON
Fund ID Fund Name Amt(HUF)
----------------------------------------------------------------------------------------
HU0000707401 AEGON Russia Részvény Befektetési Alap 837088447
HU0000710843 AEGON Lengyel Részvény Befektetési Alap B sorozat 724418208
HU0000713144 AEGON Russia Részvény Befektetési Alap PI sorozat 297676842
HU0000710157 AEGON Russia Részvény Befektetési Alap P sorozat 213092436
HU0000709514 AEGON Russia Részvény Befektetési Alap I sorozat 137271573
ERSTE
----------------------------------------------------------------------------------------
HU0000708656 ERSTE Abszolút Hozamú Eszközallokációs Alapok Alapja 741018249
HU0000708631 ERSTE DPM Globális Részvény Alapok Alapja 734481241
HU0000701537 Erste Nyíltvégű Közép-Európai Részvény Alapok Alapja 512927455
HU0000704200 Erste Stock Hungary Indexkövető Részvény Befektetési Alap 147623040
HU0000712492 Erste Stock Global HUF Alapok Alapja 124933138
OTP
----------------------------------------------------------------------------------------
HU0000709084 OTP Orosz Részvény Alap B sorozat 964596471
HU0000709092 OTP Orosz Részvény Alap C sorozat 709871151
HU0000704960 OTP Tőzsdén Kereskedett BUX Indexkövető Alap 685225834
HU0000705561 OTP Planéta Feltörekvő Piaci Részvény Alapok Alapja B sorozat 220627446
HU0000709019 OTP Orosz Részvény Alap A sorozat 59372424
....
Fund
manager
Top 5 funds
sold today,
sorted by sum
of amount desc

HOMEWORK TASKS
Create the:
• database,
• FUNDS_TRANSACTIONS external table
• FUNDS and FUNDS_TR_ARCHIVE internal tables partitioned by transaction year
• optionally index(es),
• optionally views
in Hive
1
Load the internal tables from the homework data files2
Create a single query, using analytic functions, that joins the tables and
produces report data
3
Create the reporting program that produces the report in the format
described on the previous slide
4
Schedule the running in chron or Oozie to run at 18:00 every day5
Optionally:
Send all HiveQL source and report data to me

BEELINE
• Beeline – Command Line Shell
beeline -u jdbc:hive2://localhost:10000/default -n scott -p tiger --color=true
Get list of internal commands like save commands as script, get list of indexes
for a table, get list of all tables, run script from file ....:
0: jdbc:hive2://localhost:10000/default> !help
beeline -u jdbc:hive2://localhost:10000/default -n scott -p tiger -f homework.hql
--outputformat=csv2 --showHeader=false --silent --showWarnings=false
| python report.py > report.txt
Idea for the homework:

OFF TOPIC - PYTHON
...
print ... (90 - len(fields[1].decode('utf8'))) * ' ' ...
Workaround for left positioning utf-8 text in Python 2.x:
report.py

CREATION DDL
Creating database:
• LOCATION hdfs_path
Creating table (official documentation):
• <managed> / EXTERNAL /TEMPORARY
• PARTITIONED BY
• STORED AS : use ORC or PARQUET whenever possible, 3-4x faster than TEXT
• TBLPROPERTIES : "orc.compress"="ZLIB” (/"SNAPPY "/"NONE " )
• LOCATION
Note: PK and FK constants can be defined, but not enforced. Metadata info for optimizers.
Creating index (avoid in case of file type containing index e.g. ORC):
• COMPACT/BITMAP
• <automatic>, WITH DEFERRED REBUILD (ALTER INDEX ... REBUILD)
Note: it is possible to rebuild index by partition
Notes for creating a view :
• no materialized views
• schema frozen
• filter push down

ADDING PARTITION
Creating the partitioned table:
CREATE TABLE usr_part(id int, name string) PARTITIONED BY (entry_year int);
Static mode:
INSERT INTO usr_part PARTITION(entry_year=2016)
SELECT id, name FROM dyn_part_source;
Or
ALTER TABLE usr_part ADD PARTITION (entry_year=1987);
Dynamic mode (does not work with load):
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM dyn_part_source dps
INSERT OVERWRITE TABLE usr_part PARTITION(entry_year)
SELECT dps.id, dps.name, dps.entry_year;
Directory structure:
/apps/hive/warehouse/mydb.db/usr_part
/entry_year=1987
000000_0
/entry_year=2008
000000_0
/entry_year=2009
000000_0

SKIPPING HEADER AT LOADING TABLE
Skipping header:
Solution 1 Hive table skips first row (not recommended for partitioned tables):
CREATE TABLE ....
TBLPROPERTIES ("skip.header.line.count"="1")
Solution 2 pre-processing file with tail:
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Solution 3 pre-processing file with sed:
sed -i 1d filename.csv
Warning: LOAD DATA LOCAL moves the data

RANK() ANALYTICAL (=WINDOWING) FUNCTION
Analytic functions by Example
Best earning employees by department
SELECT id, name, deptid, salary, rank() OVER (PARTITION BY deptid ORDER BY salary) as rank
FROM employee;
+-----+-------+---------+---------+-------+--+
| id | name | deptid | salary | rank |
+-----+-------+---------+---------+-------+--+
| 4 | D | 1 | 4000 | 1 |
| 3 | C | 1 | 3000 | 2 |
| 5 | E | 1 | 2500 | 3 |
| 2 | B | 1 | 2000 | 4 |
| 6 | F | 1 | 1500 | 5 |
| 1 | A | 1 | 1000 | 6 |
| 11 | K | 2 | 5000 | 1 |
| 7 | G | 2 | 2500 | 2 |
| 9 | I | 2 | 2300 | 3 |
| 10 | J | 2 | 1800 | 4 |
| 12 | L | 2 | 1600 | 5 |
| 8 | H | 2 | 1400 | 6 |
+-----+-------+---------+---------+-------+--+

RANK() ANALYTICAL (=WINDOWING) FUNCTION
Top 3 best earning employees by department
SELECT id, name, deptid, salary, rank
FROM (
SELECT id, name, deptid, salary,
rank() OVER (PARTITION BY deptid ORDER BY salary desc) as rank
from employee
) ranked_table
WHERE ranked_table.rank <=3;
+-----+-------+---------+---------+-------+--+
| id | name | deptid | salary | rank |
+-----+-------+---------+---------+-------+--+
| 1 | A | 1 | 1000 | 1 |
| 6 | F | 1 | 1500 | 2 |
| 2 | B | 1 | 2000 | 3 |
| 8 | H | 2 | 1400 | 1 |
| 12 | L | 2 | 1600 | 2 |
| 10 | J | 2 | 1800 | 3 |
+-----+-------+---------+---------+-------+--+

TIPS AND TRICKS - COMPLEX TYPES
• First: create a dummy table with exactly one row:
CREATE TABLE dual(x int) TBLPROPERTIES("immutable"="true");
INSERT INTO TABLE dual values (1);
• Structs
CREATE TABLE phonebook_struct(
name string,
phones struct<phone_type:string,phone_number:string>
);
INSERT INTO TABLE phonebook_struct
SELECT
'Tercsi',
NAMED_STRUCT('phone_type','home','phone_number','98347598374')
FROM dual;
SELECT phones.phone_number
FROM phonebook_struct
WHERE name='Tercsi' AND phones.phone_type='home';

TIPS AND TRICS - COMPLEX TYPES
• Maps (key-value tuples)
CREATE TABLE phonebook_map(name string, phones map<string,string>);
INSERT INTO TABLE phonebook_map
SELECT 'Tercsi', str_to_map("home:348756348756") FROM dual;
SELECT phones['home'] from phonebook_map WHERE name='Tercsi';
• Arrays (indexable lists)
CREATE TABLE phonebook_array(name string, phones array<string>);
INSERT INTO phonebook_array
SELECT 'Tercsi', array('12345','678')
FROM dual;
SELECT phones[0]
FROM phonebook_array
WHERE name='Tercsi';
SELECT SORT_ARRAY(work_place)
FROM employee
WHERE ARRAY_CONTAINS(work_place, 'Montreal');
• Union (Support incomplete, use only for look-at)
CREATE TABLE union_test(foo UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>);

TIPS AND TRICKS
Merging splitted files on HDFS into single local file:
hdfs dfs -getmerge /user/andras/text_import /tmp/test
Query result directly to HDFS:
Create
hive -e 'select * from andras.titanic'|hdfs dfs -put -f - /user/andras/t
Append
hive -e 'select * from andras.titanic'|hdfs dfs -appendToFile -f -
/user/andras/t
Beeline
insert overwrite directory '/user/hive/t5' select * from titanic;
Backup and restore (with metadata):
export table titanic to '/tmp/backup';
import table titanic_imported from '/tmp/backup';

TIPS AND TRICKS
Current time:
select from_unixtime(unix_timestamp()) as current_time from
employee limit 1
Difference in days
select (UNIX_TIMESTAMP('2015-01-21 18:00:00') -
UNIX_TIMESTAMP('2015-01-10 11:00:00'))/60/60/24 as daydiff
FROM employee LIMIT 1;
Converting timestamp to date:
select TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) AS curr_date
FROM employee LIMIT 1;

TIPS AND TRICKS
Workaround for „Null pointer exception” az index rebuild:
set hive.execution.engine=mr;
alter index … rebuild
set hive.execution.engine=tez;
Adding third-party serde to advanced csv processing:
add jar /home/andras_feher/csv-serde-1.1.2-0.11.0-all.jar
create table airports( ... )
)row format serde 'com.bizo.hive.serde.csv.CSVSerde'
....;

Hive in Practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive in Practice

Similar to Hive in Practice (20)

Recently uploaded

Recently uploaded (20)

Hive in Practice

Editor's Notes