2. AGENDA
• About this presentation
• Refreshing memory
• Homework micro project
• Some basic HiveQL in connection with the homework
• Tips and trics
• Q and A
3. REFRESHING THE MEMORY ...
Data warehouse infrastructure built on Hadoop1
Initially developed by Facebook2
Access any type of data through SQL-like interface3
HiveQL: subset/extension of standard SQL4
Converts HiveQL (SQL-like) queries to MapReduce, Tez, Spark jobs5
Not suitable for OLTP6
Not a relational database7
Not performant with small amounts of data8
Internal vs. External tables9
4. HOMEWORK – THE USE CASE
Mainframe
Fund
transactions
......
Hadoop cluster
Hive
Fund
transactions
external table
Linux FS
Funds internal
table
Fund tr. archive
internal table
partitionedTOP 5 Funds
Report
...
1
2
3
Reporting
application
5. HOMEWORK – REPORT SAMPLE
TOP 5 FUNDS SOLD BY FUND MANAGER
Date: 2017-01-23
AEGON
Fund ID Fund Name Amt(HUF)
----------------------------------------------------------------------------------------
HU0000707401 AEGON Russia Részvény Befektetési Alap 837088447
HU0000710843 AEGON Lengyel Részvény Befektetési Alap B sorozat 724418208
HU0000713144 AEGON Russia Részvény Befektetési Alap PI sorozat 297676842
HU0000710157 AEGON Russia Részvény Befektetési Alap P sorozat 213092436
HU0000709514 AEGON Russia Részvény Befektetési Alap I sorozat 137271573
ERSTE
Fund ID Fund Name Amt(HUF)
----------------------------------------------------------------------------------------
HU0000708656 ERSTE Abszolút Hozamú Eszközallokációs Alapok Alapja 741018249
HU0000708631 ERSTE DPM Globális Részvény Alapok Alapja 734481241
HU0000701537 Erste Nyíltvégű Közép-Európai Részvény Alapok Alapja 512927455
HU0000704200 Erste Stock Hungary Indexkövető Részvény Befektetési Alap 147623040
HU0000712492 Erste Stock Global HUF Alapok Alapja 124933138
OTP
Fund ID Fund Name Amt(HUF)
----------------------------------------------------------------------------------------
HU0000709084 OTP Orosz Részvény Alap B sorozat 964596471
HU0000709092 OTP Orosz Részvény Alap C sorozat 709871151
HU0000704960 OTP Tőzsdén Kereskedett BUX Indexkövető Alap 685225834
HU0000705561 OTP Planéta Feltörekvő Piaci Részvény Alapok Alapja B sorozat 220627446
HU0000709019 OTP Orosz Részvény Alap A sorozat 59372424
....
Fund
manager
Top 5 funds
sold today,
sorted by sum
of amount desc
6. HOMEWORK TASKS
Create the:
• database,
• FUNDS_TRANSACTIONS external table
• FUNDS and FUNDS_TR_ARCHIVE internal tables partitioned by transaction year
• optionally index(es),
• optionally views
in Hive
1
Load the internal tables from the homework data files2
Create a single query, using analytic functions, that joins the tables and
produces report data
3
Create the reporting program that produces the report in the format
described on the previous slide
4
Schedule the running in chron or Oozie to run at 18:00 every day5
Optionally:
Send all HiveQL source and report data to me
7. BEELINE
• Beeline – Command Line Shell
beeline -u jdbc:hive2://localhost:10000/default -n scott -p tiger --color=true
Get list of internal commands like save commands as script, get list of indexes
for a table, get list of all tables, run script from file ....:
0: jdbc:hive2://localhost:10000/default> !help
beeline -u jdbc:hive2://localhost:10000/default -n scott -p tiger -f homework.hql
--outputformat=csv2 --showHeader=false --silent --showWarnings=false
| python report.py > report.txt
Idea for the homework:
8. OFF TOPIC - PYTHON
...
print ... (90 - len(fields[1].decode('utf8'))) * ' ' ...
Workaround for left positioning utf-8 text in Python 2.x:
report.py
9. CREATION DDL
Creating database:
• LOCATION hdfs_path
Creating table (official documentation):
• <managed> / EXTERNAL /TEMPORARY
• PARTITIONED BY
• STORED AS : use ORC or PARQUET whenever possible, 3-4x faster than TEXT
• TBLPROPERTIES : "orc.compress"="ZLIB” (/"SNAPPY "/"NONE " )
• LOCATION
Note: PK and FK constants can be defined, but not enforced. Metadata info for optimizers.
Creating index (avoid in case of file type containing index e.g. ORC):
• COMPACT/BITMAP
• <automatic>, WITH DEFERRED REBUILD (ALTER INDEX ... REBUILD)
Note: it is possible to rebuild index by partition
Notes for creating a view :
• no materialized views
• schema frozen
• filter push down
10. ADDING PARTITION
Creating the partitioned table:
CREATE TABLE usr_part(id int, name string) PARTITIONED BY (entry_year int);
Static mode:
INSERT INTO usr_part PARTITION(entry_year=2016)
SELECT id, name FROM dyn_part_source;
Or
ALTER TABLE usr_part ADD PARTITION (entry_year=1987);
Dynamic mode (does not work with load):
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM dyn_part_source dps
INSERT OVERWRITE TABLE usr_part PARTITION(entry_year)
SELECT dps.id, dps.name, dps.entry_year;
Directory structure:
/apps/hive/warehouse/mydb.db/usr_part
/entry_year=1987
000000_0
/entry_year=2008
000000_0
/entry_year=2009
000000_0
11. SKIPPING HEADER AT LOADING TABLE
Skipping header:
Solution 1 Hive table skips first row (not recommended for partitioned tables):
CREATE TABLE ....
TBLPROPERTIES ("skip.header.line.count"="1")
Solution 2 pre-processing file with tail:
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Solution 3 pre-processing file with sed:
sed -i 1d filename.csv
Warning: LOAD DATA LOCAL moves the data
12. RANK() ANALYTICAL (=WINDOWING) FUNCTION
Analytic functions by Example
Best earning employees by department
SELECT id, name, deptid, salary, rank() OVER (PARTITION BY deptid ORDER BY salary) as rank
FROM employee;
+-----+-------+---------+---------+-------+--+
| id | name | deptid | salary | rank |
+-----+-------+---------+---------+-------+--+
| 4 | D | 1 | 4000 | 1 |
| 3 | C | 1 | 3000 | 2 |
| 5 | E | 1 | 2500 | 3 |
| 2 | B | 1 | 2000 | 4 |
| 6 | F | 1 | 1500 | 5 |
| 1 | A | 1 | 1000 | 6 |
| 11 | K | 2 | 5000 | 1 |
| 7 | G | 2 | 2500 | 2 |
| 9 | I | 2 | 2300 | 3 |
| 10 | J | 2 | 1800 | 4 |
| 12 | L | 2 | 1600 | 5 |
| 8 | H | 2 | 1400 | 6 |
+-----+-------+---------+---------+-------+--+
13. RANK() ANALYTICAL (=WINDOWING) FUNCTION
Top 3 best earning employees by department
SELECT id, name, deptid, salary, rank
FROM (
SELECT id, name, deptid, salary,
rank() OVER (PARTITION BY deptid ORDER BY salary desc) as rank
from employee
) ranked_table
WHERE ranked_table.rank <=3;
+-----+-------+---------+---------+-------+--+
| id | name | deptid | salary | rank |
+-----+-------+---------+---------+-------+--+
| 1 | A | 1 | 1000 | 1 |
| 6 | F | 1 | 1500 | 2 |
| 2 | B | 1 | 2000 | 3 |
| 8 | H | 2 | 1400 | 1 |
| 12 | L | 2 | 1600 | 2 |
| 10 | J | 2 | 1800 | 3 |
+-----+-------+---------+---------+-------+--+
14. TIPS AND TRICKS - COMPLEX TYPES
• First: create a dummy table with exactly one row:
CREATE TABLE dual(x int) TBLPROPERTIES("immutable"="true");
INSERT INTO TABLE dual values (1);
• Structs
CREATE TABLE phonebook_struct(
name string,
phones struct<phone_type:string,phone_number:string>
);
INSERT INTO TABLE phonebook_struct
SELECT
'Tercsi',
NAMED_STRUCT('phone_type','home','phone_number','98347598374')
FROM dual;
SELECT phones.phone_number
FROM phonebook_struct
WHERE name='Tercsi' AND phones.phone_type='home';
15. TIPS AND TRICS - COMPLEX TYPES
• Maps (key-value tuples)
CREATE TABLE phonebook_map(name string, phones map<string,string>);
INSERT INTO TABLE phonebook_map
SELECT 'Tercsi', str_to_map("home:348756348756") FROM dual;
SELECT phones['home'] from phonebook_map WHERE name='Tercsi';
• Arrays (indexable lists)
CREATE TABLE phonebook_array(name string, phones array<string>);
INSERT INTO phonebook_array
SELECT 'Tercsi', array('12345','678')
FROM dual;
SELECT phones[0]
FROM phonebook_array
WHERE name='Tercsi';
SELECT SORT_ARRAY(work_place)
FROM employee
WHERE ARRAY_CONTAINS(work_place, 'Montreal');
• Union (Support incomplete, use only for look-at)
CREATE TABLE union_test(foo UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>);
16. TIPS AND TRICKS
Merging splitted files on HDFS into single local file:
hdfs dfs -getmerge /user/andras/text_import /tmp/test
Query result directly to HDFS:
Create
hive -e 'select * from andras.titanic'|hdfs dfs -put -f - /user/andras/t
Append
hive -e 'select * from andras.titanic'|hdfs dfs -appendToFile -f -
/user/andras/t
Beeline
insert overwrite directory '/user/hive/t5' select * from titanic;
Backup and restore (with metadata):
export table titanic to '/tmp/backup';
import table titanic_imported from '/tmp/backup';
17. TIPS AND TRICKS
Current time:
select from_unixtime(unix_timestamp()) as current_time from
employee limit 1
Difference in days
select (UNIX_TIMESTAMP('2015-01-21 18:00:00') -
UNIX_TIMESTAMP('2015-01-10 11:00:00'))/60/60/24 as daydiff
FROM employee LIMIT 1;
Converting timestamp to date:
select TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) AS curr_date
FROM employee LIMIT 1;
18. TIPS AND TRICKS
Workaround for „Null pointer exception” az index rebuild:
set hive.execution.engine=mr;
alter index … rebuild
set hive.execution.engine=tez;
Adding third-party serde to advanced csv processing:
add jar /home/andras_feher/csv-serde-1.1.2-0.11.0-all.jar
create table airports( ... )
)row format serde 'com.bizo.hive.serde.csv.CSVSerde'
....;
Editor's Notes
Schema frozen: select * from
Schema frozen: select * from
allow developers to perform tasks in SQL that were previously confined to procedural languages
allow developers to perform tasks in SQL that were previously confined to procedural languages