Hive - Apache hadoop Bigdata training by Desing Pathshala

Apache Hadoop
Design Pathshala
April 22, 2014
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com |
http://designpathshala.com
1

Hive
 Developed at Facebook
 Used for majority of Facebook jobs
 “Relational database” built on Hadoop
 Maintains list of table schemas
 SQL-like query language (HiveQL)
 Supports table partitioning, clustering, complex data types, some optimizations
2

Apache Hadoop Bigdata
Training By Design Pathshala
Contact us on: admin@designpathshala.com
Or Call us at: +91 120 260 5512 or +91 98 188 23045
Visit us at: http://designpathshala.com
3

Why Another Data Warehousing System?
 Problem : Data, data and more data
 Several TBs of data everyday
 The Hadoop Experiment:
 Uses Hadoop File System (HDFS)
 Scalable/Available
 Problem
 Long development life cycle
 Map-Reduce hard to program
 Solution : HIVE
4

What is HIVE?
 A system for managing and querying unstructured data
as if it were structured
 Uses Map-Reduce for execution
 HDFS for Storage
5

Or Call us at: +91 120 260 5512 or +91 98 188 23045
6

Word Count
 Instead of 65 line java code lets try hive.
 create table doc(
text string
) row format delimited fields terminated by 'n' stored as
textfile;
 Load Data inpath ‘docs’ overwrite into table doc;
 SELECT word, COUNT(*) FROM doc LATERAL VIEW
explode(split(text, ' ')) lTable as word GROUP BY word;
7

Or Call us at: +91 120 260 5512 or +91 98 188 23045
8

Type System
 Primitive types
– Integers: TINYINT, SMALLINT, INT, BIGINT.
– Boolean: BOOLEAN.
– Floating point numbers: FLOAT, DOUBLE .
– String: STRING.
– Timestamp (Unix epoch seconds)
 Complex types
– Structs: {a INT; b INT}. Name.a returns a’s value
– Maps: M[‘key'] returns value
– Arrays: ['a', 'b', 'c'], A[1] returns 'b'.
9

Data Model- Tables
 Tables
 Analogous to tables in relational DBs.
 Each table has corresponding directory in HDFS.
 Example
 Table “designpathshala” could hold its data inside HDFS
directory
 /com/designpathshala
10

Creating a Hive Table
CREATE TABLE designpathshala_employees(
name STRING,
Salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING,FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>)
COMMENT 'This is the page view table'
PARTITIONED BY(department STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY'001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
LINES TERMINATED BY 'n’
STORED AS TEXTFILE;
11

Or Call us at: +91 120 260 5512 or +91 98 188 23045
12

Hive Table Data
John DoeÂ100000.0ÂMary Smith^BTodd JonesÂFederal Taxes^C.2^BState
Taxes^C.05^BInsurance^C.1Â1 Michigan Ave.^BChicago^BIL^B60600
Mary SmithÂ80000.0ÂBill KingÂFederal Taxes^C.2^BState Taxes^C.
05^BInsurance^C.1Â100 Ontario St.^BChicago^BIL^B60601
Todd JonesÂ70000.0ÂFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1Â200 Chicago Ave.^BOak Park^BIL^B60700
Bill KingÂ60000.0ÂFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1Â300 Obscure Dr.^BObscuria^BIL^B60100
13

Hive Table JSON Format
{
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance": .1
},
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip": 60600
}
}
14

Database
 CREATE DATABASE desingpathshala;
 CREATE DATABASE IF NOT EXISTS designpathshala;
 SHOW DATABASES;
 SHOW DATABASES LIKE ‘d.*’;
 Default location is: /user/hive/warehouse/{databasename}.db
 Its configured by property hive.metastore.warehouse.dir
 CREATE DATABASE designpathshala LOCATION ‘my/preferred/location’;
 CREATE DATABASE desighpathshala COMMENT ‘it holds data related to
desing Pathshala institute’;
15

Or Call us at: +91 120 260 5512 or +91 98 188 23045
16

Database
 DESCRIBE DATABASE designpathshala;
 DESCRIBE DATABASE EXTENDED designpathshala;
 Set hive.cli.print.current.db=true;
 DROP DATABASE IF EXISTS designpathshala;
17

Tables
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'=‘dp', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
18

Tables
 CREATE TABLE IF NOT EXISTS mydb.employees2
LIKE mydb.employees;
 SHOW TABLES;
 SHOW TABLES IN mydb;
 SHOW TABLES ‘desi.*’;
 DESCRIBE mytable;
 DESCRIBE EXTENDED mytable;
19

Managed Tables or Internal Tables
 When location is not defined
 Tables crated in default warehouse directory
 When we drop table hive deletes data in table
20

Or Call us at: +91 120 260 5512 or +91 98 188 23045
21

External Tables
 Point to existing data directories in HDFS
 Can create table and partitions
 Data is assumed to be in Hive-compatible format
 Dropping external table drops only the metadata
22

External Tables
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
symbol varchar(100),
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
tradeDate date)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
23

Or Call us at: +91 120 260 5512 or +91 98 188 23045
24

External Tables
 CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';
25

Partition
CREATE TABLE employees (
name STRING,
salary FLOAT,
deductions MAP<STRING, FLOAT>,
)
PARTITIONED BY (country STRING, state STRING);
...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
...
26

Partition
 SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';
hive> set hive.mapred.mode=strict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
27

Partition
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK
...
hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
...
28

Partition External Tables
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';
29

Or Call us at: +91 120 260 5512 or +91 98 188 23045
30

Partition External Tables
hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
...
hive> DESCRIBE EXTENDED log_messages;
...
message string,
year int,
month int,
day int
Detailed Table Information...
partitionKeys:[FieldSchema(name:year, type:int, comment:null),
FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null)],
...
31

Serialization/Deserialization
 Generic (De)Serialzation Interface SerDe
 Uses LazySerDe
 Flexible Interface to translate unstructured data into
structured data
 Designed to read data separated by different delimiter
characters
 The SerDes are located in 'hive_contrib.jar';
32

Or Call us at: +91 120 260 5512 or +91 98 188 23045
33

Hive File Formats - Sequence file
 Hive lets users store different file formats
 Helps in performance improvements
 SQL Example:
CREATE TABLE dest1(key INT, value STRING)
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
34

Hive File Formats - Avro
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES
('schema.url'='http://schema_provider/kst.avsc')
STORED AS
INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat';
35

Drop tables
DROP TABLE IF EXISTS employees;
For external tables, the metadata is deleted but the data is not.
36

Alter Table
 ALTER TABLE modifies table metadata only.
 The data for the table is untouched.
 Rename a column
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER other_column; --Moved the hms column after other_column
 Removes all the existing columns and replaces them with the new columns specified
ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');
37

Or Call us at: +91 120 260 5512 or +91 98 188 23045
38

Alter Table
 Alter Storage Properties
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;
39

Renaming a Table
 ALTER TABLE log_messages RENAME TO logmsgs;
40

Alter Table
 Modifying format
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;
 Modifying SerDe properties
ALTER TABLE table_using_JSON_storage
SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
'prop1' = 'value1',
'prop2' = 'value2');
41

Or Call us at: +91 120 260 5512 or +91 98 188 23045
42

Alter Table
 Add new SERDEPROPERTIES for the currentSerDe
ALTER TABLE table_using_JSON_storage
SET SERDEPROPERTIES (
'prop3' = 'value3',
'prop4' = 'value4');
 Alter the storage properties
ALTER TABLE stocks
CLUSTERED BY (exchange, symbol)
SORTED BY (symbol)
INTO 48 BUCKETS;
43

Alter Table
 ARCHIVE PARTITION statement captures the partition files into a
Hadoop archive (HAR) file. This only reduces the number of files in the
filesystem, reducing the load on the NameNode, but doesn’t provide
any space savings (e.g., through compression):
ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);
44

Or Call us at: +91 120 260 5512 or +91 98 188 23045
45

PARTITION Cont..
 Below statements prevent the partition from being dropped and
queried:
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;
46

Loading Data
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
 LOAD DATA LOCAL ... copies the local data to the final location in
the distributed filesystem, while LOAD DATA ... (i.e., without
LOCAL) moves the data to the final location.
47

Insert Data
 OVERWRITE keyword, any data already present in the target
directory will be deleted first. Without the keyword, the new files
are simply added to the target directory. However, if files already
exist in the target directory that match filenames being loaded,
the old files are overwritten.
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
48

Or Call us at: +91 120 260 5512 or +91 98 188 23045
49

Dynamic Partition Inserts
 Hive determines the values of the partition keys, country and
state, from the last two columns in the SELECT clause.
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
50

Or Call us at: +91 120 260 5512 or +91 98 188 23045
51

Mix of Dynamic & Static Partition
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';
52

Dynamic partitions properties
 hive.exec.dynamic.partion - Set to true to enable dynamic partitioning.
 hive.exec.dynamic.partition.mode - Set to nonstrict to enable all partitions to be determined
dynamically.
 hive.exec.max.dynamic.partitions.pernode - The maximum number of dynamic partitions
that can be created
 by each mapper or reducer.
 hive.exec.max.dynamic.partitions - The total number of dynamic partitions that can be
created by
 one statement with dynamic partitioning.
 hive.exec.max.created.files - The maximum total number of files that can be created
globally.
53

Or Call us at: +91 120 260 5512 or +91 98 188 23045
54

Dynamic table creation & Data export
CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
55

Nested Select
hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;
JOHN DOE 100000.0 0.2 80000
56

Or Call us at: +91 120 260 5512 or +91 98 188 23045
57

CASE … WHEN … THEN Statements
hive> SELECT name, salary,
> CASE
> WHEN salary < 50000.0 THEN 'low‘
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;
58

Or Call us at: +91 120 260 5512 or +91 98 188 23045
59

Group by
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
 GROUP BY year(ymd);
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;
60

Or Call us at: +91 120 260 5512 or +91 98 188 23045
61

Joins – Inner Join
hive> SELECT a.ymd, a.price_close, b.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';
62

Joins Optimization
 When joining three or more tables, if every ON clause uses the same join key, a single MapReduce job will be used.
hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
 Use smaller table first in join
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM big s JOIN small d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM smalltable d JOIN bigtable s ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';
63

Or Call us at: +91 120 260 5512 or +91 98 188 23045
64

Joins Optimization
 Hive assumes last table is largest in query
 It attempts to buffer the other tables and stream the last table, while performing joins on
individual records
 So, you should have largest table at the last
 OR give hint
 Select /*+ STREAMTABLE(a) */ stock, price from stocks a join dividents b on
a.symbol=b.symbol
65

Left & Right Outer join
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
 WHERE s.symbol = 'AAPL';
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol =
s.symbol
> WHERE s.symbol = 'AAPL';
66

Creating an Index
CREATE TABLE employees (
name STRING,
salary FLOAT,
deductions MAP<STRING, FLOAT>,
)
PARTITIONED BY (country STRING, state STRING);
CREATE INDEX employees_index
ON TABLE employees (country)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES ('creator = 'me', 'created_at' = 'some_time')
IN TABLE employees_index_table
PARTITIONED BY (country, name)
COMMENT 'Employees indexed by country and name.';
67

Or Call us at: +91 120 260 5512 or +91 98 188 23045
68

Creating an Index
ALTER INDEX employees_index
ON TABLE employees
PARTITION (country = 'US')
REBUILD;
SHOW FORMATTED INDEX ON employees;
DROP INDEX IF EXISTS employees_index ON TABLE employees;
69

HDFS
Map Reduce
Web UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
MetaStore
Thrift API
Hive QL
Parser
Planner
Optimizer
Execution
UDF/UDAF
substr
sum
average
SerDe
CSV
Thrift
Regex
FileFormats
TextFile
SequenceFile
RCFile
User-defined
Map-reduce Scripts
Architecture
70

Or Call us at: +91 120 260 5512 or +91 98 188 23045
71

Pros
 Pros
 A easy way to process large scale data
 Support SQL-based queries
 Provide more user defined interfaces to
extend
 Programmability
 Efficient execution plans for performance
 Interoperability with other database tools
www.designpathshala.com | +91 120 260 5512 | +91 98 188
23045 | admin@designpathshala.com |
12/15/2014
72

Cons
 Cons
 No easy way to append data
 Files in HDFS are immutable
 Future work
 Views / Variables
 More operator
 In/Exists semantic
 More future work in the mail list
www.designpathshala.com | +91 120 260 5512 | +91 98 188
23045 | admin@designpathshala.com |
12/15/2014
73

Or Call us at: +91 120 260 5512 or +91 98 188 23045
74

Hive - Apache hadoop Bigdata training by Desing Pathshala

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hive - Apache hadoop Bigdata training by Desing Pathshala

Similar to Hive - Apache hadoop Bigdata training by Desing Pathshala (20)

Recently uploaded

Recently uploaded (20)

Hive - Apache hadoop Bigdata training by Desing Pathshala