SlideShare a Scribd company logo
Apache Hadoop 
Design Pathshala 
April 22, 2014 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
1
Hive 
 Developed at Facebook 
 Used for majority of Facebook jobs 
 “Relational database” built on Hadoop 
 Maintains list of table schemas 
 SQL-like query language (HiveQL) 
 Supports table partitioning, clustering, complex data types, some optimizations 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
2
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
3
Why Another Data Warehousing System? 
 Problem : Data, data and more data 
 Several TBs of data everyday 
 The Hadoop Experiment: 
 Uses Hadoop File System (HDFS) 
 Scalable/Available 
 Problem 
 Long development life cycle 
 Map-Reduce hard to program 
 Solution : HIVE 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
4
What is HIVE? 
 A system for managing and querying unstructured data 
as if it were structured 
 Uses Map-Reduce for execution 
 HDFS for Storage 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
5
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
6
Word Count 
 Instead of 65 line java code lets try hive. 
 create table doc( 
text string 
) row format delimited fields terminated by 'n' stored as 
textfile; 
 Load Data inpath ‘docs’ overwrite into table doc; 
 SELECT word, COUNT(*) FROM doc LATERAL VIEW 
explode(split(text, ' ')) lTable as word GROUP BY word; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
7
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
8
Type System 
 Primitive types 
– Integers: TINYINT, SMALLINT, INT, BIGINT. 
– Boolean: BOOLEAN. 
– Floating point numbers: FLOAT, DOUBLE . 
– String: STRING. 
– Timestamp (Unix epoch seconds) 
 Complex types 
– Structs: {a INT; b INT}. Name.a returns a’s value 
– Maps: M[‘key'] returns value 
– Arrays: ['a', 'b', 'c'], A[1] returns 'b'. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
9
Data Model- Tables 
 Tables 
 Analogous to tables in relational DBs. 
 Each table has corresponding directory in HDFS. 
 Example 
 Table “designpathshala” could hold its data inside HDFS 
directory 
 /com/designpathshala 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
10
Creating a Hive Table 
CREATE TABLE designpathshala_employees( 
name STRING, 
Salary FLOAT, 
subordinates ARRAY<STRING>, 
deductions MAP<STRING,FLOAT>, 
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>) 
COMMENT 'This is the page view table' 
PARTITIONED BY(department STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY'001' 
COLLECTION ITEMS TERMINATED BY '002' 
MAP KEYS TERMINATED BY '003' 
LINES TERMINATED BY 'n’ 
STORED AS TEXTFILE; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
11
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
12
Hive Table Data 
John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState 
Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600 
Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState Taxes^C. 
05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601 
Todd Jones^A70000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C. 
1^A200 Chicago Ave.^BOak Park^BIL^B60700 
Bill King^A60000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C. 
1^A300 Obscure Dr.^BObscuria^BIL^B60100 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
13
Hive Table JSON Format 
{ 
"name": "John Doe", 
"salary": 100000.0, 
"subordinates": ["Mary Smith", "Todd Jones"], 
"deductions": { 
"Federal Taxes": .2, 
"State Taxes": .05, 
"Insurance": .1 
}, 
"address": { 
"street": "1 Michigan Ave.", 
"city": "Chicago", 
"state": "IL", 
"zip": 60600 
} 
} 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
14
Database 
 CREATE DATABASE desingpathshala; 
 CREATE DATABASE IF NOT EXISTS designpathshala; 
 SHOW DATABASES; 
 SHOW DATABASES LIKE ‘d.*’; 
 Default location is: /user/hive/warehouse/{databasename}.db 
 Its configured by property hive.metastore.warehouse.dir 
 CREATE DATABASE designpathshala LOCATION ‘my/preferred/location’; 
 CREATE DATABASE desighpathshala COMMENT ‘it holds data related to 
desing Pathshala institute’; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
15
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
16
Database 
 DESCRIBE DATABASE designpathshala; 
 DESCRIBE DATABASE EXTENDED designpathshala; 
 Set hive.cli.print.current.db=true; 
 DROP DATABASE IF EXISTS designpathshala; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
17
Tables 
CREATE TABLE IF NOT EXISTS mydb.employees ( 
name STRING COMMENT 'Employee name', 
salary FLOAT COMMENT 'Employee salary', 
subordinates ARRAY<STRING> COMMENT 'Names of subordinates', 
deductions MAP<STRING, FLOAT> 
COMMENT 'Keys are deductions names, values are percentages', 
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> 
COMMENT 'Home address') 
COMMENT 'Description of the table' 
TBLPROPERTIES ('creator'=‘dp', 'created_at'='2012-01-02 10:00:00', ...) 
LOCATION '/user/hive/warehouse/mydb.db/employees'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
18
Tables 
 CREATE TABLE IF NOT EXISTS mydb.employees2 
LIKE mydb.employees; 
 SHOW TABLES; 
 SHOW TABLES IN mydb; 
 SHOW TABLES ‘desi.*’; 
 DESCRIBE mytable; 
 DESCRIBE EXTENDED mytable; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
19
Managed Tables or Internal Tables 
 When location is not defined 
 Tables crated in default warehouse directory 
 When we drop table hive deletes data in table 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
20
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
21
External Tables 
 Point to existing data directories in HDFS 
 Can create table and partitions 
 Data is assumed to be in Hive-compatible format 
 Dropping external table drops only the metadata 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
22
External Tables 
CREATE EXTERNAL TABLE IF NOT EXISTS stocks ( 
symbol varchar(100), 
price_open FLOAT, 
price_high FLOAT, 
price_low FLOAT, 
price_close FLOAT, 
volume INT, 
tradeDate date) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION '/data/stocks'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
23
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
24
External Tables 
 CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3 
LIKE mydb.employees 
LOCATION '/path/to/data'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
25
Partition 
CREATE TABLE employees ( 
name STRING, 
salary FLOAT, 
subordinates ARRAY<STRING>, 
deductions MAP<STRING, FLOAT>, 
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> 
) 
PARTITIONED BY (country STRING, state STRING); 
... 
.../employees/country=CA/state=AB 
.../employees/country=CA/state=BC 
... 
.../employees/country=US/state=AL 
.../employees/country=US/state=AK 
... 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
26
Partition 
 SELECT * FROM employees 
WHERE country = 'US' AND state = 'IL'; 
hive> set hive.mapred.mode=strict; 
hive> SELECT e.name, e.salary FROM employees e LIMIT 100; 
FAILED: Error in semantic analysis: No partition predicate found for 
Alias "e" Table "employees" 
hive> set hive.mapred.mode=nonstrict; 
hive> SELECT e.name, e.salary FROM employees e LIMIT 100; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
27
Partition 
hive> SHOW PARTITIONS employees; 
... 
Country=CA/state=AB 
country=CA/state=BC 
... 
country=US/state=AL 
country=US/state=AK 
... 
hive> SHOW PARTITIONS employees PARTITION(country='US'); 
country=US/state=AL 
country=US/state=AK 
... 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
28
Partition External Tables 
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages ( 
hms INT, 
severity STRING, 
server STRING, 
process_id INT, 
message STRING) 
PARTITIONED BY (year INT, month INT, day INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; 
ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2) 
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
29
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
30
Partition External Tables 
hive> SHOW PARTITIONS log_messages; 
... 
year=2011/month=12/day=31 
year=2012/month=1/day=1 
year=2012/month=1/day=2 
... 
hive> DESCRIBE EXTENDED log_messages; 
... 
message string, 
year int, 
month int, 
day int 
Detailed Table Information... 
partitionKeys:[FieldSchema(name:year, type:int, comment:null), 
FieldSchema(name:month, type:int, comment:null), 
FieldSchema(name:day, type:int, comment:null)], 
... 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
31
Serialization/Deserialization 
 Generic (De)Serialzation Interface SerDe 
 Uses LazySerDe 
 Flexible Interface to translate unstructured data into 
structured data 
 Designed to read data separated by different delimiter 
characters 
 The SerDes are located in 'hive_contrib.jar'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
32
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
33
Hive File Formats - Sequence file 
 Hive lets users store different file formats 
 Helps in performance improvements 
 SQL Example: 
CREATE TABLE dest1(key INT, value STRING) 
STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
34
Hive File Formats - Avro 
CREATE TABLE kst 
PARTITIONED BY (ds string) 
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' 
WITH SERDEPROPERTIES 
('schema.url'='http://schema_provider/kst.avsc') 
STORED AS 
INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' 
OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
35
Drop tables 
DROP TABLE IF EXISTS employees; 
For external tables, the metadata is deleted but the data is not. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
36
Alter Table 
 ALTER TABLE modifies table metadata only. 
 The data for the table is untouched. 
 Rename a column 
ALTER TABLE log_messages 
CHANGE COLUMN hms hours_minutes_seconds INT 
COMMENT 'The hours, minutes, and seconds part of the timestamp' 
AFTER other_column; --Moved the hms column after other_column 
 Removes all the existing columns and replaces them with the new columns specified 
ALTER TABLE log_messages REPLACE COLUMNS ( 
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp', 
severity STRING COMMENT 'The message severity' 
message STRING COMMENT 'The rest of the message'); 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
37
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
38
Alter Table 
 Alter Storage Properties 
ALTER TABLE log_messages 
PARTITION(year = 2012, month = 1, day = 1) 
SET FILEFORMAT SEQUENCEFILE; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
39
Renaming a Table 
 ALTER TABLE log_messages RENAME TO logmsgs; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
40
Alter Table 
 Modifying format 
ALTER TABLE log_messages 
PARTITION(year = 2012, month = 1, day = 1) 
SET FILEFORMAT SEQUENCEFILE; 
 Modifying SerDe properties 
ALTER TABLE table_using_JSON_storage 
SET SERDE 'com.example.JSONSerDe' 
WITH SERDEPROPERTIES ( 
'prop1' = 'value1', 
'prop2' = 'value2'); 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
41
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
42
Alter Table 
 Add new SERDEPROPERTIES for the currentSerDe 
ALTER TABLE table_using_JSON_storage 
SET SERDEPROPERTIES ( 
'prop3' = 'value3', 
'prop4' = 'value4'); 
 Alter the storage properties 
ALTER TABLE stocks 
CLUSTERED BY (exchange, symbol) 
SORTED BY (symbol) 
INTO 48 BUCKETS; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
43
Alter Table 
 ARCHIVE PARTITION statement captures the partition files into a 
Hadoop archive (HAR) file. This only reduces the number of files in the 
filesystem, reducing the load on the NameNode, but doesn’t provide 
any space savings (e.g., through compression): 
ALTER TABLE log_messages ARCHIVE 
PARTITION(year = 2012, month = 1, day = 1); 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
44
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
45
PARTITION Cont.. 
 Below statements prevent the partition from being dropped and 
queried: 
ALTER TABLE log_messages 
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP; 
ALTER TABLE log_messages 
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
46
Loading Data 
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees' 
OVERWRITE INTO TABLE employees 
PARTITION (country = 'US', state = 'CA'); 
 LOAD DATA LOCAL ... copies the local data to the final location in 
the distributed filesystem, while LOAD DATA ... (i.e., without 
LOCAL) moves the data to the final location. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
47
Insert Data 
 OVERWRITE keyword, any data already present in the target 
directory will be deleted first. Without the keyword, the new files 
are simply added to the target directory. However, if files already 
exist in the target directory that match filenames being loaded, 
the old files are overwritten. 
INSERT OVERWRITE TABLE employees 
PARTITION (country = 'US', state = 'OR') 
SELECT * FROM staged_employees se 
WHERE se.cnty = 'US' AND se.st = 'OR'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
48
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
49
Dynamic Partition Inserts 
 Hive determines the values of the partition keys, country and 
state, from the last two columns in the SELECT clause. 
INSERT OVERWRITE TABLE employees 
PARTITION (country, state) 
SELECT ..., se.cnty, se.st 
FROM staged_employees se; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
50
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
51
Mix of Dynamic & Static Partition 
INSERT OVERWRITE TABLE employees 
PARTITION (country = 'US', state) 
SELECT ..., se.cnty, se.st 
FROM staged_employees se 
WHERE se.cnty = 'US'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
52
Dynamic partitions properties 
 hive.exec.dynamic.partion - Set to true to enable dynamic partitioning. 
 hive.exec.dynamic.partition.mode - Set to nonstrict to enable all partitions to be determined 
dynamically. 
 hive.exec.max.dynamic.partitions.pernode - The maximum number of dynamic partitions 
that can be created 
 by each mapper or reducer. 
 hive.exec.max.dynamic.partitions - The total number of dynamic partitions that can be 
created by 
 one statement with dynamic partitioning. 
 hive.exec.max.created.files - The maximum total number of files that can be created 
globally. 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
53
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
54
Dynamic table creation & Data export 
CREATE TABLE ca_employees 
AS SELECT name, salary, address 
FROM employees 
WHERE se.state = 'CA'; 
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees' 
SELECT name, salary, address 
FROM employees 
WHERE se.state = 'CA'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
55
Nested Select 
hive> FROM ( 
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes, 
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes 
> FROM employees 
> ) e 
> SELECT e.name, e.salary_minus_fed_taxes 
> WHERE e.salary_minus_fed_taxes > 70000; 
JOHN DOE 100000.0 0.2 80000 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
56
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
57
CASE … WHEN … THEN Statements 
hive> SELECT name, salary, 
> CASE 
> WHEN salary < 50000.0 THEN 'low‘ 
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle' 
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high' 
> ELSE 'very high' 
> END AS bracket FROM employees; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
58
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
59
Group by 
hive> SELECT year(ymd), avg(price_close) FROM stocks 
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' 
 GROUP BY year(ymd); 
hive> SELECT year(ymd), avg(price_close) FROM stocks 
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' 
> GROUP BY year(ymd) 
> HAVING avg(price_close) > 50.0; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
60
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
61
Joins – Inner Join 
hive> SELECT a.ymd, a.price_close, b.price_close 
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd 
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
62
Joins Optimization 
 When joining three or more tables, if every ON clause uses the same join key, a single MapReduce job will be used. 
hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close 
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd 
> JOIN stocks c ON a.ymd = c.ymd 
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE'; 
 Use smaller table first in join 
SELECT s.ymd, s.symbol, s.price_close, d.dividend 
FROM big s JOIN small d ON s.ymd = d.ymd AND s.symbol = d.symbol 
WHERE s.symbol = 'AAPL'; 
SELECT s.ymd, s.symbol, s.price_close, d.dividend 
FROM smalltable d JOIN bigtable s ON s.ymd = d.ymd AND s.symbol = d.symbol 
WHERE s.symbol = 'AAPL'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
63
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
64
Joins Optimization 
 Hive assumes last table is largest in query 
 It attempts to buffer the other tables and stream the last table, while performing joins on 
individual records 
 So, you should have largest table at the last 
 OR give hint 
 Select /*+ STREAMTABLE(a) */ stock, price from stocks a join dividents b on 
a.symbol=b.symbol 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
65
Left & Right Outer join 
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend 
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol 
 WHERE s.symbol = 'AAPL'; 
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend 
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = 
s.symbol 
> WHERE s.symbol = 'AAPL'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
66
Creating an Index 
CREATE TABLE employees ( 
name STRING, 
salary FLOAT, 
subordinates ARRAY<STRING>, 
deductions MAP<STRING, FLOAT>, 
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> 
) 
PARTITIONED BY (country STRING, state STRING); 
CREATE INDEX employees_index 
ON TABLE employees (country) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD 
IDXPROPERTIES ('creator = 'me', 'created_at' = 'some_time') 
IN TABLE employees_index_table 
PARTITIONED BY (country, name) 
COMMENT 'Employees indexed by country and name.'; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
67
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
68
Creating an Index 
ALTER INDEX employees_index 
ON TABLE employees 
PARTITION (country = 'US') 
REBUILD; 
SHOW FORMATTED INDEX ON employees; 
DROP INDEX IF EXISTS employees_index ON TABLE employees; 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
69
HDFS 
Map Reduce 
Web UI + Hive CLI + 
JDBC/ODBC 
Browse, Query, DDL 
MetaStore 
Thrift API 
Hive QL 
Parser 
Planner 
Optimizer 
Execution 
UDF/UDAF 
substr 
sum 
average 
SerDe 
CSV 
Thrift 
Regex 
FileFormats 
TextFile 
SequenceFile 
RCFile 
User-defined 
Map-reduce Scripts 
Architecture 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
70
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
71
Pros 
 Pros 
 A easy way to process large scale data 
 Support SQL-based queries 
 Provide more user defined interfaces to 
extend 
 Programmability 
 Efficient execution plans for performance 
 Interoperability with other database tools 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 
23045 | admin@designpathshala.com | 
http://designpathshala.com 
12/15/2014 
72
Cons 
 Cons 
 No easy way to append data 
 Files in HDFS are immutable 
 Future work 
 Views / Variables 
 More operator 
 In/Exists semantic 
 More future work in the mail list 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 
23045 | admin@designpathshala.com | 
http://designpathshala.com 
12/15/2014 
73
Apache Hadoop Bigdata 
Training By Design Pathshala 
Contact us on: admin@designpathshala.com 
Or Call us at: +91 120 260 5512 or +91 98 188 23045 
Visit us at: http://designpathshala.com 
www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | 
http://designpathshala.com 
74

More Related Content

What's hot

Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
Adam Muise
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
Mark Kromer
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Jen Stirrup
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
Cindy Gross
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
Adam Muise
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Jonathan Seidman
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
Jeff Kelly
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Bhadra Gowdra
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
Jonathan Seidman
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Mark Rittman
 

What's hot (20)

Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 

Viewers also liked

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Renato Bonomini
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
Tsuyoshi OZAWA
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Benchmark and Metrics
Benchmark and MetricsBenchmark and Metrics
Benchmark and Metrics
Yuta Imai
 
Dynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache SparkDynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache Spark
Yuta Imai
 
Spark at Scale
Spark at ScaleSpark at Scale
Spark at Scale
Yuta Imai
 
Deep Learning On Apache Spark
Deep Learning On Apache SparkDeep Learning On Apache Spark
Deep Learning On Apache Spark
Yuta Imai
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 

Viewers also liked (20)

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Benchmark and Metrics
Benchmark and MetricsBenchmark and Metrics
Benchmark and Metrics
 
Dynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache SparkDynamic Resource Allocation in Apache Spark
Dynamic Resource Allocation in Apache Spark
 
Spark at Scale
Spark at ScaleSpark at Scale
Spark at Scale
 
Deep Learning On Apache Spark
Deep Learning On Apache SparkDeep Learning On Apache Spark
Deep Learning On Apache Spark
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 

Similar to Hive - Apache hadoop Bigdata training by Desing Pathshala

OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
Bradley Holt
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Polyalgebra
PolyalgebraPolyalgebra
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
distributed matters
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
RDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful MigrationsRDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful Migrations
ScyllaDB
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
Duyhai Doan
 
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.jsVisual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
Florian Georg
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Sumeet Singh
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
thiruvel
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Sharat Chikkerur
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
hamidsamadi
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 

Similar to Hive - Apache hadoop Bigdata training by Desing Pathshala (20)

OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
RDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful MigrationsRDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful Migrations
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.jsVisual Exploration of Large Data sets with D3, crossfilter and dc.js
Visual Exploration of Large Data sets with D3, crossfilter and dc.js
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 

Recently uploaded

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 

Recently uploaded (20)

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 

Hive - Apache hadoop Bigdata training by Desing Pathshala

  • 1. Apache Hadoop Design Pathshala April 22, 2014 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 1
  • 2. Hive  Developed at Facebook  Used for majority of Facebook jobs  “Relational database” built on Hadoop  Maintains list of table schemas  SQL-like query language (HiveQL)  Supports table partitioning, clustering, complex data types, some optimizations www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 2
  • 3. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 3
  • 4. Why Another Data Warehousing System?  Problem : Data, data and more data  Several TBs of data everyday  The Hadoop Experiment:  Uses Hadoop File System (HDFS)  Scalable/Available  Problem  Long development life cycle  Map-Reduce hard to program  Solution : HIVE www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 4
  • 5. What is HIVE?  A system for managing and querying unstructured data as if it were structured  Uses Map-Reduce for execution  HDFS for Storage www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 5
  • 6. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 6
  • 7. Word Count  Instead of 65 line java code lets try hive.  create table doc( text string ) row format delimited fields terminated by 'n' stored as textfile;  Load Data inpath ‘docs’ overwrite into table doc;  SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 7
  • 8. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 8
  • 9. Type System  Primitive types – Integers: TINYINT, SMALLINT, INT, BIGINT. – Boolean: BOOLEAN. – Floating point numbers: FLOAT, DOUBLE . – String: STRING. – Timestamp (Unix epoch seconds)  Complex types – Structs: {a INT; b INT}. Name.a returns a’s value – Maps: M[‘key'] returns value – Arrays: ['a', 'b', 'c'], A[1] returns 'b'. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 9
  • 10. Data Model- Tables  Tables  Analogous to tables in relational DBs.  Each table has corresponding directory in HDFS.  Example  Table “designpathshala” could hold its data inside HDFS directory  /com/designpathshala www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 10
  • 11. Creating a Hive Table CREATE TABLE designpathshala_employees( name STRING, Salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING,FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>) COMMENT 'This is the page view table' PARTITIONED BY(department STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n’ STORED AS TEXTFILE; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 11
  • 12. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12
  • 13. Hive Table Data John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600 Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState Taxes^C. 05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601 Todd Jones^A70000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C. 1^A200 Chicago Ave.^BOak Park^BIL^B60700 Bill King^A60000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C. 1^A300 Obscure Dr.^BObscuria^BIL^B60100 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 13
  • 14. Hive Table JSON Format { "name": "John Doe", "salary": 100000.0, "subordinates": ["Mary Smith", "Todd Jones"], "deductions": { "Federal Taxes": .2, "State Taxes": .05, "Insurance": .1 }, "address": { "street": "1 Michigan Ave.", "city": "Chicago", "state": "IL", "zip": 60600 } } www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 14
  • 15. Database  CREATE DATABASE desingpathshala;  CREATE DATABASE IF NOT EXISTS designpathshala;  SHOW DATABASES;  SHOW DATABASES LIKE ‘d.*’;  Default location is: /user/hive/warehouse/{databasename}.db  Its configured by property hive.metastore.warehouse.dir  CREATE DATABASE designpathshala LOCATION ‘my/preferred/location’;  CREATE DATABASE desighpathshala COMMENT ‘it holds data related to desing Pathshala institute’; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 15
  • 16. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 16
  • 17. Database  DESCRIBE DATABASE designpathshala;  DESCRIBE DATABASE EXTENDED designpathshala;  Set hive.cli.print.current.db=true;  DROP DATABASE IF EXISTS designpathshala; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 17
  • 18. Tables CREATE TABLE IF NOT EXISTS mydb.employees ( name STRING COMMENT 'Employee name', salary FLOAT COMMENT 'Employee salary', subordinates ARRAY<STRING> COMMENT 'Names of subordinates', deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are percentages', address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home address') COMMENT 'Description of the table' TBLPROPERTIES ('creator'=‘dp', 'created_at'='2012-01-02 10:00:00', ...) LOCATION '/user/hive/warehouse/mydb.db/employees'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 18
  • 19. Tables  CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees;  SHOW TABLES;  SHOW TABLES IN mydb;  SHOW TABLES ‘desi.*’;  DESCRIBE mytable;  DESCRIBE EXTENDED mytable; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 19
  • 20. Managed Tables or Internal Tables  When location is not defined  Tables crated in default warehouse directory  When we drop table hive deletes data in table www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 20
  • 21. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 21
  • 22. External Tables  Point to existing data directories in HDFS  Can create table and partitions  Data is assumed to be in Hive-compatible format  Dropping external table drops only the metadata www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 22
  • 23. External Tables CREATE EXTERNAL TABLE IF NOT EXISTS stocks ( symbol varchar(100), price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, tradeDate date) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/stocks'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 23
  • 24. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 24
  • 25. External Tables  CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3 LIKE mydb.employees LOCATION '/path/to/data'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 25
  • 26. Partition CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> ) PARTITIONED BY (country STRING, state STRING); ... .../employees/country=CA/state=AB .../employees/country=CA/state=BC ... .../employees/country=US/state=AL .../employees/country=US/state=AK ... www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 26
  • 27. Partition  SELECT * FROM employees WHERE country = 'US' AND state = 'IL'; hive> set hive.mapred.mode=strict; hive> SELECT e.name, e.salary FROM employees e LIMIT 100; FAILED: Error in semantic analysis: No partition predicate found for Alias "e" Table "employees" hive> set hive.mapred.mode=nonstrict; hive> SELECT e.name, e.salary FROM employees e LIMIT 100; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 27
  • 28. Partition hive> SHOW PARTITIONS employees; ... Country=CA/state=AB country=CA/state=BC ... country=US/state=AL country=US/state=AK ... hive> SHOW PARTITIONS employees PARTITION(country='US'); country=US/state=AL country=US/state=AK ... www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 28
  • 29. Partition External Tables CREATE EXTERNAL TABLE IF NOT EXISTS log_messages ( hms INT, severity STRING, server STRING, process_id INT, message STRING) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2) LOCATION 'hdfs://master_server/data/log_messages/2012/01/02'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 29
  • 30. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 30
  • 31. Partition External Tables hive> SHOW PARTITIONS log_messages; ... year=2011/month=12/day=31 year=2012/month=1/day=1 year=2012/month=1/day=2 ... hive> DESCRIBE EXTENDED log_messages; ... message string, year int, month int, day int Detailed Table Information... partitionKeys:[FieldSchema(name:year, type:int, comment:null), FieldSchema(name:month, type:int, comment:null), FieldSchema(name:day, type:int, comment:null)], ... www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 31
  • 32. Serialization/Deserialization  Generic (De)Serialzation Interface SerDe  Uses LazySerDe  Flexible Interface to translate unstructured data into structured data  Designed to read data separated by different delimiter characters  The SerDes are located in 'hive_contrib.jar'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 32
  • 33. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 33
  • 34. Hive File Formats - Sequence file  Hive lets users store different file formats  Helps in performance improvements  SQL Example: CREATE TABLE dest1(key INT, value STRING) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 34
  • 35. Hive File Formats - Avro CREATE TABLE kst PARTITIONED BY (ds string) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='http://schema_provider/kst.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 35
  • 36. Drop tables DROP TABLE IF EXISTS employees; For external tables, the metadata is deleted but the data is not. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 36
  • 37. Alter Table  ALTER TABLE modifies table metadata only.  The data for the table is untouched.  Rename a column ALTER TABLE log_messages CHANGE COLUMN hms hours_minutes_seconds INT COMMENT 'The hours, minutes, and seconds part of the timestamp' AFTER other_column; --Moved the hms column after other_column  Removes all the existing columns and replaces them with the new columns specified ALTER TABLE log_messages REPLACE COLUMNS ( hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp', severity STRING COMMENT 'The message severity' message STRING COMMENT 'The rest of the message'); www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 37
  • 38. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 38
  • 39. Alter Table  Alter Storage Properties ALTER TABLE log_messages PARTITION(year = 2012, month = 1, day = 1) SET FILEFORMAT SEQUENCEFILE; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 39
  • 40. Renaming a Table  ALTER TABLE log_messages RENAME TO logmsgs; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 40
  • 41. Alter Table  Modifying format ALTER TABLE log_messages PARTITION(year = 2012, month = 1, day = 1) SET FILEFORMAT SEQUENCEFILE;  Modifying SerDe properties ALTER TABLE table_using_JSON_storage SET SERDE 'com.example.JSONSerDe' WITH SERDEPROPERTIES ( 'prop1' = 'value1', 'prop2' = 'value2'); www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 41
  • 42. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 42
  • 43. Alter Table  Add new SERDEPROPERTIES for the currentSerDe ALTER TABLE table_using_JSON_storage SET SERDEPROPERTIES ( 'prop3' = 'value3', 'prop4' = 'value4');  Alter the storage properties ALTER TABLE stocks CLUSTERED BY (exchange, symbol) SORTED BY (symbol) INTO 48 BUCKETS; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 43
  • 44. Alter Table  ARCHIVE PARTITION statement captures the partition files into a Hadoop archive (HAR) file. This only reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any space savings (e.g., through compression): ALTER TABLE log_messages ARCHIVE PARTITION(year = 2012, month = 1, day = 1); www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 44
  • 45. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 45
  • 46. PARTITION Cont..  Below statements prevent the partition from being dropped and queried: ALTER TABLE log_messages PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP; ALTER TABLE log_messages PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 46
  • 47. Loading Data LOAD DATA LOCAL INPATH '${env:HOME}/california-employees' OVERWRITE INTO TABLE employees PARTITION (country = 'US', state = 'CA');  LOAD DATA LOCAL ... copies the local data to the final location in the distributed filesystem, while LOAD DATA ... (i.e., without LOCAL) moves the data to the final location. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 47
  • 48. Insert Data  OVERWRITE keyword, any data already present in the target directory will be deleted first. Without the keyword, the new files are simply added to the target directory. However, if files already exist in the target directory that match filenames being loaded, the old files are overwritten. INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * FROM staged_employees se WHERE se.cnty = 'US' AND se.st = 'OR'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 48
  • 49. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 49
  • 50. Dynamic Partition Inserts  Hive determines the values of the partition keys, country and state, from the last two columns in the SELECT clause. INSERT OVERWRITE TABLE employees PARTITION (country, state) SELECT ..., se.cnty, se.st FROM staged_employees se; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 50
  • 51. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 51
  • 52. Mix of Dynamic & Static Partition INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state) SELECT ..., se.cnty, se.st FROM staged_employees se WHERE se.cnty = 'US'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 52
  • 53. Dynamic partitions properties  hive.exec.dynamic.partion - Set to true to enable dynamic partitioning.  hive.exec.dynamic.partition.mode - Set to nonstrict to enable all partitions to be determined dynamically.  hive.exec.max.dynamic.partitions.pernode - The maximum number of dynamic partitions that can be created  by each mapper or reducer.  hive.exec.max.dynamic.partitions - The total number of dynamic partitions that can be created by  one statement with dynamic partitioning.  hive.exec.max.created.files - The maximum total number of files that can be created globally. www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 53
  • 54. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 54
  • 55. Dynamic table creation & Data export CREATE TABLE ca_employees AS SELECT name, salary, address FROM employees WHERE se.state = 'CA'; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees' SELECT name, salary, address FROM employees WHERE se.state = 'CA'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 55
  • 56. Nested Select hive> FROM ( > SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes, > round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes > FROM employees > ) e > SELECT e.name, e.salary_minus_fed_taxes > WHERE e.salary_minus_fed_taxes > 70000; JOHN DOE 100000.0 0.2 80000 www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 56
  • 57. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 57
  • 58. CASE … WHEN … THEN Statements hive> SELECT name, salary, > CASE > WHEN salary < 50000.0 THEN 'low‘ > WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle' > WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high' > ELSE 'very high' > END AS bracket FROM employees; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 58
  • 59. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 59
  • 60. Group by hive> SELECT year(ymd), avg(price_close) FROM stocks > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'  GROUP BY year(ymd); hive> SELECT year(ymd), avg(price_close) FROM stocks > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' > GROUP BY year(ymd) > HAVING avg(price_close) > 50.0; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 60
  • 61. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 61
  • 62. Joins – Inner Join hive> SELECT a.ymd, a.price_close, b.price_close > FROM stocks a JOIN stocks b ON a.ymd = b.ymd > WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 62
  • 63. Joins Optimization  When joining three or more tables, if every ON clause uses the same join key, a single MapReduce job will be used. hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close > FROM stocks a JOIN stocks b ON a.ymd = b.ymd > JOIN stocks c ON a.ymd = c.ymd > WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';  Use smaller table first in join SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM big s JOIN small d ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL'; SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM smalltable d JOIN bigtable s ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 63
  • 64. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 64
  • 65. Joins Optimization  Hive assumes last table is largest in query  It attempts to buffer the other tables and stream the last table, while performing joins on individual records  So, you should have largest table at the last  OR give hint  Select /*+ STREAMTABLE(a) */ stock, price from stocks a join dividents b on a.symbol=b.symbol www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 65
  • 66. Left & Right Outer join hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend > FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol  WHERE s.symbol = 'AAPL'; hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend > FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol > WHERE s.symbol = 'AAPL'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 66
  • 67. Creating an Index CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> ) PARTITIONED BY (country STRING, state STRING); CREATE INDEX employees_index ON TABLE employees (country) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IDXPROPERTIES ('creator = 'me', 'created_at' = 'some_time') IN TABLE employees_index_table PARTITIONED BY (country, name) COMMENT 'Employees indexed by country and name.'; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 67
  • 68. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 68
  • 69. Creating an Index ALTER INDEX employees_index ON TABLE employees PARTITION (country = 'US') REBUILD; SHOW FORMATTED INDEX ON employees; DROP INDEX IF EXISTS employees_index ON TABLE employees; www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 69
  • 70. HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL MetaStore Thrift API Hive QL Parser Planner Optimizer Execution UDF/UDAF substr sum average SerDe CSV Thrift Regex FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts Architecture www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 70
  • 71. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 71
  • 72. Pros  Pros  A easy way to process large scale data  Support SQL-based queries  Provide more user defined interfaces to extend  Programmability  Efficient execution plans for performance  Interoperability with other database tools www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12/15/2014 72
  • 73. Cons  Cons  No easy way to append data  Files in HDFS are immutable  Future work  Views / Variables  More operator  In/Exists semantic  More future work in the mail list www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12/15/2014 73
  • 74. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 74