SlideShare a Scribd company logo
1 of 57
Download to read offline
Hive and Shark
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45
Motivation
MapReduce is hard to program.
No schema, lack of query languages, e.g., SQL.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 2 / 45
Solution
Adding tables, columns, partitions, and a subset of SQL to unstruc-
tured data.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 3 / 45
Hive
A system for managing and querying structured data built on top
of Hadoop.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
Hive
A system for managing and querying structured data built on top
of Hadoop.
Converts a query to a series of MapReduce phases.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
Hive
A system for managing and querying structured data built on top
of Hadoop.
Converts a query to a series of MapReduce phases.
Initially developed by Facebook.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
Hive
A system for managing and querying structured data built on top
of Hadoop.
Converts a query to a series of MapReduce phases.
Initially developed by Facebook.
Focuses on scalability and extensibility.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
Scalability
Massive scale out and fault tolerance capabilities on commodity
hardware.
Can handle petabytes of data.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 5 / 45
Extensibility
Data types: primitive types and complex types.
User Defined Functions (UDF).
Serializer/Deserializer: text, binary, JSON ...
Storage: HDFS, Hbase, S3 ...
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 6 / 45
RDBMS vs. Hive
RDBMS Hive
Language SQL HiveQL
Update Capabilities INSERT, UPDATE, and DELETE INSERT OVERWRITE; no UPDATE or DELETE
OLAP Yes Yes
OLTP Yes No
Latency Sub-second Minutes or more
Indexes Any number of indexes No indexes, data is always scanned (in parallel)
Data size TBs PBs
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 7 / 45
RDBMS vs. Hive
RDBMS Hive
Language SQL HiveQL
Update Capabilities INSERT, UPDATE, and DELETE INSERT OVERWRITE; no UPDATE or DELETE
OLAP Yes Yes
OLTP Yes No
Latency Sub-second Minutes or more
Indexes Any number of indexes No indexes, data is always scanned (in parallel)
Data size TBs PBs
Online Analytical Processing (OLAP): allows users to analyze
database information from multiple database systems at one time.
Online Transaction Processing (OLTP): facilitates and manages
transaction-oriented applications.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 7 / 45
Hive Data Model
Re-used from RDBMS:
• Database: Set of Tables.
• Table: Set of Rows that have the same schema (same columns).
• Row: A single record; a set of columns.
• Column: provides value and type for a single value.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 8 / 45
Hive Data Model - Table
Analogous to tables in relational databases.
Each table has a corresponding HDFS directory.
For example data for table customer is in the directory
/db/customer.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 9 / 45
Hive Data Model - Partition
A coarse-grained partitioning of a table based on the value of a
column, such as a date.
Faster queries on slices of the data.
If customer is partitioned on column country, then data with a
particular country value SE, will be stored in files within the directory
/db/customer/country=SE.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 10 / 45
Hive Data Model - Bucket
Data in each partition may in turn be divided into buckets based on
the hash of a column in the table.
For more efficient queries.
If customer country partition is subdivided further into buckets,
based on username (hashed on username), the data for each bucket
will be stored within the directories:
/db/customer/country=SE/000000 0
...
/db/customer/country=SE/000000 5
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 11 / 45
Column Data Types
Primitive types
• integers, float, strings, dates and booleans
Nestable collections
• array and map
User-defined types
• Users can also define their own types programmatically
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 12 / 45
Hive Operations
HiveQL: SQL-like query languages
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
Hive Operations
HiveQL: SQL-like query languages
DDL operations (Data Definition Language)
• Create, Alter, Drop
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
Hive Operations
HiveQL: SQL-like query languages
DDL operations (Data Definition Language)
• Create, Alter, Drop
DML operations (Data Manipulation Language)
• Load and Insert (overwrite)
• Does not support updating and deleting
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
Hive Operations
HiveQL: SQL-like query languages
DDL operations (Data Definition Language)
• Create, Alter, Drop
DML operations (Data Manipulation Language)
• Load and Insert (overwrite)
• Does not support updating and deleting
Query operations
• Select, Filter, Join, Groupby
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
DDL Operations (1/3)
Create tables
-- Creates a table with three columns
CREATE TABLE customer (id INT, name STRING, address STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 14 / 45
DDL Operations (1/3)
Create tables
-- Creates a table with three columns
CREATE TABLE customer (id INT, name STRING, address STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’;
Create tables with partitions
-- Creates a table with three columns and a partition column
-- /db/customer2/country=SE;
-- /db/customer2/country=IR;
CREATE TABLE customer2 (id INT, name STRING, address STRING)
PARTITION BY (country STRING)
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 14 / 45
DDL Operations (2/3)
Create tables with buckets
-- Specify the columns to bucket on and the number of buckets
-- /db/customer3/000000_0
-- /db/customer3/000000_1
-- /db/customer3/000000_2
set hive.enforce.bucketing = true;
CREATE TABLE customer3 (id INT, name STRING, address STRING)
CLUSTERED BY (id) INTO 3 BUCKETS;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 15 / 45
DDL Operations (2/3)
Create tables with buckets
-- Specify the columns to bucket on and the number of buckets
-- /db/customer3/000000_0
-- /db/customer3/000000_1
-- /db/customer3/000000_2
set hive.enforce.bucketing = true;
CREATE TABLE customer3 (id INT, name STRING, address STRING)
CLUSTERED BY (id) INTO 3 BUCKETS;
Browsing through tables
-- lists all the tables
SHOW TABLES;
-- shows the list of columns
DESCRIBE customer;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 15 / 45
DDL Operations (3/3)
Altering tables
-- rename the customer table to alaki
ALTER TABLE customer RENAME TO alaki;
-- add two new columns to the customer table
ALTER TABLE customer ADD COLUMNS (job STRING);
ALTER TABLE customer ADD COLUMNS (grade INT COMMENT ’some comment’);
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 16 / 45
DDL Operations (3/3)
Altering tables
-- rename the customer table to alaki
ALTER TABLE customer RENAME TO alaki;
-- add two new columns to the customer table
ALTER TABLE customer ADD COLUMNS (job STRING);
ALTER TABLE customer ADD COLUMNS (grade INT COMMENT ’some comment’);
Dropping tables
DROP TABLE customer;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 16 / 45
DML Operations
Loading data from flat files.
-- if ’LOCAL’ is omitted then it looks for the file in HDFS.
-- the ’OVERWRITE’ signifies that existing data in the table is deleted.
-- if the ’OVERWRITE’ is omitted, data are appended to existing data sets.
LOAD DATA LOCAL INPATH ’data.txt’ OVERWRITE INTO TABLE customer;
-- loads data into different partitions
LOAD DATA LOCAL INPATH ’data1.txt’ OVERWRITE INTO TABLE customer2
PARTITION (country=’SE’);
LOAD DATA LOCAL INPATH ’data2.txt’ OVERWRITE INTO TABLE customer2
PARTITION (country=’IR’);
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 17 / 45
DML Operations
Loading data from flat files.
-- if ’LOCAL’ is omitted then it looks for the file in HDFS.
-- the ’OVERWRITE’ signifies that existing data in the table is deleted.
-- if the ’OVERWRITE’ is omitted, data are appended to existing data sets.
LOAD DATA LOCAL INPATH ’data.txt’ OVERWRITE INTO TABLE customer;
-- loads data into different partitions
LOAD DATA LOCAL INPATH ’data1.txt’ OVERWRITE INTO TABLE customer2
PARTITION (country=’SE’);
LOAD DATA LOCAL INPATH ’data2.txt’ OVERWRITE INTO TABLE customer2
PARTITION (country=’IR’);
Store the query results in tables
INSERT OVERWRITE TABLE customer SELECT * From old_customers;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 17 / 45
Query Operations (1/3)
Selects and filters
SELECT id FROM customer2 WHERE country=’SE’;
-- selects all rows from customer table into a local directory
INSERT OVERWRITE LOCAL DIRECTORY ’/tmp/hive-sample-out’ SELECT *
FROM customer;
-- selects all rows from customer2 table into a directory in hdfs
INSERT OVERWRITE DIRECTORY ’/tmp/hdfs_ir’ SELECT * FROM customer2
WHERE country=’IR’;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 18 / 45
Query Operations (2/3)
Aggregations and groups
SELECT MAX(id) FROM customer;
SELECT country, COUNT(*), SUM(id) FROM customer2 GROUP BY country;
INSERT TABLE high_id_customer SELECT c.name, COUNT(*) FROM customer c
WHERE c.id > 10 GROUP BY c.name;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 19 / 45
Query Operations (3/3)
Join
CREATE TABLE customer (id INT, name STRING, address STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’;
CREATE TABLE order (id INT, cus_id INT, prod_id INT, price INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’;
SELECT * FROM customer c JOIN order o ON (c.id = o.cus_id);
SELECT c.id, c.name, c.address, ce.exp FROM customer c JOIN
(SELECT cus_id, sum(price) AS exp FROM order GROUP BY cus_id) ce
ON (c.id = ce.cus_id) INSERT OVERWRITE TABLE order_customer;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 20 / 45
User-Defined Function (UDF)
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
-- Register the class
CREATE FUNCTION my_lower AS ’com.example.hive.udf.Lower’;
-- Using the function
SELECT my_lower(title), sum(freq) FROM titles GROUP BY my_lower(title);
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 21 / 45
Executing SQL Questions
Processes HiveQL statements and generates the execution plan
through three-phase processes.
1 Query parsing: transforms a query string to a parse tree representa-
tion.
2 Logical plan generation: converts the internal query representation
to a logical plan, and optimizes it.
3 Physical plan generation: split the optimized logical plan into multiple
map/reduce and HDFS tasks.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 22 / 45
Optimization (1/2)
Column pruning
• Projecting out the needed columns.
Predicate pushdown
• Filtering rows early in the processing, by pushing down predicates to
the scan (if possible).
Partition pruning
• Pruning out files of partitions that do not satisfy the predicate.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 23 / 45
Optimization (2/2)
Map-side joins
• The small tables are replicated in all the mappers and joined with
other tables.
• No reducer needed.
Join reordering
• Only materialized and kept small tables in memory.
• This ensures that the join operation does not exceed memory limits
on the reducer side.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 24 / 45
Hive Components (1/8)
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 25 / 45
Hive Components (2/8)
External interfaces
• User interfaces, e.g., CLI and web UI
• Application programming interfaces, e.g., JDBC and ODBC
• Thrift, a framework for cross-language services.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 26 / 45
Hive Components (3/8)
Driver
• Manages the life cycle of a HiveQL statement during compilation,
optimization and execution.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 27 / 45
Hive Components (4/8)
Compiler (Parser/Query Optimizer)
• Translates the HiveQL statement into a a logical plan, and
optimizes it.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 28 / 45
Hive Components (5/8)
Physical plan
• Transforms the logical plan into a DAG of Map/Reduce jobs.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 29 / 45
Hive Components (6/8)
Execution engine
• The driver submits the individual mapreduce jobs from the DAG to
the execution engine in a topological order.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 30 / 45
Hive Components (7/8)
SerDe
• Serializer/Deserializer allows Hive to read and write table rows in
any custom format.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 31 / 45
Hive Components (8/8)
Metastore
• The system catalog.
• Contains metadata about the tables.
• Metadata is specified during table creation and reused every time the
table is referenced in HiveQL.
• Metadatas are stored on either a traditional relational database, e.g.,
MySQL, or file system and not HDFS.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 32 / 45
Hive on Spark
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 33 / 45
Spark RDD - Reminder
RDDs are immutable, partitioned collections that can be created
through various transformations, e.g., map, groupByKey, join.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 34 / 45
Executing SQL over Spark RDDs
Shark runs SQL queries over Spark using three-step process:
1 Query parsing: Shark uses Hive query compiler to parse the query
and generate a parse tree.
2 Logical plan generation: the tree is turned into a logical plan and
basic logical optimization is applied.
3 Physical plan generation: Shark applies additional optimization and
creates a physical plan consisting of transformations on RDDs.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 35 / 45
Hive Components
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 36 / 45
Shark Components
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 37 / 45
Shark and Spark
Shark extended RDD execution model:
• Partial DAG Execution (PDE): to re-optimize a running query after
running the first few stages of its task DAG.
• In-memory columnar storage and compression: to process relational
data efficiently.
• Control over data partitioning.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 38 / 45
Partial DAG Execution (1/2)
How to optimize the following query?
SELECT * FROM table1 a JOIN table2 b ON (a.key = b.key)
WHERE my_crazy_udf(b.field1, b.field2) = true;
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 39 / 45
Partial DAG Execution (1/2)
How to optimize the following query?
SELECT * FROM table1 a JOIN table2 b ON (a.key = b.key)
WHERE my_crazy_udf(b.field1, b.field2) = true;
It can not use cost-based optimization techniques that rely on ac-
curate a priori data statistics.
They require dynamic approaches to query optimization.
Partial DAG Execution (PDE): dynamic alteration of query plans
based on data statistics collected at run-time.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 39 / 45
Partial DAG Execution (2/2)
The workers gather customizable statistics at global and per-
partition granularities at run-time.
Each worker sends the collected statistics to the master.
The master aggregates the statistics and alters the query plan based
on such statistics.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 40 / 45
Columnar Memory Store
Simply caching Hive records as JVM objects is inefficient.
12 to 16 bytes of overhead per object in JVM implementation:
• e.g., storing a 270MB table as JVM objects uses approximately 971
MB of memory.
Shark employs column-oriented storage using arrays of primitive ob-
jects.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 41 / 45
Data Partitioning
Shark allows co-partitioning two tables, which are frequently joined
together, on a common key for faster joins in subsequent queries.
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 42 / 45
Shark/Spark Integration
Shark provides a simple API for programmers to convert results from
SQL queries into a special type of RDDs: sql2rdd.
val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20")
println(youngUsers.count)
val featureMatrix = youngUsers.map(extractFeatures(_))
kmeans(featureMatrix)
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 43 / 45
Summary
Operators: DDL, DML, SQL
Hive architecture vs. Shark architecture
Add advance features to Spark, e.g., PDE, columnar memory store
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 44 / 45
Questions?
Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 45 / 45

More Related Content

What's hot

0104 abap dictionary
0104 abap dictionary0104 abap dictionary
0104 abap dictionary
vkyecc1
 
Internal tables
Internal tablesInternal tables
Internal tables
waseem27
 

What's hot (20)

ABAP Programming Overview
ABAP Programming OverviewABAP Programming Overview
ABAP Programming Overview
 
SAS Functions
SAS FunctionsSAS Functions
SAS Functions
 
scientific writing 01 - latex
scientific writing   01 - latexscientific writing   01 - latex
scientific writing 01 - latex
 
Reading Fixed And Varying Data
Reading Fixed And Varying DataReading Fixed And Varying Data
Reading Fixed And Varying Data
 
Sas Plots Graphs
Sas Plots GraphsSas Plots Graphs
Sas Plots Graphs
 
SAS Proc SQL
SAS Proc SQLSAS Proc SQL
SAS Proc SQL
 
LaTeX Part 1
LaTeX Part 1LaTeX Part 1
LaTeX Part 1
 
0104 abap dictionary
0104 abap dictionary0104 abap dictionary
0104 abap dictionary
 
SAS ODS HTML
SAS ODS HTMLSAS ODS HTML
SAS ODS HTML
 
Internal tables
Internal tablesInternal tables
Internal tables
 
SAS Macros
SAS MacrosSAS Macros
SAS Macros
 
05 internal tables
05 internal tables05 internal tables
05 internal tables
 
Enhancing statistical report capabilities using the clin plus report engine
Enhancing statistical report capabilities using the clin plus report engineEnhancing statistical report capabilities using the clin plus report engine
Enhancing statistical report capabilities using the clin plus report engine
 
Understanding SAS Data Step Processing
Understanding SAS Data Step ProcessingUnderstanding SAS Data Step Processing
Understanding SAS Data Step Processing
 
How to Fine-Tune Performance Using Amazon Redshift
How to Fine-Tune Performance Using Amazon RedshiftHow to Fine-Tune Performance Using Amazon Redshift
How to Fine-Tune Performance Using Amazon Redshift
 
Compare And Merge Scripts
Compare And Merge ScriptsCompare And Merge Scripts
Compare And Merge Scripts
 
SAS Access / SAS Connect
SAS Access / SAS ConnectSAS Access / SAS Connect
SAS Access / SAS Connect
 
Utility Procedures in SAS
Utility Procedures in SASUtility Procedures in SAS
Utility Procedures in SAS
 
Chap16 scr
Chap16 scrChap16 scr
Chap16 scr
 
Latex workshop
Latex workshopLatex workshop
Latex workshop
 

Viewers also liked (20)

Scala
ScalaScala
Scala
 
Spark
SparkSpark
Spark
 
P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
 
Graph processing - Pregel
Graph processing - PregelGraph processing - Pregel
Graph processing - Pregel
 
Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEP
 
MapReduce
MapReduceMapReduce
MapReduce
 
Introduction to stream processing
Introduction to stream processingIntroduction to stream processing
Introduction to stream processing
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
 
Graph processing - Graphlab
Graph processing - GraphlabGraph processing - Graphlab
Graph processing - Graphlab
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1
 
Threads
ThreadsThreads
Threads
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3
 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARN
 
IO Systems
IO SystemsIO Systems
IO Systems
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
Security
SecuritySecurity
Security
 
Protection
ProtectionProtection
Protection
 

Similar to Hive and Shark

HBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptxHBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptx
HmadSADAQ2
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 

Similar to Hive and Shark (20)

Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Apache hive
Apache hiveApache hive
Apache hive
 
HBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptxHBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptx
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Dynamo and NoSQL Databases
Dynamo and NoSQL DatabasesDynamo and NoSQL Databases
Dynamo and NoSQL Databases
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 
Hive
HiveHive
Hive
 

More from Amir Payberah

More from Amir Payberah (18)

Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing Frameworks
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1
 
File System Interface
File System InterfaceFile System Interface
File System Interface
 
Storage
StorageStorage
Storage
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Hive and Shark

  • 1. Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45
  • 2. Motivation MapReduce is hard to program. No schema, lack of query languages, e.g., SQL. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 2 / 45
  • 3. Solution Adding tables, columns, partitions, and a subset of SQL to unstruc- tured data. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 3 / 45
  • 4. Hive A system for managing and querying structured data built on top of Hadoop. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
  • 5. Hive A system for managing and querying structured data built on top of Hadoop. Converts a query to a series of MapReduce phases. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
  • 6. Hive A system for managing and querying structured data built on top of Hadoop. Converts a query to a series of MapReduce phases. Initially developed by Facebook. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
  • 7. Hive A system for managing and querying structured data built on top of Hadoop. Converts a query to a series of MapReduce phases. Initially developed by Facebook. Focuses on scalability and extensibility. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 4 / 45
  • 8. Scalability Massive scale out and fault tolerance capabilities on commodity hardware. Can handle petabytes of data. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 5 / 45
  • 9. Extensibility Data types: primitive types and complex types. User Defined Functions (UDF). Serializer/Deserializer: text, binary, JSON ... Storage: HDFS, Hbase, S3 ... Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 6 / 45
  • 10. RDBMS vs. Hive RDBMS Hive Language SQL HiveQL Update Capabilities INSERT, UPDATE, and DELETE INSERT OVERWRITE; no UPDATE or DELETE OLAP Yes Yes OLTP Yes No Latency Sub-second Minutes or more Indexes Any number of indexes No indexes, data is always scanned (in parallel) Data size TBs PBs Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 7 / 45
  • 11. RDBMS vs. Hive RDBMS Hive Language SQL HiveQL Update Capabilities INSERT, UPDATE, and DELETE INSERT OVERWRITE; no UPDATE or DELETE OLAP Yes Yes OLTP Yes No Latency Sub-second Minutes or more Indexes Any number of indexes No indexes, data is always scanned (in parallel) Data size TBs PBs Online Analytical Processing (OLAP): allows users to analyze database information from multiple database systems at one time. Online Transaction Processing (OLTP): facilitates and manages transaction-oriented applications. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 7 / 45
  • 12. Hive Data Model Re-used from RDBMS: • Database: Set of Tables. • Table: Set of Rows that have the same schema (same columns). • Row: A single record; a set of columns. • Column: provides value and type for a single value. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 8 / 45
  • 13. Hive Data Model - Table Analogous to tables in relational databases. Each table has a corresponding HDFS directory. For example data for table customer is in the directory /db/customer. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 9 / 45
  • 14. Hive Data Model - Partition A coarse-grained partitioning of a table based on the value of a column, such as a date. Faster queries on slices of the data. If customer is partitioned on column country, then data with a particular country value SE, will be stored in files within the directory /db/customer/country=SE. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 10 / 45
  • 15. Hive Data Model - Bucket Data in each partition may in turn be divided into buckets based on the hash of a column in the table. For more efficient queries. If customer country partition is subdivided further into buckets, based on username (hashed on username), the data for each bucket will be stored within the directories: /db/customer/country=SE/000000 0 ... /db/customer/country=SE/000000 5 Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 11 / 45
  • 16. Column Data Types Primitive types • integers, float, strings, dates and booleans Nestable collections • array and map User-defined types • Users can also define their own types programmatically Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 12 / 45
  • 17. Hive Operations HiveQL: SQL-like query languages Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
  • 18. Hive Operations HiveQL: SQL-like query languages DDL operations (Data Definition Language) • Create, Alter, Drop Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
  • 19. Hive Operations HiveQL: SQL-like query languages DDL operations (Data Definition Language) • Create, Alter, Drop DML operations (Data Manipulation Language) • Load and Insert (overwrite) • Does not support updating and deleting Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
  • 20. Hive Operations HiveQL: SQL-like query languages DDL operations (Data Definition Language) • Create, Alter, Drop DML operations (Data Manipulation Language) • Load and Insert (overwrite) • Does not support updating and deleting Query operations • Select, Filter, Join, Groupby Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 13 / 45
  • 21. DDL Operations (1/3) Create tables -- Creates a table with three columns CREATE TABLE customer (id INT, name STRING, address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 14 / 45
  • 22. DDL Operations (1/3) Create tables -- Creates a table with three columns CREATE TABLE customer (id INT, name STRING, address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’; Create tables with partitions -- Creates a table with three columns and a partition column -- /db/customer2/country=SE; -- /db/customer2/country=IR; CREATE TABLE customer2 (id INT, name STRING, address STRING) PARTITION BY (country STRING) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 14 / 45
  • 23. DDL Operations (2/3) Create tables with buckets -- Specify the columns to bucket on and the number of buckets -- /db/customer3/000000_0 -- /db/customer3/000000_1 -- /db/customer3/000000_2 set hive.enforce.bucketing = true; CREATE TABLE customer3 (id INT, name STRING, address STRING) CLUSTERED BY (id) INTO 3 BUCKETS; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 15 / 45
  • 24. DDL Operations (2/3) Create tables with buckets -- Specify the columns to bucket on and the number of buckets -- /db/customer3/000000_0 -- /db/customer3/000000_1 -- /db/customer3/000000_2 set hive.enforce.bucketing = true; CREATE TABLE customer3 (id INT, name STRING, address STRING) CLUSTERED BY (id) INTO 3 BUCKETS; Browsing through tables -- lists all the tables SHOW TABLES; -- shows the list of columns DESCRIBE customer; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 15 / 45
  • 25. DDL Operations (3/3) Altering tables -- rename the customer table to alaki ALTER TABLE customer RENAME TO alaki; -- add two new columns to the customer table ALTER TABLE customer ADD COLUMNS (job STRING); ALTER TABLE customer ADD COLUMNS (grade INT COMMENT ’some comment’); Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 16 / 45
  • 26. DDL Operations (3/3) Altering tables -- rename the customer table to alaki ALTER TABLE customer RENAME TO alaki; -- add two new columns to the customer table ALTER TABLE customer ADD COLUMNS (job STRING); ALTER TABLE customer ADD COLUMNS (grade INT COMMENT ’some comment’); Dropping tables DROP TABLE customer; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 16 / 45
  • 27. DML Operations Loading data from flat files. -- if ’LOCAL’ is omitted then it looks for the file in HDFS. -- the ’OVERWRITE’ signifies that existing data in the table is deleted. -- if the ’OVERWRITE’ is omitted, data are appended to existing data sets. LOAD DATA LOCAL INPATH ’data.txt’ OVERWRITE INTO TABLE customer; -- loads data into different partitions LOAD DATA LOCAL INPATH ’data1.txt’ OVERWRITE INTO TABLE customer2 PARTITION (country=’SE’); LOAD DATA LOCAL INPATH ’data2.txt’ OVERWRITE INTO TABLE customer2 PARTITION (country=’IR’); Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 17 / 45
  • 28. DML Operations Loading data from flat files. -- if ’LOCAL’ is omitted then it looks for the file in HDFS. -- the ’OVERWRITE’ signifies that existing data in the table is deleted. -- if the ’OVERWRITE’ is omitted, data are appended to existing data sets. LOAD DATA LOCAL INPATH ’data.txt’ OVERWRITE INTO TABLE customer; -- loads data into different partitions LOAD DATA LOCAL INPATH ’data1.txt’ OVERWRITE INTO TABLE customer2 PARTITION (country=’SE’); LOAD DATA LOCAL INPATH ’data2.txt’ OVERWRITE INTO TABLE customer2 PARTITION (country=’IR’); Store the query results in tables INSERT OVERWRITE TABLE customer SELECT * From old_customers; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 17 / 45
  • 29. Query Operations (1/3) Selects and filters SELECT id FROM customer2 WHERE country=’SE’; -- selects all rows from customer table into a local directory INSERT OVERWRITE LOCAL DIRECTORY ’/tmp/hive-sample-out’ SELECT * FROM customer; -- selects all rows from customer2 table into a directory in hdfs INSERT OVERWRITE DIRECTORY ’/tmp/hdfs_ir’ SELECT * FROM customer2 WHERE country=’IR’; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 18 / 45
  • 30. Query Operations (2/3) Aggregations and groups SELECT MAX(id) FROM customer; SELECT country, COUNT(*), SUM(id) FROM customer2 GROUP BY country; INSERT TABLE high_id_customer SELECT c.name, COUNT(*) FROM customer c WHERE c.id > 10 GROUP BY c.name; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 19 / 45
  • 31. Query Operations (3/3) Join CREATE TABLE customer (id INT, name STRING, address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’; CREATE TABLE order (id INT, cus_id INT, prod_id INT, price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’; SELECT * FROM customer c JOIN order o ON (c.id = o.cus_id); SELECT c.id, c.name, c.address, ce.exp FROM customer c JOIN (SELECT cus_id, sum(price) AS exp FROM order GROUP BY cus_id) ce ON (c.id = ce.cus_id) INSERT OVERWRITE TABLE order_customer; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 20 / 45
  • 32. User-Defined Function (UDF) package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } -- Register the class CREATE FUNCTION my_lower AS ’com.example.hive.udf.Lower’; -- Using the function SELECT my_lower(title), sum(freq) FROM titles GROUP BY my_lower(title); Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 21 / 45
  • 33. Executing SQL Questions Processes HiveQL statements and generates the execution plan through three-phase processes. 1 Query parsing: transforms a query string to a parse tree representa- tion. 2 Logical plan generation: converts the internal query representation to a logical plan, and optimizes it. 3 Physical plan generation: split the optimized logical plan into multiple map/reduce and HDFS tasks. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 22 / 45
  • 34. Optimization (1/2) Column pruning • Projecting out the needed columns. Predicate pushdown • Filtering rows early in the processing, by pushing down predicates to the scan (if possible). Partition pruning • Pruning out files of partitions that do not satisfy the predicate. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 23 / 45
  • 35. Optimization (2/2) Map-side joins • The small tables are replicated in all the mappers and joined with other tables. • No reducer needed. Join reordering • Only materialized and kept small tables in memory. • This ensures that the join operation does not exceed memory limits on the reducer side. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 24 / 45
  • 36. Hive Components (1/8) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 25 / 45
  • 37. Hive Components (2/8) External interfaces • User interfaces, e.g., CLI and web UI • Application programming interfaces, e.g., JDBC and ODBC • Thrift, a framework for cross-language services. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 26 / 45
  • 38. Hive Components (3/8) Driver • Manages the life cycle of a HiveQL statement during compilation, optimization and execution. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 27 / 45
  • 39. Hive Components (4/8) Compiler (Parser/Query Optimizer) • Translates the HiveQL statement into a a logical plan, and optimizes it. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 28 / 45
  • 40. Hive Components (5/8) Physical plan • Transforms the logical plan into a DAG of Map/Reduce jobs. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 29 / 45
  • 41. Hive Components (6/8) Execution engine • The driver submits the individual mapreduce jobs from the DAG to the execution engine in a topological order. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 30 / 45
  • 42. Hive Components (7/8) SerDe • Serializer/Deserializer allows Hive to read and write table rows in any custom format. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 31 / 45
  • 43. Hive Components (8/8) Metastore • The system catalog. • Contains metadata about the tables. • Metadata is specified during table creation and reused every time the table is referenced in HiveQL. • Metadatas are stored on either a traditional relational database, e.g., MySQL, or file system and not HDFS. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 32 / 45
  • 44. Hive on Spark Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 33 / 45
  • 45. Spark RDD - Reminder RDDs are immutable, partitioned collections that can be created through various transformations, e.g., map, groupByKey, join. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 34 / 45
  • 46. Executing SQL over Spark RDDs Shark runs SQL queries over Spark using three-step process: 1 Query parsing: Shark uses Hive query compiler to parse the query and generate a parse tree. 2 Logical plan generation: the tree is turned into a logical plan and basic logical optimization is applied. 3 Physical plan generation: Shark applies additional optimization and creates a physical plan consisting of transformations on RDDs. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 35 / 45
  • 47. Hive Components Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 36 / 45
  • 48. Shark Components Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 37 / 45
  • 49. Shark and Spark Shark extended RDD execution model: • Partial DAG Execution (PDE): to re-optimize a running query after running the first few stages of its task DAG. • In-memory columnar storage and compression: to process relational data efficiently. • Control over data partitioning. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 38 / 45
  • 50. Partial DAG Execution (1/2) How to optimize the following query? SELECT * FROM table1 a JOIN table2 b ON (a.key = b.key) WHERE my_crazy_udf(b.field1, b.field2) = true; Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 39 / 45
  • 51. Partial DAG Execution (1/2) How to optimize the following query? SELECT * FROM table1 a JOIN table2 b ON (a.key = b.key) WHERE my_crazy_udf(b.field1, b.field2) = true; It can not use cost-based optimization techniques that rely on ac- curate a priori data statistics. They require dynamic approaches to query optimization. Partial DAG Execution (PDE): dynamic alteration of query plans based on data statistics collected at run-time. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 39 / 45
  • 52. Partial DAG Execution (2/2) The workers gather customizable statistics at global and per- partition granularities at run-time. Each worker sends the collected statistics to the master. The master aggregates the statistics and alters the query plan based on such statistics. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 40 / 45
  • 53. Columnar Memory Store Simply caching Hive records as JVM objects is inefficient. 12 to 16 bytes of overhead per object in JVM implementation: • e.g., storing a 270MB table as JVM objects uses approximately 971 MB of memory. Shark employs column-oriented storage using arrays of primitive ob- jects. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 41 / 45
  • 54. Data Partitioning Shark allows co-partitioning two tables, which are frequently joined together, on a common key for faster joins in subsequent queries. Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 42 / 45
  • 55. Shark/Spark Integration Shark provides a simple API for programmers to convert results from SQL queries into a special type of RDDs: sql2rdd. val youngUsers = sql2rdd("SELECT * FROM users WHERE age < 20") println(youngUsers.count) val featureMatrix = youngUsers.map(extractFeatures(_)) kmeans(featureMatrix) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 43 / 45
  • 56. Summary Operators: DDL, DML, SQL Hive architecture vs. Shark architecture Add advance features to Spark, e.g., PDE, columnar memory store Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 44 / 45
  • 57. Questions? Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 45 / 45