SlideShare a Scribd company logo
Apache Hive Overview
Ten Tools for Ten Big Data Areas
Series 04 Big Data Warehouse
www.sparkera.ca
Ten Tools for Ten Big Data Areas – Overview
2© Sparkera. Confidential. All Rights Reserved
10 Tools
10 Areas
Programming
SearchandIndex
First ETL fully on Yarn
Data storing platform
Data computing platform
SQL & Metadata
Visualize with just few clicks
Powerful as Java
Simple as Python
real-time
streaming
Made easier
Yours
Google
Lightning-fast cluster computing
Real-time distributed data store
High throughput
distributed messaging
Agenda
3© Sparkera. Confidential. All Rights Reserved
About Apache Hive History
2 Apache Hive Architecture
3 Hive Data Type, Table, Partition, Bucket, View
4 Hive Data Definition and Manipulation Language
5 Hive Data Exchange and Sorting
1
Apache Hive Overview
• Hive is an top Apache project which provides Data Warehousing Solution on Hadoop
• Provides SQL-like query language named HiveQL - HQL, minimal learning curve
• Supports running on different computing frameworks, such as tez, mapreduce, spark
• Supports ad hoc querying data on HDFS and HBase
• Supports user-defined functions, scripts, and customized I/O format to extend
functionality
• Matured JDBC and ODBC drivers allow many applications to pull Hive data for
seamless reporting
• Hive has a well-defined architecture for metadata management, authentication,
optimizations
4© Sparkera. Confidential. All Rights Reserved
Apache Hive History
• AUG 2007 – Journey Start from Facebook
• MAY 2013 – Hive 0.11.0 as Stinger Phase 1 - ORC, HiveSever2
• OCT 2013 – Hive 0.12.0 as Stinger Phase 2 - ORC improvement
• APR 2014 – Hive 0.13.0 as Stinger Phase 3 - vectorized query engine and Tez
• NOV 2014 – Hive 0.14.0 as Stinger.next Phase 1- Cost-based optimizer
• FEB 2015 – Hive 1.0.0
• MAR 2015 – Hive 1.1.0
• MAY 2015 – Hive 1.2.0
Most BI and ETL tools leverage Hive as connector
The Stinger Initiative - Making Apache Hive 100 Times Faster
5© Sparkera. Confidential. All Rights Reserved
Hive vs. MapReduce – Word Count
Vs.
67 Lines
Hive Metadata
• To support features like schema(s) and data partitioning Hive keeps its metadata
in a Relational Database
• By default, Hive is Packaged with Derby, a lightweight embedded SQL DB
 Default Derby based is good for evaluation an testing
 Schema is not shared between users as each user has their own instance of
embedded Derby
 Stored in metastore_db directory which resides in the directory that hive was started
from
• Can easily switch another SQL installation such as MySQL
• HCatalog as part of Hive exposures Hive metadata to other ecosystem
Hive Architecture
JDBC/ODBC/Other
API
Thrift Server
(Hive Server 1 and 2)
Hive Web Interface Hive Command Line
Driver
(Compiler,Optimizer,Plan Executor)
Metastore
Hive Interface – Command Line
• CML and Beeline
• Command Mode
Hive Interface – Command Line Cont.
• Interactive Mode
Example of JDBC URL:
jdbc:hive2://ubuntu:11000/db2;user=foo;password=barc
Hive Interface – Others
• Hive Web Interface|Hue|JDBC/ODBC|IDE
Data Type – Primitive Type
Type Example Type Example
TINYINT 10Y SMALLINT 10S
INT 10 BIGINT 100L
FLOAT 1.342 DOUBLE 1.234
DECIMAL 3.14 BINARY 1010
BOOLEAN TRUE STRING ‘Book’ or “Book”
CHAR ‘YES’ or “YES” VARCHAR ‘Book’ or “Book”
DATE ‘2013-01-31’ TIMESTAMP ‘2013-01-31
00:13:00.345’
Data Type – Complex Type
• ARRAY has same type for all the elements – equal to MAP uses sequence from 0 as keys
• MAP has same type of key value pairs
• STRUCT likes table/records
Type Example Define Example
ARRAY [‘Apple’,’Orange’,’M
ongo’]
ARRAY<string> a[0] = ‘Apple’
MAP {‘A’:’Apple’,’O’:’Oran
ge’}
MAP<string,string> b[‘A’] = ‘Apple’
STRUCT {‘Apple’:2} STRUCT<fruit:string,weight:int> c.weight = 2
Hive Data Structure
Data Structure Logical Physical
Database A collection of tables Folder with files
Table A collection of rows of data Folder with files
Partition Columns to split data Folder
Buckets Columns to distribute data Files
Row Line of records Line in a file
Columns Slice of records Specified positions in each line
Views Shortcut of rows of data n/a
Index Statistics of data Folder with files
Hive Database
The database in describes a collection of tables that are used for a similar purpose or
belong to the same group. If the database is not specified (use database_name), the
default database is used. Hive creates a directory for each database at
/user/hive/warehouse, which can be defined through hive.metastore.warehouse.dir
property except default database.
• CREATE DATABASE IF NOT EXISTS myhivebook;
• SHOW DATABASES;
• DESCRIBE DATABASE default; //Provide more details than ’SHOW’, such as
comments, location
• DROP DATABASE IF EXISTS myhivebook CASCADE;
• ALTER DATABASE myhivebook SET OWNER user willddy;
Hive Tables - Concept
External Tables
Data is kept in the HDFS path specified by LOCATION keywords. The data is not
managed by Hive fully since drop the table (metadata) will not delete the data
Internal Tables/Managed Table
Data is kept in the default path, such as /user/hive/warehouse/employee. The data
is fully managed by Hive since drop the table (metadata) will also delete the data
Hive Tables - DDL
CREATE EXTERNAL TABLE IF NOT EXISTS employee_external (
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
COMMENT 'This is an external table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE
LOCATION '/user/dayongd/employee';
IF NOT EXISTS is optional
TEMPORARY table is also supported
List of columns with data type
There is no constrain support
Table comment is optional
How to separate columns
How to separate MAP and
Collections
File format
Data files path in
HDFS (URI)
Side Notes - Delimiters
The default delimiters in Hive are as follows:
• Field delimiter: Can be used with Ctrl + A or ^A (Use 001 when creating the table)
• Collection item delimiter: Can be used with Ctrl + B or ^B (002)
• Map key delimiter: Can be used with Ctrl + C or ^C (003)
Issues
If delimiter is overridden during the table creation, it only works in flat structure JIRA Hive-
365. For nested types, the level of nesting determines the delimiter, for example
• ARRAY of ARRAY – Outer ARRAY is ^B, inner ARRAY is ^C
• MAP of ARRAY – Outer MAP is ^C, inner ARRAY is ^D
Hive Table – DDL – Create Table
• CTAS – CREATE TABLE ctas_employee as SELECT * FROM employee
• CTAS cannot create a partition, external, or bucket table
• CTAS with Common Table Expression (CTE)
CREATE TABLE cte_employee AS
WITH
r1 AS (SELECT name FROM r2 WHERE name = 'Michael'),
r2 AS (SELECT name FROM employee WHERE sex_age.sex= 'Male'),
r3 AS (SELECT name FROM employee WHERE sex_age.sex= 'Female')
SELECT * FROM r1 UNION ALL SELECT * FROM r3;
• Create table like other table (fast), such as CREATE TABLE employee_like LIKE employee
Hive Table – Drop/Truncate/Alter Table
• DROP TABLE IF EXISTS employee [PERGE] statement removes metadata completely and
move data to .Trash folder in the user home directory in HDFS if configured. With PERGE
option (since Hive 0.14.0), the data is removed completely. When to drop an external
table, the data is not removed. Such as DROP TABLE IF EXISTS employee;
• TRUNCATE TABLE employee statement removes all rows of data form an internal table.
• ALTER TABLE employee RENAME TO new_employee statement renames the table
• ALTER TABLE c_employee SET TBLPROPERTIES ('comment'='New name, comments’)
statement sets table property
• ALTER TABLE employee_internal SET SERDEPROPERTIES ('field.delim' = '$’) statement set
SerDe properties
• ALTER TABLE c_employee SET FILEFORMAT RCFILE statement sets file format.
Hive Table – Alter Table Columns
• ALTER TABLE employee_internal CHANGE name employee_name STRING
[BEFORE|AFTER] sex_age statement can be used to change column name, position or
type
• ALTER TABLE c_employee ADD COLUMNS (work string) statement add another column
and type to the table at the end.
• ALTER TABLE c_employee REPLACE COLUMNS (name string) statement replace all
columns in the table by specified column and type. After ALTER in this example, there is
only one column in the table.
Note:
The ALTER TABLE statement will only modify the metadata of Hive, not actually data. User
should make sure the actual data conforms with the metadata definition.
Hive Partitions - Overview
• To increase performance Hive has the capability to partition data
 The values of partitioned column divide a table into segments (folders)
 Entire partitions can be ignored at query time
 Similar to relational databases’ but not as advanced
• Partitions have to be properly created by users. When inserting data must
specify a partition
• There is no difference in schema between "partition" columns and regular
columns
• At query time, Hive will automatically filter out partitions for better
performance
Hive Partitions
CREATE TABLE employee_partitioned(
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
PARTITIONED BY (Year INT, Month INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
Partition is NOT enabled automatically. We have
to add/delete partition by ALTER TABLE
statement
Hive Partitions – Dynamic Partitions
Hive also supports dynamically giving partition values. This is useful when data volume is
large and we don't know what will be the partition values. To enable it, see line 1.
By default, the user must specify at least one static partition column. This is to avoid
accidentally overwriting partitions. To disable this restriction, we can set the partition
mode to nonstrict (see line 3) from the default strict mode.
Hive Buckets - Overview
• The bucket corresponds to the segments of files in the HDFS
• Mechanism to query and examine random samples of data and speed JOIN
• Break data into a set of buckets based on a hash function of a "bucket column”
• Hive doesn’t automatically enforce bucketing. User is required to specify the
number of buckets by setting # of reducer or set the enforced bucketing.
SET mapred.reduce.tasks = 256;
SET hive.enforce.bucketing = true;
• To define number of buckets, we should avoid having too much or too little data in each
buckets. A better choice somewhere near two blocks of data. Use 2N as the number of
buckets
Hive Buckets - DDL
• Use ‘CLUSTERED BY’ statements to define
buckets (Line 10)
• To populate the data into the buckets, we
have to use INSERT statement instead of
LOAD statement (Line 16 – 17) since it
does not verify the data against metadata
definition (copy files to the folders only)
• Buckets are list of files segments (Line 23
and 24)
Hive Views – Overview
• View is a logical structure to simplify query by hiding sub query, join, functions
in a virtual table.
• Hive view does not store data or get materialized
• Once the view is created, its schema is frozen immediately. Subsequent changes
to the underlying table schema (such as adding a column) will not be reflected.
• If the underlying table is dropped or changed, querying the view will be failed.
CREATE VIEW view_name AS SELECT statement;
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = 'This is a view');
DROP VIEW view_name;
Hive SELECT Statement
• SELECT statement is used to project the rows meeting query conditions specified
• Hive SELECT statement is subset of database standard SQL
Examples
• SELECT *, SELECT column_name, SELECT DISTINCT
• SELECT * FROM employee LIMIT 5 //Limit rows returned
• WITH t1 AS (SELECT …) SELECT * FROM t1 //CTE – Common Table Expression
• SELECT * FROM (SELECT * FROM employee) a; //Nested query
• SELECT name, sex_age.sex AS sex FROM employee a WHERE EXISTS
(SELECT * FROM employee b WHERE a.sex_age.sex = b.sex_age.sex AND b.sex_age.sex = 'male');
//The subquery support EXIST, NOT EXSIT, IN, NOT IN.
However, nested subquery is not supported. IN and NOT IN only support one column.
Hive Join - Overview
• JOIN statement is used to combine the rows from two or more tables together
• Hive JOIN statement is similar to database JOIN. But Hive does not support unequal JOIN.
• INNER JOIN, OUTER JOIN (RIGHT OUTER JOIN, LEFT OURTER JOIN, FULL OUTER JOIN), and
CROSS JOIN (Cartesian product/JOIN ON 1=1), Implicit JOIN (INNER JOIN without JOIN
keywords but use , separate tables)
Example
• C = C1 JOIN C2
• A = C1 LEFT OUTER JOIN C2
• B = C1 RIGHT OUTER JOIN C2
• AUBUC = C1 FULL OUTER JOIN C2
A BC
Hive Join – MAPJOIN
• MAPJOIN statement means doing JOIN only by map without reduce job. The MAPJOIN
statement reads all the data from the small table to memory and broadcast to all maps.
• When hive.auto.convert.join setting is set to true, Hive automatically converts the JOIN
to MAPJOIN at runtime if possible instead of checking the MAPJOIN hint.
Example:
SELECT /*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee emp JOIN employee_hr emph ON emp.name = emph.name;
The MAPJOIN operator does not support the following:
• Use MAPJOIN after UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER
BY/DISTRIBUTE BY
• Use MAPJOIN before by UNION, JOIN and other MAPJOIN
Hive Set Ops – UNION
• Hive support UNION ALL (keep duplications) and support UNION after v1.2.0
• Hive UNION ALL cannot use in the top level query until v0.13.0,
such as SELECT A UNION ALL SELECT B
• Other set operator can be implemented using JOIN/OUTER JOIN to implement
Example:
//MINUS
jdbc:hive2://> SELECT a.name
. . . . . . .> FROM employee a
. . . . . . .> LEFT JOIN employee_hr b
. . . . . . .> ON a.name = b.name
. . . . . . .> WHERE b.name IS NULL;
//INTERCEPT
jdbc:hive2://> SELECT a.name
. . . . . . .> FROM employee a
. . . . . . .> JOIN employee_hr b
. . . . . . .> ON a.name = b.name;
Hive Data Movement – LOAD
• To move data in Hive, it uses the LOAD keyword. Move here means the original
data is moved to the target table/partition and does not exist in the original
place anymore.
LOCAL keywords specifies where the files are located in the host
OVERWRITE is used to decide whether to append or replace the existing data
Hive Data Exchange – INSERT
• To insert data into table/partition, Hive uses INSERT statement.
• Generally speaking, Hive INSERT is weak than what’s in the DBMS.
• Hive INSERT supports OVERWRITE and INTO syntax
• Hive only support INSERT … SELECT (INSERT VALUES start from v0.14.0 on
special ACID tables only)
• Hive use INSERT OVERWRITE LOCAL DIRECTORY local_directory SELECT * FROM
employee to insert data to the local files (opposite to LOAD statement). Only
“INTO” keywords are supported. There is no support of appending data.
• Hive supports multiple INSERT from the same table
Hive Data Exchange – INSERT Example
Better performance
Ways to export data
Hive Data Exchange – [EX|IM]PORT
• Hive uses IMPORT and EXPORT statement to deal with data migration
• All data and metadata are exported or imported
• The EXPORT statement export data in a subdirectory called data and metadata in a file
called _metadata. For example,
 EXPORT TABLE employee TO '/user/dayongd/output3';
 EXPORT TABLE employee_partitioned partition
(year=2014, month=11) TO '/user/dayongd/output5';
• After EXPORT, we can manually copy the exported files to the other HDFS. Then, import
them with IMPORT statement.
 IMPORT TABLE empolyee_imported FROM '/user/dayongd/output3';
 IMPORT TABLE employee_partitioned_imported FROM '/user/dayongd/output5’;
Hive Sorting Data – ORDER and SORT
• ORDER BY (ASC|DESC) is similar to standard SQL
• ORDER BY in Hive performs a global sort of data by using only one reducer. For
example, SELECT name FROM employee ORDER BY NAME DESC;
• SORT BY (ASC|DESC) decides how to sort the data in each reducer.
• When number of reducer is set to 1,
it is equal to ORDER BY. For example,
When there is more than 1
reducer, the data is not sort
properly.
Hive Sorting Data – DISTRIBUTE
• DISTRIBUTE BY is similar to the GROUP BY statement in standard SQL
• It makes sure the rows with matching column value will be partitioned to the
same reducer
• When to use with SORT BY, put DISTRIBUTE BY before SORT BY (since partition is
ahead of reducer sort)
• The distributed column must appear in the SELECT column list.
SELECT name, employee_id
FROM employee_hr
DISTRIBUTE BY employee_id SORT BY name;
Hive Sorting Data – CLUSTER
• CLUSTER BY = DISTRIBUTE BY + SORT BY on the same column
• CLUSTER BY does not support DESC yet
• To fully utilize all reducer to perform global sort, we can use CLUSTER BY first,
then ORDER BY after.
• An example,
SELECT name, employee_id
FROM employee_hr CLUSTER BY name;
Hive Other Topics Not Covered Today
• Hive operators
• Hive UDFs
• Hive aggregations
• UDFs extension
• Streaming
• Hive securities
• Hue
www.sparkera.ca
BIG DATA is not only about data,
but the understanding of the data
and how people use data actively to improve their life.

More Related Content

What's hot

Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
Jeff Hammerbacher
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
Guy Harrison
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Hive Hadoop
Hive HadoopHive Hadoop
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 

What's hot (19)

Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 

Viewers also liked

Big data skew
Big data skewBig data skew
Big data skew
ayan ray
 
Hive introduction 介绍
Hive  introduction 介绍Hive  introduction 介绍
Hive introduction 介绍
ablozhou
 
User-Defined Table Generating Functions
User-Defined Table Generating FunctionsUser-Defined Table Generating Functions
User-Defined Table Generating Functionspauly1
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
amarsri
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
Uday Vakalapudi
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
Zheng Shao
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
Alex Silva
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
Thomas Vanhove
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
Liyin Tang
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Namit Jain
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
Shrihari Rathod
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
Zheng Shao
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
Minwoo Kim
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
Zheng Shao
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Nagato Kasaki
 

Viewers also liked (20)

Big data skew
Big data skewBig data skew
Big data skew
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive introduction 介绍
Hive  introduction 介绍Hive  introduction 介绍
Hive introduction 介绍
 
User-Defined Table Generating Functions
User-Defined Table Generating FunctionsUser-Defined Table Generating Functions
User-Defined Table Generating Functions
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016
 

Similar to Ten tools for ten big data areas 04_Apache Hive

Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
RUHULAMINHAZARIKA
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
RamyaMurugesan12
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
Apache hive
Apache hiveApache hive
Apache hive
Ayapparaj SKS
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Apache Hive
Apache HiveApache Hive
Apache Hive
Amit Khandelwal
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
Eyad Garelnabi
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
Hive
HiveHive
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Apache Hive
Apache HiveApache Hive
Apache Hive
tusharsinghal58
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
KalsoomTahir2
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
Luis Marques
 

Similar to Ten tools for ten big data areas 04_Apache Hive (20)

Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
Hive
HiveHive
Hive
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 

Recently uploaded

学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
CIOWomenMagazine
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
SEO Article Boost
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
nhiyenphan2005
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
harveenkaur52
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
JeyaPerumal1
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 

Recently uploaded (20)

学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 

Ten tools for ten big data areas 04_Apache Hive

  • 1. Apache Hive Overview Ten Tools for Ten Big Data Areas Series 04 Big Data Warehouse www.sparkera.ca
  • 2. Ten Tools for Ten Big Data Areas – Overview 2© Sparkera. Confidential. All Rights Reserved 10 Tools 10 Areas Programming SearchandIndex First ETL fully on Yarn Data storing platform Data computing platform SQL & Metadata Visualize with just few clicks Powerful as Java Simple as Python real-time streaming Made easier Yours Google Lightning-fast cluster computing Real-time distributed data store High throughput distributed messaging
  • 3. Agenda 3© Sparkera. Confidential. All Rights Reserved About Apache Hive History 2 Apache Hive Architecture 3 Hive Data Type, Table, Partition, Bucket, View 4 Hive Data Definition and Manipulation Language 5 Hive Data Exchange and Sorting 1
  • 4. Apache Hive Overview • Hive is an top Apache project which provides Data Warehousing Solution on Hadoop • Provides SQL-like query language named HiveQL - HQL, minimal learning curve • Supports running on different computing frameworks, such as tez, mapreduce, spark • Supports ad hoc querying data on HDFS and HBase • Supports user-defined functions, scripts, and customized I/O format to extend functionality • Matured JDBC and ODBC drivers allow many applications to pull Hive data for seamless reporting • Hive has a well-defined architecture for metadata management, authentication, optimizations 4© Sparkera. Confidential. All Rights Reserved
  • 5. Apache Hive History • AUG 2007 – Journey Start from Facebook • MAY 2013 – Hive 0.11.0 as Stinger Phase 1 - ORC, HiveSever2 • OCT 2013 – Hive 0.12.0 as Stinger Phase 2 - ORC improvement • APR 2014 – Hive 0.13.0 as Stinger Phase 3 - vectorized query engine and Tez • NOV 2014 – Hive 0.14.0 as Stinger.next Phase 1- Cost-based optimizer • FEB 2015 – Hive 1.0.0 • MAR 2015 – Hive 1.1.0 • MAY 2015 – Hive 1.2.0 Most BI and ETL tools leverage Hive as connector The Stinger Initiative - Making Apache Hive 100 Times Faster 5© Sparkera. Confidential. All Rights Reserved
  • 6. Hive vs. MapReduce – Word Count Vs. 67 Lines
  • 7. Hive Metadata • To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database • By default, Hive is Packaged with Derby, a lightweight embedded SQL DB  Default Derby based is good for evaluation an testing  Schema is not shared between users as each user has their own instance of embedded Derby  Stored in metastore_db directory which resides in the directory that hive was started from • Can easily switch another SQL installation such as MySQL • HCatalog as part of Hive exposures Hive metadata to other ecosystem
  • 8. Hive Architecture JDBC/ODBC/Other API Thrift Server (Hive Server 1 and 2) Hive Web Interface Hive Command Line Driver (Compiler,Optimizer,Plan Executor) Metastore
  • 9. Hive Interface – Command Line • CML and Beeline • Command Mode
  • 10. Hive Interface – Command Line Cont. • Interactive Mode Example of JDBC URL: jdbc:hive2://ubuntu:11000/db2;user=foo;password=barc
  • 11. Hive Interface – Others • Hive Web Interface|Hue|JDBC/ODBC|IDE
  • 12. Data Type – Primitive Type Type Example Type Example TINYINT 10Y SMALLINT 10S INT 10 BIGINT 100L FLOAT 1.342 DOUBLE 1.234 DECIMAL 3.14 BINARY 1010 BOOLEAN TRUE STRING ‘Book’ or “Book” CHAR ‘YES’ or “YES” VARCHAR ‘Book’ or “Book” DATE ‘2013-01-31’ TIMESTAMP ‘2013-01-31 00:13:00.345’
  • 13. Data Type – Complex Type • ARRAY has same type for all the elements – equal to MAP uses sequence from 0 as keys • MAP has same type of key value pairs • STRUCT likes table/records Type Example Define Example ARRAY [‘Apple’,’Orange’,’M ongo’] ARRAY<string> a[0] = ‘Apple’ MAP {‘A’:’Apple’,’O’:’Oran ge’} MAP<string,string> b[‘A’] = ‘Apple’ STRUCT {‘Apple’:2} STRUCT<fruit:string,weight:int> c.weight = 2
  • 14. Hive Data Structure Data Structure Logical Physical Database A collection of tables Folder with files Table A collection of rows of data Folder with files Partition Columns to split data Folder Buckets Columns to distribute data Files Row Line of records Line in a file Columns Slice of records Specified positions in each line Views Shortcut of rows of data n/a Index Statistics of data Folder with files
  • 15. Hive Database The database in describes a collection of tables that are used for a similar purpose or belong to the same group. If the database is not specified (use database_name), the default database is used. Hive creates a directory for each database at /user/hive/warehouse, which can be defined through hive.metastore.warehouse.dir property except default database. • CREATE DATABASE IF NOT EXISTS myhivebook; • SHOW DATABASES; • DESCRIBE DATABASE default; //Provide more details than ’SHOW’, such as comments, location • DROP DATABASE IF EXISTS myhivebook CASCADE; • ALTER DATABASE myhivebook SET OWNER user willddy;
  • 16. Hive Tables - Concept External Tables Data is kept in the HDFS path specified by LOCATION keywords. The data is not managed by Hive fully since drop the table (metadata) will not delete the data Internal Tables/Managed Table Data is kept in the default path, such as /user/hive/warehouse/employee. The data is fully managed by Hive since drop the table (metadata) will also delete the data
  • 17. Hive Tables - DDL CREATE EXTERNAL TABLE IF NOT EXISTS employee_external ( name string, work_place ARRAY<string>, sex_age STRUCT<sex:string,age:int>, skills_score MAP<string,int>, depart_title MAP<STRING,ARRAY<STRING>> ) COMMENT 'This is an external table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE LOCATION '/user/dayongd/employee'; IF NOT EXISTS is optional TEMPORARY table is also supported List of columns with data type There is no constrain support Table comment is optional How to separate columns How to separate MAP and Collections File format Data files path in HDFS (URI)
  • 18. Side Notes - Delimiters The default delimiters in Hive are as follows: • Field delimiter: Can be used with Ctrl + A or ^A (Use 001 when creating the table) • Collection item delimiter: Can be used with Ctrl + B or ^B (002) • Map key delimiter: Can be used with Ctrl + C or ^C (003) Issues If delimiter is overridden during the table creation, it only works in flat structure JIRA Hive- 365. For nested types, the level of nesting determines the delimiter, for example • ARRAY of ARRAY – Outer ARRAY is ^B, inner ARRAY is ^C • MAP of ARRAY – Outer MAP is ^C, inner ARRAY is ^D
  • 19. Hive Table – DDL – Create Table • CTAS – CREATE TABLE ctas_employee as SELECT * FROM employee • CTAS cannot create a partition, external, or bucket table • CTAS with Common Table Expression (CTE) CREATE TABLE cte_employee AS WITH r1 AS (SELECT name FROM r2 WHERE name = 'Michael'), r2 AS (SELECT name FROM employee WHERE sex_age.sex= 'Male'), r3 AS (SELECT name FROM employee WHERE sex_age.sex= 'Female') SELECT * FROM r1 UNION ALL SELECT * FROM r3; • Create table like other table (fast), such as CREATE TABLE employee_like LIKE employee
  • 20. Hive Table – Drop/Truncate/Alter Table • DROP TABLE IF EXISTS employee [PERGE] statement removes metadata completely and move data to .Trash folder in the user home directory in HDFS if configured. With PERGE option (since Hive 0.14.0), the data is removed completely. When to drop an external table, the data is not removed. Such as DROP TABLE IF EXISTS employee; • TRUNCATE TABLE employee statement removes all rows of data form an internal table. • ALTER TABLE employee RENAME TO new_employee statement renames the table • ALTER TABLE c_employee SET TBLPROPERTIES ('comment'='New name, comments’) statement sets table property • ALTER TABLE employee_internal SET SERDEPROPERTIES ('field.delim' = '$’) statement set SerDe properties • ALTER TABLE c_employee SET FILEFORMAT RCFILE statement sets file format.
  • 21. Hive Table – Alter Table Columns • ALTER TABLE employee_internal CHANGE name employee_name STRING [BEFORE|AFTER] sex_age statement can be used to change column name, position or type • ALTER TABLE c_employee ADD COLUMNS (work string) statement add another column and type to the table at the end. • ALTER TABLE c_employee REPLACE COLUMNS (name string) statement replace all columns in the table by specified column and type. After ALTER in this example, there is only one column in the table. Note: The ALTER TABLE statement will only modify the metadata of Hive, not actually data. User should make sure the actual data conforms with the metadata definition.
  • 22. Hive Partitions - Overview • To increase performance Hive has the capability to partition data  The values of partitioned column divide a table into segments (folders)  Entire partitions can be ignored at query time  Similar to relational databases’ but not as advanced • Partitions have to be properly created by users. When inserting data must specify a partition • There is no difference in schema between "partition" columns and regular columns • At query time, Hive will automatically filter out partitions for better performance
  • 23. Hive Partitions CREATE TABLE employee_partitioned( name string, work_place ARRAY<string>, sex_age STRUCT<sex:string,age:int>, skills_score MAP<string,int>, depart_title MAP<STRING,ARRAY<STRING>> ) PARTITIONED BY (Year INT, Month INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':'; Partition is NOT enabled automatically. We have to add/delete partition by ALTER TABLE statement
  • 24. Hive Partitions – Dynamic Partitions Hive also supports dynamically giving partition values. This is useful when data volume is large and we don't know what will be the partition values. To enable it, see line 1. By default, the user must specify at least one static partition column. This is to avoid accidentally overwriting partitions. To disable this restriction, we can set the partition mode to nonstrict (see line 3) from the default strict mode.
  • 25. Hive Buckets - Overview • The bucket corresponds to the segments of files in the HDFS • Mechanism to query and examine random samples of data and speed JOIN • Break data into a set of buckets based on a hash function of a "bucket column” • Hive doesn’t automatically enforce bucketing. User is required to specify the number of buckets by setting # of reducer or set the enforced bucketing. SET mapred.reduce.tasks = 256; SET hive.enforce.bucketing = true; • To define number of buckets, we should avoid having too much or too little data in each buckets. A better choice somewhere near two blocks of data. Use 2N as the number of buckets
  • 26. Hive Buckets - DDL • Use ‘CLUSTERED BY’ statements to define buckets (Line 10) • To populate the data into the buckets, we have to use INSERT statement instead of LOAD statement (Line 16 – 17) since it does not verify the data against metadata definition (copy files to the folders only) • Buckets are list of files segments (Line 23 and 24)
  • 27. Hive Views – Overview • View is a logical structure to simplify query by hiding sub query, join, functions in a virtual table. • Hive view does not store data or get materialized • Once the view is created, its schema is frozen immediately. Subsequent changes to the underlying table schema (such as adding a column) will not be reflected. • If the underlying table is dropped or changed, querying the view will be failed. CREATE VIEW view_name AS SELECT statement; ALTER VIEW view_name SET TBLPROPERTIES ('comment' = 'This is a view'); DROP VIEW view_name;
  • 28. Hive SELECT Statement • SELECT statement is used to project the rows meeting query conditions specified • Hive SELECT statement is subset of database standard SQL Examples • SELECT *, SELECT column_name, SELECT DISTINCT • SELECT * FROM employee LIMIT 5 //Limit rows returned • WITH t1 AS (SELECT …) SELECT * FROM t1 //CTE – Common Table Expression • SELECT * FROM (SELECT * FROM employee) a; //Nested query • SELECT name, sex_age.sex AS sex FROM employee a WHERE EXISTS (SELECT * FROM employee b WHERE a.sex_age.sex = b.sex_age.sex AND b.sex_age.sex = 'male'); //The subquery support EXIST, NOT EXSIT, IN, NOT IN. However, nested subquery is not supported. IN and NOT IN only support one column.
  • 29. Hive Join - Overview • JOIN statement is used to combine the rows from two or more tables together • Hive JOIN statement is similar to database JOIN. But Hive does not support unequal JOIN. • INNER JOIN, OUTER JOIN (RIGHT OUTER JOIN, LEFT OURTER JOIN, FULL OUTER JOIN), and CROSS JOIN (Cartesian product/JOIN ON 1=1), Implicit JOIN (INNER JOIN without JOIN keywords but use , separate tables) Example • C = C1 JOIN C2 • A = C1 LEFT OUTER JOIN C2 • B = C1 RIGHT OUTER JOIN C2 • AUBUC = C1 FULL OUTER JOIN C2 A BC
  • 30. Hive Join – MAPJOIN • MAPJOIN statement means doing JOIN only by map without reduce job. The MAPJOIN statement reads all the data from the small table to memory and broadcast to all maps. • When hive.auto.convert.join setting is set to true, Hive automatically converts the JOIN to MAPJOIN at runtime if possible instead of checking the MAPJOIN hint. Example: SELECT /*+ MAPJOIN(employee) */ emp.name, emph.sin_number FROM employee emp JOIN employee_hr emph ON emp.name = emph.name; The MAPJOIN operator does not support the following: • Use MAPJOIN after UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER BY/DISTRIBUTE BY • Use MAPJOIN before by UNION, JOIN and other MAPJOIN
  • 31. Hive Set Ops – UNION • Hive support UNION ALL (keep duplications) and support UNION after v1.2.0 • Hive UNION ALL cannot use in the top level query until v0.13.0, such as SELECT A UNION ALL SELECT B • Other set operator can be implemented using JOIN/OUTER JOIN to implement Example: //MINUS jdbc:hive2://> SELECT a.name . . . . . . .> FROM employee a . . . . . . .> LEFT JOIN employee_hr b . . . . . . .> ON a.name = b.name . . . . . . .> WHERE b.name IS NULL; //INTERCEPT jdbc:hive2://> SELECT a.name . . . . . . .> FROM employee a . . . . . . .> JOIN employee_hr b . . . . . . .> ON a.name = b.name;
  • 32. Hive Data Movement – LOAD • To move data in Hive, it uses the LOAD keyword. Move here means the original data is moved to the target table/partition and does not exist in the original place anymore. LOCAL keywords specifies where the files are located in the host OVERWRITE is used to decide whether to append or replace the existing data
  • 33. Hive Data Exchange – INSERT • To insert data into table/partition, Hive uses INSERT statement. • Generally speaking, Hive INSERT is weak than what’s in the DBMS. • Hive INSERT supports OVERWRITE and INTO syntax • Hive only support INSERT … SELECT (INSERT VALUES start from v0.14.0 on special ACID tables only) • Hive use INSERT OVERWRITE LOCAL DIRECTORY local_directory SELECT * FROM employee to insert data to the local files (opposite to LOAD statement). Only “INTO” keywords are supported. There is no support of appending data. • Hive supports multiple INSERT from the same table
  • 34. Hive Data Exchange – INSERT Example Better performance Ways to export data
  • 35. Hive Data Exchange – [EX|IM]PORT • Hive uses IMPORT and EXPORT statement to deal with data migration • All data and metadata are exported or imported • The EXPORT statement export data in a subdirectory called data and metadata in a file called _metadata. For example,  EXPORT TABLE employee TO '/user/dayongd/output3';  EXPORT TABLE employee_partitioned partition (year=2014, month=11) TO '/user/dayongd/output5'; • After EXPORT, we can manually copy the exported files to the other HDFS. Then, import them with IMPORT statement.  IMPORT TABLE empolyee_imported FROM '/user/dayongd/output3';  IMPORT TABLE employee_partitioned_imported FROM '/user/dayongd/output5’;
  • 36. Hive Sorting Data – ORDER and SORT • ORDER BY (ASC|DESC) is similar to standard SQL • ORDER BY in Hive performs a global sort of data by using only one reducer. For example, SELECT name FROM employee ORDER BY NAME DESC; • SORT BY (ASC|DESC) decides how to sort the data in each reducer. • When number of reducer is set to 1, it is equal to ORDER BY. For example, When there is more than 1 reducer, the data is not sort properly.
  • 37. Hive Sorting Data – DISTRIBUTE • DISTRIBUTE BY is similar to the GROUP BY statement in standard SQL • It makes sure the rows with matching column value will be partitioned to the same reducer • When to use with SORT BY, put DISTRIBUTE BY before SORT BY (since partition is ahead of reducer sort) • The distributed column must appear in the SELECT column list. SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id SORT BY name;
  • 38. Hive Sorting Data – CLUSTER • CLUSTER BY = DISTRIBUTE BY + SORT BY on the same column • CLUSTER BY does not support DESC yet • To fully utilize all reducer to perform global sort, we can use CLUSTER BY first, then ORDER BY after. • An example, SELECT name, employee_id FROM employee_hr CLUSTER BY name;
  • 39. Hive Other Topics Not Covered Today • Hive operators • Hive UDFs • Hive aggregations • UDFs extension • Streaming • Hive securities • Hue
  • 40. www.sparkera.ca BIG DATA is not only about data, but the understanding of the data and how people use data actively to improve their life.

Editor's Notes

  1. Lightning-fast cluster computing
  2. You need to use the special hiveconf for variable substitution. e.g. hive> set CURRENT_DATE='2015-06-26'; hive> select * from foo where day >= '${hiveconf:CURRENT_DATE}' similarly, you could pass on command line: % hive -hiveconf CURRENT_DATE='2015-06-26' -f test.hql Note that there are env and system variables as well, so you can reference ${env:USER} for example. To see all the available variables, from the command line, run % hive -e 'set;' or from the hive prompt, run hive> set;
  3. No sub-directory is allowed in the URI
  4. DROP TABLE IF EXISTS empty_ctas_employee;
  5. Data in the same buckets can join fast since they are likely have the same group of keys
  6. View is not likely optimized by optimizer
  7. View is not likely optimized by optimizer
  8. View is not likely optimized by optimizer
  9. View is not likely optimized by optimizer
  10. View is not likely optimized by optimizer
  11. View is not likely optimized by optimizer
  12. View is not likely optimized by optimizer
  13. View is not likely optimized by optimizer
  14. View is not likely optimized by optimizer
  15. View is not likely optimized by optimizer
  16. View is not likely optimized by optimizer
  17. View is not likely optimized by optimizer
  18. View is not likely optimized by optimizer