Apache Hive is an open-source data warehousing project that provides tools to enable easy data extraction, transformation, and querying for data stored in various databases and file systems. It allows users with no programming experience to query data using a SQL-like language called HiveQL. Hive provides structured data processing on top of Hadoop and allows for SQL queries, user-defined functions, and custom data formats. It also supports JDBC/ODBC connectivity to connect business intelligence tools and applications to the stored data.
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
This presentation is based on my experience while learning HIVE. Most of the things(Limitation and features) covered in ppt are in incubating phase while writing this tutorial.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
This presentation is based on my experience while learning HIVE. Most of the things(Limitation and features) covered in ppt are in incubating phase while writing this tutorial.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).
If you’re already a SQL user then working with Hadoop may be a little easier than you think, thanks to Apache Hive. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).
This cheat sheet covers:
-- Query
-- Metadata
-- SQL Compatibility
-- Command Line
-- Hive Shell
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
In this talk, we’ll discuss technical designs of support of HBase as a “native” data source to Spark SQL to achieve both query and load performance and scalability: near-precise execution locality of query and loading, fine-tuned partition pruning, predicate pushdown, plan execution through coprocessor, and optimized and fully parallelized bulk loader. Point and range queries on dimensional attributes will benefit particularly well from the techniques. Preliminary test results vs. established SQL-on-HBase technologies will be provided. The speaker will also share the future plan and real-world use cases, particularly in the telecom industry.
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
The document explains the problem, cause and effect of Data skew. It also explains different techniques to minimize data skew across various big data technologies like mapreduce, hive and pig.
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
In this talk, we’ll discuss technical designs of support of HBase as a “native” data source to Spark SQL to achieve both query and load performance and scalability: near-precise execution locality of query and loading, fine-tuned partition pruning, predicate pushdown, plan execution through coprocessor, and optimized and fully parallelized bulk loader. Point and range queries on dimensional attributes will benefit particularly well from the techniques. Preliminary test results vs. established SQL-on-HBase technologies will be provided. The speaker will also share the future plan and real-world use cases, particularly in the telecom industry.
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
The document explains the problem, cause and effect of Data skew. It also explains different techniques to minimize data skew across various big data technologies like mapreduce, hive and pig.
Data Engineering with Spring, Hadoop and Hive Alex Silva
This presentation will outline the evolution of the monitoring data platform pipeline at Rackspace and explore the compute and data management challenges we have faced at this scale. We will focus on our use of Hadoop and Hive as data storage and transformation platforms while discussing the technology stack, key architectural decisions, observations and pitfalls encountered in building the pipeline.
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
Hive on Spark を活用した高速データ分析 - Hadoop / Spark Conference Japan 2016Nagato Kasaki
現在、DMM.comでは、1日あたり1億レコード以上の行動ログを中心に、各サービスのコンテンツ情報や、地域情報のようなオープンデータを収集し、データドリブンマーケティングやマーケティングオートメーションに活用しています。しかし、データの規模が増大し、その用途が多様化するにともなって、データ処理のレイテンシが課題となってきました。本発表では、既存のデータ処理に用いられていたHiveの処理をHive on Sparkに置き換えることで、1日あたりのバッチ処理の時間を3分の1まで削減することができた事例を紹介し、Hive on Sparkの導入方法やメリットを具体的に解説します。
Hadoop / Spark Conference Japan 2016
http://www.eventbrite.com/e/hadoop-spark-conference-japan-2016-tickets-20809016328
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
When processing TB and PB of data, running your Big Data queries at scale and having them perform at peak is essential. In this session, we show you some state-of-the art tools on how to analyze U-SQL job performances and we discuss in-depth best practices on designing your data layout both for files and tables and writing performing and scalable queries using U-SQL. You will learn how to analyze performance and scale bottlenecks and will learn several tips on how to make your big data processing scripts both faster and scale better.
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
Data Lakes have become a new tool in building modern data warehouse architectures. In this presentation we will introduce Microsoft's Azure Data Lake offering and its new big data processing language called U-SQL that makes Big Data Processing easy by combining the declarativity of SQL with the extensibility of C#. We will give you an initial introduction to U-SQL by explaining why we introduced U-SQL and showing with an example of how to analyze some tweet data with U-SQL and its extensibility capabilities and take you on an introductory tour of U-SQL that is geared towards existing SQL users.
slides for SQL Saturday 635, Vancouver BC, Aug 2017
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
More and more customers who are looking to modernize analytics needs are exploring the data lake approach in Azure. Typically, they are most challenged by a bewildering array of poorly integrated technologies and a variety of data formats, data types not all of which are conveniently handled by existing ETL technologies. In this session, we’ll explore the basic shape of a modern ETL pipeline through the lens of Azure Data Lake. We will explore how this pipeline can scale from one to thousands of nodes at a moment’s notice to respond to business needs, how its extensibility model allows pipelines to simultaneously integrate procedural code written in .NET languages or even Python and R, how that same extensibility model allows pipelines to deal with a variety of formats such as CSV, XML, JSON, Images, or any enterprise-specific document format, and finally explore how the next generation of ETL scenarios are enabled though the integration of Intelligence in the data layer in the form of built-in Cognitive capabilities.
Concepts of Apache Hive in Big Data.
contains:
what is hive?
why hive?
how hive works
hive Architecture
data models in hive
pros and cons of hive
hiveql
pig vs hive
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Italy Agriculture Equipment Market Outlook to 2027harveenkaur52
Agriculture and Animal Care
Ken Research has an expertise in Agriculture and Animal Care sector and offer vast collection of information related to all major aspects such as Agriculture equipment, Crop Protection, Seed, Agriculture Chemical, Fertilizers, Protected Cultivators, Palm Oil, Hybrid Seed, Animal Feed additives and many more.
Our continuous study and findings in agriculture sector provide better insights to companies dealing with related product and services, government and agriculture associations, researchers and students to well understand the present and expected scenario.
Our Animal care category provides solutions on Animal Healthcare and related products and services, including, animal feed additives, vaccination
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...JeyaPerumal1
A cellular network, frequently referred to as a mobile network, is a type of communication system that enables wireless communication between mobile devices. The final stage of connectivity is achieved by segmenting the comprehensive service area into several compact zones, each called a cell.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
7. Hive Metadata
• To support features like schema(s) and data partitioning Hive keeps its metadata
in a Relational Database
• By default, Hive is Packaged with Derby, a lightweight embedded SQL DB
Default Derby based is good for evaluation an testing
Schema is not shared between users as each user has their own instance of
embedded Derby
Stored in metastore_db directory which resides in the directory that hive was started
from
• Can easily switch another SQL installation such as MySQL
• HCatalog as part of Hive exposures Hive metadata to other ecosystem
12. Data Type – Primitive Type
Type Example Type Example
TINYINT 10Y SMALLINT 10S
INT 10 BIGINT 100L
FLOAT 1.342 DOUBLE 1.234
DECIMAL 3.14 BINARY 1010
BOOLEAN TRUE STRING ‘Book’ or “Book”
CHAR ‘YES’ or “YES” VARCHAR ‘Book’ or “Book”
DATE ‘2013-01-31’ TIMESTAMP ‘2013-01-31
00:13:00.345’
13. Data Type – Complex Type
• ARRAY has same type for all the elements – equal to MAP uses sequence from 0 as keys
• MAP has same type of key value pairs
• STRUCT likes table/records
Type Example Define Example
ARRAY [‘Apple’,’Orange’,’M
ongo’]
ARRAY<string> a[0] = ‘Apple’
MAP {‘A’:’Apple’,’O’:’Oran
ge’}
MAP<string,string> b[‘A’] = ‘Apple’
STRUCT {‘Apple’:2} STRUCT<fruit:string,weight:int> c.weight = 2
14. Hive Data Structure
Data Structure Logical Physical
Database A collection of tables Folder with files
Table A collection of rows of data Folder with files
Partition Columns to split data Folder
Buckets Columns to distribute data Files
Row Line of records Line in a file
Columns Slice of records Specified positions in each line
Views Shortcut of rows of data n/a
Index Statistics of data Folder with files
15. Hive Database
The database in describes a collection of tables that are used for a similar purpose or
belong to the same group. If the database is not specified (use database_name), the
default database is used. Hive creates a directory for each database at
/user/hive/warehouse, which can be defined through hive.metastore.warehouse.dir
property except default database.
• CREATE DATABASE IF NOT EXISTS myhivebook;
• SHOW DATABASES;
• DESCRIBE DATABASE default; //Provide more details than ’SHOW’, such as
comments, location
• DROP DATABASE IF EXISTS myhivebook CASCADE;
• ALTER DATABASE myhivebook SET OWNER user willddy;
16. Hive Tables - Concept
External Tables
Data is kept in the HDFS path specified by LOCATION keywords. The data is not
managed by Hive fully since drop the table (metadata) will not delete the data
Internal Tables/Managed Table
Data is kept in the default path, such as /user/hive/warehouse/employee. The data
is fully managed by Hive since drop the table (metadata) will also delete the data
17. Hive Tables - DDL
CREATE EXTERNAL TABLE IF NOT EXISTS employee_external (
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
COMMENT 'This is an external table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE
LOCATION '/user/dayongd/employee';
IF NOT EXISTS is optional
TEMPORARY table is also supported
List of columns with data type
There is no constrain support
Table comment is optional
How to separate columns
How to separate MAP and
Collections
File format
Data files path in
HDFS (URI)
18. Side Notes - Delimiters
The default delimiters in Hive are as follows:
• Field delimiter: Can be used with Ctrl + A or ^A (Use 001 when creating the table)
• Collection item delimiter: Can be used with Ctrl + B or ^B (002)
• Map key delimiter: Can be used with Ctrl + C or ^C (003)
Issues
If delimiter is overridden during the table creation, it only works in flat structure JIRA Hive-
365. For nested types, the level of nesting determines the delimiter, for example
• ARRAY of ARRAY – Outer ARRAY is ^B, inner ARRAY is ^C
• MAP of ARRAY – Outer MAP is ^C, inner ARRAY is ^D
19. Hive Table – DDL – Create Table
• CTAS – CREATE TABLE ctas_employee as SELECT * FROM employee
• CTAS cannot create a partition, external, or bucket table
• CTAS with Common Table Expression (CTE)
CREATE TABLE cte_employee AS
WITH
r1 AS (SELECT name FROM r2 WHERE name = 'Michael'),
r2 AS (SELECT name FROM employee WHERE sex_age.sex= 'Male'),
r3 AS (SELECT name FROM employee WHERE sex_age.sex= 'Female')
SELECT * FROM r1 UNION ALL SELECT * FROM r3;
• Create table like other table (fast), such as CREATE TABLE employee_like LIKE employee
20. Hive Table – Drop/Truncate/Alter Table
• DROP TABLE IF EXISTS employee [PERGE] statement removes metadata completely and
move data to .Trash folder in the user home directory in HDFS if configured. With PERGE
option (since Hive 0.14.0), the data is removed completely. When to drop an external
table, the data is not removed. Such as DROP TABLE IF EXISTS employee;
• TRUNCATE TABLE employee statement removes all rows of data form an internal table.
• ALTER TABLE employee RENAME TO new_employee statement renames the table
• ALTER TABLE c_employee SET TBLPROPERTIES ('comment'='New name, comments’)
statement sets table property
• ALTER TABLE employee_internal SET SERDEPROPERTIES ('field.delim' = '$’) statement set
SerDe properties
• ALTER TABLE c_employee SET FILEFORMAT RCFILE statement sets file format.
21. Hive Table – Alter Table Columns
• ALTER TABLE employee_internal CHANGE name employee_name STRING
[BEFORE|AFTER] sex_age statement can be used to change column name, position or
type
• ALTER TABLE c_employee ADD COLUMNS (work string) statement add another column
and type to the table at the end.
• ALTER TABLE c_employee REPLACE COLUMNS (name string) statement replace all
columns in the table by specified column and type. After ALTER in this example, there is
only one column in the table.
Note:
The ALTER TABLE statement will only modify the metadata of Hive, not actually data. User
should make sure the actual data conforms with the metadata definition.
22. Hive Partitions - Overview
• To increase performance Hive has the capability to partition data
The values of partitioned column divide a table into segments (folders)
Entire partitions can be ignored at query time
Similar to relational databases’ but not as advanced
• Partitions have to be properly created by users. When inserting data must
specify a partition
• There is no difference in schema between "partition" columns and regular
columns
• At query time, Hive will automatically filter out partitions for better
performance
23. Hive Partitions
CREATE TABLE employee_partitioned(
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
)
PARTITIONED BY (Year INT, Month INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
Partition is NOT enabled automatically. We have
to add/delete partition by ALTER TABLE
statement
24. Hive Partitions – Dynamic Partitions
Hive also supports dynamically giving partition values. This is useful when data volume is
large and we don't know what will be the partition values. To enable it, see line 1.
By default, the user must specify at least one static partition column. This is to avoid
accidentally overwriting partitions. To disable this restriction, we can set the partition
mode to nonstrict (see line 3) from the default strict mode.
25. Hive Buckets - Overview
• The bucket corresponds to the segments of files in the HDFS
• Mechanism to query and examine random samples of data and speed JOIN
• Break data into a set of buckets based on a hash function of a "bucket column”
• Hive doesn’t automatically enforce bucketing. User is required to specify the
number of buckets by setting # of reducer or set the enforced bucketing.
SET mapred.reduce.tasks = 256;
SET hive.enforce.bucketing = true;
• To define number of buckets, we should avoid having too much or too little data in each
buckets. A better choice somewhere near two blocks of data. Use 2N as the number of
buckets
26. Hive Buckets - DDL
• Use ‘CLUSTERED BY’ statements to define
buckets (Line 10)
• To populate the data into the buckets, we
have to use INSERT statement instead of
LOAD statement (Line 16 – 17) since it
does not verify the data against metadata
definition (copy files to the folders only)
• Buckets are list of files segments (Line 23
and 24)
27. Hive Views – Overview
• View is a logical structure to simplify query by hiding sub query, join, functions
in a virtual table.
• Hive view does not store data or get materialized
• Once the view is created, its schema is frozen immediately. Subsequent changes
to the underlying table schema (such as adding a column) will not be reflected.
• If the underlying table is dropped or changed, querying the view will be failed.
CREATE VIEW view_name AS SELECT statement;
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = 'This is a view');
DROP VIEW view_name;
28. Hive SELECT Statement
• SELECT statement is used to project the rows meeting query conditions specified
• Hive SELECT statement is subset of database standard SQL
Examples
• SELECT *, SELECT column_name, SELECT DISTINCT
• SELECT * FROM employee LIMIT 5 //Limit rows returned
• WITH t1 AS (SELECT …) SELECT * FROM t1 //CTE – Common Table Expression
• SELECT * FROM (SELECT * FROM employee) a; //Nested query
• SELECT name, sex_age.sex AS sex FROM employee a WHERE EXISTS
(SELECT * FROM employee b WHERE a.sex_age.sex = b.sex_age.sex AND b.sex_age.sex = 'male');
//The subquery support EXIST, NOT EXSIT, IN, NOT IN.
However, nested subquery is not supported. IN and NOT IN only support one column.
29. Hive Join - Overview
• JOIN statement is used to combine the rows from two or more tables together
• Hive JOIN statement is similar to database JOIN. But Hive does not support unequal JOIN.
• INNER JOIN, OUTER JOIN (RIGHT OUTER JOIN, LEFT OURTER JOIN, FULL OUTER JOIN), and
CROSS JOIN (Cartesian product/JOIN ON 1=1), Implicit JOIN (INNER JOIN without JOIN
keywords but use , separate tables)
Example
• C = C1 JOIN C2
• A = C1 LEFT OUTER JOIN C2
• B = C1 RIGHT OUTER JOIN C2
• AUBUC = C1 FULL OUTER JOIN C2
A BC
30. Hive Join – MAPJOIN
• MAPJOIN statement means doing JOIN only by map without reduce job. The MAPJOIN
statement reads all the data from the small table to memory and broadcast to all maps.
• When hive.auto.convert.join setting is set to true, Hive automatically converts the JOIN
to MAPJOIN at runtime if possible instead of checking the MAPJOIN hint.
Example:
SELECT /*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee emp JOIN employee_hr emph ON emp.name = emph.name;
The MAPJOIN operator does not support the following:
• Use MAPJOIN after UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER
BY/DISTRIBUTE BY
• Use MAPJOIN before by UNION, JOIN and other MAPJOIN
31. Hive Set Ops – UNION
• Hive support UNION ALL (keep duplications) and support UNION after v1.2.0
• Hive UNION ALL cannot use in the top level query until v0.13.0,
such as SELECT A UNION ALL SELECT B
• Other set operator can be implemented using JOIN/OUTER JOIN to implement
Example:
//MINUS
jdbc:hive2://> SELECT a.name
. . . . . . .> FROM employee a
. . . . . . .> LEFT JOIN employee_hr b
. . . . . . .> ON a.name = b.name
. . . . . . .> WHERE b.name IS NULL;
//INTERCEPT
jdbc:hive2://> SELECT a.name
. . . . . . .> FROM employee a
. . . . . . .> JOIN employee_hr b
. . . . . . .> ON a.name = b.name;
32. Hive Data Movement – LOAD
• To move data in Hive, it uses the LOAD keyword. Move here means the original
data is moved to the target table/partition and does not exist in the original
place anymore.
LOCAL keywords specifies where the files are located in the host
OVERWRITE is used to decide whether to append or replace the existing data
33. Hive Data Exchange – INSERT
• To insert data into table/partition, Hive uses INSERT statement.
• Generally speaking, Hive INSERT is weak than what’s in the DBMS.
• Hive INSERT supports OVERWRITE and INTO syntax
• Hive only support INSERT … SELECT (INSERT VALUES start from v0.14.0 on
special ACID tables only)
• Hive use INSERT OVERWRITE LOCAL DIRECTORY local_directory SELECT * FROM
employee to insert data to the local files (opposite to LOAD statement). Only
“INTO” keywords are supported. There is no support of appending data.
• Hive supports multiple INSERT from the same table
34. Hive Data Exchange – INSERT Example
Better performance
Ways to export data
35. Hive Data Exchange – [EX|IM]PORT
• Hive uses IMPORT and EXPORT statement to deal with data migration
• All data and metadata are exported or imported
• The EXPORT statement export data in a subdirectory called data and metadata in a file
called _metadata. For example,
EXPORT TABLE employee TO '/user/dayongd/output3';
EXPORT TABLE employee_partitioned partition
(year=2014, month=11) TO '/user/dayongd/output5';
• After EXPORT, we can manually copy the exported files to the other HDFS. Then, import
them with IMPORT statement.
IMPORT TABLE empolyee_imported FROM '/user/dayongd/output3';
IMPORT TABLE employee_partitioned_imported FROM '/user/dayongd/output5’;
36. Hive Sorting Data – ORDER and SORT
• ORDER BY (ASC|DESC) is similar to standard SQL
• ORDER BY in Hive performs a global sort of data by using only one reducer. For
example, SELECT name FROM employee ORDER BY NAME DESC;
• SORT BY (ASC|DESC) decides how to sort the data in each reducer.
• When number of reducer is set to 1,
it is equal to ORDER BY. For example,
When there is more than 1
reducer, the data is not sort
properly.
37. Hive Sorting Data – DISTRIBUTE
• DISTRIBUTE BY is similar to the GROUP BY statement in standard SQL
• It makes sure the rows with matching column value will be partitioned to the
same reducer
• When to use with SORT BY, put DISTRIBUTE BY before SORT BY (since partition is
ahead of reducer sort)
• The distributed column must appear in the SELECT column list.
SELECT name, employee_id
FROM employee_hr
DISTRIBUTE BY employee_id SORT BY name;
38. Hive Sorting Data – CLUSTER
• CLUSTER BY = DISTRIBUTE BY + SORT BY on the same column
• CLUSTER BY does not support DESC yet
• To fully utilize all reducer to perform global sort, we can use CLUSTER BY first,
then ORDER BY after.
• An example,
SELECT name, employee_id
FROM employee_hr CLUSTER BY name;
40. www.sparkera.ca
BIG DATA is not only about data,
but the understanding of the data
and how people use data actively to improve their life.
Editor's Notes
Lightning-fast cluster computing
You need to use the special hiveconf for variable substitution. e.g.
hive> set CURRENT_DATE='2015-06-26';
hive> select * from foo where day >= '${hiveconf:CURRENT_DATE}'
similarly, you could pass on command line:
% hive -hiveconf CURRENT_DATE='2015-06-26' -f test.hql
Note that there are env and system variables as well, so you can reference ${env:USER} for example.
To see all the available variables, from the command line, run
% hive -e 'set;'
or from the hive prompt, run
hive> set;
No sub-directory is allowed in the URI
DROP TABLE IF EXISTS empty_ctas_employee;
Data in the same buckets can join fast since they are likely have the same group of keys