This document provides an overview of Hive, including:
1. It describes Hive's architecture which uses HDFS for storage, MapReduce for execution, and stores metadata in an RDBMS.
2. It outlines Hive's data types including primitive, collection, and file format types.
3. It discusses Hive's query language (HQL) which resembles SQL and can be used to define databases and tables, load and query data.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
https://www.learntek.org/big-data-and-hadoop-training/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features like UDFs but Hive framework does.
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
MindScripts Technologies, is the leading Big-Data Hadoop Training institutes in Pune, providing a complete Big-Data Hadoop Course with Cloud-Era certification.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
https://www.learntek.org/big-data-and-hadoop-training/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features like UDFs but Hive framework does.
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
MindScripts Technologies, is the leading Big-Data Hadoop Training institutes in Pune, providing a complete Big-Data Hadoop Course with Cloud-Era certification.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
This research article experimentally shows, how Multiple Queries can be provided to Hive and their execution can be reduced by searching common expression and common data source. The TPC-H queries are used for demonstration and test data is generated in variation of 2 GB, 5 GB and 10 GB using DBGEN software. Technology used in this is HADOOP and Hive. HADOOP is configured in Single user over ubunto Operating system.
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
The real estate market is one of the most competitive in terms of pricing, and as a result, prices tend to vary significantly based on a variety of factors. Forecasting property prices is an important module in decision making for both buyers and investors in supporting budget allocation, finding property finding stratagems, and determining suitable policies, making it one of the top fields to apply the concepts of machine learning to optimise and predict the prices with high accuracy.
The literature study provides a clear concept and will benefit any next endeavours. The majority of writers have come to the conclusion that artificial neural networks are more effective at forecasting the future, but in the actual world, there are other algorithms that should have been taken into account. In order to maximise profits, investors base their judgments on market trends. Developers are curious in future trends because it might help them weigh the advantages and downsides and assist them create new products.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
This is the Day-4 lab exercise for CGI group webinar series. It primarily includes demonstrations on Hive, Analytics and other tools on the Cloudera Hadoop Platform.
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
This research article experimentally shows, how Multiple Queries can be provided to Hive and their execution can be reduced by searching common expression and common data source. The TPC-H queries are used for demonstration and test data is generated in variation of 2 GB, 5 GB and 10 GB using DBGEN software. Technology used in this is HADOOP and Hive. HADOOP is configured in Single user over ubunto Operating system.
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
The real estate market is one of the most competitive in terms of pricing, and as a result, prices tend to vary significantly based on a variety of factors. Forecasting property prices is an important module in decision making for both buyers and investors in supporting budget allocation, finding property finding stratagems, and determining suitable policies, making it one of the top fields to apply the concepts of machine learning to optimise and predict the prices with high accuracy.
The literature study provides a clear concept and will benefit any next endeavours. The majority of writers have come to the conclusion that artificial neural networks are more effective at forecasting the future, but in the actual world, there are other algorithms that should have been taken into account. In order to maximise profits, investors base their judgments on market trends. Developers are curious in future trends because it might help them weigh the advantages and downsides and assist them create new products.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
This is the Day-4 lab exercise for CGI group webinar series. It primarily includes demonstrations on Hive, Analytics and other tools on the Cloudera Hadoop Platform.
Similar to Big Data & Analytics (CSE6005) L6.pptx (20)
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Big Data & Analytics (CSE6005) L6.pptx
1. SESSION 2017-2018
B.TECH (CSE) YEAR: III SEMESTER: VI
INTRODUCTION TO HIVE
(CSE6005)
MODULE 2 (L6)
Presented By
Vivek Kumar
Dept of Computer Engineering & Applications
GLA University India
2. Agenda
Learning Objectives Learning Outcomes
Introduction to Hive
1. To study the Hive Architecture
2. To study the Hive File format
3. To study the Hive Query
Language
a) To understand the hive
architecture.
b) To create databases, tables and
execute data manipulation
language statements on it.
c) To differentiate between static
and dynamic partitions.
d) To differentiate between
managed and external tables.
3. Agenda
What is Hive?
Hive Architecture
Hive Data Types
Primitive Data Types
Collection Data Types
Hive File Format
Text File
Sequential File
RCFile (Record Columnar File)
4. Agenda …
Hive Query Language
DDL (Data Definition Language) Statements
DML (Data Manipulation Language) Statements
Database
Tables
Partitions
Buckets
Aggregation
Group BY and Having
SERDER
5. Case Study: Retail
Major Indian retailers
include FutureGroup, Reliance Industries, Tata
Group and Aditya Birla Group are using Hive.
One of the retail groups, let’s call it BigX,
wanted their last 5 years semi- structured
dataset to be analyzed for trends and patterns.
Let us see how we can solve their problem
using Hadoop.
6. Case Study: Retail cont..
About BigX
BigX is a chain of hypermarket in India.
Currently there are 220+ stores across 85
cities and towns in India and employs 35,000+
people. Its annual revenue for the year 2011
was USD 1 Billion. It offers a wide range of
products including fashion and apparels, food
products, books, furniture, electronics, health
care, general merchandise and entertainment
sections.
7. Case Study: Retail cont..
Problem Scenario
1. One of BigX log datasets that needs to be
analyzed was approximately 12TB in overall
size and holds 5 years of vital information in
semi structured form.
8. Case Study: Retail cont..
2. Traditional business intelligence (BI) tools are
good up to a certain degree, usually several
hundreds of gigabytes. But when the scale is
of the order of terabytes and petabytes, these
frameworks become inefficient. Also, BI tools
work best when data is present in a known
pre-defined schema. The particular dataset
from BigX was mostly logs which didn’t
conform to any specific schema.
9. Case Study: Retail cont..
3. It took around 12+ hours to move the data
into their Business Intelligence systems bi-
weekly. BigX wanted to reduce this time
drastically.
4. Querying such large data set was taking too
long
10. Case Study: Retail cont..
Solution
This is where Hadoop shines in all its glory as
a solution. Since the size of the logs dataset is
12TB, at such a large scale, the problem is 2-
fold:
Problem 1: Moving the logs dataset to HDFS
periodically
Problem 2: Performing the analysis on this
HDFS dataset
11. Case Study: Retail cont..
Solution of
Problem1
Since logs are
unstructured in
this case, Sqoop
was of little or no
use. So Flume
was used to move
the log data
periodically into
HDFS.
12. Case Study: Retail cont..
Solution of Problem2
Hive is a data warehouse infrastructure built on top of
Hadoop for providing data summarization, query and
analysis. It provides an SQL-like language called
HiveQL and converts the query into MapReduce tasks.
13.
14. Hive in this Case Study
Hive uses “Schema on Read” unlike a
traditional database which uses “Schema on
Write”.
While reading log files, the simplest
recommended approach during Hive table
creation is to use a RegexSerDe.
By default, Hive metadata is usually stored in
an embeddedDerbydatabase which allows
only one user to issue queries. This is not ideal
for production purposes. Hence, Hive was
15. Conclusion- Case Study: Retail
Using the Hadoop system, log transfer time
was reduced to ~3 hours bi-weekly and
querying time also was significantly improved.
Thanks to Vijay, for case study, Big Data Lead
at 8KMiles, holds M. Tech in Information
Retrieval from IIIT-B.
https://yourstory.com/2012/04/hive-for-retail-
analysis/
16. What is Hive?
Hive is a Data Warehousing tool. Hive is used
to query structured data built on top of
Hadoop.
Facebook created Hive component to manage
their ever-growing volumes of data. Hive
makes use of the following:
1. HDFS for Storage
2. MapReduce for execution
3. Stores metadata in an RDBMS.
17. What is Hive ?
Apache Hive is a popular SQL interface for
batch processing on Hadoop.
Hadoop was built to organize and store
massive amounts of data.
Hive gives another way to access Data inside
the cluster in easy, quick way.
18. Hive provides a query language
called HiveQL that closely resembles the
common Structured Query Language (SQL)
standard.
Hive was one of the earliest project to bring
higher-level languages to Apache Hadoop.
Hive Gives ability to Analysts and Data
Scientists to access data with out being expert
in Java .
Hive gives structure to Data on HDFS making
it data warehousing platform.
19. This interface to Hadoop
not only accelerates the time required to produce
results from data analysis,
it significantly broadens who can use Hadoop and
MapReduce.
Let us take a moment to thank Facebook team
because
Hive was developed by the Facebook Data team
and, after being used internally,
it was contributed to the Apache Software
Foundation .
Currently Hive is freely available as an open
20. What Hive is not?
Hive is not Relational Database, it uses a
database to store meta data, but the data that
hive processes is stored in HDFS.
Hive is not designed for on-line transaction
processing(OLTP).
Hive is not suited for real-time queries and row
level updates and it is best used for batch jobs
over large sets of immutable data such as web
logs.
21. Typical Use-Case of Hive
Hive takes large amount of unstructured data
and place it into a structured view.
Hive supports use cases such as Ad-hoc
queries, summarization, data analysis.
HIVEQL can also be exchange with custom
scalar functions means user defined
functions(UDF'S), aggregations(UDFA's) and
table functions(UDTF's)
It converts SQL queries into MapReduce jobs.
22. Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs,
lists, and maps.
4. Hive supports SQL filters, group-by and order-
by clauses.
23. Prerequisites of Hive in Hadoop
The prerequisites for setting up Hive and
running queries are
1. User should have stable build of Hadoop
2. Machine Should have Java 1.6 installed
3. Basic Java Programming skills
4. Basic SQL Knowledge
Start all the services of Hadoop using the
command $ start-all.sh.
Check all services are running, then use $ hive to
start HIVE
24. Hive Integration and Workflow
Hourly Log data
can be stored
directly into
HDFS
And then
datacleaning is
performed on the
log file
Finally Hive Table
can be created to
query the log file.
Hadoop HDFS
Hourly Log
Log Compression
Hive table 2 Hive Table 1
26. Hive Architecture
The various parts are as follows:
Hive Command-.Line Interface (Hive CLI): The most
commonly used interface to interact with Hive.
Hive Web Interface: It is a simple Graphic User
Interface to interact with Hive and to execute query.
Hive Server: This is an optional server. This can be
used to submit Hive Jobs from a remote client.
JDBC / ODBC: Jobs can be submitted from a JDBC
Client. One can write a Java code to connect to Hive
and submit jobs on it.
27. Hive Architecture
Driver: Hive queries are sent to the driver for
compilation, optimization and execution.
Metastore: Hive table definitions and mappings to the
data are stored in a Metastore. A Metastore
consists of the following:
'Metastore service: Offers interface to the Hive.
' Database: Stores data definitions, mappings to the
data and others.
The metadata which is stored in the metastore includes IDs
of Database, IDs of Tables, IDs of Indexes etc, the time of
creation of a Table, the Input Format used for a Table, the
Output Format used for a table etc. The metastore is updated
whenever a cable is created or deleted from Hive. There are
three kinds of metastore.
28. Hive Architecture
1. Embedded Metatore: This metastore is mainly used
for unit tests. Here, only one process is allowed to
connect to the metastore at a time. This is the default
metastore for Hive. It is Apache Derby Database. In this
metastore, both the database and the metastore service
runs, embedded in the main
Hive Server process. Figure 9.8 shows an Embedded
Mecastore.
2. Local Metastore: Metadata can be stored in any
RDBMS component like MySQL Local metastore allows
multiple connections at a time. In this mode, the Hive
metastore service runs in the main Hive Server process,
but the metastore database runs in a separate process,
and can be on a separate host, Figure 9.9 shows a local
29. Hive Architecture
3. Remote Metastore: In this, the Hive driver and the
metastore interface run on different JVMs (which can
run on different machines as well) as in Figure 9.10.
This way the database can be fire-walled from the Hive
user and also database credentials are completely
isolated from the users of Hive.
32. Hive Data Model Contd.
Tables
- Analogous to relational tables
- Each table has a corresponding directory in
HDFS
- Data serialized and stored as files within that
directory
- Hive has default serialization built in which
supports compression and lazy deserialization
- Users can specify custom serialization –
deserialization schemes (SerDe’s)
33. Hive Data Model Contd.
Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within
subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount
FLOAT)
PARTITIONED BY (country STRING, year INT,
month INT)
So each partition will be split out into different folders
like
Sales/country=US/year=2012/month=12
35. Partition
The general definition of Partition
is horizontally dividing the data into number
of slice in a equal and manageable manner.
Every partition is stored as directory within data
warehouse table.
In data warehouse this partition concept is
common but there is two types of Partitions
are available in data warehouse concepts.
There are
i) SQL Partition
ii) Hive Partition
36. Hive Partition
The main work of Hive partition is also same
as SQL Partition but
the main difference between SQL partition and
hive partition is SQL partition is only supported for
single column in table but in Hive partition it
supported for Multiple columns in a table .
The main work of Hive partition is also same as
SQL Partition but the main difference between
SQL partition and hive partition is SQL partition is
only supported for single column in table but in
Hive partition it supported for Multiple columns in
a table .
37. Hive Data Model Contd.
Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket
number
- Each bucket is stored as a file in partition
directory
38. Hive Data Types
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer
INT 4 - byte signed integer
BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
39. Hive Data Types cont..
Collection Data Types
STRUC
T
Similar to ‘C’ struct. Fields are accessed using dot notation.
E.g.: struct('John', 'Doe')
MAP A collection of key - value pairs. Fields are accessed using [] notation.
E.g.: map('first', 'John', 'last', 'Doe')
ARRAY Ordered sequence of same types. Fields are accessed using array index.
E.g.: array('John', 'Doe')
40. Hive File Format
Text File: The default file format is text file.
Sequential File: Sequential files are flat files
that store binary key-value pairs.
RCFile (Record Columnar File):
RCFile stores the data in Column Oriented
Manner which ensures that Aggregation
operation is not an expensive operation.
41. Hive Query Language (HQL)
Works on Databases, Tables, Partitions, Buckets
(Clusters)
Create and manage tables and partitions.
Support various Relational, Arithmetic, and Logical
Operators.
Evaluate functions.
Downloads the contents of a table to a local
directory or result of queries to HDFS directory.
42. Database
To create a database named “STUDENTS”
with comments and database properties.
CREATE DATABASE IF NOT EXISTS
STUDENTS COMMENT 'STUDENT Details'
WITH DBPROPERTIES ('creator' = 'JOHN');
43. Database
To describe a database
DESCRIBE DATABASE STUDENTS;
To show Databases
SHOW DATABASES;
To drop database.
DROP DATABASE STUDENTS;
44. Tables
There are two types of tables in Hive:
Managed table
External table
The difference between two is when you drop
a table:
if it is managed table hive deletes both data and
meta data,
if it is external table hive only deletes metadata.
Use external keyword to create a external
table
45. Tables
To create managed table named ‘STUDENT’.
CREATE TABLE IF NOT EXISTS
STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't';
46. Tables
To create external table named
‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS
EXT_STUDENT(rollno INT,name STRING,gpa
FLOAT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't' LOCATION
‘/STUDENT_INFO;
47. Tables
To load data into the table from file named
student.tsv.
LOAD DATA LOCAL INPATH
‘/root/hivedemos/student.tsv' OVERWRITE
INTO TABLE EXT_STUDENT;
To retrieve the student details from
“EXT_STUDENT” table.
SELECT * from EXT_STUDENT;
48. Table ALTER Operations
ALTER TABLE mytablename RENAME to mt;
ALTER TABLE mytable ADD COLOUMNS (mycol
STRING);
ALTER TABLE name RENAME TO new_name
ALTER TABLE name DROP [COLUMN]
column_name
ALTER TABLE name CHANGE column_name
new_name new_type
ALTER TABLE name REPLACE COLUMNS
(col_spec[, col_spec ...])
49. Partitions
Partitions split the larger dataset into more meaningful chunks.
Hive provides two kinds of partitions: Static Partition and Dynamic
Partition.
• To create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT
(rollno INT, name STRING) PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
Load data into partition table from table.
INSERT OVERWRITE TABLE STATIC_PART_STUDENT
PARTITION (gpa =4.0) SELECT rollno, name from
EXT_STUDENT where gpa=4.0;
50. Partitions
• To create dynamic partition on column date.
CREATE TABLE IF NOT EXISTS
DYNAMIC_PART_STUDENT(rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't';
To load data into a dynamic partition table from table.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Note: The dynamic partition strict mode requires at least one static
partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT
PARTITION (gpa) SELECT rollno,name,gpa from
EXT_STUDENT;
51. Buckets
Tables or partitions are sub-divided
into buckets, to provide extra structure to the
data that may be used for more efficient
querying. Bucketing works based on the value
of hash function of some column of a table.
We can add partitions to a table by altering the
table. Let us assume we have a table
called employee with fields such as Id, Name,
Salary, Designation, Dept, and yoj.
52. Buckets
• To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno
INT,name STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;
Load data to bucketed table.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET
TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
53. Aggregations
Hive supports aggregation functions like avg,
count, etc.
To write the average and count aggregation
function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT;
54. Group by and Having
To write group by and having function.
SELECT rollno, name,gpa
FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;
55. SerDer
SerDer stands for Serializer/Deserializer.
Contains the logic to convert unstructured data
into records.
Implemented using Java.
Serializers are used at the time of writing.
Deserializers are used at query time (SELECT
Statement).
56. Fill in the blanks
The metastore consists of ______________
and a ______________.
The most commonly used interface to interact
with Hive is ______________.
The default metastore for Hive is
______________.
Metastore contains ______________ of Hive
tables.
______________ is responsible for
compilation, optimization, and execution of
Hive queries.