Introduction to Hive and HCatalog presentation by Mark Grover at NYC HUG. A video of this presentation is available at https://www.youtube.com/watch?v=JGwhfr4qw5s
Technological Geeks Video 13 :-
Video Link :- https://youtu.be/mfLxxD4vjV0
FB page Link :- https://www.facebook.com/bitwsandeep/
Contents :-
Hive Architecture
Hive Components
Limitations of Hive
Hive data model
Difference with traditional RDBMS
Type system in Hive
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
This presentation is based on my experience while learning HIVE. Most of the things(Limitation and features) covered in ppt are in incubating phase while writing this tutorial.
Introduction to Hive and HCatalog presentation by Mark Grover at NYC HUG. A video of this presentation is available at https://www.youtube.com/watch?v=JGwhfr4qw5s
Technological Geeks Video 13 :-
Video Link :- https://youtu.be/mfLxxD4vjV0
FB page Link :- https://www.facebook.com/bitwsandeep/
Contents :-
Hive Architecture
Hive Components
Limitations of Hive
Hive data model
Difference with traditional RDBMS
Type system in Hive
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
This presentation is based on my experience while learning HIVE. Most of the things(Limitation and features) covered in ppt are in incubating phase while writing this tutorial.
A short introduction to Apache Hadoop Hive, what is it and what can it do. How could we use it to connect a Hadoop cluster to business intelligence tools. Then create management reports from our Hadoop cluster data.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration?
In this talk we will cover:
- Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven.
- Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js.
- Easy deployment of the Hadoop stack to the cloud.
- Hermes - our homegrown command-line tool which helps us automate data-related tasks.
- Examples of exciting machine learning challenges that we are currently tackling
- Hadoop with Azure and Microsoft stack.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Here's the talk that we presented at the Hadoop Summit 2015, in San Jose. This was an inside look at how we at Yahoo scaled Hive to work at Yahoo's data/metadata scale.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Apache Hive and HBase are very popular projects in the Hadoop ecosystem. Using Hive with HBase was made possible by contributions from Facebook around 2010. In this talk, we will go over the details of how the integration works, and talk about recent improvements. Specifically, we will cover the basic architecture, schema and data type mappings, and recent filter pushdown optimizations. We will also go into detail about the security aspects of Hadoop/HBase related to Hive setups.
A short introduction to Apache Hadoop Hive, what is it and what can it do. How could we use it to connect a Hadoop cluster to business intelligence tools. Then create management reports from our Hadoop cluster data.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration?
In this talk we will cover:
- Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven.
- Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js.
- Easy deployment of the Hadoop stack to the cloud.
- Hermes - our homegrown command-line tool which helps us automate data-related tasks.
- Examples of exciting machine learning challenges that we are currently tackling
- Hadoop with Azure and Microsoft stack.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Here's the talk that we presented at the Hadoop Summit 2015, in San Jose. This was an inside look at how we at Yahoo scaled Hive to work at Yahoo's data/metadata scale.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Apache Hive and HBase are very popular projects in the Hadoop ecosystem. Using Hive with HBase was made possible by contributions from Facebook around 2010. In this talk, we will go over the details of how the integration works, and talk about recent improvements. Specifically, we will cover the basic architecture, schema and data type mappings, and recent filter pushdown optimizations. We will also go into detail about the security aspects of Hadoop/HBase related to Hive setups.
Data Engineering with Spring, Hadoop and Hive Alex Silva
This presentation will outline the evolution of the monitoring data platform pipeline at Rackspace and explore the compute and data management challenges we have faced at this scale. We will focus on our use of Hadoop and Hive as data storage and transformation platforms while discussing the technology stack, key architectural decisions, observations and pitfalls encountered in building the pipeline.
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data. The session will cover the latest advancements in Hive and provide practical tips for maximizing Hive Performance.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=7c8f800cbbef256680db14c78b871f97
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
The real estate market is one of the most competitive in terms of pricing, and as a result, prices tend to vary significantly based on a variety of factors. Forecasting property prices is an important module in decision making for both buyers and investors in supporting budget allocation, finding property finding stratagems, and determining suitable policies, making it one of the top fields to apply the concepts of machine learning to optimise and predict the prices with high accuracy.
The literature study provides a clear concept and will benefit any next endeavours. The majority of writers have come to the conclusion that artificial neural networks are more effective at forecasting the future, but in the actual world, there are other algorithms that should have been taken into account. In order to maximise profits, investors base their judgments on market trends. Developers are curious in future trends because it might help them weigh the advantages and downsides and assist them create new products.
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
This session focused on data visualisation using Power BI, based on big data. Some examples of Hive and HDFS file storage are given. An overview of Microsoft HDInsight is supplied.
Atlanta meetup presentation, discussion around big data processing engines (Hive, HBase, Druid, Spark). Weighs the relative strengths of each engine and which use cases each of the engines are most suited for
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
The initiation of The Hadoop Apache Hive began in 2007 by Facebook due to its data growth.
This ETL system began to fail over few years as more people joined Facebook.
In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive
Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
Hadoop and Internet of Things presentation from Sinergija 2014 conference, held in Belgrade in October 2014. How the rising data resources change the business, and how the Big Data technologies combined with Internet of Things devices can help to improve the business and the everyday life. Hadoop is already the most significant technology for working with Big Data. Microsoft is playing a very important role in this field, with the Stinger initiative. The main goal is to bring the enterprise SQL at Hadoop scale.
Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).
We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these
tools.
1. An Introduction to
Apache HIVE
Credits
By: Reza Ameri
Semester: Fall 2013
Course: DDB
Prof: Dr. Naderi
2. Agenda
• Starting Note
– What is Hive
– What is cool about Hive
– Hive in use
– What Hive is not?
• Brief About Data Warehouse
An Introduction to Apache HIVE
2 of 31
3. Agenda- Contd.
• Hive Architecture
– Components
– Architecture Diagram
• Hive in Production
– HQL
– Data Insertion/Aggregation
• Performance
• Further Reading
• References
An Introduction to Apache HIVE
3 of 31
4. Starting Note
• What is Apache Hive?
– Open Source (Very Important!) So Free
– Data Warehouse System on Hadoop
– Provides HQL(SQL like query interface)
– Suitable for Structured and Semi-Structured Data
– Capability to deal with different storages and file
formats
An Introduction to Apache HIVE
4 of 31
5. Starting Note- Contd.
• What is cool about Hive
– Let users use MR without thinking MR with
HiveQL interface.
• Some history
– Hive is made by Facebook!
– Developing by Netflix aslo.
– Amazon uses it in Amazon Elastic MapReduce
An Introduction to Apache HIVE
5 of 31
6. Starting Note- Contd.
• What Hive is not
– Does not use complex indexes so do not response
in a seconds!
– But it scales very well and, It works with data of
Peta Byte order
– It is not independent and it’s performance is tied
Hadoop
An Introduction to Apache HIVE
6 of 31
7. Brief About Data Warehouse
• OLAP vs OLTP
– DW is needed in OLAP
– We want report and summary not live data of
transactions for continuing the operate
– We need reports to make operation better not to
conduct and operation!
– We use ETL to populate data in DW.
An Introduction to Apache HIVE
7 of 31
8. Brief About Data Warehouse
Inmon approach
vs
Kimbal approach
An Introduction to Apache HIVE
8 of 31
9. Brief About Data Warehouse
Inmon approach
vs
Kimbal approach
An Introduction to Apache HIVE
9 of 31
10. Brief About Data Warehouse
• Other keywords
– ODS- Operational Data Store
– Fact Tables
– Data Mart
– Dimensions
– Concurrent ETLs
An Introduction to Apache HIVE
10 of 31
11. Hive Architecture
• Components
– Hadoop
– Driver
– Command Line Interface (CLI)
– Web Interface
– Metastore
– Thrift Server
An Introduction to Apache HIVE
11 of 31
13. Hive Architecture
Map Reduce
Web UI + Hive CLI + JDBC/ODBC
User-defined
Map-reduce Scripts
HDFS
Browse, Query, DDL
Hive QL
MetaStore
Parser
UDF/UDAF
substr
sum
average
Planner
Execution
Thrift API
Optimizer
SerDe
CSV
Thrift
Regex
An Introduction to Apache HIVE
FileFormats
TextFile
SequenceFile
RCFile
13 of 31
14. Hive Architecture- Contd.
– Internal Components
• Compiler and Planner
– It compiles and checks the input query and create an
execution plan.
• Optimizer
– It optimizes the execution plan before it runs.
• Execution Engine
– Runs the execution plan. It is guaranteed that execution plan
is DAG
An Introduction to Apache HIVE
14 of 31
15. Hive Architecture- Contd.
• Hive Data Model
– Any data in hive is categorized in
• Databases
– First level of abstraction.
• Tables
– Ordinary tables
• Partition
– To handle data transferring in MR.
• Bucket
– Facilitate the data access in partitions.
An Introduction to Apache HIVE
15 of 31
16. Hive in Production
• Log processing
– Daily Report
– User Activity Measurement
• Data/Text mining
– Machine learning (Training Data)
• Business intelligence
– Advertising Delivery
– Spam Detection
An Introduction to Apache HIVE
16 of 31
17. Hive in Production
– HQL
•
•
•
•
•
Create
Row Format
SerDe
Select
Cluster By/Distribute By
– Data Insertion/Aggregation
An Introduction to Apache HIVE
17 of 31
18. HQL- Samples
• CREATE TABLE
CREATE TABLE movies (movie_id int, movie_name string, tags
string)
• ROW FORMAT
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘:’;
An Introduction to Apache HIVE
18 of 31
19. HQL- Samples
• Partition
create table table_name (
id int,
date string,
name string)
partitioned by (date string)
An Introduction to Apache HIVE
19 of 31
20. HQL- Samples
• SerDe
– User Table with
“id::gender::age::occupation::zipcode” format.
CREATE TABLE USER (id INT, gender STRING, age INT,
occupation STRING, zipcode INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");
An Introduction to Apache HIVE
20 of 31
21. HQL- Samples
• Select
SELECT * FROM movies LIMIT 10;
• Distribute By
– Select * from movies distribute by tags;
– Select the column to organize data while sending
it to reducer.
An Introduction to Apache HIVE
21 of 20
22. Hive Process
• Data Insertion/Aggregation
– Bulk
• ETL
– Talend - Community version
– Sqoop (SQl to hadOOP, Apache license)
– SyncSort – Not Free!
An Introduction to Apache HIVE
22 of 31
23. Hive Process- Contd.
– STP(Straight Through Processing)
• Flume – Apache lisenced
• Chukwa - a part of Apache Hadoop distribution
• Scribe – Facebook solution for log processing
and aggregation.
An Introduction to Apache HIVE
23 of 31
24. Hive Process- Contd.
• NetFlix Case Study
– Usage of Chukwa
– Log processing
– Count Errors per session
– Count Streams per day
– Ad-hoc queries like summaries (sum, max, min, …)
An Introduction to Apache HIVE
24 of 31
26. Hive Process- Contd.
• Phase 1
– Hadoop job parses the logs and loads to Hive
every hour.
– Previous job should also run every 24 hours for
summary
• Phase 2
– Real-time log processing(parse/merge/load)
– Chukwa has non-stop log collection.
An Introduction to Apache HIVE
26 of 31
30. Further Reading
• Apache Drill
– Software framework that supports data-intensive, distributed
applications, for interactive analysis of large-scale datasets
• PIG
– MR Platform for creating and using MR on Hadoop
•
•
•
•
•
•
•
Oracle Big Data
DB2 10 and InfoSphere Warehouse
Parallel databases: Gamma, Bubba, Volcano
Google: Sawzall
Yahoo: Pig
IBM: JAQL
Microsoft: DradLINQ , SCOPE
An Introduction to Apache HIVE
30 of 31
هایو روی هادوپ ساخته شده تا بتوان روی BigData کوئری زد. هایو در فیسبوک ایجاد شد.مشکلی فیسبوک با آن روبرو بود بعد از آن مشکل خیلی از شرکتهای دیگر هم شد و کم کم کارایی و قابلیتهای rdbmsها و NoSqlها در دادههای بزرگ کمرنگ شد.گزارشات کم کم چند دقیقه طول کشیدند و گاهی ساعتها زمان بردند.گاهی همزمانی دو گزارش مشکل بزرگی را به وجود آورد.کم کم سیستم ها کند شدند و گیر کردند و یا از دسترس خارج شدند.تازه بعد از حل این مشکل نیاز به اطلاعات بدون درگیر شدن به MR هم به چشم امد. لازم بود که اطلاعات را بدون داشتن تسلط به دانش پیچیده مپ ریدوس فراخوانی و استفاده کنند.هادوپ اسکیما نداشت و کار باهاش سخت بود.Not ReusableFor complex jobs:Multiple stage of Map/Reduce functionsمثال مشکل شرکت مخابرات استان تهران برای اعلام لیست قطعی و یا تغییرات در دیتابیس خود.مثال کوئری ۳۶ ساعته و ۲۴ ثانیهایمثال توانیر
هادوپ چیست؟رایگان و متن باز.فرق هست بین متن باز رو رایگان این هم رایگان هست و هم متن بازDWareHouse برای هادوپ است.یک انتزاع هست و یک سیستم انتزاعی است.
چیزی که در مورد هایو جالبه اینه که این امکان رو می ده که بدون داشتن دانش نگاشت کاهشیبتونیم از هادوپ و امکانات بیگ دیتا استفاده کنیم.بهرهمندی از امکانات scalable با وجود استفاده از واسط Query Languageای که مشابه با SQL قدیمی هست.هایو در سال ۲۰۰۸ توسط فیسبوک متن باز شد و تحت لایسنس آپاچی در اومد.
Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
UDF User Defined functions
Directed acyclic graph: is a directed graph with no directed cycles.
پارتیشن: هر جدول می تواند یک یا چند کلید پارتیشن داشته باشد. اطلاعات براساس کلید پارتیشن در فایلها ذخیره میشوند. بدون پارتیشن کل دیتا به MR ارسال می شوند اما با پارتیشن ارسال اطلاعات به MR مدیریت می شود.باکت: اطلاعات هر پارتیشن هم براساس hash valueها دسته بندی میشوند.ایناطلاعات در همان پوشهی پارتیشن نگهداری میشود.
برای کار با دادههای پیچیده و delimeterهای چند حرفی و پیچیده.کاربرد: پردازش لاگها
DISTRIBUTE BY + Sort By = Cluster byشبیه به group by
اینها مثل log4jبا این تفاوت که پیش و پس پردازش روی لاگ دارند.
Drill:Design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds