This document provides an introduction to Apache Hive, including:
- Hive allows for data warehousing and analysis of large datasets stored in Hadoop through use of the HiveQL query language, which is automatically translated to MapReduce jobs.
- Key advantages of Hive include its higher-level query language that simplifies working with large data and lower learning curve compared to Pig or MapReduce. However, updating data can be complicated due to HDFS and Hive has high query latency.
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
This presentation is based on my experience while learning HIVE. Most of the things(Limitation and features) covered in ppt are in incubating phase while writing this tutorial.
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
This presentation is based on my experience while learning HIVE. Most of the things(Limitation and features) covered in ppt are in incubating phase while writing this tutorial.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data. The session will cover the latest advancements in Hive and provide practical tips for maximizing Hive Performance.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=7c8f800cbbef256680db14c78b871f97
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2xkCd84
This CloudxLab Introduction to Hive tutorial helps you to understand Hive in detail. Below are the topics covered in this tutorial:
1) Hive Introduction
2) Why Do We Need Hive?
3) Hive - Components
4) Hive - Limitations
5) Hive - Data Types
6) Hive - Metastore
7) Hive - Warehouse
8) Accessing Hive using Command Line
9) Accessing Hive using Hue
10) Tables in Hive - Managed and External
11) Hive - Loading Data From Local Directory
12) Hive - Loading Data From HDFS
13) S3 Based External Tables in Hive
14) Hive - Select Statements
15) Hive - Aggregations
16) Saving Data in Hive
17) Hive Tables - DDL - ALTER
18) Partitions in Hive
19) Views in Hive
20) Load JSON Data
21) Sorting & Distributing - Order By, Sort By, Distribute By, Cluster By
22) Bucketing in Hive
23) Hive - ORC Files
24) Connecting to Tableau using Hive
25) Analyzing MovieLens Data using Hive
26) Hands-on demos on CloudxLab
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).
Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data. The session will cover the latest advancements in Hive and provide practical tips for maximizing Hive Performance.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=7c8f800cbbef256680db14c78b871f97
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
In this session you will learn:
HIVE Overview
Working of Hive
Hive Tables
Hive - Data Types
Complex Types
Hive Database
HiveQL - Select-Joins
Different Types of Join
Partitions
Buckets
Strict Mode in Hive
Like and Rlike in Hive
Hive UDF
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In our Hadoop World 2012 talk "Performing Data Science with HBase", Aaron Kimball and Kiyan Ahmadizadeh demonstrated publicly for the first time our new WibiData Language (or "WDL" for short) -- a concise, powerful, and interactive tool for engineers and data scientists to explore data and write ad-hoc queries against datasets stored in WibiData.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
This document introduces the theory behind geological database processes and provides detailed
examples using the geological database modelling functions in Surpac. By working through this
tutorial you will gain skills in the creation, use, and modification of geological databases.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. Outline
• What is Hive
• Why Hive over MapReduce or Pig?
- Advantages and disadvantages
• Running Hive
• HiveQL language
• Examples
• Internals
• Hive vs Pig
• Other Hadoop projects
2
3. Hive
• Data warehousing on top of Hadoop.
• Designed to
- enable easy data summarization
- ad-hoc querying
- analysis of large volumes of data.
• HiveQL statements are automatically translated into
MapReduce jobs
3
4. Advantages of Hive
• Higher level query language
- Simplifies working with large amounts of data
• Lower learning curve than Pig or MapReduce
- HiveQL is much closer to SQL than Pig
- Less trial and error than Pig
4
5. Disadvantages
• Updating data is complicated
- Mainly because of using HDFS
- Can add records
- Can overwrite partitions
• No real time access to data
- Use other means like Hbase or Impala
• High latency
5
6. Running Hive
• We will look at it more closely in the practice
session, but you can run hive from
- Hive web interface
- Hive shell
• $HIVE_HOME/bin/hive for interactive shell
• Or you can run queries directly: $HIVE_HOME/bin/hive -e
‘select a.col from tab1 a’
- JDBC - Java Database Connectivity
• "jdbc:hive://host:port/dbname„
- Also possible to use hive directly in Python, C, C++,
PHP
6
7. Query Processor
• Compiler
- Parser
- Type checking
- Plan Generator
- Task Generator
• Execution engine
- Plan
- Operators
- UDF (UDAF’s)
7
8. HiveQL
• Hive query language provides the basic SQL like operations.
• These operations are:
- Ability to filter rows from a table using a where clause.
- Ability to select certain columns from the table using a select clause.
- Ability to do equi-joins between two tables.
- Ability to evaluate aggregations on multiple "group by" columns for
the data stored in a table.
- Ability to store the results of a query into another table.
- Ability to download the contents of a table to a local directory.
- Ability to store the results of a query in a hadoop dfs directory.
- Ability to manage tables and partitions (create, drop and alter).
- Ability to use custom scripts in chosen language (for map/reduce).
8
9. Data units
• Databases: Namespaces that separate tables and other data units from naming confliction.
• Tables:
- Homogeneous units of data which have the same schema.
- Consists of specified columns accordingly to its schema
• Partitions:
- Each Table can have one or more partition Keys which determines how data is stored.
- Partitions allow the user to efficiently identify the rows that satisfy a certain criteria.
- It is the user's job to guarantee the relationship between partition name and data!
- Partitions are virtual columns, they are not part of the data, but are derived on load.
• Buckets (or Clusters):
- Data in each partition may in turn be divided into Buckets based on the value of a hash
function of some column of the Table.
- For example the page_views table may be bucketed by userid to sample the data.
9
10. Types
• Types are associated with the columns in the tables. The
following Primitive types are supported:
- Integers
• TINYINT - 1 byte integer
• SMALLINT - 2 byte integer
• INT - 4 byte integer
• BIGINT - 8 byte integer
- Boolean type
• BOOLEAN
- Floating point numbers
• FLOAT
• DOUBLE
- String type
• STRING
10
11. Complex types
• Structs
- the elements within the type can be accessed using the DOT (.)
notation. F
- or example, for a column c of type STRUCT {a INT; b INT} the a field is
accessed by the expression c.a
• Maps (key-value tuples)
- The elements are accessed using ['element name'] notation.
- For example in a map M comprising of a mapping from 'group' -> gid
the gid value can be accessed using M['group']
• Arrays (indexable lists)
- The elements in the array have to be in the same type.
- Elements can be accessed using the [n] notation where n is an index
(zero-based) into the array.
- For example for an array A having the elements ['a', 'b', 'c'], A[1]
retruns 'b'.
11
12. Built in functions
• Round(number), floor(number), ceil(number)
• Rand(seed)
• Concat(string), substr(string, length)
• Upper(string), lower(string), trim(string)
• Yea(date), Month(date), Day (date)
• Count(*), sum(col), avg(col), max(col), min(col)
12
13. Create Table
CREATE TABLE page_view(viewTime INT, userid
BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the
User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;
13
14. Load Data
• There are multiple ways to load data into Hive
tables.
• The user can create an external table that points
to a specified location within HDFS.
• The user can copy a file into the specified location
in HDFS and create a table pointing to this
location with all the relevant row format
information.
• Once this is done, the user can transform the
data and insert them into any other Hive table.
14
15. Load example
• CREATE EXTERNAL TABLE page_view_stg(viewTime INT,
userid BIGINT,
• page_url STRING, referrer_url STRING,
• ip STRING COMMENT 'IP Address of the User',
• country STRING COMMENT 'country of
origination')
• COMMENT 'This is the staging page view table'
• ROW FORMAT DELIMITED FIELDS TERMINATED BY '44'
LINES TERMINATED BY '12'
• STORED AS TEXTFILE
• LOCATION '/user/data/staging/page_view';
15
17. Loading and storing data
• Loading data directly:
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-
08_us.txt INTO TABLE page_view
PARTITION(date='2008-06-08', country='US')
17
18. Storing locally
• To write the output into a local disk, for
example to load it into an excel spreadsheet
later:
INSERT OVERWRITE LOCAL DIRECTORY
'/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;
18
19. Display functions
• SHOW TABLES;
- Lists all tables
• SHOW PARTITIONS page_view;
- Lists partitions on a specific table
• DESCRIBE EXTENDED page_view;
- To show all metadata
- Mainly for debugging
19
20. INSERT
INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31' AND
page_views.referrer_url like '%xyz.com';
20
21. Multiple table/file inserts
• The output can be sent into multiple tables or even to hadoop dfs files
(which can then be manipulated using hdfs utilities).
• If along with the gender breakdown, one needed to find the breakdown of
unique page views by age, one could accomplish that with the following
query:
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
• The first insert clause sends the results of the first group by to a Hive table
while the second one sends the results to a hadoop dfs files.
21
22. JOIN
• LEFT OUTER, RIGHT OUTER or FULL OUTER
• Can join more than 2 tables at once
• It is best to put the largest table on the rightmost
side of the join to get the best performance.
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid =
u.id)
WHERE pv.date = '2008-03-03';
22
23. Aggregations
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT
pv_users.userid), count(*), sum(DISTINCT
pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
23
24. Sampling
INSERT OVERWRITE TABLE
pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3
OUT OF 32);
24
25. Union
INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
SELECT av.uid AS uid
FROM action_video av
WHERE av.date = '2008-06-03'
UNION ALL
SELECT ac.uid AS uid
FROM action_comment ac
WHERE ac.date = '2008-06-03'
) actions JOIN users u ON(u.id = actions.uid);
25
26. Running custom mapreduce
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script.py'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script.py'
AS date, count;
26
27. Map Example
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, unixtime = line.split('t')
weekday =
datetime.datetime.fromtimestamp(float(unixtime)).
isoweekday()
print ','.join([userid, str(weekday)])
27
28. Co-group
FROM (
FROM (
FROM action_video av
SELECT av.uid AS uid, av.id AS id, av.date AS date
UNION ALL
FROM action_comment ac
SELECT ac.uid AS uid, ac.id AS id, ac.date AS date
) union_actions
SELECT union_actions.uid, union_actions.id, union_actions.date
CLUSTER BY union_actions.uid) map
INSERT OVERWRITE TABLE actions_reduced
SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS
(uid, id, reduced_val);
28
30. Java UDF
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
30
31. Using UDF
• create temporary function my_lower as
'com.example.hive.udf.Lower';
• hive> select my_lower(title), sum(freq) from
titles group by my_lower(title);
31
32. Hive disadvantages
• Same disadvantages as MapReduce and Pig
- Slow start-up and clean-up of MapReduce jobs
• It takes time for Hadoop to schedule MR jobs
- Not suitable for interactive OLAP Analytics
• When results are expected in < 1 sec
• Designed for querying and not data
transformation
- Limitations of the SQL language
- Tasks like co-grouping can get complicated
32
33. Pig vs Hive
Pig Hive
Purpose Data transformation Ad-Hoc querying
Language Something similiar to SQL SQL-like
Difficulty Medium (Trial-and-error) Low to medium
Schemas Yes (implicit) Yes (explicit)
Join (Distributed) Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web interface No Yes
Partitions No Yes
UDF’s Yes Yes
33
34. Pig vs Hive
• SQL might not be the perfect language for
expressing data transformation commands
• Pig
- Mainly for data transformations and processing
- Unstructured data
• Hive
- Mainly for warehousing and querying data
- Structured data
34
35. More Hadoop projects
• Hbase
- Open-source distributed database ontop of HDFS
- Hbase tables only use a single key
- Tuned for real-time access to data
• Cloudera Impala™
- Simplified, real time queries over HDFS
- Bypass job schedulling, and remove everything
else that makes MR slow.
35
36. Big Picture
• Store large amounts of data to HDFS
• Process raw data: Pig
• Build schema using Hive
• Data querries: Hive
• Real time access
- access to data with Hbase
- real time queries with Impala
36
37. Is Hadoop enough?
• Why use Hadoop for large scale data processing?
- It is becoming a de facto standard in Big Data
- Collaboration among Top Companies instead of
vendor tool lock-in.
• Amazon, Apache, Facebook, Yahoo!, etc all contribute to
open source Hadoop
- There are tools from setting up Hadoop cluster in
minutes and importing data from relational databases
to setting up workflows of MR, Pig and Hive.
37