SlideShare a Scribd company logo
Apache Hive
Sheetal Sharma
Intern At IBM Innovation Centre
Apache Hive
● Apache Hive is a tool built on top of Hadoop
for analyzing large, unstructured data sets
using a SQL-like syntax, thus making Hadoop
accessible to legions of existing BI and
corporate analytics researchers.
● Hive is fundamentally an operational data
store that's also suitable for analyzing large,
relatively static data sets where query time is
not important.
Apache Hive
● Hive makes an excellent addition to an existing data
warehouse, but it is not a replacement. Instead,
using Hive to augment a data warehouse is a great
way to leverage existing investments while keeping
up with the data deluge.
● Hive data store brings together vast amounts
of unstructured data -- such as log files,
customer tweets, email messages, geo-data,
and CRM interactions -- and stores them in an
unstructured format on cheap commodity
hardware.
Apache Hive
● Hive allows analysts to project a databaselike
structure on this data, to resemble traditional
tables, columns, and rows, and to write SQL-
like queries over it.
● This means that different schemas may be
projected over the same data sets, depending
on the nature of the query, allowing the user to
ask questions that weren't envisioned when
the data was gathered.
Apache Hive
● Hive queries traditionally had high latency,
and even small queries could take some time
to run because they were transformed into
map-reduce jobs and submitted to the cluster
to be run in batch mode.
● long-running queries were inconvenient and
troublesome to run in a multi-user
environment, where a single job could
dominate the cluster.
Apache Hive
multi-user environment
Apache Hive
● HiveQL, the query language, is based on SQL-92, it
differs from SQL in some important ways due to its
running on top of Hadoop.
● For instance, DDL (Data Definition Language)
commands need to account for the fact that tables
exist in a multi-user file system that supports multiple
storage formats.
● Nevertheless, SQL users will find the HiveQL
language familiar and should not have any problems
adapting to it.
Hive platform architecture
Hive platform architecture
● From the top down, Hive looks much like any other
relational database.
● Users write SQL queries and submit them for
processing, using either a command line tool that
interacts directly with the database engine or by
using third-party tools that communicate with the
database via JDBC or ODBC.
● By using the JDBC and ODBC drivers, available for
Mac and Windows, data workers can connect their
favorite SQL client to Hive to browse, query, and
create tables.
Working with Hive
● HiveQL was designed to ease the transition from SQL
and to get data analysts up and running on Hadoop right
away.
● Most BI and SQL developer tools can connect to Hive as
easily as to any other database. Using the ODBC
connector, users can import data and use tools like
PowerPivot for Excel to explore and analyze data,
making big data accessible across the organization.
Differences in HiveQL and standard SQL
Hive 0.13 was designed to perform full-table scans
across petabyte-scale data sets using the YARN and Tez
infrastructure, so some features normally found in a
relational database aren't available to the Hive user.
These include transactions, cursors, prepared
statements, row-level updates and deletes, and the
ability to cancel a running query.
The absence of these features won't significantly
affect data analysis, but it might affect your ability to use
existing SQL queries on a Hive cluster.
Differences in HiveQL and standard SQL
In a traditional database environment, the database
engine controls all reads and writes to the database. In
Hive, the database tables are stored as files in the
Hadoop Distributed File System (HDFS), where other
applications could have modified them.
Although this can be a good thing, it means that Hive
can never be certain if the data being read matches the
schema.
Aspects of Data Storage
File formats and Compression
● Tuning Hive queries can involve making the underlying
map-reduce jobs run more efficiently by optimizing the
number, type, and size of the files backing the database
tables.
● Hive's default storage format is text, which has the
advantage of being usable by other tools.
● The disadvantage, however, is that queries over raw
text files can't be easily optimized.
Hive can read and write several file formats and decompress
many of them on the fly. Storage requirements and query
efficiency can differ dramatically among these file formats, as can
be seen in the figure below (courtesy of Hortonworks).
File formats are an active area of research in the Hadoop community.
Efficient file formats both reduce storage costs and increase query
efficiency.
For Example
● For example, let's say you want to do a query
that's not part of the built-in SQL. Without a
UDF, you would have to dump a temporary
table to disk, run a second tool (such as Pig or
Java) for your custom query, and possibly
produce a third table in HDFS that would be
analyzed by Hive
Hive Query Performance
Hive 0.13 is the final piece in the Stinger initiative, a
community effort to improve the performance of Hive. The
most significant feature of 0.13 is the ability to run queries on
the new Tez execution framework.
● query times drop by half when run on Tez.
● On queries that could be cached, times dropped another 30
percent.
● On larger data sets, the speedup was even more dramatic.
● possible to execute petabyte-scale queries to refine and
cleanse data for later incorporation into data warehouse
analytics.
Hive Query Performance
● Hadoop and Hive could also be used in the reverse scenario:
to off-load data summaries that would otherwise need to be
stored in the data warehouse at much greater cost.
● Organizations or departments without a data warehouse can
start with Hive to get a feel for the value of data analytics.
● It does make a great, low-cost, large-scale operational data
store with a fair set of analytics tools.
● Hive offers near linear scalability in query processing, an order
of magnitude better price/performance ratio than traditional
enterprise data warehouses.
Apache Hive At a Glance
Thank You!

More Related Content

What's hot

SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Hive
HiveHive
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande
 
Hive
HiveHive
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
Anuja Gunale
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
Some corner at the Laboratory
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Big Data and Hadoop Components
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop Components
DezyreAcademy
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hadoop
HadoopHadoop
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Omar Jaber
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
Data Con LA
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
AnandMHadoop
 
Mongo db
Mongo dbMongo db

What's hot (20)

SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Hive
HiveHive
Hive
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hive
HiveHive
Hive
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big Data and Hadoop Components
Big Data and Hadoop ComponentsBig Data and Hadoop Components
Big Data and Hadoop Components
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Mongo db
Mongo dbMongo db
Mongo db
 

Viewers also liked

Consumer 720-The keys to consumer engagement in a social media world
Consumer 720-The keys to consumer engagement in a social media  worldConsumer 720-The keys to consumer engagement in a social media  world
Consumer 720-The keys to consumer engagement in a social media world
duane lyons
 
LR Beauty 01 2015
LR Beauty 01 2015LR Beauty 01 2015
LR Beauty 01 2015
Chris Voulgarakis
 
Internet and Social Media Marketing - L5 Sample
Internet and Social Media Marketing - L5 SampleInternet and Social Media Marketing - L5 Sample
Internet and Social Media Marketing - L5 SampleSpilios Aristotelidis
 
Ganesan resume
Ganesan resumeGanesan resume
Ganesan resume
Ganesan Kaliyamurthy
 
Сучасна школа
Сучасна школаСучасна школа
Сучасна школа
Lyudmila Boyko
 
11. робота з обдарованими учнями
11. робота з обдарованими учнями11. робота з обдарованими учнями
11. робота з обдарованими учнями
Lyudmila Boyko
 
Alexander Godfrey Learning marketing (feb 2015)
Alexander Godfrey Learning marketing (feb 2015)Alexander Godfrey Learning marketing (feb 2015)
Alexander Godfrey Learning marketing (feb 2015)
Roger Godfrey
 
eusim unlimited call to eu
 eusim unlimited call to eu  eusim unlimited call to eu
eusim unlimited call to eu
Danny Patel danny@thesimcompany.com
 
estrategias de comunicación
estrategias de comunicación estrategias de comunicación
estrategias de comunicación
Fidel Vargas
 
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
István Ignácz
 
darshan-lal
darshan-laldarshan-lal
метод учебного проекта на уроках
метод учебного проекта на урокахметод учебного проекта на уроках
метод учебного проекта на уроках
Matveeva050287
 
Rockagent
RockagentRockagent
Rockagent
felipe madero
 
Updated baron tower near Greenhills, San Juan City, Metro Manila
Updated baron tower near Greenhills, San Juan City, Metro ManilaUpdated baron tower near Greenhills, San Juan City, Metro Manila
Updated baron tower near Greenhills, San Juan City, Metro Manila
Roy Buen
 
Presentation2
Presentation2Presentation2
Presentation2
Ud Ehi
 
Urban deca tower edsa (1)
Urban deca tower   edsa (1)Urban deca tower   edsa (1)
Urban deca tower edsa (1)
Roy Buen
 
HERE GEIZL
HERE GEIZLHERE GEIZL
HERE GEIZL
keveneleven
 

Viewers also liked (17)

Consumer 720-The keys to consumer engagement in a social media world
Consumer 720-The keys to consumer engagement in a social media  worldConsumer 720-The keys to consumer engagement in a social media  world
Consumer 720-The keys to consumer engagement in a social media world
 
LR Beauty 01 2015
LR Beauty 01 2015LR Beauty 01 2015
LR Beauty 01 2015
 
Internet and Social Media Marketing - L5 Sample
Internet and Social Media Marketing - L5 SampleInternet and Social Media Marketing - L5 Sample
Internet and Social Media Marketing - L5 Sample
 
Ganesan resume
Ganesan resumeGanesan resume
Ganesan resume
 
Сучасна школа
Сучасна школаСучасна школа
Сучасна школа
 
11. робота з обдарованими учнями
11. робота з обдарованими учнями11. робота з обдарованими учнями
11. робота з обдарованими учнями
 
Alexander Godfrey Learning marketing (feb 2015)
Alexander Godfrey Learning marketing (feb 2015)Alexander Godfrey Learning marketing (feb 2015)
Alexander Godfrey Learning marketing (feb 2015)
 
eusim unlimited call to eu
 eusim unlimited call to eu  eusim unlimited call to eu
eusim unlimited call to eu
 
estrategias de comunicación
estrategias de comunicación estrategias de comunicación
estrategias de comunicación
 
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
 
darshan-lal
darshan-laldarshan-lal
darshan-lal
 
метод учебного проекта на уроках
метод учебного проекта на урокахметод учебного проекта на уроках
метод учебного проекта на уроках
 
Rockagent
RockagentRockagent
Rockagent
 
Updated baron tower near Greenhills, San Juan City, Metro Manila
Updated baron tower near Greenhills, San Juan City, Metro ManilaUpdated baron tower near Greenhills, San Juan City, Metro Manila
Updated baron tower near Greenhills, San Juan City, Metro Manila
 
Presentation2
Presentation2Presentation2
Presentation2
 
Urban deca tower edsa (1)
Urban deca tower   edsa (1)Urban deca tower   edsa (1)
Urban deca tower edsa (1)
 
HERE GEIZL
HERE GEIZLHERE GEIZL
HERE GEIZL
 

Similar to Apache hive1

Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
Mostafa
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
Krisshhna Daasaarii
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
Mostafa
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
Jonathan Bloom
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
Shivam Shukla
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
Jonathan Bloom
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
Khalid Salama
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
KarthigaGunasekaran1
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
HariKumar544765
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
Anonymous9etQKwW
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
葵慶 李
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Yahoo Developer Network
 

Similar to Apache hive1 (20)

Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 

More from sheetal sharma

Db import&export
Db import&exportDb import&export
Db import&export
sheetal sharma
 
Db import&export
Db import&exportDb import&export
Db import&export
sheetal sharma
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
sheetal sharma
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
sheetal sharma
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
sheetal sharma
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
sheetal sharma
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Services
sheetal sharma
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
sheetal sharma
 

More from sheetal sharma (9)

Db import&export
Db import&exportDb import&export
Db import&export
 
Db import&export
Db import&exportDb import&export
Db import&export
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Services
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
 

Recently uploaded

Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Apache hive1

  • 1. Apache Hive Sheetal Sharma Intern At IBM Innovation Centre
  • 2. Apache Hive ● Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers. ● Hive is fundamentally an operational data store that's also suitable for analyzing large, relatively static data sets where query time is not important.
  • 3. Apache Hive ● Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge. ● Hive data store brings together vast amounts of unstructured data -- such as log files, customer tweets, email messages, geo-data, and CRM interactions -- and stores them in an unstructured format on cheap commodity hardware.
  • 4. Apache Hive ● Hive allows analysts to project a databaselike structure on this data, to resemble traditional tables, columns, and rows, and to write SQL- like queries over it. ● This means that different schemas may be projected over the same data sets, depending on the nature of the query, allowing the user to ask questions that weren't envisioned when the data was gathered.
  • 5. Apache Hive ● Hive queries traditionally had high latency, and even small queries could take some time to run because they were transformed into map-reduce jobs and submitted to the cluster to be run in batch mode. ● long-running queries were inconvenient and troublesome to run in a multi-user environment, where a single job could dominate the cluster.
  • 7. Apache Hive ● HiveQL, the query language, is based on SQL-92, it differs from SQL in some important ways due to its running on top of Hadoop. ● For instance, DDL (Data Definition Language) commands need to account for the fact that tables exist in a multi-user file system that supports multiple storage formats. ● Nevertheless, SQL users will find the HiveQL language familiar and should not have any problems adapting to it.
  • 9. Hive platform architecture ● From the top down, Hive looks much like any other relational database. ● Users write SQL queries and submit them for processing, using either a command line tool that interacts directly with the database engine or by using third-party tools that communicate with the database via JDBC or ODBC. ● By using the JDBC and ODBC drivers, available for Mac and Windows, data workers can connect their favorite SQL client to Hive to browse, query, and create tables.
  • 10. Working with Hive ● HiveQL was designed to ease the transition from SQL and to get data analysts up and running on Hadoop right away. ● Most BI and SQL developer tools can connect to Hive as easily as to any other database. Using the ODBC connector, users can import data and use tools like PowerPivot for Excel to explore and analyze data, making big data accessible across the organization.
  • 11. Differences in HiveQL and standard SQL Hive 0.13 was designed to perform full-table scans across petabyte-scale data sets using the YARN and Tez infrastructure, so some features normally found in a relational database aren't available to the Hive user. These include transactions, cursors, prepared statements, row-level updates and deletes, and the ability to cancel a running query. The absence of these features won't significantly affect data analysis, but it might affect your ability to use existing SQL queries on a Hive cluster.
  • 12. Differences in HiveQL and standard SQL In a traditional database environment, the database engine controls all reads and writes to the database. In Hive, the database tables are stored as files in the Hadoop Distributed File System (HDFS), where other applications could have modified them. Although this can be a good thing, it means that Hive can never be certain if the data being read matches the schema.
  • 13. Aspects of Data Storage File formats and Compression ● Tuning Hive queries can involve making the underlying map-reduce jobs run more efficiently by optimizing the number, type, and size of the files backing the database tables. ● Hive's default storage format is text, which has the advantage of being usable by other tools. ● The disadvantage, however, is that queries over raw text files can't be easily optimized.
  • 14. Hive can read and write several file formats and decompress many of them on the fly. Storage requirements and query efficiency can differ dramatically among these file formats, as can be seen in the figure below (courtesy of Hortonworks). File formats are an active area of research in the Hadoop community. Efficient file formats both reduce storage costs and increase query efficiency.
  • 15. For Example ● For example, let's say you want to do a query that's not part of the built-in SQL. Without a UDF, you would have to dump a temporary table to disk, run a second tool (such as Pig or Java) for your custom query, and possibly produce a third table in HDFS that would be analyzed by Hive
  • 16. Hive Query Performance Hive 0.13 is the final piece in the Stinger initiative, a community effort to improve the performance of Hive. The most significant feature of 0.13 is the ability to run queries on the new Tez execution framework. ● query times drop by half when run on Tez. ● On queries that could be cached, times dropped another 30 percent. ● On larger data sets, the speedup was even more dramatic. ● possible to execute petabyte-scale queries to refine and cleanse data for later incorporation into data warehouse analytics.
  • 17. Hive Query Performance ● Hadoop and Hive could also be used in the reverse scenario: to off-load data summaries that would otherwise need to be stored in the data warehouse at much greater cost. ● Organizations or departments without a data warehouse can start with Hive to get a feel for the value of data analytics. ● It does make a great, low-cost, large-scale operational data store with a fair set of analytics tools. ● Hive offers near linear scalability in query processing, an order of magnitude better price/performance ratio than traditional enterprise data warehouses.
  • 18. Apache Hive At a Glance