Hadoop from Hive with Stinger to Tez

www.rubicon.nl
Hadoop: From Hive
with Stinger to Tez
Jan Pieter Posthuma
March 5, 2015

2
Introduction
 Jan Pieter Posthuma
 Microsoft Data Consultant
 Rubicon, local consultancy firm in the Netherlands
 Architect role at multiple projects
 Analysis Service, Reporting Service, Big Data, HDInsight,
Cloud BI, Power BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jp.posthuma@rubicon.nl

3
Agenda
Hive Stinger Tez
Hadoop

4
Hadoop
 Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware:
‘store and process the data on the Internet in a simple, scalable and
economically feasible way’
 Widely accepted by Database vendors as a solution for unstructured data
 Microsoft partners with HortonWorks and delivers their Hadoop Data Platform
as Microsoft HDInsight (now on Windows and Linux)
 Available on premise and as an Azure service
 HortonWorks Data Platform (HDP) 100% Open Source!

5
Why SQL on Hadoop?
Hadoop is great for cost, but
MapReduce is too difficult.
SQL on Hadoop makes
Hadoop real and gives me
scale that traditional SQL
can’t offer.
I’m deleting important data
because it’s too expensive to
store it.
$

6
Hive
Developed Hive to address traditional RDBMS limitations.
300+ PB of data under management.
600+ TB of data loaded daily.
60,000+ Hive queries per day.
More than 1,000 users per day.
Initial Apache release in April 2009
Problem: Hive is bound to MapReduce leading to latency
and needs higher performance

7
Stinger
‘Making Apache Hive 100 Times Faster’
Hortonworks blog, February 2013
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORCFile
= 100X+ +
Distributed
Execution
Apache Tez

8
ORCFiles
 Started by HortonWorks to optimize existing RCFiles with input
from Microsoft to cooperate with QE and Tez
 Two goals:
 Improve query speed
 Improve storage efficiency
 CREATE TABLE … STORED AS ORC

11
Stinger TPC-DS Benchmark at 30 Terabyte Scale
 Sample of 50 queries from TPC-DS at 30 terabyte scale.
 Average 52x Query Speedup, Maximum 160x Query Speedup.
 Total benchmark time decreased from 7.8 days to 9.3 hours.(3)
 Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.

12
Stinger.Next
 Stinger.Next (in 3 phases)
 Transactions with ACID semantics – allow users to easily modify data with
inserts, updates and deletes. It extend Hive from the traditional write-
once, and read-often system to support analytics over changing data.
 Sub-second queries – allow users to deploy Hive for interactive
dashboards and explorative analytics that have more demanding
response-time requirements. Emerge of LLAP (Live Long and Process) and
Hive on Spark.
 SQL:2011 Analytics – allows rich reporting to be deployed on Hive faster,
more simply and reliably using standard SQL. A powerful cost based
optimizer ensures complex queries and tool-generated queries run fast.
Hive now provides the full expressive power that enterprise SQL users
have enjoyed, but at Hadoop scale.

13
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In Develop
ment
Legend
Apache Hive: Modern Architecture

15
Links
 Microsoft Big Data:
http://www.microsoft.com/bigdata
 Hortonworks:
http://www.hortonworks.com
 Try your self via Windows Azure HDInsight:
http://azure.com/hdinsight

16
Usefull resources
 http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final/
 http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/
 http://hortonworks.com/labs/stinger/
 http://hortonworks.com/blog/100x-faster-hive/
 http://www.slideshare.net/hugfrance/recent-enhancements-to-apache-hive-query-
performance?qid=2cd74ce1-e863-436c-a1ab-
52a513c61a27&v=default&b=&from_search=10
 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
 http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit
 http://hortonworks.com/blog/microsofts-contributions-to-the-stinger-initiative-and-
apache-hive/

Hadoop from Hive with Stinger to Tez

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop from Hive with Stinger to Tez

Similar to Hadoop from Hive with Stinger to Tez (20)

More from Jan Pieter Posthuma

More from Jan Pieter Posthuma (13)

Recently uploaded

Recently uploaded (20)

Hadoop from Hive with Stinger to Tez

Editor's Notes