This document provides an overview of the Stinger initiative to improve the performance of Hive interactive queries. The Stinger project worked to optimize Hive so that queries return results in seconds instead of minutes or hours by implementing features like Hive on Tez, vectorized processing, predicate pushdown, the ORC file format, and a cost-based optimizer. These optimizations improved Hive performance by over 100 times, allowing interactive use of Hive for the first time on large datasets.
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Introduction to HDInsight Hadoop on Windows Azure services, including using the interactive console with JavaScript and running WordCount via other methods (Streaming, Hive, etc..)
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
There are any number of tricks and traps around getting the query optimizer to provide you with an optimal execution plan that gets you your data quickly and efficiently. But, at the end of the day, the principal driving factor of the optimizer, and therefore of your queries, are the statistics that define your data. This session teaches you how those statistics are put together and maintained by SQL Server. Different types of maintenance results in different levels of accuracy within statistics so we detail what the structures and information looks like after this maintenance. Your understanding of how the optimizer works with statistics will better enable you to understand why you’re getting the performance and types of execution plans that you are getting. Understanding enables you to write better t-sql statements and deal with performance problems such as bad parameter sniffing.
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Introduction to HDInsight Hadoop on Windows Azure services, including using the interactive console with JavaScript and running WordCount via other methods (Streaming, Hive, etc..)
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
There are any number of tricks and traps around getting the query optimizer to provide you with an optimal execution plan that gets you your data quickly and efficiently. But, at the end of the day, the principal driving factor of the optimizer, and therefore of your queries, are the statistics that define your data. This session teaches you how those statistics are put together and maintained by SQL Server. Different types of maintenance results in different levels of accuracy within statistics so we detail what the structures and information looks like after this maintenance. Your understanding of how the optimizer works with statistics will better enable you to understand why you’re getting the performance and types of execution plans that you are getting. Understanding enables you to write better t-sql statements and deal with performance problems such as bad parameter sniffing.
A great power point presentation for DBMS Concepts from start to end and with best examples chapter by chapter. Please go though each chapters sequentially for your knowledge.
A very easy going study material for better understanding and concepts of Database Management System.
Copper: A high performance workflow enginedmoebius
COPPER (COmmon Persistable Process Excecution Runtime) is an open-source high performance workflow engine, that persists the workflow instances (process) state into a database. So there is no limit to the runtime of a process. It can run for weeks, month or years. In addition, this strategy leads to crash safety.
A workflow can describe business processes for example, however any kind of use case is supported. The "modelling" language is Java, that has several advantages:
* with COPPER any Java developer is able to design workflows
* all Java developers like to use Java
* many Java libs can be integrated within COPPER
* many Java tools, like IDEs, can be used
* with COPPER your productivity will be increased when using a workflow engine
* using Java solutions will protect your investment
* COPPER is OpenSource under Apache Licence 2.0
Please visit copper-engine.org for details.
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
The main objective of this workshop is to give the audience hands on experience with several Hadoop technologies and jump start their hadoop journey. In this workshop, you will load data and submit queries using Hadoop! Before jumping in to the technology, the Founders of DataKitchen review Hadoop and some of its technologies (MapReduce, Hive, Pig, Impala and Spark), look at performance, and present a rubric for choosing which technology to use when.
NOTE: To complete hands on poriton in the time allotted, attendees should come with a newly created AWS (Amazon Web Services) Account and complete the other prerequisites found in the DataKitchen blog <http: />.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto,
Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi.
In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
호튼웍스 아시아 기술 총괄 이사 제프 마크햄 (Jeff Markham) 이 테즈에 대한 소개를 합니다. 테즈는 맵리듀스를 대체하여 하둡의 질의 처리를 가속하는 소프트웨어입니다. 왜 테즈를 만들었고, 어떻게 구성되었으며, 최적화는 어떻게 진행되고, 그 성능은 얼마나 좋아졌는지 전반에 대해 설명합니다.
SF Big Analytics meetup : Hoodie From UberChester Chen
Even after a decade, the name “Hadoop" remains synonymous with "big data”, even as new options for processing/querying (stream processing, in-memory analytics, interactive sql) and storage services (S3/Google Cloud/Azure) have emerged & unlocked new possibilities. However, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and strain on usability . In this talk, we argue that by adding some missing blocks to existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop, at reduced cost and increased efficiency, greatly simplifying the overall architecture as well in the process. We will discuss the need for incremental processing primitives on Hadoop, motivating them with some real world problems from Uber. We will then introduce “Hoodie”, an open source spark library built at Uber, to enable faster data for petabyte scale data analytics and solve these problems. We will deep dive into the design & implementation of the system and discuss the core concepts around timeline consistency, tradeoffs between ingest speed & query performance. We contrast Hoodie with similar systems in the space, discuss how its deployed across Hadoop ecosystem at Uber and finally also share the technical direction ahead for the project.
Speaker: VINOTH CHANDAR, Staff Software Engineer at Uber
Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including "Hoodie". He has keen interest in unified architectures for data analytics and processing.
Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Similar to Overview of stinger interactive query for hive (20)
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
Overview of stinger interactive query for hive
1. Overview
of
S+nger:
Interac+ve
Query
for
Hive
@ddkaiser
linkedin.com/in/dkaiser
slideshare.net/ddkaiser
dkaiser@cdk.com
dkaiser@hortonworks.com
OC
Big
Data
Meetup
#1
May
21,
2014
David
Kaiser
2. Who Am I?
David
Kaiser
20+
years
experience
with
Linux
3
years
experience
with
Hadoop
Career
experiences:
• Data
Warehousing
• Geospa+al
Analy+cs
• Open-‐source
Solu+ons
and
Architecture
Employed
at
Hortonworks
as
a
Senior
Solu+ons
Engineer
@ddkaiser
linkedin.com/in/dkaiser
slideshare.net/ddkaiser
dkaiser@cdk.com
dkaiser@hortonworks.com
3. Overview of Stinger: Interactive Query for Hive
• Abstract:
– Hadoop
is
about
so
much
more
than
batch
processing.
With
the
recent
release
of
Hadoop
2,
there
have
been
many
new
approaches
for
increased
applica+on
performance.
– Hive
is
the
most
used
SQL
implementa+on
on
Hadoop.
– Hive
provides
the
most
amount
of
SQL
compa+bility
on
Hadoop.
– But…
Hive
is
Slow.
– Hive
WAS
Slow.
– This
talk
will
discuss
the
S+nger
ini+a+ve,
which
improved
Hive
performance
over
100x.
4. S"nger
Project
(announced
February
2013)
Batch AND Interactive SQL-in-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive
0.13,
April
2014
• Hive
on
Apache
Tez
• Query
Service
• Buffer
Cache
• Cost
Based
Op+mizer
(Op+q)
• Vectorized
Processing
Hive
0.11,
May
2013:
• Base
Op+miza+ons
• SQL
Analy+c
Func+ons
• ORCFile,
Modern
File
Format
Hive
0.12,
October
2013:
• VARCHAR,
DATE
Types
• ORCFile
predicate
pushdown
• Advanced
Op+miza+ons
• Performance
Boosts
via
YARN
Speed
Improve
Hive
query
performance
by
100X
to
allow
for
interac+ve
query
+mes
(seconds)
Scale
The
only
SQL
processing
in
Hadoop
designed
for
queries
that
scale
from
TB
to
PB
SQL
Support
broadest
range
of
SQL
seman+cs
for
analy+c
applica+ons
running
against
Hadoop
Goals:
An Open Community at its finest: Apache Hive Contribution
1,672Jira Tickets Closed
145Developers
44Companies
~400,000Lines Of Code Added…
13Months
5. Outcomes from the Stinger Project
Page 5
Feature
Descrip"on
Benefit
Tez
Integra+on
Tez
is
significantly
beeer
engine
than
MapReduce
Latency
Vectorized
Query
Take
advantage
of
modern
hardware
by
processing
thousand-‐row
blocks
rather
than
row-‐at-‐a-‐+me.
Throughput
Query
Planner
Using
extensive
sta+s+cs
now
available
in
Metastore
to
beeer
plan
and
op+mize
query,
including
predicate
pushdown
during
compila+on
to
eliminate
por+ons
of
input
(beyond
par++on
pruning)
Latency
ORC
File
Columnar,
type
aware
format
with
indices
Latency
Cost
Based
Op+mizer
(Op+q)
Join
re-‐ordering
and
other
op+miza+ons
based
on
column
sta+s+cs
including
histograms
etc.
Latency
Hive
as
a
Service
Leaves
engine
running
between
sessions
Latency
Buffer
Cache
Leaves
most
used
HDFS
file
blocks
in
memory
Latency
6. Hadoop 2: Moving Past MapReduce
Page
6
HADOOP
1.0
HDFS
(redundant,
reliable
storage)
MapReduce
(cluster
resource
management
&
data
processing)
HDFS2
(redundant,
highly-‐available
&
reliable
storage)
YARN
(cluster
resource
management)
MapReduce
(data
processing)
Others
HADOOP
2.0
Single
Use
System
Batch
Apps
Mul/
Purpose
Pla5orm
Batch,
Interac/ve,
Online,
Streaming,
…
7. Apache Tez as the new Primitive
HDFS2
(redundant,
reliable
storage)
Tez
(execu+on
engine)
YARN
(cluster
resource
management)
HADOOP
2.0
MapReduce
as
Base
Apache
Tez
as
Base
HDFS
(redundant,
reliable
storage)
MapReduce
(cluster
resource
management
&
data
processing)
Pig
(data
flow)
Hive
(sql)
Others
(Cascading)
HADOOP
1.0
Data
Flow
Pig
SQL
Hive
Others
(Cascading)
Batch
MapReduce
Slider
(con+nuous
execu+on)
Online
Data
Processing
HBase,
Accumulo
Real
Time
Stream
Processing
Storm
8. Complete Open Source Stack
• YARN is the logical extension of Apache Hadoop
– Complements
HDFS,
the
data
reservoir
• Resource Management for the Enterprise Data Lake
– Shared,
secure,
mul+-‐tenant
Hadoop
Allows for all processing in Open-Source Hadoop
Page
8
HDFS2
(Redundant,
Reliable
Storage)
YARN
(Cluster
Resource
Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm,
S4,…)
GRAPH
(Giraph)
IN-‐MEMORY
(Spark)
HPC
MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
9. Feature
Descrip"on
Benefit
Tez
Session
Overcomes
Map-‐Reduce
job-‐launch
latency
by
pre-‐
launching
Tez
AppMaster
Latency
Tez
Container
Pre-‐
Launch
Overcomes
Map-‐Reduce
latency
by
pre-‐launching
hot
containers
ready
to
serve
queries.
Latency
Tez
Container
Re-‐
Use
Finished
maps
and
reduces
pick
up
more
work
rather
than
exi+ng.
Reduces
latency
and
eliminates
difficult
split-‐size
tuning.
Out
of
box
performance!
Latency
Run+me
re-‐
configura+on
of
DAG
Run+me
query
tuning
by
picking
aggrega+on
parallelism
using
online
query
sta+s+cs
Throughput
Tez
In-‐Memory
Cache
Hot
data
kept
in
RAM
for
fast
access.
Latency
Complex
DAGs
Tez
Broadcast
Edge
and
Map-‐Reduce-‐Reduce
paeern
improve
query
scale
and
throughput.
Throughput
Hive On Tez - Execution
10. ORC File Advantages
Sustained Query Times
Apache Hive 0.12 provides
sustained acceptable query
times even at petabyte scale
131
GB
(78%
Smaller)
File
Size
Comparison
Across
Encoding
Methods
Dataset:
TPC-‐DS
Scale
500
Dataset
221
GB
(62%
Smaller)
Encoded
with
Text
Encoded
with
RCFile
Encoded
with
ORCFile
Encoded
with
Parquet
505
GB
(14%
Smaller)
585
GB
(Original
Size)
• Larger
Block
Sizes
• Columnar
format
arranges
columns
adjacent
within
the
file
for
compression
&
fast
access
Impala
Hive
12
Smaller Footprint
Better encoding with ORC in
Apache Hive 0.12 reduces resource
requirements for your cluster.
11. ORCFile
File
Format
Page 11
Query-‐Op"mized:
Split-‐able,
columnar
storage
file
Efficient
Reads:
Break
into
large
“stripes”
of
data
for
efficient
read
Fast
Filtering:
Built
in
index,
min/max,
metadata
for
fast
filtering
blocks
-‐
bloom
filters
if
desired
Efficient
Compression:
Decompose
complex
row
types
into
primi+ves:
massive
compression
and
efficient
comparisons
for
filtering
Precomputa"on:
Built
in
aggregates
per
block
(min,
max,
count,
sum,
etc.)
12. A Journey to SQL Compliance
Evolu"on
of
SQL
Compliance
in
Hive
SQL
Datatypes
SQL
Seman"cs
INT/TINYINT/SMALLINT/BIGINT
SELECT,
INSERT
FLOAT/DOUBLE
GROUP
BY,
ORDER
BY,
HAVING
BOOLEAN
JOIN
on
explicit
join
key
ARRAY,
MAP,
STRUCT,
UNION
Inner,
outer,
cross
and
semi
joins
STRING
Sub-‐queries
in
the
FROM
clause
BINARY
ROLLUP
and
CUBE
TIMESTAMP
UNION
DECIMAL
Standard
aggrega+ons
(sum,
avg,
etc.)
DATE
Custom
Java
UDFs
VARCHAR
Windowing
func+ons
(OVER,
RANK,
etc.)
CHAR
Advanced
UDFs
(ngram,
XPath,
URL)
Interval
Types
Sub-‐queries
for
IN/NOT
IN,
HAVING
JOINs
in
WHERE
Clause
INSERT/UPDATE/DELETE
Legend
Hive
10
or
earlier
Roadmap
Hive
11
Hive
12
Hive
13
13. Tez – Execution Performance
• Performance gains over Map Reduce
– Eliminate
replicated
write
barrier
between
successive
computa+ons.
– Eliminate
job
launch
overhead
of
workflow
jobs.
– Eliminate
extra
stage
of
map
reads
in
every
workflow
job.
– Eliminate
queue
and
resource
conten+on
suffered
by
workflow
jobs
that
are
started
aper
a
predecessor
job
completes.
Page
13
Pig/Hive
-‐
MR
Pig/Hive
-‐
Tez
14. Hive
–
MR
Hive
–
Tez
Hive-on-MR vs. Hive-on-Tez
SELECT
a.state,
COUNT(*),
AVERAGE(c.price)
FROM
a
JOIN
b
on
(a.id
=
b.id)
JOIN
c
on
(a.itemId
=
c.itemId)
GROUP
by
a.state
SELECT
a.state
JOIN
(a,
c)
SELECT
c.price
SELECT
b.id
JOIN(a,
b)
GROUP
BY
a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT
a.state,
c.itemId
JOIN
(a,
c)
JOIN(a,
b)
GROUP
BY
a.state
COUNT(*)
AVERAGE(c.price)
SELECT
b.id
Tez
avoids
unneeded
writes
to
HDFS
15. Vectorization
• Rewrite all operations to operate on blocks of 1K+
records, rather than one record at a time
• Block is array of Java scalars, not Objects (eliminate
Objects – compounding GC gains over time)
• Avoids many function calls, CPU pipeline stalls
• Size to fit in L1 cache, avoid cache misses
Page
15
16. Stinger Phase 3: Unlocking Interactive Query
S"nger
Phase
3:
Features
and
Benefits
Container
Pre-‐Launch
Overcomes
Java
VM
startup
latency
by
pre-‐
launching
hot
containers
ready
to
serve
queries
Container
Re-‐Use
Finished
Maps
and
Reduces
pick
up
more
work
rather
than
exi+ng.
Reduces
latency
and
eliminates
difficult
split
size
tuning
Tez
Integra+on
Tez
Broadcast
Edge
and
Map-‐Reduce-‐Reduce
paeern
improve
query
scale
and
throughput
In-‐Memory
Cache
Hot
data
kept
in
RAM
for
fast
access
22. Your Fastest On-ramp to Enterprise Hadoop™!
Page
22
hep://hortonworks.com/products/hortonworks-‐sandbox/
The
Sandbox
lets
you
experience
Apache
Hadoop
from
the
convenience
of
your
own
laptop
–
no
data
center,
no
cloud
and
no
internet
connec+on
needed!
The
Hortonworks
Sandbox
is:
• A
free
download:
hep://hortonworks.com/products/hortonworks-‐sandbox/
• A
complete,
self
contained
virtual
machine
with
Apache
Hadoop
pre-‐configured
• A
personal,
portable
and
standalone
Hadoop
environment
• A
set
of
hands-‐on,
step-‐by-‐step
tutorials
that
allow
you
to
learn
and
explore
Hadoop