SlideShare a Scribd company logo
1 of 44
Apache Tez : Accelerating
Hadoop Query Processing
Page 1
Arun C. Murthy Bikas Saha
Founder & Architect Hortonworks
@acmurthy @bikassaha
(@hortonworks)
© Hortonworks Inc. 2013
Hello!
• Founder/Architect at Hortonworks Inc.
–Lead - Map-Reduce/YARN/Tez
–Formerly, Architect Hadoop MapReduce, Yahoo
–Responsible for running Hadoop MapReduce as a service for all
of Yahoo (~50k nodes footprint)
• Apache Hadoop, ASF
–Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
–Long-term Committer/PMC member (full time for 7 years)
–Release Manager for hadoop-2.x
Page 2
© Hortonworks Inc. 2013
Once upon a time …
Page 3
… long, long ago, there was a kingdom we shall call
Apache Hadoop
http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo
© Hortonworks Inc. 2013
Hadoop begat …
Page 4
… a two-headed monster on every node in the kingdom;
each belonged to a different clan and answered to a
different master
http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg
© Hortonworks Inc. 2013
Knights of Bytes - HDFS
Page 5
… stored data uncompromisingly in directories/files, nary a
care about contents
http://whoiscraigmoser.com/Images/identity/knight.png
© Hortonworks Inc. 2013
Prince of Processing - MapReduce
Page 6
He ruled with an iron fist by mapping,
and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg
© Hortonworks Inc. 2013
Peace Reigned
Page 7
… for a while with the odd change in the direction of the wind
http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg
© Hortonworks Inc. 2013
Slowly, but surely …
Page 8
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
© Hortonworks Inc. 2013
Slowly, but surely …
Page 9
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
© Hortonworks Inc. 2013
Slowly, but surely …
Page 10
… people of the kingdom clamored for more.
A palpable sense of greed & expectation.
http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg
© Hortonworks Inc. 2013
Signs of Distress
Page 11
SQL said some, others said Machine Learning,
still others said Real-Time Event Processing
http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg
© Hortonworks Inc. 2013
A Meeting at the Summit
Page 12
MapReduce is dead!
Err… not quite.
We need more options! We need more!
True…
http://4.bp.blogspot.com/-
oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp
© Hortonworks Inc. 2013
A Meeting at the Summit
Page 13
A common thread YARN running through all applications…
Long live the King!
http://whipup.net/wp-content/images/2008/08/yarn.gif
© Hortonworks Inc. 2013
The Edict
Page 14
Henceforth, in the Kingdom of King YARN…
MapReduce has been relegated to the status
of, merely, one of the applications!
http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg
© Hortonworks Inc. 2013
Reign of King YARN
Page 15
King YARN came to throne
with promises to return power
to all applications
equally, lower performance
taxes and resource
management…
http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg
© Hortonworks Inc. 2013
Oh the Shame!
Page 16
Well, at least, Prince
MapReduce still had
powerful allies like
Highness
Hive, Powerful
Pig, Cheery
Cascading…
http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg
© Hortonworks Inc. 2013
Things get worse before better
Page 17
Unfortunately, things got a lot worse for the Prince MapReduce…
http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg
© Hortonworks Inc. 2013
Knight Tez
Page 18
He did MapReduce, and so much more…
Smartly aligned himself to Kingdom YARN.
http://twomorrows.com/alterego/media/08shiningknight.gif
© Hortonworks Inc. 2013
Knight Tez
Page 19
… they decided to throw their
lot with Knight Tez!
http://informatica.upg-ploiesti.ro/62689/img/partners.jpg
Long term alliances of MapReduce with
Hive, Pig, Cascading etc. broke up…
http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png
© Hortonworks Inc. 2013
Happily ever after…
Page 20
(nothing cute to say)
© Hortonworks Inc. 2013
On a more serious note…
Page 21
© Hortonworks Inc. 2013
Every season has a flavor…
Page 22
SQL-on-Hadoop is the new black!
SQL-on-Hadoop will be solved within
the existing ecosystem
© Hortonworks Inc. 2013
Looking ahead
Page 23
What will it be next year?
Real-time event processing?
Machine Learning?
© Hortonworks Inc. 2013
Play to our strengths
Page 24
Invest in the Apache Hadoop platform
and the ecosystem (Hive et al).
© Hortonworks Inc. 2013
Seriously…
Technical Details
Page 25
© Hortonworks Inc. 2013
Tez – Introduction
Page 26
• Distributed execution
framework targeted towards
data-processing applications.
• Based on expressing a
computation as a dataflow
graph.
• Built on top of YARN – the
resource management
framework for Hadoop.
• Open source Apache incubator
project and Apache licensed.
© Hortonworks Inc. 2013
Tez – Design Themes
Page 27
• Empowering End Users
• Execution Performance
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment
Page 28
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
–Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.
–Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow
graphs with no translation impedance.
Page 29
TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2
TaskD-1 TaskD-2 TaskE-1 TaskE-2
© Hortonworks Inc. 2013
Aggregate Stage
Partition Stage
Preprocessor Stage
Tez – Empowering End Users
• Expressive dataflow definition API’s
Page 30
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model
–Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
–End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful operators.
Page 31
IntermediateReduce
ShuffleInput
ReduceProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
PairwiseJoin
Input1
JoinProcessor
FileSortedOutput
Input2
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Data type agnostic
–Tez is only concerned with the movement of data. Files and
streams of bytes.
–Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig
can use tuple oriented formats that are natural and native to them.
Page 32
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Simplifying deployment
–Tez is a completely client side application.
–No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.
–Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
–Leverages YARN local resources and distributed cache.
Page 33
Client
Machine
Node
Manager
TezTask
Node
Manager
TezTaskTezClient
HDFS
Tez Lib 1 Tez Lib 2
Client
Machine
TezClient
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying usage
With great power API’s come great responsibilities 
Page 34
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce
• Plan reconfiguration at runtime
• Optimal resource management
• Dynamic physical data flow decisions
Page 35
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce
–Eliminate replicated write barrier between successive
computations.
–Eliminate job launch overhead of workflow jobs.
–Eliminate extra stage of map reads in every workflow job.
–Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.
Page 36
Pig/Hive - MR
Pig/Hive - Tez
© Hortonworks Inc. 2013
Tez – Execution Performance
• Plan reconfiguration at runtime
–Dynamic runtime concurrent control based on data size, user
operator resources, available cluster resources and locality.
–Advanced changes in dataflow graph structure.
–Progressive graph construction in concert with user optimizer.
Page 37
HDFS
Blocks
YARN
Resources
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Stage 2
100 10
reducers
Only 10GB’s
of data
© Hortonworks Inc. 2013
Tez – Execution Performance
• Optimal resource management
–Reuse YARN containers to launch new tasks.
–Reuse YARN containers to enable shared objects across tasks.
Page 38
YARN Container
TezTask Host
TezTask1
TezTask2
SharedObjects
YARN Container
Tez
Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2013
Tez – Execution Performance
• Dynamic physical data flow decisions
–Decide the type of physical byte movement and storage on the fly.
–Store intermediate data on distributed store, local store or in-
memory.
–Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 39
Producer
(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime
© Hortonworks Inc. 2013
Tez – Current status
• Apache Incubator Project
–Rapid development. Over 270 jiras opened. Over 170 resolved.
–Growing community.
• Focus on stability
–Testing and quality are highest priority.
–Code ready and deployed on multi-node clusters.
• DAG of MR processing is working
– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.
– Working Hive prototype that can target Tez for execution of
queries.
–Work started on prototype of Pig that can target Tez.
Page 40
© Hortonworks Inc. 2013
Tez – Current status
Page 41
Fact Table
Dimension
Table 1
Result
Table 1
Dimension
Table 2
Result
Table 2
Dimension
Table 3
Result
Table 3
Join
Join
Join
Typical pattern in a
TPC-DS query
Fact Table
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Optimization for
small data sets
Both can now run
as a single Tez job
© Hortonworks Inc. 2013
Tez – Looking ahead
• Early adopters and contributors welcome
–Adopters to drive more scenarios. Contributors to make them
happen.
• Stay tuned for Tez meetups with deep dives on Tez
architecture and using Tez
• Useful links
–Work tracking: https://issues.apache.org/jira/browse/TEZ
–Code: https://github.com/apache/incubator-tez
–High level design document and API specification:
https://issues.apache.org/jira/browse/TEZ-65
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
Page 42
© Hortonworks Inc. 2013
Tez – Takeaways
• Distributed execution framework that works on
computations represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Execution architecture designed to enable dynamic
performance optimizations at runtime
• Open source Apache project – your use-cases and
code are welcome
• It works and is already being used by Hive
Page 43
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Questions?
Page 44

More Related Content

What's hot

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 

What's hot (20)

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 

Similar to Apache Tez: Accelerating Hadoop Query Processing

February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and futureCodemotion
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopHortonworks
 

Similar to Apache Tez: Accelerating Hadoop Query Processing (20)

February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Apache Tez: Accelerating Hadoop Query Processing

  • 1. Apache Tez : Accelerating Hadoop Query Processing Page 1 Arun C. Murthy Bikas Saha Founder & Architect Hortonworks @acmurthy @bikassaha (@hortonworks)
  • 2. © Hortonworks Inc. 2013 Hello! • Founder/Architect at Hortonworks Inc. –Lead - Map-Reduce/YARN/Tez –Formerly, Architect Hadoop MapReduce, Yahoo –Responsible for running Hadoop MapReduce as a service for all of Yahoo (~50k nodes footprint) • Apache Hadoop, ASF –Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) –Long-term Committer/PMC member (full time for 7 years) –Release Manager for hadoop-2.x Page 2
  • 3. © Hortonworks Inc. 2013 Once upon a time … Page 3 … long, long ago, there was a kingdom we shall call Apache Hadoop http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo
  • 4. © Hortonworks Inc. 2013 Hadoop begat … Page 4 … a two-headed monster on every node in the kingdom; each belonged to a different clan and answered to a different master http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg
  • 5. © Hortonworks Inc. 2013 Knights of Bytes - HDFS Page 5 … stored data uncompromisingly in directories/files, nary a care about contents http://whoiscraigmoser.com/Images/identity/knight.png
  • 6. © Hortonworks Inc. 2013 Prince of Processing - MapReduce Page 6 He ruled with an iron fist by mapping, and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg
  • 7. © Hortonworks Inc. 2013 Peace Reigned Page 7 … for a while with the odd change in the direction of the wind http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg
  • 8. © Hortonworks Inc. 2013 Slowly, but surely … Page 8 Human beings define reality through misery and suffering. - Agent Smith http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
  • 9. © Hortonworks Inc. 2013 Slowly, but surely … Page 9 Human beings define reality through misery and suffering. - Agent Smith http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
  • 10. © Hortonworks Inc. 2013 Slowly, but surely … Page 10 … people of the kingdom clamored for more. A palpable sense of greed & expectation. http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg
  • 11. © Hortonworks Inc. 2013 Signs of Distress Page 11 SQL said some, others said Machine Learning, still others said Real-Time Event Processing http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg
  • 12. © Hortonworks Inc. 2013 A Meeting at the Summit Page 12 MapReduce is dead! Err… not quite. We need more options! We need more! True… http://4.bp.blogspot.com/- oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp
  • 13. © Hortonworks Inc. 2013 A Meeting at the Summit Page 13 A common thread YARN running through all applications… Long live the King! http://whipup.net/wp-content/images/2008/08/yarn.gif
  • 14. © Hortonworks Inc. 2013 The Edict Page 14 Henceforth, in the Kingdom of King YARN… MapReduce has been relegated to the status of, merely, one of the applications! http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg
  • 15. © Hortonworks Inc. 2013 Reign of King YARN Page 15 King YARN came to throne with promises to return power to all applications equally, lower performance taxes and resource management… http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg
  • 16. © Hortonworks Inc. 2013 Oh the Shame! Page 16 Well, at least, Prince MapReduce still had powerful allies like Highness Hive, Powerful Pig, Cheery Cascading… http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg
  • 17. © Hortonworks Inc. 2013 Things get worse before better Page 17 Unfortunately, things got a lot worse for the Prince MapReduce… http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg
  • 18. © Hortonworks Inc. 2013 Knight Tez Page 18 He did MapReduce, and so much more… Smartly aligned himself to Kingdom YARN. http://twomorrows.com/alterego/media/08shiningknight.gif
  • 19. © Hortonworks Inc. 2013 Knight Tez Page 19 … they decided to throw their lot with Knight Tez! http://informatica.upg-ploiesti.ro/62689/img/partners.jpg Long term alliances of MapReduce with Hive, Pig, Cascading etc. broke up… http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png
  • 20. © Hortonworks Inc. 2013 Happily ever after… Page 20 (nothing cute to say)
  • 21. © Hortonworks Inc. 2013 On a more serious note… Page 21
  • 22. © Hortonworks Inc. 2013 Every season has a flavor… Page 22 SQL-on-Hadoop is the new black! SQL-on-Hadoop will be solved within the existing ecosystem
  • 23. © Hortonworks Inc. 2013 Looking ahead Page 23 What will it be next year? Real-time event processing? Machine Learning?
  • 24. © Hortonworks Inc. 2013 Play to our strengths Page 24 Invest in the Apache Hadoop platform and the ecosystem (Hive et al).
  • 25. © Hortonworks Inc. 2013 Seriously… Technical Details Page 25
  • 26. © Hortonworks Inc. 2013 Tez – Introduction Page 26 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed.
  • 27. © Hortonworks Inc. 2013 Tez – Design Themes Page 27 • Empowering End Users • Execution Performance
  • 28. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment Page 28
  • 29. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s –Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. –Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. Page 29 TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2 TaskD-1 TaskD-2 TaskE-1 TaskE-2
  • 30. © Hortonworks Inc. 2013 Aggregate Stage Partition Stage Preprocessor Stage Tez – Empowering End Users • Expressive dataflow definition API’s Page 30 Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
  • 31. © Hortonworks Inc. 2013 Tez – Empowering End Users • Flexible Input-Processor-Output runtime model –Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. –End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful operators. Page 31 IntermediateReduce ShuffleInput ReduceProcessor FileSortedOutput FinalReduce ShuffleInput ReduceProcessor HDFSOutput PairwiseJoin Input1 JoinProcessor FileSortedOutput Input2
  • 32. © Hortonworks Inc. 2013 Tez – Empowering End Users • Data type agnostic –Tez is only concerned with the movement of data. Files and streams of bytes. –Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Page 32 File Stream Key Value Tez Task Tuples User Code Bytes Bytes
  • 33. © Hortonworks Inc. 2013 Tez – Empowering End Users • Simplifying deployment –Tez is a completely client side application. –No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. –Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. –Leverages YARN local resources and distributed cache. Page 33 Client Machine Node Manager TezTask Node Manager TezTaskTezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient
  • 34. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage With great power API’s come great responsibilities  Page 34
  • 35. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains over Map Reduce • Plan reconfiguration at runtime • Optimal resource management • Dynamic physical data flow decisions Page 35
  • 36. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains over Map Reduce –Eliminate replicated write barrier between successive computations. –Eliminate job launch overhead of workflow jobs. –Eliminate extra stage of map reads in every workflow job. –Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Page 36 Pig/Hive - MR Pig/Hive - Tez
  • 37. © Hortonworks Inc. 2013 Tez – Execution Performance • Plan reconfiguration at runtime –Dynamic runtime concurrent control based on data size, user operator resources, available cluster resources and locality. –Advanced changes in dataflow graph structure. –Progressive graph construction in concert with user optimizer. Page 37 HDFS Blocks YARN Resources Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Stage 2 100 10 reducers Only 10GB’s of data
  • 38. © Hortonworks Inc. 2013 Tez – Execution Performance • Optimal resource management –Reuse YARN containers to launch new tasks. –Reuse YARN containers to enable shared objects across tasks. Page 38 YARN Container TezTask Host TezTask1 TezTask2 SharedObjects YARN Container Tez Application Master Start Task Task Done Start Task
  • 39. © Hortonworks Inc. 2013 Tez – Execution Performance • Dynamic physical data flow decisions –Decide the type of physical byte movement and storage on the fly. –Store intermediate data on distributed store, local store or in- memory. –Transfer bytes via blocking files or streaming and the spectrum in between. Page 39 Producer (small size) In-Memory Consumer Producer Local File Consumer At Runtime
  • 40. © Hortonworks Inc. 2013 Tez – Current status • Apache Incubator Project –Rapid development. Over 270 jiras opened. Over 170 resolved. –Growing community. • Focus on stability –Testing and quality are highest priority. –Code ready and deployed on multi-node clusters. • DAG of MR processing is working – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. – Working Hive prototype that can target Tez for execution of queries. –Work started on prototype of Pig that can target Tez. Page 40
  • 41. © Hortonworks Inc. 2013 Tez – Current status Page 41 Fact Table Dimension Table 1 Result Table 1 Dimension Table 2 Result Table 2 Dimension Table 3 Result Table 3 Join Join Join Typical pattern in a TPC-DS query Fact Table Dimension Table 1 Dimension Table 1 Dimension Table 1 Optimization for small data sets Both can now run as a single Tez job
  • 42. © Hortonworks Inc. 2013 Tez – Looking ahead • Early adopters and contributors welcome –Adopters to drive more scenarios. Contributors to make them happen. • Stay tuned for Tez meetups with deep dives on Tez architecture and using Tez • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ –Code: https://github.com/apache/incubator-tez –High level design document and API specification: https://issues.apache.org/jira/browse/TEZ-65 – Developer list: dev@tez.incubator.apache.org User list: user@tez.incubator.apache.org Issues list: issues@tez.incubator.apache.org Page 42
  • 43. © Hortonworks Inc. 2013 Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Execution architecture designed to enable dynamic performance optimizations at runtime • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive Page 43
  • 44. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Questions? Page 44