SlideShare a Scribd company logo
1 of 33
©	
  Hortonworks	
  Inc.	
  2014	
   Page	
  1	
  
Accelera8ng	
  
Hadoop	
  Data	
  
Pipelines	
  
	
  
	
  
Fi>hElephant.in	
  2014	
  
	
  
gopalv	
  @	
  apache.org	
  
©	
  Hortonworks	
  Inc.	
  2014	
  
Tez	
  –	
  Introduc8on	
  
Page	
  2	
  
• Distributed	
  execu-on	
  framework	
  
targeted	
  towards	
  data-­‐processing	
  
applica-ons.	
  
• Based	
  on	
  expressing	
  a	
  computa-on	
  
as	
  a	
  dataflow	
  graph.	
  
• Highly	
  customizable	
  to	
  meet	
  a	
  
broad	
  spectrum	
  of	
  use	
  cases.	
  
• Built	
  on	
  top	
  of	
  YARN	
  –	
  the	
  resource	
  
management	
  framework	
  for	
  
Hadoop.	
  
• Open	
  source	
  Apache	
  project	
  and	
  
Apache	
  licensed.	
  
© Hortonworks Inc. 2014© Hortonworks Inc. 2014. Confidential and Proprietary.
Hadoop	
  1	
  -­‐>	
  Hadoop	
  2	
  
HADOOP 1.0
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  
Pig	
  
(data	
  flow)	
  
Hive	
  
(sql)	
  
	
  
Others	
  
(cascading)	
  
	
  
HDFS2	
  
(redundant,	
  reliable	
  storage)	
  
YARN	
  
(cluster	
  resource	
  management)	
  
Tez	
  
(execu8on	
  engine)	
  
HADOOP 2.0
Data	
  Flow	
  
Pig	
  
SQL	
  
Hive	
  
	
  
Others	
  
(Cascading)	
  
	
  
Batch	
  
MapReduce	
   Real	
  Time	
  	
  
Stream	
  	
  
Processing	
  
Storm	
  
Online	
  	
  
Data	
  	
  
Processing	
  
HBase,	
  
Accumulo	
  
	
  
Monolithic	
  
•  Resource	
  Management	
  
•  Execu-on	
  Engine	
  
•  User	
  API	
  
Layered	
  
•  Resource	
  Management	
  –	
  YARN	
  
•  Execu-on	
  Engine	
  –	
  Tez	
  
•  User	
  API	
  –	
  Hive,	
  Pig,	
  Cascading,	
  Your	
  App!	
  	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Design	
  considera8ons	
  
Don’t	
  solve	
  problems	
  that	
  have	
  already	
  been	
  solved.	
  Or	
  you	
  
will	
  have	
  to	
  solve	
  them	
  again!	
  
•  Leverage	
  discrete	
  task	
  based	
  compute	
  model	
  for	
  elas8city,	
  scalability	
  
and	
  fault	
  tolerance	
  
•  Leverage	
  several	
  man	
  years	
  of	
  work	
  in	
  Hadoop	
  Map-­‐Reduce	
  data	
  
shuffling	
  opera8ons	
  
•  Leverage	
  proven	
  resource	
  sharing	
  and	
  mul8-­‐tenancy	
  model	
  for	
  Hadoop	
  
and	
  YARN	
  
•  Leverage	
  built-­‐in	
  security	
  mechanisms	
  in	
  Hadoop	
  for	
  privacy	
  and	
  
isola8on	
  
Page 4
Look	
  to	
  the	
  Future	
  with	
  an	
  eye	
  on	
  the	
  Past	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Problems	
  that	
  it	
  addresses	
  
• Expressing	
  the	
  computa-on	
  
•  Direct	
  and	
  elegant	
  representa8on	
  of	
  the	
  data	
  processing	
  flow	
  
•  Interfacing	
  with	
  applica8on	
  code	
  and	
  new	
  technologies	
  
• Performance	
  
•  Late	
  Binding	
  :	
  Make	
  decisions	
  as	
  late	
  as	
  possible	
  using	
  real	
  data	
  from	
  at	
  
run8me	
  
•  Leverage	
  the	
  resources	
  of	
  the	
  cluster	
  efficiently	
  
•  Just	
  work	
  out	
  of	
  the	
  box!	
  
•  Customizable	
  engine	
  to	
  let	
  applica8ons	
  tailor	
  the	
  job	
  to	
  meet	
  their	
  
specific	
  requirements	
  
• Opera-on	
  simplicity	
  
•  Painless	
  to	
  operate,	
  experiment	
  and	
  upgrade	
  
Page 5
© Hortonworks Inc. 2014
Tez	
  –	
  Simplifying	
  Opera8ons	
  
•  Tez	
  is	
  a	
  pure	
  YARN	
  applica8on.	
  Easy	
  and	
  safe	
  to	
  try	
  it	
  out!	
  
•  No	
  deployments	
  to	
  do,	
  no	
  servers	
  to	
  run	
  
•  Enables	
  running	
  different	
  versions	
  concurrently.	
  Easy	
  to	
  test	
  new	
  
func8onality	
  while	
  keeping	
  stable	
  versions	
  for	
  produc8on.	
  
•  Leverages	
  YARN	
  local	
  resources.	
  
	
  
Page 6
Client
Machine
Node
Manager
TezTask
Node
Manager
TezTaskTezClient
HDFS
Tez Lib 1 Tez Lib 2
Client
Machine
TezClient
© Hortonworks Inc. 2014
Tez	
  –	
  Expressing	
  the	
  computa8on	
  
Page 7
Aggregate Stage
Partition Stage
Preprocessor Stage
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
Distributed	
  data	
  processing	
  jobs	
  typically	
  look	
  like	
  DAGs	
  (Directed	
  Acyclic	
  
Graph).	
  	
  
•  Ver-ces	
  in	
  the	
  graph	
  represent	
  data	
  transforma-ons	
  	
  
•  Edges	
  represent	
  data	
  movement	
  from	
  producers	
  to	
  consumers	
  
© Hortonworks Inc. 2014
MR	
  is	
  a	
  2-­‐vertex	
  sub-­‐set	
  of	
  Tez	
  
Page 8
© Hortonworks Inc. 2014
But	
  Tez	
  is	
  so	
  much	
  more	
  
Page 9
© Hortonworks Inc. 2014
Tez	
  –	
  Expressing	
  the	
  computa8on	
  
Page 10
Tez	
  defines	
  the	
  following	
  APIs	
  to	
  define	
  the	
  work	
  
• DAG	
  API	
  
•  Defines	
   the	
   structure	
   of	
   the	
   data	
   processing	
   and	
   the	
   rela8onship	
  
between	
  producers	
  and	
  consumers	
  
•  Enable	
   defini8on	
   of	
   complex	
   data	
   flow	
   pipelines	
   using	
   simple	
   graph	
  
connec8on	
  API’s.	
  Tez	
  expands	
  the	
  logical	
  DAG	
  at	
  run8me	
  
•  This	
  is	
  how	
  all	
  the	
  tasks	
  in	
  the	
  job	
  get	
  specified	
  
• Run-me	
  API	
  
•  Defines	
  the	
  interface	
  using	
  which	
  the	
  framework	
  and	
  app	
  code	
  interact	
  
with	
  each	
  other	
  
•  App	
  code	
  transforms	
  data	
  and	
  moves	
  it	
  between	
  tasks	
  
•  This	
  is	
  how	
  we	
  specify	
  what	
  actually	
  executes	
  in	
  each	
  task	
  on	
  the	
  cluster	
  
nodes	
  
© Hortonworks Inc. 2014
Tez	
  –	
  DAG	
  API	
  	
  
//	
  Define	
  DAG	
  
DAG	
  dag	
  =	
  new	
  DAG();	
  
	
  
//	
  Define	
  Vertex	
  
Vertex	
  source	
  =	
  new	
  Vertex(Processor.class);	
  
	
  
//	
  Define	
  Edge	
  
Edge	
  edge	
  =	
  Edge(source,	
  des8na8on,	
  
SCATTER_GATHER,	
  PERSISTED,	
  SEQUENTIAL,	
  
Output.class,	
  Input.class);	
  
	
  
//	
  Connect	
  them	
  
dag.addVertex(source).addEdge(edge)…	
  
Page 11
reduce1
map2
reduce2
join1
map1
Scatter_Gather
Bipartite
Sequential
Scatter_Gather
Bipartite
Sequential
Defines the global processing flow
© Hortonworks Inc. 2014
Tez	
  –	
  Logical	
  DAG	
  expansion	
  at	
  Run8me	
  
Page 12
reduce1
map2
reduce2
join1
map1
© Hortonworks Inc. 2014
Tez	
  –	
  Library	
  of	
  Inputs	
  and	
  Outputs	
  
Page 13
Classical	
  ‘Map’	
   Classical	
  ‘Reduce’	
  
Intermediate	
  ‘Reduce’	
  for	
  	
  
Map-­‐Reduce-­‐Reduce	
  
Map	
  
Processor	
  
HDFS	
  
Input	
  
Sorted	
  
Output	
  
Reduce	
  
Processor	
  
Shuffle	
  
Input	
  
HDFS	
  
Output	
  
Reduce	
  
Processor	
  
Shuffle	
  
Input	
  
Sorted	
  
Output	
  
• What	
  is	
  built	
  in?	
  
– 	
  Hadoop	
  InputFormat/OutputFormat	
  
– 	
  SortedGroupedPar88oned	
  Key-­‐Value	
  
Input/Output	
  
– 	
  UnsortedGroupedPar88oned	
  Key-­‐
Value	
  Input/Output	
  
– 	
  Key-­‐Value	
  Input/Output	
  	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Broadcast	
  Edge	
  
SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand
FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk
from store_sales
group by ss_item_sk) ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M
M
M
M M
HDFS
Store Sales
scan. Group by
and aggregation
reduce size of
this input.
Inventory scan
and Join
Broadcast
edge
M M M
HDFS
Store Sales
scan. Group by
and aggregation.
Inventory and Store
Sales (aggr.) output
scan and shuffle
join.
R R
R R
RR
M
MMM
HDFS
Hive	
  :	
  
Broadcast	
  Join
© Hortonworks Inc. 2014
Tez	
  –	
  Custom	
  Edge	
  
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M MM
M M
HDFS
Inventory scan
(Runs on
cluster
potentially more
than 1 mapper)
Store Sales
scan and Join
(Custom vertex
reads both
inputs – no side
file reads)
Custom
edge (routes
outputs of
previous stage to
the correct
Mappers of the
next stage)
M MM
M
HDFS
Inventory scan
(Runs as single
local map task)
Store Sales
scan and Join
(Inventory hash
table read as
side file)
HDFS
Hive	
  :	
  Dynamically	
  
Par88oned	
  Hash	
  Join	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Mul8ple	
  Outputs	
  
FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk
= d_date_sk and d_year = 2000)
INSERT INTO TABLE t1 SELECT distinct ss_item_sk
INSERT INTO TABLE t2 SELECT distinct ss_customer_sk;
Hive – MR Hive – Tez
M MM
M
HDFS
Map join date_dim/
store sales
Two MR jobs to
do the distinct
M MM
M M
HDFS
RR
HDFS
M M M
R
M M M
R
HDFS
Broadcast Join
(scan date_dim,
join store sales)
Distinct for
customer + items
Materialize join on
HDFS
Hive	
  :	
  Mul8-­‐insert	
  
queries	
  
© Hortonworks Inc. 2014
Tez	
  –	
  One	
  to	
  One	
  Edge	
  
Page 17
Aggregate
Sample L
Join
Stage sample map
on distributed cache
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x
USING ‘skewed’;
Load &
Sample
Aggregate
Partition L
Join
Pass through input
via 1-1 edge
Partition R
HDFS
Broadcast
sample map
Partition L and Partition R
Pig – MR Pig – Tez
Pig	
  :	
  Skewed	
  Join	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Bringing	
  it	
  all	
  together	
  
Page 18
Architecting the Future of Big Data
Tez Session populates
container pool
Dimension table
calculation and HDFS
split generation in
parallel
Dimension tables
broadcasted to Hive
MapJoin tasks
Final Reducer pre-
launched and fetches
completed inputs
TPCDS – Query-27 with Hive on Tez
© Hortonworks Inc. 2014
Tez	
  –	
  Performance	
  
• Benefits	
  of	
  expressing	
  the	
  data	
  processing	
  as	
  a	
  DAG	
  
•  Reducing	
  overheads	
  and	
  queuing	
  effects	
  
•  Gives	
  system	
  the	
  global	
  picture	
  for	
  beper	
  planning	
  
• Efficient	
  use	
  of	
  resources	
  
•  Re-­‐use	
  resources	
  to	
  maximize	
  u8liza8on	
  
•  Pre-­‐launch,	
  pre-­‐warm	
  and	
  cache	
  
•  Locality	
  &	
  resource	
  aware	
  scheduling	
  
• Support	
  for	
  applica-on	
  defined	
  DAG	
  modifica-ons	
  at	
  run-me	
  
for	
  op-mized	
  execu-on	
  
•  Change	
  task	
  concurrency	
  	
  
•  Change	
  task	
  scheduling	
  
•  Change	
  DAG	
  edges	
  
•  Change	
  DAG	
  ver8ces	
  
Page 19
© Hortonworks Inc. 2014
Tez	
  –	
  Benefits	
  of	
  DAG	
  execu8on	
  
• Faster	
  Execu-on	
  and	
  Higher	
  Predictability	
  
– Eliminate	
  replicated	
  write	
  barrier	
  between	
  successive	
  computa8ons.	
  
– Eliminate	
  job	
  launch	
  overhead	
  of	
  workflow	
  jobs.	
  
– Eliminate	
  extra	
  stage	
  of	
  map	
  reads	
  in	
  every	
  workflow	
  job.	
  
– Eliminate	
  queue	
  and	
  resource	
  conten8on	
  suffered	
  by	
  workflow	
  jobs	
  
that	
  are	
  started	
  a>er	
  a	
  predecessor	
  job	
  completes.	
  
– Beper	
  locality	
  because	
  the	
  engine	
  has	
  the	
  global	
  picture	
  
Page 20
Pig/Hive - MR
Pig/Hive - Tez
© Hortonworks Inc. 2014
Tez	
  –	
  Container	
  Re-­‐Use	
  
• Reuse	
  YARN	
  containers/JVMs	
  to	
  launch	
  new	
  tasks	
  
• Reduce	
  scheduling	
  and	
  launching	
  delays	
  
• Shared	
  in-­‐memory	
  data	
  across	
  tasks	
  
• JVM	
  JIT	
  friendly	
  execu8on	
  
Page 21
YARN Container / JVM
TezTask Host
TezTask1
TezTask2
SharedObjects
YARN Container
Tez
Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2014
Tez	
  –	
  Sessions	
  
Page 22
Application Master
Client
Start
Session
Submit
DAG
Task Scheduler
ContainerPool
Shared
Object
Registry
Pre
Warmed
JVM
Sessions	
  
•  Standard	
  concepts	
  of	
  pre-­‐launch	
  
and	
  pre-­‐warm	
  applied	
  
•  Key	
  for	
  Interac8ve	
  queries	
  
•  Represents	
  a	
  connec8on	
  between	
  
the	
  user	
  and	
  the	
  cluster	
  
•  Mul8ple	
  DAGs/Queries	
  executed	
  in	
  
the	
  same	
  AM	
  
•  Containers	
  re-­‐used	
  across	
  queries	
  
•  Takes	
  care	
  of	
  data	
  locality	
  and	
  
releasing	
  resources	
  when	
  idle	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Re-­‐Use	
  in	
  Ac8on	
  
Task	
  Execu8on	
  
Timeline	
  
© Hortonworks Inc. 2014
Tez	
  –	
  Customizable	
  Core	
  Engine	
  
Page 24
Vertex-2
Vertex-1
Start
vertex
Vertex Manager
Start
tasks
DAG
Scheduler
Get Priority
Get Priority
Start
vertex
Task
Scheduler
Get container
Get container
•  Vertex Manager
•  Determines task
parallelism
•  Determines
when tasks in a
vertex can start.
•  DAG Scheduler
Determines priority
of task
•  Task Scheduler
Allocates
containers from
YARN and assigns
them to tasks
© Hortonworks Inc. 2014
Tez	
  –	
  Theory	
  to	
  Prac8ce	
  
• In theory, there is no difference
between theory and practice.
• But, in practice, there is.
Page 25
© Hortonworks Inc. 2014
Tez	
  –	
  Data	
  at	
  scale	
  
Page 26
Hive	
  TPC-­‐DS	
  
Scale	
  10TB
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Replicated
Join (2.8x)
Join +
Groupby
(1.5x)
Join +
Groupby +
Orderby
(1.5x)
3 way Split +
Join +
Groupby +
Orderby
(2.6x)
Timeinsecs
MR
Tez
Tez	
  –	
  Pig	
  performance	
  gains	
  
•  Demonstrate	
  performance	
  gains	
  from	
  a	
  basic	
  transla8on	
  to	
  a	
  
Tez	
  DAG	
  
•  Deeper	
  integra8on	
  in	
  the	
  works	
  for	
  further	
  boost	
  
Tez	
  –	
  itera8ve	
  algorithms	
  
•  Pig	
  can	
  do	
  itera8ve	
  algorithms	
  on	
  top	
  of	
  Tez	
  
•  This	
  uses	
  heavy-­‐weight	
  itera8on	
  (for-­‐loop	
  +	
  map)	
  
•  Future	
  work	
  for	
  faster	
  loop-­‐unrolled	
  out-­‐of-­‐order	
  itera8on	
  
•  1-­‐1	
  edges	
  between	
  loops	
  	
  allows	
  building	
  morsel	
  style	
  
parallelism	
  
0
1000
2000
3000
10 50 100
Timeinsecs
Iteration
k-means
MR
Tez
14.84X
13.12X
5.37X
* Source code at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
© Hortonworks Inc. 2014
Tez	
  –	
  Designed	
  for	
  big,	
  busy	
  clusters	
  
•  Number of stages in the DAG
•  Higher the number of stages in the DAG, performance of Tez (over MR)
will be better.
•  Cluster/queue capacity
•  More congested a queue is, the performance of Tez (over MR) will be
better due to container reuse.
•  Size of intermediate output
•  More the size of intermediate output, the performance of Tez (over MR)
will be better due to reduced HDFS usage (cross-rack traffic)
•  Size of data in the job
•  For smaller data and more stages, the performance of Tez (over MR) will
be better as percentage of launch overhead in the total time is high for
smaller jobs.
•  Move workloads from gateway boxes to the cluster
•  Move as much work as possible to the cluster by modelling it via the job
DAG. Exploit the parallelism and resources of the cluster.
Page 29
© Hortonworks Inc. 2014
Tez	
  –	
  what	
  if	
  you	
  can’t	
  get	
  enough	
  containers?	
  
• 78 vertex + 8374 tasks on 50 YARN containers
Page 30
© Hortonworks Inc. 2014
Tez	
  –	
  Adop8on	
  	
  
• Hive	
  
•  Hadoop	
  standard	
  for	
  declara8ve	
  access	
  via	
  SQL-­‐like	
  interface	
  
• Pig	
  
•  Hadoop	
  standard	
  for	
  procedural	
  scrip8ng	
  and	
  pipeline	
  processing	
  
• Cascading	
  
•  Developer	
  friendly	
  Java	
  API	
  and	
  SDK	
  
•  Scalding	
  (Scala	
  API	
  on	
  Cascading)	
  
• Commercial	
  Vendors	
  
•  ETL	
  :	
  Use	
  Tez	
  instead	
  of	
  MR	
  or	
  custom	
  pipelines	
  
•  Analy8cs	
  Vendors	
  :	
  Use	
  Tez	
  as	
  a	
  target	
  plasorm	
  for	
  scaling	
  parallel	
  
analy8cal	
  tools	
  to	
  large	
  data-­‐sets	
  
Page 31
© Hortonworks Inc. 2014
Tez	
  –	
  Roadmap	
  
• Richer	
  DAG	
  support	
  
– 	
  Addi8on	
  of	
  ver8ces	
  at	
  run8me	
  
– 	
  Shared	
  edges	
  for	
  shared	
  outputs	
  
– 	
  Enhance	
  Input/Output	
  collec8ons	
  
• Performance	
  op-miza-ons	
  
– 	
  Improve	
  throughput	
  at	
  high	
  concurrency	
  	
  
– 	
  Improve	
  locality	
  aware	
  scheduling	
  (co-­‐scheduling)	
  
– 	
  Add	
  framework	
  level	
  data	
  sta8s8cs	
  	
  
– 	
  HDFS	
  memory	
  storage	
  integra8on	
  
• Usability	
  
– 	
  Stability	
  and	
  testability	
  
– 	
  API	
  ease	
  of	
  use	
  
– 	
  Tools	
  for	
  performance	
  analysis	
  and	
  debugging	
  
Page 32
© Hortonworks Inc. 2014
Tez	
  –	
  Community	
  
• Early	
  adopters	
  and	
  code	
  contributors	
  welcome	
  
– Adopters	
  to	
  drive	
  more	
  scenarios.	
  Contributors	
  to	
  make	
  them	
  happen.	
  
• Technical	
  blog	
  series	
  
– hpp://hortonworks.com/blog/apache-­‐tez-­‐a-­‐new-­‐chapter-­‐in-­‐hadoop-­‐data-­‐
processing	
  	
  
• Useful	
  links	
  
– Work	
  tracking:	
  hpps://issues.apache.org/jira/browse/TEZ	
  
– Code:	
  hpps://github.com/apache/tez	
  
– 	
  Developer	
  list:	
  dev@tez.apache.org	
  
	
  User	
  list:	
  user@tez.apache.org	
  
	
  Issues	
  list:	
  issues@tez.apache.org	
  
Page 33

More Related Content

What's hot

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrought3rmin4t0r
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 

What's hot (20)

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 

Similar to Tez: Accelerating Data Pipelines - fifthel

Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and futureCodemotion
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterYahoo Developer Network
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 

Similar to Tez: Accelerating Data Pipelines - fifthel (20)

Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Tez: Accelerating Data Pipelines - fifthel

  • 1. ©  Hortonworks  Inc.  2014   Page  1   Accelera8ng   Hadoop  Data   Pipelines       Fi>hElephant.in  2014     gopalv  @  apache.org  
  • 2. ©  Hortonworks  Inc.  2014   Tez  –  Introduc8on   Page  2   • Distributed  execu-on  framework   targeted  towards  data-­‐processing   applica-ons.   • Based  on  expressing  a  computa-on   as  a  dataflow  graph.   • Highly  customizable  to  meet  a   broad  spectrum  of  use  cases.   • Built  on  top  of  YARN  –  the  resource   management  framework  for   Hadoop.   • Open  source  Apache  project  and   Apache  licensed.  
  • 3. © Hortonworks Inc. 2014© Hortonworks Inc. 2014. Confidential and Proprietary. Hadoop  1  -­‐>  Hadoop  2   HADOOP 1.0 HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)   Pig   (data  flow)   Hive   (sql)     Others   (cascading)     HDFS2   (redundant,  reliable  storage)   YARN   (cluster  resource  management)   Tez   (execu8on  engine)   HADOOP 2.0 Data  Flow   Pig   SQL   Hive     Others   (Cascading)     Batch   MapReduce   Real  Time     Stream     Processing   Storm   Online     Data     Processing   HBase,   Accumulo     Monolithic   •  Resource  Management   •  Execu-on  Engine   •  User  API   Layered   •  Resource  Management  –  YARN   •  Execu-on  Engine  –  Tez   •  User  API  –  Hive,  Pig,  Cascading,  Your  App!    
  • 4. © Hortonworks Inc. 2014 Tez  –  Design  considera8ons   Don’t  solve  problems  that  have  already  been  solved.  Or  you   will  have  to  solve  them  again!   •  Leverage  discrete  task  based  compute  model  for  elas8city,  scalability   and  fault  tolerance   •  Leverage  several  man  years  of  work  in  Hadoop  Map-­‐Reduce  data   shuffling  opera8ons   •  Leverage  proven  resource  sharing  and  mul8-­‐tenancy  model  for  Hadoop   and  YARN   •  Leverage  built-­‐in  security  mechanisms  in  Hadoop  for  privacy  and   isola8on   Page 4 Look  to  the  Future  with  an  eye  on  the  Past  
  • 5. © Hortonworks Inc. 2014 Tez  –  Problems  that  it  addresses   • Expressing  the  computa-on   •  Direct  and  elegant  representa8on  of  the  data  processing  flow   •  Interfacing  with  applica8on  code  and  new  technologies   • Performance   •  Late  Binding  :  Make  decisions  as  late  as  possible  using  real  data  from  at   run8me   •  Leverage  the  resources  of  the  cluster  efficiently   •  Just  work  out  of  the  box!   •  Customizable  engine  to  let  applica8ons  tailor  the  job  to  meet  their   specific  requirements   • Opera-on  simplicity   •  Painless  to  operate,  experiment  and  upgrade   Page 5
  • 6. © Hortonworks Inc. 2014 Tez  –  Simplifying  Opera8ons   •  Tez  is  a  pure  YARN  applica8on.  Easy  and  safe  to  try  it  out!   •  No  deployments  to  do,  no  servers  to  run   •  Enables  running  different  versions  concurrently.  Easy  to  test  new   func8onality  while  keeping  stable  versions  for  produc8on.   •  Leverages  YARN  local  resources.     Page 6 Client Machine Node Manager TezTask Node Manager TezTaskTezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient
  • 7. © Hortonworks Inc. 2014 Tez  –  Expressing  the  computa8on   Page 7 Aggregate Stage Partition Stage Preprocessor Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort Distributed  data  processing  jobs  typically  look  like  DAGs  (Directed  Acyclic   Graph).     •  Ver-ces  in  the  graph  represent  data  transforma-ons     •  Edges  represent  data  movement  from  producers  to  consumers  
  • 8. © Hortonworks Inc. 2014 MR  is  a  2-­‐vertex  sub-­‐set  of  Tez   Page 8
  • 9. © Hortonworks Inc. 2014 But  Tez  is  so  much  more   Page 9
  • 10. © Hortonworks Inc. 2014 Tez  –  Expressing  the  computa8on   Page 10 Tez  defines  the  following  APIs  to  define  the  work   • DAG  API   •  Defines   the   structure   of   the   data   processing   and   the   rela8onship   between  producers  and  consumers   •  Enable   defini8on   of   complex   data   flow   pipelines   using   simple   graph   connec8on  API’s.  Tez  expands  the  logical  DAG  at  run8me   •  This  is  how  all  the  tasks  in  the  job  get  specified   • Run-me  API   •  Defines  the  interface  using  which  the  framework  and  app  code  interact   with  each  other   •  App  code  transforms  data  and  moves  it  between  tasks   •  This  is  how  we  specify  what  actually  executes  in  each  task  on  the  cluster   nodes  
  • 11. © Hortonworks Inc. 2014 Tez  –  DAG  API     //  Define  DAG   DAG  dag  =  new  DAG();     //  Define  Vertex   Vertex  source  =  new  Vertex(Processor.class);     //  Define  Edge   Edge  edge  =  Edge(source,  des8na8on,   SCATTER_GATHER,  PERSISTED,  SEQUENTIAL,   Output.class,  Input.class);     //  Connect  them   dag.addVertex(source).addEdge(edge)…   Page 11 reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Defines the global processing flow
  • 12. © Hortonworks Inc. 2014 Tez  –  Logical  DAG  expansion  at  Run8me   Page 12 reduce1 map2 reduce2 join1 map1
  • 13. © Hortonworks Inc. 2014 Tez  –  Library  of  Inputs  and  Outputs   Page 13 Classical  ‘Map’   Classical  ‘Reduce’   Intermediate  ‘Reduce’  for     Map-­‐Reduce-­‐Reduce   Map   Processor   HDFS   Input   Sorted   Output   Reduce   Processor   Shuffle   Input   HDFS   Output   Reduce   Processor   Shuffle   Input   Sorted   Output   • What  is  built  in?   –   Hadoop  InputFormat/OutputFormat   –   SortedGroupedPar88oned  Key-­‐Value   Input/Output   –   UnsortedGroupedPar88oned  Key-­‐ Value  Input/Output   –   Key-­‐Value  Input/Output    
  • 14. © Hortonworks Inc. 2014 Tez  –  Broadcast  Edge   SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M M M M M HDFS Store Sales scan. Group by and aggregation reduce size of this input. Inventory scan and Join Broadcast edge M M M HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R R R RR M MMM HDFS Hive  :   Broadcast  Join
  • 15. © Hortonworks Inc. 2014 Tez  –  Custom  Edge   SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M MM M M HDFS Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join (Custom vertex reads both inputs – no side file reads) Custom edge (routes outputs of previous stage to the correct Mappers of the next stage) M MM M HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) HDFS Hive  :  Dynamically   Par88oned  Hash  Join  
  • 16. © Hortonworks Inc. 2014 Tez  –  Mul8ple  Outputs   FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) INSERT INTO TABLE t1 SELECT distinct ss_item_sk INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; Hive – MR Hive – Tez M MM M HDFS Map join date_dim/ store sales Two MR jobs to do the distinct M MM M M HDFS RR HDFS M M M R M M M R HDFS Broadcast Join (scan date_dim, join store sales) Distinct for customer + items Materialize join on HDFS Hive  :  Mul8-­‐insert   queries  
  • 17. © Hortonworks Inc. 2014 Tez  –  One  to  One  Edge   Page 17 Aggregate Sample L Join Stage sample map on distributed cache l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x USING ‘skewed’; Load & Sample Aggregate Partition L Join Pass through input via 1-1 edge Partition R HDFS Broadcast sample map Partition L and Partition R Pig – MR Pig – Tez Pig  :  Skewed  Join  
  • 18. © Hortonworks Inc. 2014 Tez  –  Bringing  it  all  together   Page 18 Architecting the Future of Big Data Tez Session populates container pool Dimension table calculation and HDFS split generation in parallel Dimension tables broadcasted to Hive MapJoin tasks Final Reducer pre- launched and fetches completed inputs TPCDS – Query-27 with Hive on Tez
  • 19. © Hortonworks Inc. 2014 Tez  –  Performance   • Benefits  of  expressing  the  data  processing  as  a  DAG   •  Reducing  overheads  and  queuing  effects   •  Gives  system  the  global  picture  for  beper  planning   • Efficient  use  of  resources   •  Re-­‐use  resources  to  maximize  u8liza8on   •  Pre-­‐launch,  pre-­‐warm  and  cache   •  Locality  &  resource  aware  scheduling   • Support  for  applica-on  defined  DAG  modifica-ons  at  run-me   for  op-mized  execu-on   •  Change  task  concurrency     •  Change  task  scheduling   •  Change  DAG  edges   •  Change  DAG  ver8ces   Page 19
  • 20. © Hortonworks Inc. 2014 Tez  –  Benefits  of  DAG  execu8on   • Faster  Execu-on  and  Higher  Predictability   – Eliminate  replicated  write  barrier  between  successive  computa8ons.   – Eliminate  job  launch  overhead  of  workflow  jobs.   – Eliminate  extra  stage  of  map  reads  in  every  workflow  job.   – Eliminate  queue  and  resource  conten8on  suffered  by  workflow  jobs   that  are  started  a>er  a  predecessor  job  completes.   – Beper  locality  because  the  engine  has  the  global  picture   Page 20 Pig/Hive - MR Pig/Hive - Tez
  • 21. © Hortonworks Inc. 2014 Tez  –  Container  Re-­‐Use   • Reuse  YARN  containers/JVMs  to  launch  new  tasks   • Reduce  scheduling  and  launching  delays   • Shared  in-­‐memory  data  across  tasks   • JVM  JIT  friendly  execu8on   Page 21 YARN Container / JVM TezTask Host TezTask1 TezTask2 SharedObjects YARN Container Tez Application Master Start Task Task Done Start Task
  • 22. © Hortonworks Inc. 2014 Tez  –  Sessions   Page 22 Application Master Client Start Session Submit DAG Task Scheduler ContainerPool Shared Object Registry Pre Warmed JVM Sessions   •  Standard  concepts  of  pre-­‐launch   and  pre-­‐warm  applied   •  Key  for  Interac8ve  queries   •  Represents  a  connec8on  between   the  user  and  the  cluster   •  Mul8ple  DAGs/Queries  executed  in   the  same  AM   •  Containers  re-­‐used  across  queries   •  Takes  care  of  data  locality  and   releasing  resources  when  idle  
  • 23. © Hortonworks Inc. 2014 Tez  –  Re-­‐Use  in  Ac8on   Task  Execu8on   Timeline  
  • 24. © Hortonworks Inc. 2014 Tez  –  Customizable  Core  Engine   Page 24 Vertex-2 Vertex-1 Start vertex Vertex Manager Start tasks DAG Scheduler Get Priority Get Priority Start vertex Task Scheduler Get container Get container •  Vertex Manager •  Determines task parallelism •  Determines when tasks in a vertex can start. •  DAG Scheduler Determines priority of task •  Task Scheduler Allocates containers from YARN and assigns them to tasks
  • 25. © Hortonworks Inc. 2014 Tez  –  Theory  to  Prac8ce   • In theory, there is no difference between theory and practice. • But, in practice, there is. Page 25
  • 26. © Hortonworks Inc. 2014 Tez  –  Data  at  scale   Page 26 Hive  TPC-­‐DS   Scale  10TB
  • 27. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Replicated Join (2.8x) Join + Groupby (1.5x) Join + Groupby + Orderby (1.5x) 3 way Split + Join + Groupby + Orderby (2.6x) Timeinsecs MR Tez Tez  –  Pig  performance  gains   •  Demonstrate  performance  gains  from  a  basic  transla8on  to  a   Tez  DAG   •  Deeper  integra8on  in  the  works  for  further  boost  
  • 28. Tez  –  itera8ve  algorithms   •  Pig  can  do  itera8ve  algorithms  on  top  of  Tez   •  This  uses  heavy-­‐weight  itera8on  (for-­‐loop  +  map)   •  Future  work  for  faster  loop-­‐unrolled  out-­‐of-­‐order  itera8on   •  1-­‐1  edges  between  loops    allows  building  morsel  style   parallelism   0 1000 2000 3000 10 50 100 Timeinsecs Iteration k-means MR Tez 14.84X 13.12X 5.37X * Source code at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
  • 29. © Hortonworks Inc. 2014 Tez  –  Designed  for  big,  busy  clusters   •  Number of stages in the DAG •  Higher the number of stages in the DAG, performance of Tez (over MR) will be better. •  Cluster/queue capacity •  More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. •  Size of intermediate output •  More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage (cross-rack traffic) •  Size of data in the job •  For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. •  Move workloads from gateway boxes to the cluster •  Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. Page 29
  • 30. © Hortonworks Inc. 2014 Tez  –  what  if  you  can’t  get  enough  containers?   • 78 vertex + 8374 tasks on 50 YARN containers Page 30
  • 31. © Hortonworks Inc. 2014 Tez  –  Adop8on     • Hive   •  Hadoop  standard  for  declara8ve  access  via  SQL-­‐like  interface   • Pig   •  Hadoop  standard  for  procedural  scrip8ng  and  pipeline  processing   • Cascading   •  Developer  friendly  Java  API  and  SDK   •  Scalding  (Scala  API  on  Cascading)   • Commercial  Vendors   •  ETL  :  Use  Tez  instead  of  MR  or  custom  pipelines   •  Analy8cs  Vendors  :  Use  Tez  as  a  target  plasorm  for  scaling  parallel   analy8cal  tools  to  large  data-­‐sets   Page 31
  • 32. © Hortonworks Inc. 2014 Tez  –  Roadmap   • Richer  DAG  support   –   Addi8on  of  ver8ces  at  run8me   –   Shared  edges  for  shared  outputs   –   Enhance  Input/Output  collec8ons   • Performance  op-miza-ons   –   Improve  throughput  at  high  concurrency     –   Improve  locality  aware  scheduling  (co-­‐scheduling)   –   Add  framework  level  data  sta8s8cs     –   HDFS  memory  storage  integra8on   • Usability   –   Stability  and  testability   –   API  ease  of  use   –   Tools  for  performance  analysis  and  debugging   Page 32
  • 33. © Hortonworks Inc. 2014 Tez  –  Community   • Early  adopters  and  code  contributors  welcome   – Adopters  to  drive  more  scenarios.  Contributors  to  make  them  happen.   • Technical  blog  series   – hpp://hortonworks.com/blog/apache-­‐tez-­‐a-­‐new-­‐chapter-­‐in-­‐hadoop-­‐data-­‐ processing     • Useful  links   – Work  tracking:  hpps://issues.apache.org/jira/browse/TEZ   – Code:  hpps://github.com/apache/tez   –   Developer  list:  dev@tez.apache.org    User  list:  user@tez.apache.org    Issues  list:  issues@tez.apache.org   Page 33