SlideShare a Scribd company logo
1 of 26
Download to read offline
Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen O’Malley (@owen_omalley)
@hortonworks

© Hortonworks Inc. 2013.
YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service
Applica'ons	
  Run	
  Na'vely	
  IN	
  Hadoop	
  
BATCH	
  
INTERACTIVE	
  
(MapReduce)	
  
(Tez)	
  

ONLINE	
  
(HBase)	
  

STREAMING	
  
(Storm,	
  S4,…)	
  

GRAPH	
  
(Giraph)	
  

IN-­‐MEMORY	
  
(Spark)	
  

HPC	
  MPI	
  
(OpenMPI)	
  

OTHER	
  
(Search)	
  
(Weave…)	
  

YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  

© Hortonworks Inc. 2013.

Page 2
Hadoop Beyond Batch with YARN
A shift from the old to the new…
Single Use System

Multi Use Data Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1

HADOOP 2
MapReduce	
  
(batch)	
  

MapReduce	
  

Tez	
  

(interac:ve)	
  

YARN	
  

Others	
  
(varied)	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(opera:ng	
  system:	
  cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

© Hortonworks Inc. 2013.

(redundant,	
  reliable	
  storage)	
  
Apache Tez (“Speed”)
•  Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
Task with pluggable Input, Processor and Output

Input	
  

Processor	
  

Output	
  

Task	
  
Tez Task - <Input, Processor, Output>

YARN ApplicationMaster to run DAG of Tez Tasks
© Hortonworks Inc. 2013.
Tez: Building blocks for scalable data processing
Classical ‘Map’

HDFS	
  
Input	
  

Map	
  
Processor	
  

Classical ‘Reduce’

Sorted	
  
Output	
  

Shuffle	
  
Input	
  

Shuffle	
  
Input	
  

Reduce	
  
Processor	
  

Sorted	
  
Output	
  

Intermediate ‘Reduce’ for
Map-Reduce-Reduce
© Hortonworks Inc. 2013.

Reduce	
  
Processor	
  

HDFS	
  
Output	
  
Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;

Hive – MR
M

M

Hive – Tez

M

SELECT a.state

SELECT b.id
R

R

M

SELECT a.state,
c.itemId

M

M

M
R

M

SELECT b.id

R

M

HDFS

JOIN (a, c)
SELECT c.price

M

R

M
R

HDFS

R

JOIN (a, c)

R

HDFS

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

© Hortonworks Inc. 2013.

M

M

R

M

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

R
Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions
– Hot containers ready for immediate use
– Removes task and job launch overhead (~5s – 30s)

• Hive
– Session launch/shutdown in background (seamless, user not
aware)
– Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc
© Hortonworks Inc. 2013.
Tez Delivers Interactive Query - Out of the Box!
Feature	
  

Descrip'on	
  

Benefit	
  

Tez	
  Session	
  

Overcomes	
  Map-­‐Reduce	
  job-­‐launch	
  latency	
  by	
  pre-­‐
launching	
  Tez	
  AppMaster	
  

Latency	
  

Tez	
  Container	
  Pre-­‐
Launch	
  

Overcomes	
  Map-­‐Reduce	
  latency	
  by	
  pre-­‐launching	
  
hot	
  containers	
  ready	
  to	
  serve	
  queries.	
  

Latency	
  

Finished	
  maps	
  and	
  reduces	
  pick	
  up	
  more	
  work	
  
Tez	
  Container	
  Re-­‐Use	
   rather	
  than	
  exi:ng.	
  Reduces	
  latency	
  and	
  eliminates	
  
difficult	
  split-­‐size	
  tuning.	
  Out	
  of	
  box	
  performance!	
  
Run:me	
  re-­‐
Run:me	
  query	
  tuning	
  by	
  picking	
  aggrega:on	
  
configura:on	
  of	
  DAG	
   parallelism	
  using	
  online	
  query	
  sta:s:cs	
  
Tez	
  In-­‐Memory	
  
Cache	
  

Hot	
  data	
  kept	
  in	
  RAM	
  for	
  fast	
  access.	
  

Complex	
  DAGs	
  

Tez	
  Broadcast	
  Edge	
  and	
  Map-­‐Reduce-­‐Reduce	
  
paXern	
  improve	
  query	
  scale	
  and	
  throughput.	
  

© Hortonworks Inc. 2013.

Latency	
  

Throughput	
  

Latency	
  
Throughput	
  
Page 8
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative

A broad, community-based effort to
drive the next generation of HIVE

S'nger	
  Project	
  

(announced	
  February	
  2013)	
  
Hive	
  0.11,	
  May	
  2013:	
  
•  Base	
  Op:miza:ons	
  
•  SQL	
  Analy:c	
  Func:ons	
  
•  ORCFile,	
  Modern	
  File	
  Format	
  

Goals:

Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)

Hive	
  0.12,	
  October	
  2013:	
  
• 
• 
• 
• 

The only SQL interface to Hadoop designed
for queries that scale from TB to PB

SQL

	
  

© Hortonworks Inc. 2013.

Hive	
  on	
  Apache	
  Tez	
  
Query	
  Service	
  
Buffer	
  Cache	
  
Cost	
  Based	
  Op:mizer	
  (Op:q)	
  
Vectorized	
  Processing	
  

	
  
	
  
Coming	
  Soon:	
  

Support broadest range of SQL semantics for
analytic applications running against Hadoop

…all IN Hadoop

VARCHAR,	
  DATE	
  Types	
  
ORCFile	
  predicate	
  pushdown	
  
Advanced	
  Op:miza:ons	
  
Performance	
  Boosts	
  via	
  YARN	
  

• 
• 
• 
• 
• 

Scale
Hive 0.12
Hive 0.12
Release Theme

Speed, Scale and SQL

Specific Features

•  10x faster query launch when using large number
(500+) of partitions
•  ORCFile predicate pushdown speeds queries
•  Evaluate LIMIT on the map side
•  Parallel ORDER BY
•  New query optimizer
•  Introduces VARCHAR and DATE datatypes
•  GROUP BY on structs or unions

Included
Components

Apache Hive 0.12

© Hortonworks Inc. 2013.
SPEED: Increasing Hive Performance
Interactive Query Times across ALL use cases
•  Simple and advanced queries in seconds
•  Integrates seamlessly with existing tools
•  Currently a >100x improvement in just nine months
Performance Improvements
included in Hive 12
–  Base & advanced query optimization
–  Startup time improvement
–  Join optimizations

© Hortonworks Inc. 2013.
Stinger Phase 3: Interactive Query In Hadoop
Query	
  27:	
  Pricing	
  Analy'cs	
  using	
  Star	
  Schema	
  Join	
  	
  
Query	
  82:	
  Inventory	
  Analy'cs	
  Joining	
  2	
  Large	
  Fact	
  Tables	
  
1400s

190x	
  
Improvement	
  

3200s

200x	
  
Improvement	
  

65s
39s
14.9s

7.2s
TPC-­‐DS	
  Query	
  27	
  

Hive 10

Hive 0.11 (Phase 1)

TPC-­‐DS	
  Query	
  82	
  

Trunk (Phase 3)

All	
  Results	
  at	
  Scale	
  Factor	
  200	
  (Approximately	
  200GB	
  Data)	
  
© Hortonworks Inc. 2013.

Page 12
Speed: Delivering Interactive Query
Query	
  Time	
  in	
  Seconds	
  

Query	
  52:	
  Star	
  Schema	
  Join	
  	
  
Query	
  5:	
  Star	
  Schema	
  Join	
  

41.1s

39.8s

4.2s
TPC-­‐DS	
  Query	
  52	
  
Hive 0.12

Trunk (Phase 3)
© Hortonworks Inc. 2013.

4.1s
TPC-­‐DS	
  Query	
  55	
  
Test	
  Cluster:	
  
•  200	
  GB	
  Data	
  (Impala:	
  Parquet	
  	
  Hive:	
  ORCFile)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  disk	
  each	
  	
  
Speed: Delivering Interactive Query
Query	
  Time	
  in	
  Seconds	
  

Query	
  28:	
  Vectoriza'on	
  
Query	
  12:	
  Complex	
  join	
  (M-­‐R-­‐R	
  pabern)	
  

31s
22s

9.8s
TPC-­‐DS	
  Query	
  28	
  
Hive 0.12

Trunk (Phase 3)
© Hortonworks Inc. 2013.

6.7s
TPC-­‐DS	
  Query	
  12	
  
Test	
  Cluster:	
  
•  200	
  GB	
  Data	
  (Impala:	
  Parquet	
  	
  Hive:	
  ORCFile)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  disk	
  each	
  	
  
AMPLab Big Data Benchmark
AMPLab	
  Query	
  1:	
  Simple	
  Filter	
  Query	
  
63s

63s

45s

1.6s

2.3s

AMPLab	
  Query	
  1a	
  

AMPLab	
  Query	
  1b	
  

9.4s
AMPLab	
  Query	
  1c	
  
Query	
  Time	
  in	
  Seconds	
  
(lower	
  is	
  beXer)	
  

Hive 0.10 (5 node EC2)
Trunk (Phase 3)

© Hortonworks Inc. 2013.

S:nger	
  Phase	
  3	
  Cluster	
  Configura:on:	
  
•  AMPLab	
  Data	
  Set	
  (~135	
  GB	
  Data)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  Disk	
  each	
  	
  
Page 15
AMPLab Big Data Benchmark
AMPLab	
  Query	
  2:	
  Group	
  By	
  IP	
  Block	
  and	
  Aggregate	
  
552s
466s

104.3s
AMPLab	
  Query	
  2a	
  

490s

118.3s
AMPLab	
  Query	
  2b	
  

172.7s
AMPLab	
  Query	
  2c	
  
Query	
  Time	
  in	
  Seconds	
  
(lower	
  is	
  beXer)	
  

Hive 0.10 (5 node EC2)
Trunk (Phase 3)

© Hortonworks Inc. 2013.

S:nger	
  Phase	
  3	
  Cluster	
  Configura:on:	
  
•  AMPLab	
  Data	
  Set	
  (~135	
  GB	
  Data)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  Disk	
  each	
  	
  
Page 16
AMPLab Big Data Benchmark
AMPLab	
  Query	
  3:	
  Correlate	
  Page	
  Rankings	
  and	
  Revenues	
  Across	
  Time	
  
490s

466s

40s
AMPLab	
  Query	
  3a	
  

145s
AMPLab	
  Query	
  3b	
  
Query	
  Time	
  in	
  Seconds	
  
(lower	
  is	
  beXer)	
  

Hive 0.10 (5 node EC2)
Trunk (Phase 3)

© Hortonworks Inc. 2013.

S:nger	
  Phase	
  3	
  Cluster	
  Configura:on:	
  
•  AMPLab	
  Data	
  Set	
  (~135	
  GB	
  Data)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  Disk	
  each	
  	
  
Page 17
How Stinger Phase 3 Delivers Interactive Query

Feature	
  
Tez	
  Integra:on	
  

Descrip'on	
  
Tez	
  is	
  significantly	
  beXer	
  engine	
  than	
  MapReduce	
  

Benefit	
  
Latency	
  

Vectorized	
  Query	
  

Take	
  advantage	
  of	
  modern	
  hardware	
  by	
  processing	
  
thousand-­‐row	
  blocks	
  rather	
  than	
  row-­‐at-­‐a-­‐:me.	
  

Throughput	
  

Query	
  Planner	
  

Using	
  extensive	
  sta:s:cs	
  now	
  available	
  in	
  Metastore	
  
to	
  beXer	
  plan	
  and	
  op:mize	
  query,	
  including	
  
predicate	
  pushdown	
  during	
  compila:on	
  to	
  eliminate	
  
por:ons	
  of	
  input	
  (beyond	
  par::on	
  pruning)	
  

Latency	
  

Cost	
  Based	
  Op:mizer	
   Join	
  re-­‐ordering	
  and	
  other	
  op:miza:ons	
  based	
  on	
  
(Op:q)	
  
column	
  sta:s:cs	
  including	
  histograms	
  etc.	
  

© Hortonworks Inc. 2013.

Latency	
  

Page 18
SQL: Enhancing SQL Semantics
Hive	
  SQL	
  Datatypes	
  

Hive	
  SQL	
  Seman'cs	
  

SQL Compliance

INT	
  

SELECT,	
  INSERT	
  

TINYINT/SMALLINT/BIGINT	
  

GROUP	
  BY,	
  ORDER	
  BY,	
  SORT	
  BY	
  

BOOLEAN	
  

JOIN	
  on	
  explicit	
  join	
  key	
  

FLOAT	
  

Inner,	
  outer,	
  cross	
  and	
  semi	
  joins	
  

DOUBLE	
  

Sub-­‐queries	
  in	
  FROM	
  clause	
  

Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop

STRING	
  

ROLLUP	
  and	
  CUBE	
  

TIMESTAMP	
  

UNION	
  

BINARY	
  

Windowing	
  Func:ons	
  (OVER,	
  RANK,	
  etc)	
  

DECIMAL	
  

Custom	
  Java	
  UDFs	
  

ARRAY,	
  MAP,	
  STRUCT,	
  UNION	
  

Standard	
  Aggrega:on	
  (SUM,	
  AVG,	
  etc.)	
  

DATE	
  

Advanced	
  UDFs	
  (ngram,	
  Xpath,	
  URL)	
  	
  

VARCHAR	
  

Sub-­‐queries	
  in	
  WHERE,	
  HAVING	
  

CHAR	
  

Expanded	
  JOIN	
  Syntax	
  
SQL	
  Compliant	
  Security	
  (GRANT,	
  etc.)	
  
INSERT/UPDATE/DELETE	
  (ACID)	
  
© Hortonworks Inc. 2013.

Available	
  
Hive	
  0.12	
  
Roadmap	
  
ORC File Format
• Columnar format for complex data types
• Built into Hive from 0.11
• Support for Pig and MapReduce via HCat
• Two levels of compression
– Lightweight type-specific and generic

• Built in indexes
– Every 10,000 rows with position information
– Min, Max, Sum, Count of each column
– Supports seek to row number

© Hortonworks Inc. 2013.

Page 20
SCALE: Interactive Query at Petabyte Scale
Sustained Query Times

Smaller Footprint

Apache Hive 0.12 provides
sustained acceptable query
times even at petabyte scale

Better encoding with ORC in
Apache Hive 0.12 reduces resource
requirements for your cluster

File	
  Size	
  Comparison	
  Across	
  Encoding	
  Methods	
  
Dataset:	
  TPC-­‐DS	
  Scale	
  500	
  Dataset	
  

585	
  GB	
  
(Original	
  Size)	
  

505	
  GB	
  
(14%	
  Smaller)	
  

Impala	
  

221	
  GB	
  

(62%	
  Smaller)	
  

Hive	
  12	
  

131	
  GB	
  

(78%	
  Smaller)	
  

Encoded	
  with	
  

Text	
  

© Hortonworks Inc. 2013.

Encoded	
  with	
  

RCFile	
  

Encoded	
  with	
  

Parquet	
  

Encoded	
  with	
  

ORCFile	
  

•  Larger Block Sizes
•  Columnar format
arranges columns
adjacent within the
file for compression
& fast access
ORC File Format
• Hive 0.12
– Predicate Push Down
– Improved run length encoding
– Adaptive string dictionaries
– Padding stripes to HDFS block boundaries

• Trunk
– Stripe-based Input Splits
– Input Split elimination
– Vectorized Reader
– Customized Pig Load and Store functions

© Hortonworks Inc. 2013.

Page 22
Vectorized Query Execution
• Designed for Modern Processor Architectures
– Avoid branching in the inner loop.
– Make the most use of L1 and L2 cache.

• How It Works
– Process records in batches of 1,000 rows
– Generate code from templates to minimize branching.

• What It Gives
– 30x improvement in rows processed per second.
– Initial prototype: 100M rows/sec on laptop

© Hortonworks Inc. 2013.

Page 23
HDFS Buffer Cache
• Use memory mapped buffers for zero copy
– Avoid overhead of going through DataNode
– Can mlock the block files into RAM

• ORC Reader enhanced for zero-copy reads
– New compression interfaces in Hadoop

• Vectorization specific reader
– Read 1000 rows at a time
– Read into Hive’s internal representation

© Hortonworks Inc. 2013.
Next Steps
• Blog
http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/

• Stinger Initiative
http://hortonworks.com/labs/stinger/

• Stinger Beta: HDP-2.1 Beta, December, 2013

© Hortonworks Inc. 2013.
Thank You!

@acmurthy
@alanfgates
@owen_omalley
@hortonworks
© Hortonworks Inc. 2013. Confidential and Proprietary.

More Related Content

What's hot

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 

What's hot (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 

Viewers also liked

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 

Viewers also liked (20)

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
La investigación cualitativa
La investigación cualitativaLa investigación cualitativa
La investigación cualitativa
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 

Similar to Strata Stinger Talk October 2013

Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 

Similar to Strata Stinger Talk October 2013 (20)

Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 

More from alanfgates

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018alanfgates
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013alanfgates
 

More from alanfgates (6)

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Strata Stinger Talk October 2013

  • 1. Apache Hive and Stinger: SQL in Hadoop Arun Murthy (@acmurthy) Alan Gates (@alanfgates) Owen O’Malley (@owen_omalley) @hortonworks © Hortonworks Inc. 2013.
  • 2. YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applica'ons  Run  Na'vely  IN  Hadoop   BATCH   INTERACTIVE   (MapReduce)   (Tez)   ONLINE   (HBase)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   OTHER   (Search)   (Weave…)   YARN  (Cluster  Resource  Management)       HDFS2  (Redundant,  Reliable  Storage)   © Hortonworks Inc. 2013. Page 2
  • 3. Hadoop Beyond Batch with YARN A shift from the old to the new… Single Use System Multi Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1 HADOOP 2 MapReduce   (batch)   MapReduce   Tez   (interac:ve)   YARN   Others   (varied)   (cluster  resource  management    &  data  processing)   (opera:ng  system:  cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. (redundant,  reliable  storage)  
  • 4. Apache Tez (“Speed”) •  Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft Task with pluggable Input, Processor and Output Input   Processor   Output   Task   Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks © Hortonworks Inc. 2013.
  • 5. Tez: Building blocks for scalable data processing Classical ‘Map’ HDFS   Input   Map   Processor   Classical ‘Reduce’ Sorted   Output   Shuffle   Input   Shuffle   Input   Reduce   Processor   Sorted   Output   Intermediate ‘Reduce’ for Map-Reduce-Reduce © Hortonworks Inc. 2013. Reduce   Processor   HDFS   Output  
  • 6. Hive-on-MR vs. Hive-on-Tez Tez avoids unneeded writes to HDFS SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) © Hortonworks Inc. 2013. M M R M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R
  • 7. Tez Sessions … because Map/Reduce query startup is expensive • Tez Sessions – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s) • Hive – Session launch/shutdown in background (seamless, user not aware) – Submits query plan directly to Tez Session Native Hadoop service, not ad-hoc © Hortonworks Inc. 2013.
  • 8. Tez Delivers Interactive Query - Out of the Box! Feature   Descrip'on   Benefit   Tez  Session   Overcomes  Map-­‐Reduce  job-­‐launch  latency  by  pre-­‐ launching  Tez  AppMaster   Latency   Tez  Container  Pre-­‐ Launch   Overcomes  Map-­‐Reduce  latency  by  pre-­‐launching   hot  containers  ready  to  serve  queries.   Latency   Finished  maps  and  reduces  pick  up  more  work   Tez  Container  Re-­‐Use   rather  than  exi:ng.  Reduces  latency  and  eliminates   difficult  split-­‐size  tuning.  Out  of  box  performance!   Run:me  re-­‐ Run:me  query  tuning  by  picking  aggrega:on   configura:on  of  DAG   parallelism  using  online  query  sta:s:cs   Tez  In-­‐Memory   Cache   Hot  data  kept  in  RAM  for  fast  access.   Complex  DAGs   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce   paXern  improve  query  scale  and  throughput.   © Hortonworks Inc. 2013. Latency   Throughput   Latency   Throughput   Page 8
  • 9. Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE S'nger  Project   (announced  February  2013)   Hive  0.11,  May  2013:   •  Base  Op:miza:ons   •  SQL  Analy:c  Func:ons   •  ORCFile,  Modern  File  Format   Goals: Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Hive  0.12,  October  2013:   •  •  •  •  The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL   © Hortonworks Inc. 2013. Hive  on  Apache  Tez   Query  Service   Buffer  Cache   Cost  Based  Op:mizer  (Op:q)   Vectorized  Processing       Coming  Soon:   Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop VARCHAR,  DATE  Types   ORCFile  predicate  pushdown   Advanced  Op:miza:ons   Performance  Boosts  via  YARN   •  •  •  •  •  Scale
  • 10. Hive 0.12 Hive 0.12 Release Theme Speed, Scale and SQL Specific Features •  10x faster query launch when using large number (500+) of partitions •  ORCFile predicate pushdown speeds queries •  Evaluate LIMIT on the map side •  Parallel ORDER BY •  New query optimizer •  Introduces VARCHAR and DATE datatypes •  GROUP BY on structs or unions Included Components Apache Hive 0.12 © Hortonworks Inc. 2013.
  • 11. SPEED: Increasing Hive Performance Interactive Query Times across ALL use cases •  Simple and advanced queries in seconds •  Integrates seamlessly with existing tools •  Currently a >100x improvement in just nine months Performance Improvements included in Hive 12 –  Base & advanced query optimization –  Startup time improvement –  Join optimizations © Hortonworks Inc. 2013.
  • 12. Stinger Phase 3: Interactive Query In Hadoop Query  27:  Pricing  Analy'cs  using  Star  Schema  Join     Query  82:  Inventory  Analy'cs  Joining  2  Large  Fact  Tables   1400s 190x   Improvement   3200s 200x   Improvement   65s 39s 14.9s 7.2s TPC-­‐DS  Query  27   Hive 10 Hive 0.11 (Phase 1) TPC-­‐DS  Query  82   Trunk (Phase 3) All  Results  at  Scale  Factor  200  (Approximately  200GB  Data)   © Hortonworks Inc. 2013. Page 12
  • 13. Speed: Delivering Interactive Query Query  Time  in  Seconds   Query  52:  Star  Schema  Join     Query  5:  Star  Schema  Join   41.1s 39.8s 4.2s TPC-­‐DS  Query  52   Hive 0.12 Trunk (Phase 3) © Hortonworks Inc. 2013. 4.1s TPC-­‐DS  Query  55   Test  Cluster:   •  200  GB  Data  (Impala:  Parquet    Hive:  ORCFile)   •  20  Nodes,  24GB  RAM  each,  6x  disk  each    
  • 14. Speed: Delivering Interactive Query Query  Time  in  Seconds   Query  28:  Vectoriza'on   Query  12:  Complex  join  (M-­‐R-­‐R  pabern)   31s 22s 9.8s TPC-­‐DS  Query  28   Hive 0.12 Trunk (Phase 3) © Hortonworks Inc. 2013. 6.7s TPC-­‐DS  Query  12   Test  Cluster:   •  200  GB  Data  (Impala:  Parquet    Hive:  ORCFile)   •  20  Nodes,  24GB  RAM  each,  6x  disk  each    
  • 15. AMPLab Big Data Benchmark AMPLab  Query  1:  Simple  Filter  Query   63s 63s 45s 1.6s 2.3s AMPLab  Query  1a   AMPLab  Query  1b   9.4s AMPLab  Query  1c   Query  Time  in  Seconds   (lower  is  beXer)   Hive 0.10 (5 node EC2) Trunk (Phase 3) © Hortonworks Inc. 2013. S:nger  Phase  3  Cluster  Configura:on:   •  AMPLab  Data  Set  (~135  GB  Data)   •  20  Nodes,  24GB  RAM  each,  6x  Disk  each     Page 15
  • 16. AMPLab Big Data Benchmark AMPLab  Query  2:  Group  By  IP  Block  and  Aggregate   552s 466s 104.3s AMPLab  Query  2a   490s 118.3s AMPLab  Query  2b   172.7s AMPLab  Query  2c   Query  Time  in  Seconds   (lower  is  beXer)   Hive 0.10 (5 node EC2) Trunk (Phase 3) © Hortonworks Inc. 2013. S:nger  Phase  3  Cluster  Configura:on:   •  AMPLab  Data  Set  (~135  GB  Data)   •  20  Nodes,  24GB  RAM  each,  6x  Disk  each     Page 16
  • 17. AMPLab Big Data Benchmark AMPLab  Query  3:  Correlate  Page  Rankings  and  Revenues  Across  Time   490s 466s 40s AMPLab  Query  3a   145s AMPLab  Query  3b   Query  Time  in  Seconds   (lower  is  beXer)   Hive 0.10 (5 node EC2) Trunk (Phase 3) © Hortonworks Inc. 2013. S:nger  Phase  3  Cluster  Configura:on:   •  AMPLab  Data  Set  (~135  GB  Data)   •  20  Nodes,  24GB  RAM  each,  6x  Disk  each     Page 17
  • 18. How Stinger Phase 3 Delivers Interactive Query Feature   Tez  Integra:on   Descrip'on   Tez  is  significantly  beXer  engine  than  MapReduce   Benefit   Latency   Vectorized  Query   Take  advantage  of  modern  hardware  by  processing   thousand-­‐row  blocks  rather  than  row-­‐at-­‐a-­‐:me.   Throughput   Query  Planner   Using  extensive  sta:s:cs  now  available  in  Metastore   to  beXer  plan  and  op:mize  query,  including   predicate  pushdown  during  compila:on  to  eliminate   por:ons  of  input  (beyond  par::on  pruning)   Latency   Cost  Based  Op:mizer   Join  re-­‐ordering  and  other  op:miza:ons  based  on   (Op:q)   column  sta:s:cs  including  histograms  etc.   © Hortonworks Inc. 2013. Latency   Page 18
  • 19. SQL: Enhancing SQL Semantics Hive  SQL  Datatypes   Hive  SQL  Seman'cs   SQL Compliance INT   SELECT,  INSERT   TINYINT/SMALLINT/BIGINT   GROUP  BY,  ORDER  BY,  SORT  BY   BOOLEAN   JOIN  on  explicit  join  key   FLOAT   Inner,  outer,  cross  and  semi  joins   DOUBLE   Sub-­‐queries  in  FROM  clause   Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING   ROLLUP  and  CUBE   TIMESTAMP   UNION   BINARY   Windowing  Func:ons  (OVER,  RANK,  etc)   DECIMAL   Custom  Java  UDFs   ARRAY,  MAP,  STRUCT,  UNION   Standard  Aggrega:on  (SUM,  AVG,  etc.)   DATE   Advanced  UDFs  (ngram,  Xpath,  URL)     VARCHAR   Sub-­‐queries  in  WHERE,  HAVING   CHAR   Expanded  JOIN  Syntax   SQL  Compliant  Security  (GRANT,  etc.)   INSERT/UPDATE/DELETE  (ACID)   © Hortonworks Inc. 2013. Available   Hive  0.12   Roadmap  
  • 20. ORC File Format • Columnar format for complex data types • Built into Hive from 0.11 • Support for Pig and MapReduce via HCat • Two levels of compression – Lightweight type-specific and generic • Built in indexes – Every 10,000 rows with position information – Min, Max, Sum, Count of each column – Supports seek to row number © Hortonworks Inc. 2013. Page 20
  • 21. SCALE: Interactive Query at Petabyte Scale Sustained Query Times Smaller Footprint Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster File  Size  Comparison  Across  Encoding  Methods   Dataset:  TPC-­‐DS  Scale  500  Dataset   585  GB   (Original  Size)   505  GB   (14%  Smaller)   Impala   221  GB   (62%  Smaller)   Hive  12   131  GB   (78%  Smaller)   Encoded  with   Text   © Hortonworks Inc. 2013. Encoded  with   RCFile   Encoded  with   Parquet   Encoded  with   ORCFile   •  Larger Block Sizes •  Columnar format arranges columns adjacent within the file for compression & fast access
  • 22. ORC File Format • Hive 0.12 – Predicate Push Down – Improved run length encoding – Adaptive string dictionaries – Padding stripes to HDFS block boundaries • Trunk – Stripe-based Input Splits – Input Split elimination – Vectorized Reader – Customized Pig Load and Store functions © Hortonworks Inc. 2013. Page 22
  • 23. Vectorized Query Execution • Designed for Modern Processor Architectures – Avoid branching in the inner loop. – Make the most use of L1 and L2 cache. • How It Works – Process records in batches of 1,000 rows – Generate code from templates to minimize branching. • What It Gives – 30x improvement in rows processed per second. – Initial prototype: 100M rows/sec on laptop © Hortonworks Inc. 2013. Page 23
  • 24. HDFS Buffer Cache • Use memory mapped buffers for zero copy – Avoid overhead of going through DataNode – Can mlock the block files into RAM • ORC Reader enhanced for zero-copy reads – New compression interfaces in Hadoop • Vectorization specific reader – Read 1000 rows at a time – Read into Hive’s internal representation © Hortonworks Inc. 2013.

Editor's Notes

  1. query 52 star join followed by group/order (different keys), selective filterquery 55 same
  2. query 28: 4subquery joinquery 12: star join over range of dates
  3. query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank &gt; X
  4. SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BYSUBSTR(sourceIP, 1, X)
  5. SELECT sourceIP, totalRevenue, avgPageRankFROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01&apos;) AND Date(`X&apos;) GROUP BY UV.sourceIP)ORDER BY totalRevenue DESC LIMIT 1
  6. With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance.That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.