SlideShare a Scribd company logo
1 of 29
Download to read offline
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: 2015
Gopal Vijayaraghavan
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC – Optimized Row-Columnar File
Columnar Storage+
Row-groups & Fixed splits
Protobuf Metadata Storage+
+
Type-safe Vectorization+
Hive ACID transactions+
Single SerDe for Format+
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Speed: The Stinger Initiative
Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
Launched: February 2013; Delivered: April 2014.
Delivered in 100% Apache Open Source.
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORC
= 100X+ +
Distributed
Execution
Apache Tez
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Facebook
Saved more than 1,400
servers worth of storage.
Compressioni
Compression ratio
increased from 5x to 8x
globally.
Compressioni
[1]
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Spotify
16x less HDFS read when
using ORC versus Avro.(5)
IOi
32x less CPU when using
ORC versus Avro.(5)
CPUi
[2]
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Today
What is Optimized about ORC?
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC – Optimized Row-Columnar File
Columnar Storage+
Row-groups & Stripe splits
Protobuf Metadata Storage+
+
Type-safe Vectorization+
Hive ACID transactions+
Single SerDe for Format+
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Columnar Storage
Storage Performance
● Compress each column differently
● Detect & compress common sub-sequences
● Auto-increment ids
● String Enums
● Large Integers (uid scale)
● Unique strings (UUIDS)
Read Performance
● Column projection
● Columnar deserializers
● Data locality
Write Throughput
● Stats auto-gather
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Row-groups & Stripe splits
Split Parallelism
● Effective parallelism
● No seeks to find boundaries
● No splits with zero data
● Decompress fixed chunks
Stripes
● Single unsplittable chunk
● Will reside in 1 HDFS block entirely
● Is self-contained for all read ops
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
A Single SerDe for all ORC Files
A Single Writer
● No mismatch of serialization
● Forward compatibility
Readers
● Multiple reader implementations
● Allows for vector readers
● And row-mode readers
● Similar loop – good JIT hit-rate
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Protobuf Metadata Storage
Standardized Metadata
● Readers are easier to write
● Metadata readers are auto-generated
Metadata Forward Compatibility
● Protobuf Optional fields
Statistics Storage in Metadata
● Standard serialization for stats
● Allows for PPD into the IO layer
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Type-safe Vectorization
Schema on Write
● Write ORC Structs with types
● SerDe & Inputformat
Read Performance
● Data is read with few copies
● Primitive types are fast
● Primitives are also unboxed
● Predicates are typed too
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: ETL Improvements
Always more new data
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC (Zlib): Compress Differently
674
389
433
ORC (old zlib) ORC SNAPPY ORC (new zlib)
ETL for TPC-H LineItem (scale 1 Tb)
Time Taken
Different Zlib algorithms for encoding
● Z_FILTERED
● Z_DEFAULT
● Z_BEST_SPEED
● Z_DEFAULT_COMPRESSION
In detail
● Compress IS_NULL bitsets lightly
● Compress Integers differently from Doubles
● Compress string dictionaries differently
● Allow for user choice
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC (Zlib): Compress Differently
Different Zlib algorithms for encoding
● Z_FILTERED
● Z_DEFAULT
● Z_BEST_SPEED
● Z_DEFAULT_COMPRESSION
In detail
● Compress IS_NULL bitsets lightly
● Compress Integers differently from Doubles
● Compress string dictionaries differently
● Allow for user choice
178.5
225.1
172.2
ORC (old zlib) ORC SNAPPY ORC (new zlib)
Data Sizes for TPC-H Lineitem (Scale 1 Tb)
Size on Disk
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using JDK8 SIMD: Integer Writers
Integer encodings
● Base + Delta
● Run-length
● Direct
Trade-off for Size/Speed
● Use fixed bit-width loops
● Snap to nearest bit-width
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 4 8 16 24 32 40 48 56 64
MeanTime(ms)
Bit Width
ORC Write Integer Performance
(smaller better)
hive 0.13 bitpacking
hive 1.0 bitpacking (new)
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Double Writers
273.331
247.634
231.741
0
50
100
150
200
250
300
old buffered + BE buffered + LE
MeanTime(ms)
Double Write Modes
ORC Write Double Performance
(smaller is better)
Double Writers
● JVM is big-endian
● X86 is little-endian
● Special handling of NaN
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Scale compression buffers
269.4
263.3
258.5 258.4 258.4 258.4
184.8 183.5 182.2 180.1 178.3 177.4
140
160
180
200
220
240
260
280
300
320
8 16 32 64 128 256
SizeinMB
Compression Buffer Size in KB
File Size
ZLIB
SNAPPY
Large Columns vs More Columns
● Adjust when >1000 columns
Trade offs
● Compression
● Low memory use
More additions
● Dynamically partitioned insert
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Streaming Ingest + ACID
Broken pattern: Partitions for Atomicity-
- Isolation & Consistency on retries+
Transactions are pluggable (txn.manager)+
Cache/Replication friendly (base + deltas)+
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP and Sub-second
ORC – Pushing for Sub-second
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Row Indexes
Min-Max pruning
● Evaluate on statistics
Bloom filters
● Better String filters
● Filter a random distribution
LLAP Future
● Row-level vector SARGs
5999989709
540,000
10,000
No Indexes Min-Max Indexes Bloomfilter Indexes
from tpch_1000.lineitem where l_orderkey = 1212000001;
(log scale)
Rows Read
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Row Indexes
Min-Max pruning
● Evaluate on Statistics
Bloom filters
● Better String filters
● Filter a random distribution
LLAP Future
● Row-level vector SARGs
74
4.5 1.34
No Indexes Min-Max Indexes Bloomfilter Indexes
* from tpch_1000.lineitem where l_orderkey=1212000001;
(smaller better)
Time Taken (seconds)
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: JDK8 SIMD Readers
Integer encodings
● Base + Delta
● Run-length
● Direct
Trade-off for Size/Speed
● Use fixed bit-width loops
● Snap to nearest bit-width
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 24 32 40 48 56 64
MeanTime(ms)
Bit Width
ORC Read Integer Performance
hive 0.13 unpacking
hive-1.0 unpacking (new)
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Vectorization + SIMD
Advantage of a Single SerDe
● Primitive Types
Allocation free tight inner loops
● JDK8 has auto-vectorization
Vectorized Early Filter
● Vectors can be filtered early in ORC
● StringDictionary can be used to binary-search
Vectorized SIMD Join
● Performance for single key joins
0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2
0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2
0x00007f13d2e6afba: movslq %eax,%r10
0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3
;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate
(line 94)
0x00007f13d2e6afc4: vmovdqu %ymm2,0x10(%rdx,%rax,8)
0x00007f13d2e6afca: vaddpd %ymm1,%ymm3,%ymm2
0x00007f13d2e6afce: vmovdqu %ymm2,0x30(%rdx,%r10,8)
;*dastore vector.expressions.gen.DoubleColAddDoubleColumn::evaluate
(line 94)
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Split Strategies + Tez Grouping
Amdahl’s Law
● As fast as the slowest task
● Slice work thinly, but not too thin
Split-generation vs Execution time
● ETL
● BI
● Hybrid
Split-grouping & estimation
● ColumnarSplit size
● Group by estimate, not file size
● Bucket pruning
Slow split
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP
- JIT Performance for short queries+
Row-group level caching+
Asynchronous IO Elevator+
+ Multi-threaded Column Vector processing+
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn more
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Endnotes
(1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
(2) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014

More Related Content

What's hot

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixRajeshbabu Chintaguntla
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Strongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixStrongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixYugabyteDB
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OGeorge Cao
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 

What's hot (20)

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 
Strongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixStrongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache Phoenix
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 

Viewers also liked

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveDataWorks Summit
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Differences of Deep Learning Frameworks
Differences of Deep Learning FrameworksDifferences of Deep Learning Frameworks
Differences of Deep Learning FrameworksSeiya Tokui
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Differences of Deep Learning Frameworks
Differences of Deep Learning FrameworksDifferences of Deep Learning Frameworks
Differences of Deep Learning Frameworks
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 

Similar to ORC 2015

Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus SDN/OpenFlow switch
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFTDataWorks Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Community
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...MongoDB
 

Similar to ORC 2015 (20)

ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 serveryOracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
 

More from t3rmin4t0r

Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrought3rmin4t0r
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 

More from t3rmin4t0r (7)

Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 

Recently uploaded

Tech Tuesday Slides - Getting Started with the Portfolio Module.
Tech Tuesday Slides - Getting Started with the Portfolio Module.Tech Tuesday Slides - Getting Started with the Portfolio Module.
Tech Tuesday Slides - Getting Started with the Portfolio Module.OnePlan Solutions
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Preparing BitVisor for Supporting Multiple Architectures
Preparing BitVisor for Supporting Multiple ArchitecturesPreparing BitVisor for Supporting Multiple Architectures
Preparing BitVisor for Supporting Multiple ArchitecturesAke Koomsin
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBUETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBUsamruddhijedgule2004
 
SPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptx
SPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptxSPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptx
SPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptxVersion 1 Analytics
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
ManageIQ - Sprint 234 Review - Slide Deck
ManageIQ - Sprint 234 Review - Slide DeckManageIQ - Sprint 234 Review - Slide Deck
ManageIQ - Sprint 234 Review - Slide DeckManageIQ
 
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Piyovi
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxTechnogeeks
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxRemote DBA Services
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...Bert Jan Schrijver
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dbaRemote DBA Services
 
Chapter -5 Agile Testing types and its examples.pptx
Chapter -5 Agile Testing types and its examples.pptxChapter -5 Agile Testing types and its examples.pptx
Chapter -5 Agile Testing types and its examples.pptxManishaPatil932723
 
What are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docxWhat are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docxkzayra69
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 

Recently uploaded (20)

Tech Tuesday Slides - Getting Started with the Portfolio Module.
Tech Tuesday Slides - Getting Started with the Portfolio Module.Tech Tuesday Slides - Getting Started with the Portfolio Module.
Tech Tuesday Slides - Getting Started with the Portfolio Module.
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Preparing BitVisor for Supporting Multiple Architectures
Preparing BitVisor for Supporting Multiple ArchitecturesPreparing BitVisor for Supporting Multiple Architectures
Preparing BitVisor for Supporting Multiple Architectures
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBUETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
ETE PPT.pdf LMMKLMKLMLKMLLMJKBHJBHBNUIHBU
 
SPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptx
SPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptxSPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptx
SPSS Statistics - Encrypting a Syntax File in IBM SPSS Statistics.pptx
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
ManageIQ - Sprint 234 Review - Slide Deck
ManageIQ - Sprint 234 Review - Slide DeckManageIQ - Sprint 234 Review - Slide Deck
ManageIQ - Sprint 234 Review - Slide Deck
 
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
Explore the Three Main Types of Logistics - Inbound Logistics, Outbound Logis...
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docx
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
AmsterdamJUG April 2024 - Going serverless with Quarkus GraalVM native images...
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dba
 
Chapter -5 Agile Testing types and its examples.pptx
Chapter -5 Agile Testing types and its examples.pptxChapter -5 Agile Testing types and its examples.pptx
Chapter -5 Agile Testing types and its examples.pptx
 
What are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docxWhat are the core components of Azure Data Engineer courses.docx
What are the core components of Azure Data Engineer courses.docx
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 

ORC 2015

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: 2015 Gopal Vijayaraghavan
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC – Optimized Row-Columnar File Columnar Storage+ Row-groups & Fixed splits Protobuf Metadata Storage+ + Type-safe Vectorization+ Hive ACID transactions+ Single SerDe for Format+
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Speed: The Stinger Initiative Stinger: An Open Roadmap to improve Apache Hive’s performance 100x. Launched: February 2013; Delivered: April 2014. Delivered in 100% Apache Open Source. SQL Engine Vectorized SQL Engine Columnar Storage ORC = 100X+ + Distributed Execution Apache Tez
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC at Facebook Saved more than 1,400 servers worth of storage. Compressioni Compression ratio increased from 5x to 8x globally. Compressioni [1]
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC at Spotify 16x less HDFS read when using ORC versus Avro.(5) IOi 32x less CPU when using ORC versus Avro.(5) CPUi [2]
  • 6. Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Today What is Optimized about ORC?
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC – Optimized Row-Columnar File Columnar Storage+ Row-groups & Stripe splits Protobuf Metadata Storage+ + Type-safe Vectorization+ Hive ACID transactions+ Single SerDe for Format+
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Columnar Storage Storage Performance ● Compress each column differently ● Detect & compress common sub-sequences ● Auto-increment ids ● String Enums ● Large Integers (uid scale) ● Unique strings (UUIDS) Read Performance ● Column projection ● Columnar deserializers ● Data locality Write Throughput ● Stats auto-gather
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Row-groups & Stripe splits Split Parallelism ● Effective parallelism ● No seeks to find boundaries ● No splits with zero data ● Decompress fixed chunks Stripes ● Single unsplittable chunk ● Will reside in 1 HDFS block entirely ● Is self-contained for all read ops
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved A Single SerDe for all ORC Files A Single Writer ● No mismatch of serialization ● Forward compatibility Readers ● Multiple reader implementations ● Allows for vector readers ● And row-mode readers ● Similar loop – good JIT hit-rate
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Protobuf Metadata Storage Standardized Metadata ● Readers are easier to write ● Metadata readers are auto-generated Metadata Forward Compatibility ● Protobuf Optional fields Statistics Storage in Metadata ● Standard serialization for stats ● Allows for PPD into the IO layer
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Type-safe Vectorization Schema on Write ● Write ORC Structs with types ● SerDe & Inputformat Read Performance ● Data is read with few copies ● Primitive types are fast ● Primitives are also unboxed ● Predicates are typed too
  • 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: ETL Improvements Always more new data
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC (Zlib): Compress Differently 674 389 433 ORC (old zlib) ORC SNAPPY ORC (new zlib) ETL for TPC-H LineItem (scale 1 Tb) Time Taken Different Zlib algorithms for encoding ● Z_FILTERED ● Z_DEFAULT ● Z_BEST_SPEED ● Z_DEFAULT_COMPRESSION In detail ● Compress IS_NULL bitsets lightly ● Compress Integers differently from Doubles ● Compress string dictionaries differently ● Allow for user choice
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC (Zlib): Compress Differently Different Zlib algorithms for encoding ● Z_FILTERED ● Z_DEFAULT ● Z_BEST_SPEED ● Z_DEFAULT_COMPRESSION In detail ● Compress IS_NULL bitsets lightly ● Compress Integers differently from Doubles ● Compress string dictionaries differently ● Allow for user choice 178.5 225.1 172.2 ORC (old zlib) ORC SNAPPY ORC (new zlib) Data Sizes for TPC-H Lineitem (Scale 1 Tb) Size on Disk
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Using JDK8 SIMD: Integer Writers Integer encodings ● Base + Delta ● Run-length ● Direct Trade-off for Size/Speed ● Use fixed bit-width loops ● Snap to nearest bit-width 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 2 4 8 16 24 32 40 48 56 64 MeanTime(ms) Bit Width ORC Write Integer Performance (smaller better) hive 0.13 bitpacking hive 1.0 bitpacking (new)
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Double Writers 273.331 247.634 231.741 0 50 100 150 200 250 300 old buffered + BE buffered + LE MeanTime(ms) Double Write Modes ORC Write Double Performance (smaller is better) Double Writers ● JVM is big-endian ● X86 is little-endian ● Special handling of NaN
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Scale compression buffers 269.4 263.3 258.5 258.4 258.4 258.4 184.8 183.5 182.2 180.1 178.3 177.4 140 160 180 200 220 240 260 280 300 320 8 16 32 64 128 256 SizeinMB Compression Buffer Size in KB File Size ZLIB SNAPPY Large Columns vs More Columns ● Adjust when >1000 columns Trade offs ● Compression ● Low memory use More additions ● Dynamically partitioned insert
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Streaming Ingest + ACID Broken pattern: Partitions for Atomicity- - Isolation & Consistency on retries+ Transactions are pluggable (txn.manager)+ Cache/Replication friendly (base + deltas)+
  • 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP and Sub-second ORC – Pushing for Sub-second
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Row Indexes Min-Max pruning ● Evaluate on statistics Bloom filters ● Better String filters ● Filter a random distribution LLAP Future ● Row-level vector SARGs 5999989709 540,000 10,000 No Indexes Min-Max Indexes Bloomfilter Indexes from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale) Rows Read
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Row Indexes Min-Max pruning ● Evaluate on Statistics Bloom filters ● Better String filters ● Filter a random distribution LLAP Future ● Row-level vector SARGs 74 4.5 1.34 No Indexes Min-Max Indexes Bloomfilter Indexes * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller better) Time Taken (seconds)
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: JDK8 SIMD Readers Integer encodings ● Base + Delta ● Run-length ● Direct Trade-off for Size/Speed ● Use fixed bit-width loops ● Snap to nearest bit-width 0 200 400 600 800 1000 1200 1400 1600 1800 1 2 4 8 16 24 32 40 48 56 64 MeanTime(ms) Bit Width ORC Read Integer Performance hive 0.13 unpacking hive-1.0 unpacking (new)
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Vectorization + SIMD Advantage of a Single SerDe ● Primitive Types Allocation free tight inner loops ● JDK8 has auto-vectorization Vectorized Early Filter ● Vectors can be filtered early in ORC ● StringDictionary can be used to binary-search Vectorized SIMD Join ● Performance for single key joins 0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2 0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2 0x00007f13d2e6afba: movslq %eax,%r10 0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3 ;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94) 0x00007f13d2e6afc4: vmovdqu %ymm2,0x10(%rdx,%rax,8) 0x00007f13d2e6afca: vaddpd %ymm1,%ymm3,%ymm2 0x00007f13d2e6afce: vmovdqu %ymm2,0x30(%rdx,%r10,8) ;*dastore vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Split Strategies + Tez Grouping Amdahl’s Law ● As fast as the slowest task ● Slice work thinly, but not too thin Split-generation vs Execution time ● ETL ● BI ● Hybrid Split-grouping & estimation ● ColumnarSplit size ● Group by estimate, not file size ● Bucket pruning Slow split
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP - JIT Performance for short queries+ Row-group level caching+ Asynchronous IO Elevator+ + Multi-threaded Column Vector processing+
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
  • 28. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more
  • 29. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Endnotes (1) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ (2) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014