SlideShare a Scribd company logo
1 of 42
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC 2015: Faster, Better, Smaller
Prasanth Jayachandran
Apache Hive Team, Hortonworks
@prasanth_j
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache ORC – Optimized Row-Columnar File
Apache TLP – orc.apache.org+
Type Specific Encodings+
Came out of Apache Hive+
Vectorized Readers (Java, C++)+
Projection and Predicate Pushdown+
Columnar Storage+
Block Compression+
Hive ACID transactions+
Single SerDe Format+
Protobuf Metadata Storage+
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Format Specification
How ORC stores data?
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC File Layout
 File Footer and Postscript
 Stripes
 Indexes (Row group indexes and Bloom Filter
interleaved)
 Min/Max stats, Positions for every 10K rows
 Data
 Multiple streams per column encoded and
compressed independently
 Stripe Footer
 Locations to streams, type of encoding
 Full specification at [1]
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC Writer
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
 One tree writer per flattened column
 Multiple streams per column
 PRESENT
 DATA
 LENGTH
 DICTIONARY_DATA
 SECONDARY
 ROW_INDEX
 BLOOM_FILTER
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC Data Streams
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
 Streams can be suppressed.
 Example: PRESENT stream is suppressed when all values in a stripe are non-null.
IS_PRESENT DATA DICTIONARY LENGTH SECONDARY
Compression
Buffers
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Features Timeline
How ORC improved over time?
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2013
 Stinger Initiative Announcement*
 Roadmap to improve Apache Hive’s
performance by 100x
 Delivered in 100% Apache Open Source
* http://hortonworks.com/blog/100x-faster-hive/
| 2013
| 2014
| 2015
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORC
+ +
Distributed
Execution
Apache Tez
= 100x
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
March 2013
Optimized Row Columnar (ORC)
file format committed to Hive
 Hive version: 0.11
 Native data format in Hive
| 2013
| 2014
| 2015
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
March 2013
| 2013
| 2014
| 2015
Predicate Pushdown
 SARG interface
 Prune stripes and row groups
based on min/max statistics
Improved Run Length Encoding
 Tighter bit packing
 Longer runs
 DELTA, SHORT_REPEATS,
DIRECT, PATCHED_BASE
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Run Length Encoding Improvements
RLE (hive 0.11) RLE (hive >= 0.12)
Compression
Ratio
Encoding Time (in
ms)
Decoding Time (in
ms)
Compression
Ratio
Encoding Time (in
ms)
Decoding Time (in
ms)
Twitter Census API ID (24,556,361
records) 2.32 1770 1263 6.97 1558 864
HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125
Github Archive
(root.payload.name.txt.dict-len) 114.05 21 15 260.73 23 15
AOL Querylog Epoch (36,389,577
records) 2.51 553 364 3.7 652 246
Reference: https://issues.apache.org/jira/secure/attachment/12596722/ORC-Compression-Ratio-Comparison.xlsx
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2013
| 2013
| 2014
| 2015
Vectorized ORC readers
 Read and process columns in
batches of size 1024
Null stream suppression
 Suppress PRESENT stream
if no nulls in a stripe
 Enables fast path in vectorization
June 2013
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
October 2013
| 2013
| 2014
| 2015
Statistics Interface
 Writer – Update statistics during load time
 Reader – ANALYZE TABLE .. NOSCAN
Split Elimination
 Stripe level column statistics
 Eliminate stripes that do not satisfy
predicate conditions
November 2013
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2014
| 2013
| 2014
| 2015
Zero copy read path
 HDFS caching APIs to read directly into
memory without extra data copies
Serialization Improvements
 Bit width alignment (trade-off space
for speed)
 Unrolled bit packing and unpacking
 Buffered double reader and writer
June 2014
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Serialization Improvements
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 24 32 40 48 56 64
MeanTime(ms)
Bit Width
ORC Read Integer Performance (smaller is better)
hive 0.13 unpacking
hive-1.0 unpacking (new)
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Serialization Improvements
241.679
171.045
174.163
0
50
100
150
200
250
300
hive <= 0.13 buffered + BE buffered + LE
MeanTime(ms)
Double Read Modes
ORC Read Double Performance
(smaller is better)
~1.4x improvement
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
June 2014
| 2013
| 2014
| 2015
Adaptive compression buffer size
 >1000 columns adjust compression buffer
size based on available memory
 Avoids wide table OOMs
Fast stripe level file merging
 Many small files to few large files
 No Decompression, No Decoding
 ALTER TABLE … CONCATENATE
July 2014
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fast File Merging
1091
651
245
816
0
200
400
600
800
1000
1200
1400
1600
ORC RCFile
TotalTimeinseconds
CONCAT Supporting File Formats
ETL With File Merging – TPC-H 1000 Scale Lineitem
(smaller is better)
Merge Time
Load Time
1336
1467
~3.33x improvement
in merge time
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
July 2014
| 2013
| 2014
| 2015
ORC Padding Improvements
 Pad bytes to avoid remote HDFS reads
 Last stripe is adjusted to fit within HDFS
block boundary (worst case: 5% wastage)
Decouple stripe size vs block size
 Smaller stripes (64MB)
 More stripes per block (4 per block)
 Better parallelism & split elimination
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013
| 2014
| 2015
String Dictionary Improvements
 Row group level checking
 Remember decision across stripes
 Avoids expensive RBTree insertions
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
String Dictionary Improvements
767
540
0
100
200
300
400
500
600
700
800
900
hive <= 0.13 hive > 0.13
Timeinseconds
Hive Version
String Dictionary Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
Load Time
~1.4x improvement
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013
| 2014
| 2015
Improved ZLIB compression
 Different streams compressed with
different zlib strategies/levels
 Compress integers and doubles
differently
 Data and Dictionary stream
- Looks for smaller byte patterns
 All other streams
- Less LZ77, More Huffman
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ZLIB Improvements
178.5
172.2
225.1
0
50
100
150
200
250
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Data Size Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~4% improvement ~1.3x smaller
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ZLIB Improvements
674
433
389
0
100
200
300
400
500
600
700
800
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Load Time Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~1.6x improvement Only ~10% slower than SNAPPY
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013
| 2014
| 2015
ACID transactions
 Order of millions of rows
 Not designed for OLTP requirements
 Streaming Ingest via Flume or Storm
 Atomically add base and delta directories
 Minor compaction – Merge many delta files
 Major compaction – Re-write base files to
incorporate delta file changes
Broken pattern: Add Partitions for Atomicity-
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
January 2015
| 2013
| 2014
| 2015
hasNull flag in ORC internal index
 Better pruning of row groups
 Improves the performance of
SELECT .. WHERE column IS NULL;
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
hasNull in Index Improvement
Bytes Read: 208.77 GB vs 539 MB
66.73
7.87
0
10
20
30
40
50
60
70
80
hive < 1.1.0 hive >= 1.1.0
ExecutionTimeinseconds
Hive Version
select * from lineitem where l_shipdate is null
(smaller is better)
Execution Time~8.5x improvement
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2015
| 2013
| 2014
| 2015
Bloom Filter Index
 Much better row group pruning when
compared to min/max
 Bloom filter evaluated after the
fast Min/Max based elimination
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bloom Filter Indexes Improvements
5999989709
540,000
10,000
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey = 1212000001;
(log scale – smaller is better)
Rows Read
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bloom Filter Indexes Improvements
74
4.5 1.34
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey=1212000001;
(smaller is better)
Time Taken (seconds)
~16x improvement
~3.3x improvement
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2015
| 2013
| 2014
| 2015
Split Strategies
 BI – Skip reading file footer
 ETL – Read and cache file footer
 HYBRID – Default. Chooses BI/ETL
based on number of files and
average file size
 Group splits based on columnar
projection size instead of file size
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2015
| 2013
| 2014
| 2015
ORC became Apache Top Level Project
 C++ reader with contributions from
Hortonworks, HP and Microsoft
 Column encryption to encrypt
sensitive columns
http://orc.apache.org/
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: In Production
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Facebook
Saved more than 1,400
servers worth of storage.(2)
Compressioni
Compression ratio
increased from 5x to 8x
globally.(2)
Compressioni
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Spotify
16x less HDFS read when
using ORC versus Avro.(3)
IOi
32x less CPU when using
ORC versus Avro.(3)
CPUi
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Yahoo!
6-50x speedup when using
ORC versus Text File.(4)
Speedupi
1.6-30x speedup when
using ORC versus RCFile.(4)
Speedupi
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP and Sub-second
ORC – Pushing for Sub-second
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP
- JIT Performance for short queries+
Row-group level caching+
Asynchronous IO Elevator+
+ Multi-threaded Column Vector processing+
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Vectorization + SIMD
0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2
0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2
0x00007f13d2e6afba: movslq %eax,%r10
0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3
;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)
Example:
Query: select ss_ext_tax + 1.0 from store_sales_orc;
JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly”
Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib
Generated Assembly:
 Allocation free tight inner loops enables JDK’s auto-vectorization
 Vectors can be filtered early in ORC
 String dictionary can be used to binary-search
 Vectorized SIMD Join
 Improves performance for single key joins
AVX - Vector Addition Packed Double
4 doubles loaded to 256 bit registers
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
select * from tpch_1000.lineitem where l_orderkey=1212000001;
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions
?
Interested? Stop by the Hortonworks booth to learn more
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Endnotes
(1) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-
specORCFormatSpecification
(2) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
(3) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014
(4) http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3

More Related Content

What's hot

Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 

What's hot (20)

Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
ORC Files
ORC FilesORC Files
ORC Files
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 

Viewers also liked

cstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQLcstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQLCitus Data
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Differences of Deep Learning Frameworks
Differences of Deep Learning FrameworksDifferences of Deep Learning Frameworks
Differences of Deep Learning FrameworksSeiya Tokui
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 

Viewers also liked (9)

cstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQLcstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQL
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Differences of Deep Learning Frameworks
Differences of Deep Learning FrameworksDifferences of Deep Learning Frameworks
Differences of Deep Learning Frameworks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 

Similar to ORC 2015: Faster, Better, Smaller

Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in HiveEugene Koifman
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 

Similar to ORC 2015: Faster, Better, Smaller (20)

Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

ORC 2015: Faster, Better, Smaller

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC 2015: Faster, Better, Smaller Prasanth Jayachandran Apache Hive Team, Hortonworks @prasanth_j
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache ORC – Optimized Row-Columnar File Apache TLP – orc.apache.org+ Type Specific Encodings+ Came out of Apache Hive+ Vectorized Readers (Java, C++)+ Projection and Predicate Pushdown+ Columnar Storage+ Block Compression+ Hive ACID transactions+ Single SerDe Format+ Protobuf Metadata Storage+
  • 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Format Specification How ORC stores data?
  • 4. Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC File Layout  File Footer and Postscript  Stripes  Indexes (Row group indexes and Bloom Filter interleaved)  Min/Max stats, Positions for every 10K rows  Data  Multiple streams per column encoded and compressed independently  Stripe Footer  Locations to streams, type of encoding  Full specification at [1]
  • 5. Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC Writer Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>  One tree writer per flattened column  Multiple streams per column  PRESENT  DATA  LENGTH  DICTIONARY_DATA  SECONDARY  ROW_INDEX  BLOOM_FILTER
  • 6. Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC Data Streams Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>  Streams can be suppressed.  Example: PRESENT stream is suppressed when all values in a stripe are non-null. IS_PRESENT DATA DICTIONARY LENGTH SECONDARY Compression Buffers
  • 7. Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Features Timeline How ORC improved over time?
  • 8. Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline February 2013  Stinger Initiative Announcement*  Roadmap to improve Apache Hive’s performance by 100x  Delivered in 100% Apache Open Source * http://hortonworks.com/blog/100x-faster-hive/ | 2013 | 2014 | 2015 SQL Engine Vectorized SQL Engine Columnar Storage ORC + + Distributed Execution Apache Tez = 100x
  • 9. Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline March 2013 Optimized Row Columnar (ORC) file format committed to Hive  Hive version: 0.11  Native data format in Hive | 2013 | 2014 | 2015
  • 10. Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline March 2013 | 2013 | 2014 | 2015 Predicate Pushdown  SARG interface  Prune stripes and row groups based on min/max statistics Improved Run Length Encoding  Tighter bit packing  Longer runs  DELTA, SHORT_REPEATS, DIRECT, PATCHED_BASE
  • 11. Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Run Length Encoding Improvements RLE (hive 0.11) RLE (hive >= 0.12) Compression Ratio Encoding Time (in ms) Decoding Time (in ms) Compression Ratio Encoding Time (in ms) Decoding Time (in ms) Twitter Census API ID (24,556,361 records) 2.32 1770 1263 6.97 1558 864 HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125 Github Archive (root.payload.name.txt.dict-len) 114.05 21 15 260.73 23 15 AOL Querylog Epoch (36,389,577 records) 2.51 553 364 3.7 652 246 Reference: https://issues.apache.org/jira/secure/attachment/12596722/ORC-Compression-Ratio-Comparison.xlsx
  • 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline April 2013 | 2013 | 2014 | 2015 Vectorized ORC readers  Read and process columns in batches of size 1024 Null stream suppression  Suppress PRESENT stream if no nulls in a stripe  Enables fast path in vectorization June 2013
  • 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline October 2013 | 2013 | 2014 | 2015 Statistics Interface  Writer – Update statistics during load time  Reader – ANALYZE TABLE .. NOSCAN Split Elimination  Stripe level column statistics  Eliminate stripes that do not satisfy predicate conditions November 2013
  • 14. Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline February 2014 | 2013 | 2014 | 2015 Zero copy read path  HDFS caching APIs to read directly into memory without extra data copies Serialization Improvements  Bit width alignment (trade-off space for speed)  Unrolled bit packing and unpacking  Buffered double reader and writer June 2014
  • 15. Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Serialization Improvements 0 200 400 600 800 1000 1200 1400 1600 1800 1 2 4 8 16 24 32 40 48 56 64 MeanTime(ms) Bit Width ORC Read Integer Performance (smaller is better) hive 0.13 unpacking hive-1.0 unpacking (new)
  • 16. Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Serialization Improvements 241.679 171.045 174.163 0 50 100 150 200 250 300 hive <= 0.13 buffered + BE buffered + LE MeanTime(ms) Double Read Modes ORC Read Double Performance (smaller is better) ~1.4x improvement
  • 17. Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline June 2014 | 2013 | 2014 | 2015 Adaptive compression buffer size  >1000 columns adjust compression buffer size based on available memory  Avoids wide table OOMs Fast stripe level file merging  Many small files to few large files  No Decompression, No Decoding  ALTER TABLE … CONCATENATE July 2014
  • 18. Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Fast File Merging 1091 651 245 816 0 200 400 600 800 1000 1200 1400 1600 ORC RCFile TotalTimeinseconds CONCAT Supporting File Formats ETL With File Merging – TPC-H 1000 Scale Lineitem (smaller is better) Merge Time Load Time 1336 1467 ~3.33x improvement in merge time
  • 19. Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline July 2014 | 2013 | 2014 | 2015 ORC Padding Improvements  Pad bytes to avoid remote HDFS reads  Last stripe is adjusted to fit within HDFS block boundary (worst case: 5% wastage) Decouple stripe size vs block size  Smaller stripes (64MB)  More stripes per block (4 per block)  Better parallelism & split elimination
  • 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline September 2014 | 2013 | 2014 | 2015 String Dictionary Improvements  Row group level checking  Remember decision across stripes  Avoids expensive RBTree insertions
  • 21. Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved String Dictionary Improvements 767 540 0 100 200 300 400 500 600 700 800 900 hive <= 0.13 hive > 0.13 Timeinseconds Hive Version String Dictionary Improvements - TPC-H 1000 Scale Lineitem (smaller is better) Load Time ~1.4x improvement
  • 22. Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline September 2014 | 2013 | 2014 | 2015 Improved ZLIB compression  Different streams compressed with different zlib strategies/levels  Compress integers and doubles differently  Data and Dictionary stream - Looks for smaller byte patterns  All other streams - Less LZ77, More Huffman
  • 23. Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ZLIB Improvements 178.5 172.2 225.1 0 50 100 150 200 250 ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY DataSizeinGBs File Format + Compression Codec Data Size Improvements - TPC-H 1000 Scale Lineitem (smaller is better) ~4% improvement ~1.3x smaller
  • 24. Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ZLIB Improvements 674 433 389 0 100 200 300 400 500 600 700 800 ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY DataSizeinGBs File Format + Compression Codec Load Time Improvements - TPC-H 1000 Scale Lineitem (smaller is better) ~1.6x improvement Only ~10% slower than SNAPPY
  • 25. Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline September 2014 | 2013 | 2014 | 2015 ACID transactions  Order of millions of rows  Not designed for OLTP requirements  Streaming Ingest via Flume or Storm  Atomically add base and delta directories  Minor compaction – Merge many delta files  Major compaction – Re-write base files to incorporate delta file changes Broken pattern: Add Partitions for Atomicity-
  • 26. Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline January 2015 | 2013 | 2014 | 2015 hasNull flag in ORC internal index  Better pruning of row groups  Improves the performance of SELECT .. WHERE column IS NULL;
  • 27. Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved hasNull in Index Improvement Bytes Read: 208.77 GB vs 539 MB 66.73 7.87 0 10 20 30 40 50 60 70 80 hive < 1.1.0 hive >= 1.1.0 ExecutionTimeinseconds Hive Version select * from lineitem where l_shipdate is null (smaller is better) Execution Time~8.5x improvement
  • 28. Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline February 2015 | 2013 | 2014 | 2015 Bloom Filter Index  Much better row group pruning when compared to min/max  Bloom filter evaluated after the fast Min/Max based elimination
  • 29. Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Bloom Filter Indexes Improvements 5999989709 540,000 10,000 No Indexes Min-Max Indexes Bloomfilter Indexes select * from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale – smaller is better) Rows Read
  • 30. Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Bloom Filter Indexes Improvements 74 4.5 1.34 No Indexes Min-Max Indexes Bloomfilter Indexes select * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller is better) Time Taken (seconds) ~16x improvement ~3.3x improvement
  • 31. Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline April 2015 | 2013 | 2014 | 2015 Split Strategies  BI – Skip reading file footer  ETL – Read and cache file footer  HYBRID – Default. Chooses BI/ETL based on number of files and average file size  Group splits based on columnar projection size instead of file size
  • 32. Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Timeline April 2015 | 2013 | 2014 | 2015 ORC became Apache Top Level Project  C++ reader with contributions from Hortonworks, HP and Microsoft  Column encryption to encrypt sensitive columns http://orc.apache.org/
  • 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: In Production
  • 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC at Facebook Saved more than 1,400 servers worth of storage.(2) Compressioni Compression ratio increased from 5x to 8x globally.(2) Compressioni
  • 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC at Spotify 16x less HDFS read when using ORC versus Avro.(3) IOi 32x less CPU when using ORC versus Avro.(3) CPUi
  • 36. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC at Yahoo! 6-50x speedup when using ORC versus Text File.(4) Speedupi 1.6-30x speedup when using ORC versus RCFile.(4) Speedupi
  • 37. Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP and Sub-second ORC – Pushing for Sub-second
  • 38. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP - JIT Performance for short queries+ Row-group level caching+ Asynchronous IO Elevator+ + Multi-threaded Column Vector processing+
  • 39. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: Vectorization + SIMD 0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2 0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2 0x00007f13d2e6afba: movslq %eax,%r10 0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3 ;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94) Example: Query: select ss_ext_tax + 1.0 from store_sales_orc; JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly” Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib Generated Assembly:  Allocation free tight inner loops enables JDK’s auto-vectorization  Vectors can be filtered early in ORC  String dictionary can be used to binary-search  Vectorized SIMD Join  Improves performance for single key joins AVX - Vector Addition Packed Double 4 doubles loaded to 256 bit registers
  • 40. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC: LLAP (+ SIMD + Split Strategies + Row Indexes) select * from tpch_1000.lineitem where l_orderkey=1212000001;
  • 41. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions ? Interested? Stop by the Hortonworks booth to learn more
  • 42. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Endnotes (1) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc- specORCFormatSpecification (2) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ (3) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014 (4) http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3