Submit Search
Upload
ORC Deep Dive 2020
•
Download as PPTX, PDF
•
4 likes
•
7,329 views
Owen O'Malley
Follow
A deep dive in to the architecture of Apache ORC.
Read less
Read more
Engineering
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 45
Download now
Recommended
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
ORC 2015
ORC 2015
t3rmin4t0r
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
ORC Files
ORC Files
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
Introduction to Apache Solr
Introduction to Apache Solr
Christos Manios
Recommended
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
ORC 2015
ORC 2015
t3rmin4t0r
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
ORC Files
ORC Files
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
Introduction to Apache Solr
Introduction to Apache Solr
Christos Manios
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
DataWorks Summit
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
Parquet overview
Parquet overview
Julien Le Dem
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
Spark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
More Related Content
What's hot
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
DataWorks Summit
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
Parquet overview
Parquet overview
Julien Le Dem
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
Spark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
What's hot
(20)
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Introduction to Apache Kudu
Introduction to Apache Kudu
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
Introduction to Spark Internals
Introduction to Spark Internals
Parquet overview
Parquet overview
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Spark tuning
Spark tuning
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Similar to ORC Deep Dive 2020
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Kafka overview v0.1
Kafka overview v0.1
Mahendran Ponnusamy
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furber
asodariyabhavesh
Assembler
Assembler
Temesgen Molla
chapter8.ppt clean code Boundary ppt Coding guide
chapter8.ppt clean code Boundary ppt Coding guide
SanjeevSaharan5
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT IV Designing Embedded System with 8051...
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT IV Designing Embedded System with 8051...
Arti Parab Academics
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Prashant Rane
Cloudera Impala technical deep dive
Cloudera Impala technical deep dive
huguk
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
Performance Tuning by Dijesh P
Performance Tuning by Dijesh P
PlusOrMinusZero
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Antonio García-Domínguez
Similar to ORC Deep Dive 2020
(20)
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
The Impala Cookbook
The Impala Cookbook
Kafka overview v0.1
Kafka overview v0.1
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Arm architecture chapter2_steve_furber
Arm architecture chapter2_steve_furber
Assembler
Assembler
chapter8.ppt clean code Boundary ppt Coding guide
chapter8.ppt clean code Boundary ppt Coding guide
HadoopFileFormats_2016
HadoopFileFormats_2016
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT IV Designing Embedded System with 8051...
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT IV Designing Embedded System with 8051...
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Cloudera Impala technical deep dive
Cloudera Impala technical deep dive
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
DataFrames: The Extended Cut
DataFrames: The Extended Cut
Performance Tuning by Dijesh P
Performance Tuning by Dijesh P
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
More from Owen O'Malley
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
Big Data's Journey to ACID
Big Data's Journey to ACID
Owen O'Malley
Protect your private data with ORC column encryption
Protect your private data with ORC column encryption
Owen O'Malley
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Strata NYC 2018 Iceberg
Strata NYC 2018 Iceberg
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
ORC Column Encryption
ORC Column Encryption
Owen O'Malley
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
Data protection2015
Data protection2015
Owen O'Malley
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
Hadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
Adding ACID Updates to Hive
Adding ACID Updates to Hive
Owen O'Malley
ORC File Introduction
ORC File Introduction
Owen O'Malley
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
Next Generation MapReduce
Next Generation MapReduce
Owen O'Malley
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
Owen O'Malley
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
More from Owen O'Malley
(19)
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Big Data's Journey to ACID
Big Data's Journey to ACID
Protect your private data with ORC column encryption
Protect your private data with ORC column encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Strata NYC 2018 Iceberg
Strata NYC 2018 Iceberg
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
ORC Column Encryption
ORC Column Encryption
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Data protection2015
Data protection2015
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Hadoop Security Architecture
Hadoop Security Architecture
Adding ACID Updates to Hive
Adding ACID Updates to Hive
ORC File Introduction
ORC File Introduction
Optimizing Hive Queries
Optimizing Hive Queries
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Next Generation MapReduce
Next Generation MapReduce
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
Recently uploaded
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
9953056974 Low Rate Call Girls In Saket, Delhi NCR
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
RajaP95
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
eptoze12
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
k795866
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
dollysharma2066
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
jennyeacort
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes examples
Dr. Gudipudi Nageswara Rao
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
RajaP95
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
VICTOR MAESTRE RAMIREZ
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
hassan khalil
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
me23b1001
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
PoojaBan
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Asst.prof M.Gokilavani
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
rehmti665
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
NikhilNagaraju
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Tsuyoshi Horigome
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
wendy cai
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
Tagore Institute of Engineering And Technology
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Recently uploaded
(20)
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes examples
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
ORC Deep Dive 2020
1.
ORC DEEP DIVE Owen
O’Malley omalley@apache.org January 2020 @owen_omalley
2.
OVERVIEW
3.
© 2019 Cloudera,
Inc. All rights reserved. 3 REQUIREMENTS • Files had to be completely self describing • Schema • File version • Tight compression ⇒ Run Length Encoding (RLE) & compression • Column projection ⇒ segregate column data • Predicate pushdown ⇒ understand & index user’s types • Files had to be easy & fast to divide • Compatible with a write once file systems
4.
© 2019 Cloudera,
Inc. All rights reserved. 4 FILE STRUCTURE • The file footer contains: • Metadata – schema, file statistics • Stripe information – metadata and location of stripes • Postscript with the compression, buffer size, & file version • ORC file data is divided into stripes. • Stripes are self contained sets of rows organized by columns. • Stripes are the smallest unit of work for tasks. • Default is ~64MB, but often configured larger.
5.
© 2019 Cloudera,
Inc. All rights reserved. 5 STRIPE STRUCTURE • Within a stripe, the metadata data is in the stripe footer. • List of streams • Column encoding information (eg. direct or dictionary) • Columns are written as a set of streams. There are 3 kinds: • Index streams • Data streams • Dictionary streams
6.
© 2019 Cloudera,
Inc. All rights reserved. 6 FILE STRUCTURE
7.
© 2019 Cloudera,
Inc. All rights reserved. 7 READ PATH • The Reader reads last 16k of file, extra as needed • The RowReader reads • Stripe footer • Required streams
8.
© 2019 Cloudera,
Inc. All rights reserved. 8 STREAMS • Streams are an independent sequence of bytes • Serialization into streams depends on column type & encoding • Optional pipeline stages: • Run Length Encoding (RLE) – first pass integer compression • Generic compression – Zlib, Snappy, LZO, Zstd • Encryption – AES/CTR
9.
DATA ENCODING
10.
© 2019 Cloudera,
Inc. All rights reserved. 10 COMPOUND TYPES • Compound types are serialized as trees of columns. • struct, list, map, uniontype all have child columns • Types are numbered in a preorder traversal • The column reading classes are called TreeReadera: int, b: map<string, struct<c: string, d: double>>, e: timestamp
11.
© 2019 Cloudera,
Inc. All rights reserved. 11 ENCODING COLUMNS • To interpret a stream, you need three pieces of information: • Column type • Column encoding (direct, dictionary) • Stream kind (present, data, length, etc.) • All columns, if they have nulls, will have a present stream • Serialized using a boolean RLE • Integer columns are serialized with • A data stream using integer RLE
12.
© 2019 Cloudera,
Inc. All rights reserved. 12 ENCODING COLUMNS • Binary columns are serialized with: • Length stream of integer RLE • Data stream of raw sequence of bytes • String columns may be direct or dictionary encoded • Direct looks like binary column, but dictionary is different • Dictionary_data is raw sequence of dictionary bytes • Length is an integer RLE stream of the dictionary lengths • Data is an integer RLE stream of indexes into dictionary
13.
© 2019 Cloudera,
Inc. All rights reserved. 13 ENCODING COLUMNS • Lists and maps record the number of child elements • Length is an integer RLE stream • Structs only have the present stream • Timestamps need nanosecond resolution (ouch!) • Data is an integer RLE of seconds from Jan 2015 • Secondary is an integer RLE of nanoseconds with 0 suppress
14.
© 2019 Cloudera,
Inc. All rights reserved. 14 RUN LENGTH ENCODING • Goal is to get some cheap quick compression • Handles repeating/incrementing values • Handles integer byte packing • Two versions • Version 1 – relative simple repeat/literal encoding • Version 2 – complex encoding with 4 variants • Column encoding of *_V2 means use RLE version 2
15.
COMPRESSION & INDEXES
16.
© 2019 Cloudera,
Inc. All rights reserved. 16 ROW PRUNING • Three levels of indexing/row pruning • File – uses file statistics in file footer • Stripe – uses stripe statistics before file footer • Row group (default of 10k rows) – uses index stream • The index stream for each column includes for each row group • Column statistics (min, max, count, sum) • The start positions of each stream
17.
© 2019 Cloudera,
Inc. All rights reserved. 17 SEARCH ARGUMENTS • Engines can pass Search Arguments (SArgs) to the RowReader. • Limited set of operations (=, <=>, <, <=, in, between, is null) • Compare one column to literal(s) • Can only eliminate entire row groups, stripes, or files. • Engine must still filter the individual rows afterwards • For Hive, ensure hive.optimize.index.filter is true.
18.
© 2019 Cloudera,
Inc. All rights reserved. 18 COMPRESSION • All of the generic compression is done in chunks • Codec is reinitialized at start of chunk • Each chunk is compressed separately • Each uncompressed chunk is at most the buffer size • Each chunk has a 3 byte header giving: • Compressed size of chunk • Whether it is the original or compressed
19.
© 2019 Cloudera,
Inc. All rights reserved. 19 INDEXES • Wanted ability to seek to each row group • Allows fine grain seeking & row pruning • Could have flushed stream compression pipeline • Would have dramatically lowered compression • Instead treat compression & RLE has gray boxes • Use our knowledge of compression & RLE • Always start fresh at beginning of chunk or run
20.
© 2019 Cloudera,
Inc. All rights reserved. 20 INDEX POSITIONS • Records information to seek to a given row in all of a column’s streams • Includes: • C Compressed bytes • U Uncompressed bytes • V RLE values • C, U, & V jump to RG 4
21.
© 2019 Cloudera,
Inc. All rights reserved. 21 BLOOM FILTERS • For use cases where you need to find particular values • Sorting by that column allows min/max filtering • But you can only sort on one column effectively • Bloom filters are probabilistic data structures • Only useful for equality, not less than or greater than • Need ~10 bits/distinct value ⇒ opt in • ORC uses a bloom_filter_utf8 stream to record a bloom filter per a row group
22.
© 2019 Cloudera,
Inc. All rights reserved. 22 ROW PRUNING EXAMPLE • TPC-DS from tpch1000.lineitem where l_orderkey = 1212000001; Index Rows Read Time Nothing 5,999,989,709 74 sec Min/Max 540,000 4.5 sec Bloom 10,000 1.3 sec
23.
VERSIONING
24.
© 2019 Cloudera,
Inc. All rights reserved. 24 COMPATIBILITY • Within a file version, old readers must be able to read all files. • A few exceptions (eg. new codecs, types) • Version 0 (from Hive 0.11) • Only RLE V1 & string dictionary encoding • Version 1 (from Hive 0.12 forward) • Version 2 (under development) • The library includes ability to write any file version. • Enables smooth upgrades across clusters
25.
© 2019 Cloudera,
Inc. All rights reserved. 25 WRITER VERSION • When fixes or feature additions are made to the writer, we bump the writer version. • Allows reader to work around bugs, especially in index • Does not affect reader compatibility • We should require each minor version adds a new one. • We also record which writer wrote the file: • Java, C++, Presto, Go
26.
© 2019 Cloudera,
Inc. All rights reserved. 26 EXAMPLE WORKAROUND FOR HIVE-8746 • Timestamps suck! • ORC uses an epoch of 01-01-2015 00:00:00. • Timestamp columns record seconds offset from epoch • Unfortunately, the original code use local time zone. • If reader and writer were in time zones with the same rules, it worked. • Fix involved writing the writer time zone into file. • Forwards and backwards compatible
27.
ADDITIONAL FEATURES
28.
© 2019 Cloudera,
Inc. All rights reserved. 28 SCHEMA EVOLUTION • User passes desired schema to RecordReader factory. • SchemaEvolution class maps between file & reader schemas. • The mapping can be positional or name based. • Conversions based on legacy Hive behavior… • The RecordReader uses the mapping to translate • Choosing streams uses the file schema column ids • Type translation is done by ConvertTreeReaderFactory. • Adds an additional TreeReader that does conversion.
29.
© 2019 Cloudera,
Inc. All rights reserved. 29 STRIPE CONCATENATION & FLUSH • ORC has a special operator to concatenate files • Requires consistent options & schema • Concatenates stripes without reserialization • ORC can flush the current contents including a file footer while still writing to the file. • Writes a side file with the current offset of the file tail • When the file closes the intermediate file footers are ignored
30.
© 2019 Cloudera,
Inc. All rights reserved. 30 COLUMN ENCRYPTION • Released in ORC 1.6 • Allows consistent column level access control across engines • Writes two variants of data • Encrypted original • Unencrypted statically masked • Each variant has its own streams & encodings • Each column has a unique local key, which is encrypted by KMS
31.
© 2019 Cloudera,
Inc. All rights reserved. 31 OTHER DEVELOPER TOOLS • Benchmarks • Hive & Spark • Avro, Json, ORC, and Parquet • Three data sets (taxi, sales, github) • Docker • Allows automated builds on all supported Linux variants • Site source code is with C++ & Java
32.
USING ORC
33.
© 2019 Cloudera,
Inc. All rights reserved. 33 WHICH VERSION IS IT? Engine Version ORC Version Hive 0.11 to 2.2 Hive ORC 0.11 to 2.2 2.3 ORC 1.3 3.0 ORC 1.4 3.1 ORC 1.5 Spark hive * Hive ORC 1.2 Spark native 2.3 ORC 1.4 2.4 to 3.0 ORC 1.5
34.
© 2019 Cloudera,
Inc. All rights reserved. 34 FROM SQL • Hive: • Add “stored as orc” to table definition • Table properties override configuration for ORC • Spark’s “spark.sql.orc.impl” controls implementation • native – Use ORC 1.5 • hive – Use ORC from Hive 1.2
35.
© 2019 Cloudera,
Inc. All rights reserved. 35 FROM JAVA • Use the ORC project rather than Hive’s ORC. • Maven group id: org.apache.orc version: 1.6.2 • nohive classifier avoids interfering with Hive’s packages • Two levels of access • orc-core – Faster access, but uses Hive’s vectorized API • orc-mapreduce – Row by row access, simpler OrcStruct API • MapReduce API implements WritableComparable • Can be shuffled • Need to specify type information in configuration for shuffle or output
36.
© 2019 Cloudera,
Inc. All rights reserved. 36 FROM C++ • Pure C++ client library • No JNI or JDK so client can estimate and control memory • Uses pure C++ HDFS client from HDFS-8707 • Reader and writer are stable and in production use. • Runs on Linux, Mac OS, and Windows. • Docker scripts for CentOS 6-8, Debian 8-10, Ubuntu 14-18 • CI builds on Mac OS, Ubuntu, and Windows
37.
© 2019 Cloudera,
Inc. All rights reserved. 37 FROM COMMAND LINE • Using hive –orcfiledump from Hive • -j -p – pretty prints the metadata as JSON • -d – prints data as JSON • Using java -jar orc-tools-*-uber.jar from ORC • meta -j -p – print the metadata as JSON • data – print data as JSON • convert – convert CSV, JSON, or ORC to ORC • json-schema – scan a set of JSON documents to find schema
38.
© 2019 Cloudera,
Inc. All rights reserved. 38 DEBUGGING • Things to look for: • Stripe size • Rows/Stripe • File version • Writer version • Width of schema • Sanity of statistics • Column encoding • Size of dictionaries
39.
OPTIMIZATION
40.
© 2019 Cloudera,
Inc. All rights reserved. 40 STRIPE SIZE • Makes a huge difference in performance • orc.stripe.size or hive.exec.orc.default.stripe.size • Controls the amount of buffer in writer. Default is 64MB • Trade off • Large = Large more efficient reads • Small = Less memory and more granular processing splits • Multiple files written at the same time will shrink stripes
41.
© 2019 Cloudera,
Inc. All rights reserved. 41 HDFS BLOCK PADDING • The stripes don’t align exactly with HDFS blocks • Unless orc.write.variable.length.blocks • HDFS scatters blocks around cluster • Often want to pad to block boundaries • Costs space, but improves performance • orc.default.block.padding • orc.block.padding.tolerance
42.
© 2019 Cloudera,
Inc. All rights reserved. 42 SPLIT CALCULATION • BI Small fast queries Splits based on HDFS blocks • ETL Large queries Read file footer and apply SearchArg to stripes Can include footer in splits (hive.orc.splits.include.file.footer) • Hybrid If small files or lots of files, use BI
43.
CONCLUSION
44.
© 2019 Cloudera,
Inc. All rights reserved. 44 FOR MORE INFORMATION • The orc_proto.proto defines the ORC metadata • Read code and especially OrcConf, which has all of the knobs • Website on https://orc.apache.org/ • /bugs ⇒ jira repository • /src ⇒ github repository • /specification ⇒ format specification • Apache email list dev@orc.apache.org
45.
THANK YOU Owen O’Malley omalley@apache.org @owen_omalley
Download now