Submit Search
Upload
ORC improvement in Apache Spark 2.3
•
Download as PPTX, PDF
•
9 likes
•
2,530 views
Dongjoon Hyun
Follow
Dataworks Summit 2018 Berlin - ORC improvement in Apache Spark 2.3
Read less
Read more
Software
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 45
Download now
Recommended
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Dongjoon Hyun
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Recommended
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Dongjoon Hyun
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
ORC 2015
ORC 2015
t3rmin4t0r
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
You Can't Search Without Data
You Can't Search Without Data
Bryan Bende
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Running Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
More Related Content
What's hot
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
ORC 2015
ORC 2015
t3rmin4t0r
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
You Can't Search Without Data
You Can't Search Without Data
Bryan Bende
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Running Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
What's hot
(20)
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
HadoopFileFormats_2016
HadoopFileFormats_2016
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
LLAP Nov Meetup
LLAP Nov Meetup
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
ORC 2015
ORC 2015
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
You Can't Search Without Data
You Can't Search Without Data
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Running Services on YARN
Running Services on YARN
ORC Deep Dive 2020
ORC Deep Dive 2020
Similar to ORC improvement in Apache Spark 2.3
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
What's new in apache hive
What's new in apache hive
DataWorks Summit
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
Sunil Govindan
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
Hadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Antonio García-Domínguez
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
High throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
Similar to ORC improvement in Apache Spark 2.3
(20)
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
What's new in apache hive
What's new in apache hive
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
High throughput data replication over RAFT
High throughput data replication over RAFT
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
Recently uploaded
EY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
Neo4j
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
stazi3110
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
Hanief Utama
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
jennyeacort
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
OPEN KNOWLEDGE GmbH
MYjobs Presentation Django-based project
MYjobs Presentation Django-based project
AnoyGreter
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
kzayra69
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Ortus Solutions, Corp
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
qr0udbr0
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Matt Ray
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
FerryKemperman
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
Livetecs LLC
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
andrehoraa
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Christina Lin
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Andreas Granig
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
soniya singh
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Dinusha Kumarasiri
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
VICTOR MAESTRE RAMIREZ
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
Christina Lin
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
VICTOR MAESTRE RAMIREZ
Recently uploaded
(20)
EY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
MYjobs Presentation Django-based project
MYjobs Presentation Django-based project
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
ORC improvement in Apache Spark 2.3
1.
1 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Improvement in Apache Spark 2.3 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team April 2018
2.
2 © Hortonworks
Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks • Principal Software Engineer @ Data Science Team • Apache Project • Apache REEF Project Management Committee(PMC) Member & Committer • Apache Spark Project Contributor • GitHub • https://github.com/dongjoon-hyun
3.
3 © Hortonworks
Inc. 2011–2018. All rights reserved Agenda • What’s New in Apache Spark 2.3 • Previous ORC issues in Apache Spark • Current Approach & Demo • Performance & Limitation • Future roadmap
4.
4 © Hortonworks
Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
5.
5 © Hortonworks
Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
6.
6 © Hortonworks
Inc. 2011–2018. All rights reserved Spark’s file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables
7.
7 © Hortonworks
Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables Fast Flexible Hive Table Access
8.
8 © Hortonworks
Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
9.
9 © Hortonworks
Inc. 2011–2018. All rights reserved Background – The story of Spark, ORC, and Hive • Before Apache ORC • Hive 1.2.1 (2015 JUN) SPARK-2883 (Spark 1.4) • After Apache ORC • v1.0.0 (2016 JAN) ... • v1.3.3 (2017 FEB) • v1.4.0 (2017 MAY)
10.
10 © Hortonworks
Inc. 2011–2018. All rights reserved Background – The story of Spark, ORC, and Hive – Cont. • Before Apache ORC • Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC • v1.0.0 (2016 JAN) ... • v1.3.3 (2017 FEB) HIVE-15841 Hive 2.3 • v1.4.0 (2017 MAY) SPARK-21422 Spark 2.3 • v1.4.1 (2017 OCT) SPARK-22300 Spark 2.3 • v1.4.3 (2018 FEB) SPARK-23340, HIVE-18674 Hive 3.0 Spark 2.4
11.
11 © Hortonworks
Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
12.
12 © Hortonworks
Inc. 2011–2018. All rights reserved Category 1 – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Fix column names in FileSinkOperator • HIVE_12055(2015) Create row-by-row shims for the write path • HIVE_13083(2016) Writing HiveDecimal can wrongly suppress present stream • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
13.
13 © Hortonworks
Inc. 2011–2018. All rights reserved Category 2 – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
14.
14 © Hortonworks
Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
15.
15 © Hortonworks
Inc. 2011–2018. All rights reserved Category 4 – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
16.
16 © Hortonworks
Inc. 2011–2018. All rights reserved Category 5 – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) • Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) • Return wrong result if ORC file schema is different from Hive MetaStore schema order
17.
17 © Hortonworks
Inc. 2011–2018. All rights reserved Category 6 – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
18.
18 © Hortonworks
Inc. 2011–2018. All rights reserved Current Approach
19.
19 © Hortonworks
Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4.1
20.
20 © Hortonworks
Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
21.
21 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
22.
22 © Hortonworks
Inc. 2011–2018. All rights reserved How to specify `native` OrcFileFormat directly CREATE TABLE people (name string, age int) USING org.apache.spark.sql.execution.datasources.orc df.write .format("org.apache.spark.sql.execution.datasources.orc") .save(path) spark.read .format("org.apache.spark.sql.execution.datasources.orc") .load(path) Read Dataset Write Dataset Create ORC Table
23.
23 © Hortonworks
Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
24.
24 © Hortonworks
Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
25.
25 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations orc.impl # of cols <= codegen.maxFields `native` `hive` ORC Reader `hive` true spark.sql.codegen.maxFields=100 (default) false `native` ORC Columnar Batch Reader all atomic types true false `native` ORC Record Reader orc.enableVectorizedReader false true
26.
26 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations – Cont. orc.enableVectorizedReader Wrapping ORC ColumnVector Spark OrcColumnVector orc.copyBatchToSpark true false Copying ORC ColumnVector Spark OffHeapColumnVector true columnVector.offheap.enabled true Copying ORC ColumnVector Spark OnHeapColumnVector false `native` ORC Columnar Batch Reader
27.
27 © Hortonworks
Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) • `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip') SPARK-23355
28.
28 © Hortonworks
Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns • Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting • boolean -> byte -> short -> int -> long • float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
29.
29 © Hortonworks
Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Type2 ✔️ ✔️3 ✔️ Change Position ✔️ ✔️ ✔️
30.
30 © Hortonworks
Inc. 2011–2018. All rights reserved Demo 1 ORC configuration
31.
31 © Hortonworks
Inc. 2011–2018. All rights reserved Demo 2 PySpark with ORC
32.
32 © Hortonworks
Inc. 2011–2018. All rights reserved Performance
33.
33 © Hortonworks
Inc. 2011–2018. All rights reserved Micro Benchmark • Target • Apache Spark 2.3.0 • Apache ORC 1.4.1 • Machine • MacBook Pro (2015 Mid) • Intel® Core™ i7-4770JQ CPI @ 2.20GHz • Mac OS X 10.13.4 • JDK 1.8.0_161
34.
34 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
35.
35 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
36.
36 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
37.
37 © Hortonworks
Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
38.
38 © Hortonworks
Inc. 2011–2018. All rights reserved Limitation Future Roadmap
39.
39 © Hortonworks
Inc. 2011–2018. All rights reserved Limitation • Spark vectorization supports atomic types only • Limited simple schema evolution. JSON provides more • boolean -> byte -> short -> int -> long • float -> double • `convertMetastore` ignores `STORED AS` table properties (SPARK-23355) • Both ORC/Parquet
40.
40 © Hortonworks
Inc. 2011–2018. All rights reserved Future Roadmap – Apache Spark 2.4 (2018 Fall) • Feature Parity for ORC with Parquet (SPARK-20901) • Use `native` ORC implementation by default (SPARK-23456) • Use ORC predicate pushdown by default (SPARK-21783) • Use `convertMetastoreOrc` by default (SPARK-22279) • Test ORC as default data source format (SPARK-23553) • Test and support Bloom Filters (SPARK-12417)
41.
41 © Hortonworks
Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • Support VectorUDT/MatrixUDT (SPARK-22320) • Support CHAR/VARCHAR Types • Vectorized Writer with DataSource V2 • ALTER TABLE … CHANGE column type (SPARK-18727)
42.
42 © Hortonworks
Inc. 2011–2018. All rights reserved Summary • Apache Spark 2.3 starts to take advantage of Apache ORC • Native vectorized ORC reader • boosts Spark ORC performance • provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
43.
43 © Hortonworks
Inc. 2011–2018. All rights reserved Reference • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
44.
44 © Hortonworks
Inc. 2011–2018. All rights reserved Questions?
45.
45 © Hortonworks
Inc. 2011–2018. All rights reserved Thank you
Download now