ORC improvement in Apache Spark 2.3

1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement in Apache Spark 2.3
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
April 2018

Dongjoon Hyun
• Hortonworks
• Principal Software Engineer @ Data Science Team
• Apache Project
• Apache REEF Project Management Committee(PMC) Member & Committer
• Apache Spark Project Contributor
• GitHub
• https://github.com/dongjoon-hyun

Agenda
• What’s New in Apache Spark 2.3
• Previous ORC issues in Apache Spark
• Current Approach & Demo
• Performance & Limitation
• Future roadmap

• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
What’s New in Apache Spark 2.3

Spark’s file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables

Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
Fast
Flexible
Hive Table Access

Previous ORC Issues in Spark

Background – The history of Spark and ORC
• Before Apache ORC
• Hive 1.2.1 (2015 JUN)  SPARK-2883 (Hive ORC is used since Spark 1.4)
• After Apache ORC
• v1.0.0 (2016 JAN)
• v1.1.0 (2016 JUN)
• v1.2.0 (2016 AUG)
• v1.3.0 (2017 JAN)
• v1.4.0 (2017 MAY)  SPARK-21422 (Apache ORC is added since Spark 2.3)
• v1.4.1 (2017 OCT)  SPARK-22300
• v1.4.3 (2018 FEB)  SPARK-23340 (Spark 2.4)

Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness

Issues with ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Fix column names in FileSinkOperator
• HIVE_12055(2015) Create row-by-row shims for the write path
• HIVE_13083(2016) Writing HiveDecimal can wrongly suppress
present stream
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different

Issues with performance
• Vectorized ORC Reader (SPARK-16060)
• Fast read partition-column only (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)

• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Issues with structured streaming
spark.readStream.orc(path)

Issues with column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)

Issues with Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
• Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
• Return wrong result if ORC file schema is different from Hive MetaStore schema order
• `convertMetastore` ignore storage property (SPARK-22158, Fixed at 2.2.1)
• `convertMetastoreOrc` is introduced in Spark 2.0, but it had several issues.

Issues with robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
• `FileNotFound` at file names with special chars (SPARK-22146, Fixed in 2.2.1)

Current Approach

Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4.3

In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4

Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

How to specify `native` OrcFileFormat directly
CREATE TABLE people (name string, age int)
USING org.apache.spark.sql.execution.datasources.orc
df.write
.format("org.apache.spark.sql.execution.datasources.orc")
.save(path)
spark.read
.format("org.apache.spark.sql.execution.datasources.orc")
.load(path)
Read Dataset
Write Dataset
Create ORC Table

Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table

Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream

ORC Readers with `spark.sql.` configurations
orc.impl
# of cols <= codegen.maxFields
`native`
`hive` ORC Reader
`hive`
true
spark.sql.codegen.maxFields=100 (default)
false
`native` ORC Columnar Batch Reader
all atomic types
true
false
`native` ORC Record Reader
orc.enableVectorizedReader false
true

ORC Readers with `spark.sql.` configurations – Cont.
orc.enableVectorizedReader
Wrapping
ORC ColumnVector 
Spark OrcColumnVector
orc.copyBatchToSpark
true
false
Copying
Spark OnHeapColumnVector
true
columnVector.offheap.enabled
true
Copying
Spark OffHeapColumnVector
false
`native` ORC Columnar Batch Reader

Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
• `spark.sql.orc.impl=native` is required, too.
STORED AS ORC
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
SPARK-23355

Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
• Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
• boolean -> byte -> short -> int -> long
• float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)

Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Type2 ✔️ ✔️3 ✔️
Change Position ✔️ ✔️ ✔️

Demo 1
ORC configuration

Demo 2
PySpark with ORC

Performance

Micro Benchmark
• Target
• Apache Spark 2.3.0
• Apache ORC 1.4.1
• Machine
• MacBook Pro (2015 Mid)
• Intel® Core™ i7-4770JQ CPI @ 2.20GHz
• Mac OS X 10.13.4
• JDK 1.8.0_161

Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x

Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
11x

Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
15M rows in a partitioned table

Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column

Limitation
Future Roadmap

Limitation
• Spark vectorization supports atomic types only
• Limited simple schema evolution. JSON provides more
• boolean -> byte -> short -> int -> long
• float -> double
• `convertMetastore` ignores `STORED AS` table properties (SPARK-23355)
• Both ORC/Parquet

Future Roadmap – Apache Spark 2.4 (2018 Fall)
• Feature Parity for ORC with Parquet (SPARK-20901)
• Use `native` ORC implementation by default (SPARK-23456)
• Use ORC predicate pushdown by default (SPARK-21783)
• Use `convertMetastoreOrc` by default (SPARK-22279)
• Test ORC as default data source format (SPARK-23553)
• Test and support Bloom Filters (SPARK-12417)

Future Roadmap – On-going work
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Support CHAR/VARCHAR Types
• Vectorized Writer with DataSource V2
• ALTER TABLE … CHANGE column type (SPARK-18727)

Summary
• Apache Spark 2.3 starts to take advantage of Apache ORC
• Native vectorized ORC reader
• boosts Spark ORC performance
• provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC

Reference
• https://youtu.be/ZVSD9EsQl-8, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose

Questions?

Thank you

ORC improvement in Apache Spark 2.3

More Related Content

What's hot

Similar to ORC improvement in Apache Spark 2.3

More from DataWorks Summit

Recently uploaded

ORC improvement in Apache Spark 2.3