1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement in Apache Spark 2.3
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
April 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Dongjoon Hyun
• Hortonworks
• Principal Software Engineer @ Data Science Team
• Apache Project
• Apache REEF Project Management Committee(PMC) Member & Committer
• Apache Spark Project Contributor
• GitHub
• https://github.com/dongjoon-hyun
3 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• What’s New in Apache Spark 2.3
• Previous ORC issues in Apache Spark
• Current Approach & Demo
• Performance & Limitation
• Future roadmap
4 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
What’s New in Apache Spark 2.3
5 © Hortonworks Inc. 2011–2018. All rights reserved
Spark’s file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
6 © Hortonworks Inc. 2011–2018. All rights reserved
Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
Fast
Flexible
Hive Table Access
7 © Hortonworks Inc. 2011–2018. All rights reserved
Previous ORC Issues in Spark
8 © Hortonworks Inc. 2011–2018. All rights reserved
Background – The history of Spark and ORC
• Before Apache ORC
• Hive 1.2.1 (2015 JUN)  SPARK-2883 (Hive ORC is used since Spark 1.4)
• After Apache ORC
• v1.0.0 (2016 JAN)
• v1.1.0 (2016 JUN)
• v1.2.0 (2016 AUG)
• v1.3.0 (2017 JAN)
• v1.4.0 (2017 MAY)  SPARK-21422 (Apache ORC is added since Spark 2.3)
• v1.4.1 (2017 OCT)  SPARK-22300
• v1.4.3 (2018 FEB)  SPARK-23340 (Spark 2.4)
9 © Hortonworks Inc. 2011–2018. All rights reserved
Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness
10 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Fix column names in FileSinkOperator
• HIVE_12055(2015) Create row-by-row shims for the write path
• HIVE_13083(2016) Writing HiveDecimal can wrongly suppress
present stream
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different
11 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with performance
• Vectorized ORC Reader (SPARK-16060)
• Fast read partition-column only (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)
12 © Hortonworks Inc. 2011–2018. All rights reserved
• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Issues with structured streaming
spark.readStream.orc(path)
13 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)
14 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
• Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
• Return wrong result if ORC file schema is different from Hive MetaStore schema order
• `convertMetastore` ignore storage property (SPARK-22158, Fixed at 2.2.1)
• `convertMetastoreOrc` is introduced in Spark 2.0, but it had several issues.
15 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
• `FileNotFound` at file names with special chars (SPARK-22146, Fixed in 2.2.1)
16 © Hortonworks Inc. 2011–2018. All rights reserved
Current Approach
17 © Hortonworks Inc. 2011–2018. All rights reserved
Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4.3
18 © Hortonworks Inc. 2011–2018. All rights reserved
In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4
19 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
20 © Hortonworks Inc. 2011–2018. All rights reserved
How to specify `native` OrcFileFormat directly
CREATE TABLE people (name string, age int)
USING org.apache.spark.sql.execution.datasources.orc
df.write
.format("org.apache.spark.sql.execution.datasources.orc")
.save(path)
spark.read
.format("org.apache.spark.sql.execution.datasources.orc")
.load(path)
Read Dataset
Write Dataset
Create ORC Table
21 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
CREATE TABLE people (name string, age int)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table
22 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream
23 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Readers with `spark.sql.` configurations
orc.impl
# of cols <= codegen.maxFields
`native`
`hive` ORC Reader
`hive`
true
spark.sql.codegen.maxFields=100 (default)
false
`native` ORC Columnar Batch Reader
all atomic types
true
false
`native` ORC Record Reader
orc.enableVectorizedReader false
true
24 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Readers with `spark.sql.` configurations – Cont.
orc.enableVectorizedReader
Wrapping
ORC ColumnVector 
Spark OrcColumnVector
orc.copyBatchToSpark
true
false
Copying
ORC ColumnVector 
Spark OnHeapColumnVector
true
columnVector.offheap.enabled
true
Copying
ORC ColumnVector 
Spark OffHeapColumnVector
false
`native` ORC Columnar Batch Reader
25 © Hortonworks Inc. 2011–2018. All rights reserved
Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
• `spark.sql.orc.impl=native` is required, too.
CREATE TABLE people (name string, age int)
STORED AS ORC
CREATE TABLE people (name string, age int)
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
SPARK-23355
26 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
• Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
• boolean -> byte -> short -> int -> long
• float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)
27 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Type2 ✔️ ✔️3 ✔️
Change Position ✔️ ✔️ ✔️
28 © Hortonworks Inc. 2011–2018. All rights reserved
Demo 1
ORC configuration
29 © Hortonworks Inc. 2011–2018. All rights reserved
Demo 2
PySpark with ORC
30 © Hortonworks Inc. 2011–2018. All rights reserved
Performance
31 © Hortonworks Inc. 2011–2018. All rights reserved
Micro Benchmark
• Target
• Apache Spark 2.3.0
• Apache ORC 1.4.1
• Machine
• MacBook Pro (2015 Mid)
• Intel® Core™ i7-4770JQ CPI @ 2.20GHz
• Mac OS X 10.13.4
• JDK 1.8.0_161
32 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
33 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
11x
34 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
15M rows in a partitioned table
35 © Hortonworks Inc. 2011–2018. All rights reserved
Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select 50% rows (id < value)
Select 90% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column
36 © Hortonworks Inc. 2011–2018. All rights reserved
Limitation
Future Roadmap
37 © Hortonworks Inc. 2011–2018. All rights reserved
Limitation
• Spark vectorization supports atomic types only
• Limited simple schema evolution. JSON provides more
• boolean -> byte -> short -> int -> long
• float -> double
• `convertMetastore` ignores `STORED AS` table properties (SPARK-23355)
• Both ORC/Parquet
38 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – Apache Spark 2.4 (2018 Fall)
• Feature Parity for ORC with Parquet (SPARK-20901)
• Use `native` ORC implementation by default (SPARK-23456)
• Use ORC predicate pushdown by default (SPARK-21783)
• Use `convertMetastoreOrc` by default (SPARK-22279)
• Test ORC as default data source format (SPARK-23553)
• Test and support Bloom Filters (SPARK-12417)
39 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – On-going work
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Support CHAR/VARCHAR Types
• Vectorized Writer with DataSource V2
• ALTER TABLE … CHANGE column type (SPARK-18727)
40 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Apache Spark 2.3 starts to take advantage of Apache ORC
• Native vectorized ORC reader
• boosts Spark ORC performance
• provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC
41 © Hortonworks Inc. 2011–2018. All rights reserved
Reference
• https://youtu.be/ZVSD9EsQl-8, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose
42 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
43 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

ORC improvement in Apache Spark 2.3

  • 1.
    1 © HortonworksInc. 2011–2018. All rights reserved ORC Improvement in Apache Spark 2.3 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team April 2018
  • 2.
    2 © HortonworksInc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks • Principal Software Engineer @ Data Science Team • Apache Project • Apache REEF Project Management Committee(PMC) Member & Committer • Apache Spark Project Contributor • GitHub • https://github.com/dongjoon-hyun
  • 3.
    3 © HortonworksInc. 2011–2018. All rights reserved Agenda • What’s New in Apache Spark 2.3 • Previous ORC issues in Apache Spark • Current Approach & Demo • Performance & Limitation • Future roadmap
  • 4.
    4 © HortonworksInc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
  • 5.
    5 © HortonworksInc. 2011–2018. All rights reserved Spark’s file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables
  • 6.
    6 © HortonworksInc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables Fast Flexible Hive Table Access
  • 7.
    7 © HortonworksInc. 2011–2018. All rights reserved Previous ORC Issues in Spark
  • 8.
    8 © HortonworksInc. 2011–2018. All rights reserved Background – The history of Spark and ORC • Before Apache ORC • Hive 1.2.1 (2015 JUN)  SPARK-2883 (Hive ORC is used since Spark 1.4) • After Apache ORC • v1.0.0 (2016 JAN) • v1.1.0 (2016 JUN) • v1.2.0 (2016 AUG) • v1.3.0 (2017 JAN) • v1.4.0 (2017 MAY)  SPARK-21422 (Apache ORC is added since Spark 2.3) • v1.4.1 (2017 OCT)  SPARK-22300 • v1.4.3 (2018 FEB)  SPARK-23340 (Spark 2.4)
  • 9.
    9 © HortonworksInc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
  • 10.
    10 © HortonworksInc. 2011–2018. All rights reserved Issues with ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Fix column names in FileSinkOperator • HIVE_12055(2015) Create row-by-row shims for the write path • HIVE_13083(2016) Writing HiveDecimal can wrongly suppress present stream • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
  • 11.
    11 © HortonworksInc. 2011–2018. All rights reserved Issues with performance • Vectorized ORC Reader (SPARK-16060) • Fast read partition-column only (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
  • 12.
    12 © HortonworksInc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Issues with structured streaming spark.readStream.orc(path)
  • 13.
    13 © HortonworksInc. 2011–2018. All rights reserved Issues with column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
  • 14.
    14 © HortonworksInc. 2011–2018. All rights reserved Issues with Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) • Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) • Return wrong result if ORC file schema is different from Hive MetaStore schema order • `convertMetastore` ignore storage property (SPARK-22158, Fixed at 2.2.1) • `convertMetastoreOrc` is introduced in Spark 2.0, but it had several issues.
  • 15.
    15 © HortonworksInc. 2011–2018. All rights reserved Issues with robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305) • `FileNotFound` at file names with special chars (SPARK-22146, Fixed in 2.2.1)
  • 16.
    16 © HortonworksInc. 2011–2018. All rights reserved Current Approach
  • 17.
    17 © HortonworksInc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4.3
  • 18.
    18 © HortonworksInc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
  • 19.
    19 © HortonworksInc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 20.
    20 © HortonworksInc. 2011–2018. All rights reserved How to specify `native` OrcFileFormat directly CREATE TABLE people (name string, age int) USING org.apache.spark.sql.execution.datasources.orc df.write .format("org.apache.spark.sql.execution.datasources.orc") .save(path) spark.read .format("org.apache.spark.sql.execution.datasources.orc") .load(path) Read Dataset Write Dataset Create ORC Table
  • 21.
    21 © HortonworksInc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
  • 22.
    22 © HortonworksInc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
  • 23.
    23 © HortonworksInc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations orc.impl # of cols <= codegen.maxFields `native` `hive` ORC Reader `hive` true spark.sql.codegen.maxFields=100 (default) false `native` ORC Columnar Batch Reader all atomic types true false `native` ORC Record Reader orc.enableVectorizedReader false true
  • 24.
    24 © HortonworksInc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations – Cont. orc.enableVectorizedReader Wrapping ORC ColumnVector  Spark OrcColumnVector orc.copyBatchToSpark true false Copying ORC ColumnVector  Spark OnHeapColumnVector true columnVector.offheap.enabled true Copying ORC ColumnVector  Spark OffHeapColumnVector false `native` ORC Columnar Batch Reader
  • 25.
    25 © HortonworksInc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) • `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip') SPARK-23355
  • 26.
    26 © HortonworksInc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns • Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting • boolean -> byte -> short -> int -> long • float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path)
  • 27.
    27 © HortonworksInc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Type2 ✔️ ✔️3 ✔️ Change Position ✔️ ✔️ ✔️
  • 28.
    28 © HortonworksInc. 2011–2018. All rights reserved Demo 1 ORC configuration
  • 29.
    29 © HortonworksInc. 2011–2018. All rights reserved Demo 2 PySpark with ORC
  • 30.
    30 © HortonworksInc. 2011–2018. All rights reserved Performance
  • 31.
    31 © HortonworksInc. 2011–2018. All rights reserved Micro Benchmark • Target • Apache Spark 2.3.0 • Apache ORC 1.4.1 • Machine • MacBook Pro (2015 Mid) • Intel® Core™ i7-4770JQ CPI @ 2.20GHz • Mac OS X 10.13.4 • JDK 1.8.0_161
  • 32.
    32 © HortonworksInc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 33.
    33 © HortonworksInc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
  • 34.
    34 © HortonworksInc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
  • 35.
    35 © HortonworksInc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
  • 36.
    36 © HortonworksInc. 2011–2018. All rights reserved Limitation Future Roadmap
  • 37.
    37 © HortonworksInc. 2011–2018. All rights reserved Limitation • Spark vectorization supports atomic types only • Limited simple schema evolution. JSON provides more • boolean -> byte -> short -> int -> long • float -> double • `convertMetastore` ignores `STORED AS` table properties (SPARK-23355) • Both ORC/Parquet
  • 38.
    38 © HortonworksInc. 2011–2018. All rights reserved Future Roadmap – Apache Spark 2.4 (2018 Fall) • Feature Parity for ORC with Parquet (SPARK-20901) • Use `native` ORC implementation by default (SPARK-23456) • Use ORC predicate pushdown by default (SPARK-21783) • Use `convertMetastoreOrc` by default (SPARK-22279) • Test ORC as default data source format (SPARK-23553) • Test and support Bloom Filters (SPARK-12417)
  • 39.
    39 © HortonworksInc. 2011–2018. All rights reserved Future Roadmap – On-going work • Support VectorUDT/MatrixUDT (SPARK-22320) • Support CHAR/VARCHAR Types • Vectorized Writer with DataSource V2 • ALTER TABLE … CHANGE column type (SPARK-18727)
  • 40.
    40 © HortonworksInc. 2011–2018. All rights reserved Summary • Apache Spark 2.3 starts to take advantage of Apache ORC • Native vectorized ORC reader • boosts Spark ORC performance • provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
  • 41.
    41 © HortonworksInc. 2011–2018. All rights reserved Reference • https://youtu.be/ZVSD9EsQl-8, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
  • 42.
    42 © HortonworksInc. 2011–2018. All rights reserved Questions?
  • 43.
    43 © HortonworksInc. 2011–2018. All rights reserved Thank you