SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement & Roadmap
in Apache Spark 2.3 and 2.4
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
June 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Dongjoon Hyun
• Hortonworks
− Principal Software Engineer @ Data Science Team
• Apache Project
− Apache REEF Project Management Committee(PMC) Member & Committer
− Apache Spark Project Contributor
• GitHub
− https://github.com/dongjoon-hyun
3 © Hortonworks Inc. 2011–2018. All rights reserved
HDP 2.6.5 (May 2018)
• Apache Spark
− 2.3.0 (2018 FEB)
• Apache ORC
− 1.4.3 (2018 FEB)
• Apache KAFKA
− 1.0.0 (2017 NOV)
4 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
Apache Spark 2.3.x
Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
5 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
Apache Spark 2.3.x
Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
6 © Hortonworks Inc. 2011–2018. All rights reserved
Spark’s built-in file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Storage-efficient and popular for shared Hive tables
7 © Hortonworks Inc. 2011–2018. All rights reserved
Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Storage-efficient and popular for shared Hive tables
Fast
Flexible
Hive Table Access
8 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
9 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
10 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
11 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
− v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
12 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
− v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
13 © Hortonworks Inc. 2011–2018. All rights reserved
Previous ORC Issues in Spark
14 © Hortonworks Inc. 2011–2018. All rights reserved
Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness
15 © Hortonworks Inc. 2011–2018. All rights reserved
Category 1 – ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Use real column names from Hive tables
• HIVE_12055(2015) Vectorized Writer
• HIVE_13083(2016) Decimals write present stream correctly
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different
16 © Hortonworks Inc. 2011–2018. All rights reserved
Category 2 – Performance
• Vectorized ORC Reader (SPARK-16060)
• Fast reading partition-columns (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)
17 © Hortonworks Inc. 2011–2018. All rights reserved
• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Category 3 – Structured streaming
spark.readStream.orc(path)
18 © Hortonworks Inc. 2011–2018. All rights reserved
Category 4 – Column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)
19 © Hortonworks Inc. 2011–2018. All rights reserved
Category 5 – Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
− Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
− Return wrong result if ORC file schema is different from Hive MetaStore schema order
• Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4)
− For ORC/Parquet Hive tables, `convertMetastore` ignores table properties
20 © Hortonworks Inc. 2011–2018. All rights reserved
Category 6 – Robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
21 © Hortonworks Inc. 2011–2018. All rights reserved
Current Approach
22 © Hortonworks Inc. 2011–2018. All rights reserved
Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4+
23 © Hortonworks Inc. 2011–2018. All rights reserved
In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4
24 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
25 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
CREATE TABLE people (name string, age int)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table
26 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream
27 © Hortonworks Inc. 2011–2018. All rights reserved
Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
− `spark.sql.orc.impl=native` is required, too.
CREATE TABLE people (name string, age int)
STORED AS ORC
CREATE TABLE people (name string, age int)
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
28 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
− Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
− boolean -> byte -> short -> int -> long
− float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)
Old Data
New Data
29 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Column Type2 ✔️ ✔️3 ✔️
Change Column Position ✔️ ✔️ ✔️
30 © Hortonworks Inc. 2011–2018. All rights reserved
Performance
31 © Hortonworks Inc. 2011–2018. All rights reserved
Micro Benchmark (Apache Spark 2.3.0)
• Target
− Apache Spark 2.3.0
− Apache ORC 1.4.1
• Machine
− MacBook Pro (2015 Mid)
− Intel® Core™ i7-4770JQ CPI @ 2.20GHz
− Mac OS X 10.13.4
− JDK 1.8.0_161
32 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
33 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
11x
34 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
15M rows in a partitioned table
35 © Hortonworks Inc. 2011–2018. All rights reserved
Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select 50% rows (id < value)
Select 90% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column
36 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
37 © Hortonworks Inc. 2011–2018. All rights reserved
Support Matrix
Future Roadmap
38 © Hortonworks Inc. 2011–2018. All rights reserved
Support Matrix
• Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5.
HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1
TP for ORC on Spark GA for ORC on Spark Early Access
Spark 2.2 Spark 2.3.0+ Spark 2.3.1+
N/A ORC 1.4.3 ORC 1.4.3+
spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native
spark.sql.orc.char.enabled=true N/A N/A
1. https://hortonworks.com/info/early-access-hdp-3-0/
39 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – Targeting Apache Spark 2.4 (2018 Fall)
Umbrella Issue
• Feature Parity for ORC with Parquet SPARK-20901
Sub issues
• Upgrade Apache ORC to 1.5.1 SPARK-24576
• Use `native` ORC implementation by default SPARK-23456
• Use ORC predicate pushdown by default SPARK-21783
• Use `convertMetastoreOrc` by default SPARK-22279
• Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355
• Test ORC as default data source format SPARK-23553
• Test and support Bloom Filters SPARK-12417
40 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – On-going work
• ORC Column-level encryption (with ORC 1.6)
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Vectorized Writer with DataSource V2
• Support CHAR/VARCHAR Types
• ALTER TABLE … CHANGE column type (SPARK-18727)
41 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC
− Improved feature parity between Spark and Hive
• Native vectorized ORC reader
− boosts Spark ORC performance
− provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC
42 © Hortonworks Inc. 2011–2018. All rights reserved
Reference
• https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23,
Dataworks Summit 2018 Berlin
• https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose
43 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
44 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

What's hot

Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
Eric Wohlstadter
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
t3rmin4t0r
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
Josh Elser
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 

What's hot (20)

Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 

Similar to ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
Xiao Li
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
alanfgates
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 

Similar to ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4 (20)

What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 

Recently uploaded

AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 

Recently uploaded (20)

AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 

ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team June 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks − Principal Software Engineer @ Data Science Team • Apache Project − Apache REEF Project Management Committee(PMC) Member & Committer − Apache Spark Project Contributor • GitHub − https://github.com/dongjoon-hyun
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved HDP 2.6.5 (May 2018) • Apache Spark − 2.3.0 (2018 FEB) • Apache ORC − 1.4.3 (2018 FEB) • Apache KAFKA − 1.0.0 (2017 NOV)
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Spark’s built-in file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables Fast Flexible Hive Table Access
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Category 1 – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Use real column names from Hive tables • HIVE_12055(2015) Vectorized Writer • HIVE_13083(2016) Decimals write present stream correctly • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Category 2 – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Category 4 – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Category 5 – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) − Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) − Return wrong result if ORC file schema is different from Hive MetaStore schema order • Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4) − For ORC/Parquet Hive tables, `convertMetastore` ignores table properties
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Category 6 – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Current Approach
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4+
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) − `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns − Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting − boolean -> byte -> short -> int -> long − float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Column Type2 ✔️ ✔️3 ✔️ Change Column Position ✔️ ✔️ ✔️
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Performance
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Micro Benchmark (Apache Spark 2.3.0) • Target − Apache Spark 2.3.0 − Apache ORC 1.4.1 • Machine − MacBook Pro (2015 Mid) − Intel® Core™ i7-4770JQ CPI @ 2.20GHz − Mac OS X 10.13.4 − JDK 1.8.0_161
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Demo
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Support Matrix Future Roadmap
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Support Matrix • Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5. HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1 TP for ORC on Spark GA for ORC on Spark Early Access Spark 2.2 Spark 2.3.0+ Spark 2.3.1+ N/A ORC 1.4.3 ORC 1.4.3+ spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native spark.sql.orc.char.enabled=true N/A N/A 1. https://hortonworks.com/info/early-access-hdp-3-0/
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – Targeting Apache Spark 2.4 (2018 Fall) Umbrella Issue • Feature Parity for ORC with Parquet SPARK-20901 Sub issues • Upgrade Apache ORC to 1.5.1 SPARK-24576 • Use `native` ORC implementation by default SPARK-23456 • Use ORC predicate pushdown by default SPARK-21783 • Use `convertMetastoreOrc` by default SPARK-22279 • Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355 • Test ORC as default data source format SPARK-23553 • Test and support Bloom Filters SPARK-12417
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • ORC Column-level encryption (with ORC 1.6) • Support VectorUDT/MatrixUDT (SPARK-22320) • Vectorized Writer with DataSource V2 • Support CHAR/VARCHAR Types • ALTER TABLE … CHANGE column type (SPARK-18727)
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Summary • Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC − Improved feature parity between Spark and Hive • Native vectorized ORC reader − boosts Spark ORC performance − provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Reference • https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23, Dataworks Summit 2018 Berlin • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Thank you