PERFORMANCE UPDATE: WHEN
APACHE ORC MET APACHE SPARK
Dongjoon Hyun - dhyun@hortonworks.com
Owen O'Malley - owen@hortonworks.com
@owen_omalley
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC History
 Originally released as part of Hive
– Released in Hive 0.11.0 (2013-05-16)
– Included in each release before Hive 2.3.0 (2017-07-17)
 Factored out of Hive
– Improve integration with other tools
– Shrink the size of the dependencies
– Releases faster than Hive
– Added C++ reader and new C++ writer
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integration History
 Before Apache ORC
– Hive 1.2.1 (2015-06-27)  SPARK-2883
Since Spark 1.4, Hive ORC is used.
 After Apache ORC
– v1.0.0 (2016-01-25)
– v1.1.0 (2016-06-10)
– v1.2.0 (2016-08-25)
– v1.3.0 (2017-01-23)
– v1.4.0 (2017-05-08)  SPARK-21422
For Spark 2.3, Apache ORC dependency is added and used.
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integration Design (in HDP 2.6.3 with Spark 2.2)
 Switch between old and new formats by configuration
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
Old OrcFileFormat
from Hive 1.2.1
(SPARK-2883)
New OrcFileFormat
with ORC 1.4.0
(SPARK-21422,
SPARK-20682,
SPARK-20728)
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benefit in Apache Spark
 Speed
– Use both Spark ColumnarBatch and ORC RowBatch together more seamlessly
 Stability
– Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
 Usability
– User can use ORC data sources without hive module, i.e, -Phive.
 Maintainability
– Reduce the Hive dependency and can remove old legacy code later.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Faster and More Scalable
Number of columns
Time
(ms)
Single Column Scan from Wide Tables
(1M rows with all BIGINT columns)
0 500 1000 1500 2000
TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOULBE
OLD NEW
Vectorized Read
(15M rows in a single-column table)
Time
(ms)
0
200
400
600
800
100 200 300
NEW OLD
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Support Matrix
 HDP 2.6.3 is going to add a new faster and stable ORC file format for ORC tables
with the subset of limitations of Apache Spark 2.2.
 The followings are not supported yet
– Zero-byte ORC File
– Schema Evolution
• Adding columns at the end
• Changing types and deleting columns
 Please see the full JIRA issue list in next slides.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Done Tickets
 SPARK-20901 Feature parity for ORC with Parquet
– The parent ticket of all ORC issues
 Done
– SPARK-20566 ColumnVector should support `appendFloats` for array
– SPARK-21422 Depend on Apache ORC 1.4.0
– SPARK-21831 Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite
– SPARK-21839 Support SQL config for ORC compression
– SPARK-21884 Fix StackOverflowError on MetadataOnlyQuery
– SPARK-21912 ORC/Parquet table should not create invalid column names
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
On-Going Tickets
 SPARK-20682 Support a new faster ORC data source based on Apache ORC
 SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core
 SPARK-16060 Vectorized Orc reader
 SPARK-21791 ORC should support column names with dot
 SPARK-21787 Support for pushing down filters for DATE types in ORC
 SPARK-19809 Zero byte ORC file support
 SPARK-14387 Enable Hive-1.x ORC compatibility with
spark.sql.hive.convertMetastoreOrc
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
To-do Tickets
 Configuration
– SPARK-21783 Turn on ORC filter push-down by default
 Read
– SPARK-11412 Support merge schema for ORC
– SPARK-16628 OrcConversions should not convert an ORC table
 Alter
– SPARK-21929 Support `ALTER TABLE ADD COLUMNS(..)` for ORC data source
– SPARK-18355 Spark SQL fails to read from a ORC table with new column
 Write
– SPARK-12417 Orc bloom filter options are not propagated during file write
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark Overview
• Benchmark Objective
• Compare performance of New ORC vs Old ORC in Spark
• With TPC-DS at different scales 1 TB, 10 TB
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enabling New ORC in Spark
 spark.sql.hive.convertMetastoreOrc=true
 spark.sql.orc.enabled=true
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Software/Hardware/Cluster Details
 Software
– Internal branch of Spark 2.2 (in HDP)
 Node Configuration
– 32 CPU, Intel E5-2640, 2.00GHz
– 256 GB RAM
– 6 SATA Disks (~4 TB, 7200 rpm)
– 10 Gbps network card
 Cluster
– 10 nodes used for 1 TB
– 15 nodes used for 10 TB
– Double type used in data
– TPC-DS (1.4) queries in Spark used for benchmarking
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Tuning
Spark Parameter Name Default Value Tuned Value
spark.sql.hive.convertMetastoreOrc false true
spark.sql.orc.enabled false true
spark.sql.orc.filterPushdown false true
spark.sql.statistics.fallBackToHdfs false true
spark.sql.autoBroadcastJoinThreshold 10L * 1024 * 1024 (10MB) 26214400
spark.shuffle.io.numConnectionsPerPeer 1 10
spark.io.compression.lz4.blockSize 32k 128kb
spark.sql.shuffle.partitions 200 300
spark.network.timeout 120s 600s
spark.locality.wait 3s 0s
*Hive Parameter Name (hive-site.xml) Default Value Tuned Value
hive.exec.max.dynamic.partitions 1000 10000
hive.exec.dynamic.partition.mode strict nonstrict
*mainly for data ingestion
*all tables analyzed with `noscan` option for getting basic stats
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 10 TB Scale (15 executors, 170 GB, 25 cores)
* Total runtime of 74 queries in TPC-DS
- New ORC is > 2x better than old ORC
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 10 TB Scale (15 executors, 170 GB, 25 cores)
New ORC vs Old ORC
New ORC consistently outperforms old ORC in all queries significantly
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 1 TB Scale (10 executors, 170 GB, 25 cores)
* Total runtime of 97 queries in TPC-DS @ 1 TB scale
• New ORC is ~2x faster than old ORC
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
*Profiler shows high CPU usage with libzip.so during ORC reads. ORC-175 can be used for improving this further
*ORC-175 (Intel ISAL: intelligent storage acceleration libraries for inflate) can be useful here
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Comparison of New ORC vs Old ORC vs Parquet (10 TB Scale)
* Total runtime of 74 queries in TPC-DS
* 10 TB Scale (15 executors, 170 GB, 25 cores)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Performance Update
Spark – Apache ORC 1.4 Integration
Benchmark Overview
Results
Roadmap
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RoadMap
 Tune default ambari configs for spark based on these benchmark results to get better
OOTB experience for end users
 Additional enhancements like parallel reading of footers in ORC
 Include ORC-175 (Intel ISAL) when complete
 Contribute back the fixes back to community
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thanks!
 Questions
– For ORC – dev@@orc.apache.org
– For Spark – dev@spark.apache.org
– Benchmarks – dhyun@hortonworks.com
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Results : 10 TB Scale (15 executors, 170 GB, 25 cores)
Comparison of New ORC vs Parquet

Performance Update: When Apache ORC Met Apache Spark

  • 1.
    PERFORMANCE UPDATE: WHEN APACHEORC MET APACHE SPARK Dongjoon Hyun - dhyun@hortonworks.com Owen O'Malley - owen@hortonworks.com @owen_omalley
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved ORC History  Originally released as part of Hive – Released in Hive 0.11.0 (2013-05-16) – Included in each release before Hive 2.3.0 (2017-07-17)  Factored out of Hive – Improve integration with other tools – Shrink the size of the dependencies – Releases faster than Hive – Added C++ reader and new C++ writer
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Integration History  Before Apache ORC – Hive 1.2.1 (2015-06-27)  SPARK-2883 Since Spark 1.4, Hive ORC is used.  After Apache ORC – v1.0.0 (2016-01-25) – v1.1.0 (2016-06-10) – v1.2.0 (2016-08-25) – v1.3.0 (2017-01-23) – v1.4.0 (2017-05-08)  SPARK-21422 For Spark 2.3, Apache ORC dependency is added and used.
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Integration Design (in HDP 2.6.3 with Spark 2.2)  Switch between old and new formats by configuration FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat Old OrcFileFormat from Hive 1.2.1 (SPARK-2883) New OrcFileFormat with ORC 1.4.0 (SPARK-21422, SPARK-20682, SPARK-20728)
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Benefit in Apache Spark  Speed – Use both Spark ColumnarBatch and ORC RowBatch together more seamlessly  Stability – Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.  Usability – User can use ORC data sources without hive module, i.e, -Phive.  Maintainability – Reduce the Hive dependency and can remove old legacy code later.
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Faster and More Scalable Number of columns Time (ms) Single Column Scan from Wide Tables (1M rows with all BIGINT columns) 0 500 1000 1500 2000 TINYINT SMALLINT INT BIGINT FLOAT DOULBE OLD NEW Vectorized Read (15M rows in a single-column table) Time (ms) 0 200 400 600 800 100 200 300 NEW OLD
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Support Matrix  HDP 2.6.3 is going to add a new faster and stable ORC file format for ORC tables with the subset of limitations of Apache Spark 2.2.  The followings are not supported yet – Zero-byte ORC File – Schema Evolution • Adding columns at the end • Changing types and deleting columns  Please see the full JIRA issue list in next slides.
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Done Tickets  SPARK-20901 Feature parity for ORC with Parquet – The parent ticket of all ORC issues  Done – SPARK-20566 ColumnVector should support `appendFloats` for array – SPARK-21422 Depend on Apache ORC 1.4.0 – SPARK-21831 Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite – SPARK-21839 Support SQL config for ORC compression – SPARK-21884 Fix StackOverflowError on MetadataOnlyQuery – SPARK-21912 ORC/Parquet table should not create invalid column names
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved On-Going Tickets  SPARK-20682 Support a new faster ORC data source based on Apache ORC  SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core  SPARK-16060 Vectorized Orc reader  SPARK-21791 ORC should support column names with dot  SPARK-21787 Support for pushing down filters for DATE types in ORC  SPARK-19809 Zero byte ORC file support  SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved To-do Tickets  Configuration – SPARK-21783 Turn on ORC filter push-down by default  Read – SPARK-11412 Support merge schema for ORC – SPARK-16628 OrcConversions should not convert an ORC table  Alter – SPARK-21929 Support `ALTER TABLE ADD COLUMNS(..)` for ORC data source – SPARK-18355 Spark SQL fails to read from a ORC table with new column  Write – SPARK-12417 Orc bloom filter options are not propagated during file write
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Benchmark Overview • Benchmark Objective • Compare performance of New ORC vs Old ORC in Spark • With TPC-DS at different scales 1 TB, 10 TB
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Enabling New ORC in Spark  spark.sql.hive.convertMetastoreOrc=true  spark.sql.orc.enabled=true
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Software/Hardware/Cluster Details  Software – Internal branch of Spark 2.2 (in HDP)  Node Configuration – 32 CPU, Intel E5-2640, 2.00GHz – 256 GB RAM – 6 SATA Disks (~4 TB, 7200 rpm) – 10 Gbps network card  Cluster – 10 nodes used for 1 TB – 15 nodes used for 10 TB – Double type used in data – TPC-DS (1.4) queries in Spark used for benchmarking
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark Tuning Spark Parameter Name Default Value Tuned Value spark.sql.hive.convertMetastoreOrc false true spark.sql.orc.enabled false true spark.sql.orc.filterPushdown false true spark.sql.statistics.fallBackToHdfs false true spark.sql.autoBroadcastJoinThreshold 10L * 1024 * 1024 (10MB) 26214400 spark.shuffle.io.numConnectionsPerPeer 1 10 spark.io.compression.lz4.blockSize 32k 128kb spark.sql.shuffle.partitions 200 300 spark.network.timeout 120s 600s spark.locality.wait 3s 0s *Hive Parameter Name (hive-site.xml) Default Value Tuned Value hive.exec.max.dynamic.partitions 1000 10000 hive.exec.dynamic.partition.mode strict nonstrict *mainly for data ingestion *all tables analyzed with `noscan` option for getting basic stats
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) * Total runtime of 74 queries in TPC-DS - New ORC is > 2x better than old ORC
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) New ORC vs Old ORC New ORC consistently outperforms old ORC in all queries significantly
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Results : 1 TB Scale (10 executors, 170 GB, 25 cores) * Total runtime of 97 queries in TPC-DS @ 1 TB scale • New ORC is ~2x faster than old ORC
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved *Profiler shows high CPU usage with libzip.so during ORC reads. ORC-175 can be used for improving this further *ORC-175 (Intel ISAL: intelligent storage acceleration libraries for inflate) can be useful here
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Comparison of New ORC vs Old ORC vs Parquet (10 TB Scale) * Total runtime of 74 queries in TPC-DS * 10 TB Scale (15 executors, 170 GB, 25 cores)
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved RoadMap  Tune default ambari configs for spark based on these benchmark results to get better OOTB experience for end users  Additional enhancements like parallel reading of footers in ORC  Include ORC-175 (Intel ISAL) when complete  Contribute back the fixes back to community
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Thanks!  Questions – For ORC – dev@@orc.apache.org – For Spark – dev@spark.apache.org – Benchmarks – dhyun@hortonworks.com
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) Comparison of New ORC vs Parquet