Apache Spark 2.3, released on February 2018, is the fourth release in 2.x line and has a lot of new improvements. One of the notable improvements is ORC support. Apache Spark 2.3 adds a native ORC file format implementation by using the latest Apache ORC 1.4.1. Users can switch between “native” and “hive” ORC file formats. Hive ORC file format is the existing one until Spark 2.2.
In this talk, I'll talk about three key changes. First of all, performance. New native ORC implementation is faster 2x - 11x times on 10TB TPCDS benchmark. Vectorized query execution over ORC files improves Spark ORC query execution greatly. Especially, ORC filter pushdown can be faster than Parquet due to in-file indexes. Second, as a part of native ORC support, Spark 2.3 can convert the Hive ORC tables into Spark ORC data sources automatically. This solves several existing ORC issues and Spark 2.4 will enable it by default. Last, but not least, Spark 2.3 officially supports structural streaming over ORC data sources. You can create a streaming dataset over ORC files.
Dongjoon Hyun, Staff Software Engineer, Hortonworks