Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performance Update: When Apache ORC Met Apache Spark

7,887 views

Published on

Apache Spark 1.4 introduced support for Apache ORC. However, initially it did not take advantage of the full power of ORC. For instance, it was slow because ORC vectorization was not used and push-down predicate wa s also not supported on DATE types. Recently the Apache Spark community has started to use the latest Apache ORC which include new enhancements to address these limitations. In this talk, we show the result of integrating the latest Apache ORC and Apache Spark. We will also review the latest enhancements and roadmap.

Speakers:
Owen O'Malley, Co-founder & Technical Fellow, Hortonworks
Dongjoon Hyun, Staff Software Engineer, Hortonworks

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • This is an updated presentation from what was presented at DataWorks Summit Sydney correcting a few minor errors.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Performance Update: When Apache ORC Met Apache Spark

  1. 1. PERFORMANCE UPDATE: WHEN APACHE ORC MET APACHE SPARK Dongjoon Hyun - dhyun@hortonworks.com Owen O'Malley - owen@hortonworks.com @owen_omalley
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC History  Originally released as part of Hive – Released in Hive 0.11.0 (2013-05-16) – Included in each release before Hive 2.3.0 (2017-07-17)  Factored out of Hive – Improve integration with other tools – Shrink the size of the dependencies – Releases faster than Hive – Added C++ reader and new C++ writer
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integration History  Before Apache ORC – Hive 1.2.1 (2015-06-27)  SPARK-2883 Since Spark 1.4, Hive ORC is used.  After Apache ORC – v1.0.0 (2016-01-25) – v1.1.0 (2016-06-10) – v1.2.0 (2016-08-25) – v1.3.0 (2017-01-23) – v1.4.0 (2017-05-08)  SPARK-21422 For Spark 2.3, Apache ORC dependency is added and used.
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Integration Design (in HDP 2.6.3 with Spark 2.2)  Switch between old and new formats by configuration FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat Old OrcFileFormat from Hive 1.2.1 (SPARK-2883) New OrcFileFormat with ORC 1.4.0 (SPARK-21422, SPARK-20682, SPARK-20728)
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benefit in Apache Spark  Speed – Use both Spark ColumnarBatch and ORC RowBatch together more seamlessly  Stability – Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.  Usability – User can use ORC data sources without hive module, i.e, -Phive.  Maintainability – Reduce the Hive dependency and can remove old legacy code later.
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Faster and More Scalable Number of columns Time (ms) Single Column Scan from Wide Tables (1M rows with all BIGINT columns) 0 500 1000 1500 2000 TINYINT SMALLINT INT BIGINT FLOAT DOULBE OLD NEW Vectorized Read (15M rows in a single-column table) Time (ms) 0 200 400 600 800 100 200 300 NEW OLD
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Support Matrix  HDP 2.6.3 is going to add a new faster and stable ORC file format for ORC tables with the subset of limitations of Apache Spark 2.2.  The followings are not supported yet – Zero-byte ORC File – Schema Evolution • Adding columns at the end • Changing types and deleting columns  Please see the full JIRA issue list in next slides.
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Done Tickets  SPARK-20901 Feature parity for ORC with Parquet – The parent ticket of all ORC issues  Done – SPARK-20566 ColumnVector should support `appendFloats` for array – SPARK-21422 Depend on Apache ORC 1.4.0 – SPARK-21831 Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite – SPARK-21839 Support SQL config for ORC compression – SPARK-21884 Fix StackOverflowError on MetadataOnlyQuery – SPARK-21912 ORC/Parquet table should not create invalid column names
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved On-Going Tickets  SPARK-20682 Support a new faster ORC data source based on Apache ORC  SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core  SPARK-16060 Vectorized Orc reader  SPARK-21791 ORC should support column names with dot  SPARK-21787 Support for pushing down filters for DATE types in ORC  SPARK-19809 Zero byte ORC file support  SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved To-do Tickets  Configuration – SPARK-21783 Turn on ORC filter push-down by default  Read – SPARK-11412 Support merge schema for ORC – SPARK-16628 OrcConversions should not convert an ORC table  Alter – SPARK-21929 Support `ALTER TABLE ADD COLUMNS(..)` for ORC data source – SPARK-18355 Spark SQL fails to read from a ORC table with new column  Write – SPARK-12417 Orc bloom filter options are not propagated during file write
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark Overview • Benchmark Objective • Compare performance of New ORC vs Old ORC in Spark • With TPC-DS at different scales 1 TB, 10 TB
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enabling New ORC in Spark  spark.sql.hive.convertMetastoreOrc=true  spark.sql.orc.enabled=true
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Software/Hardware/Cluster Details  Software – Internal branch of Spark 2.2 (in HDP)  Node Configuration – 32 CPU, Intel E5-2640, 2.00GHz – 256 GB RAM – 6 SATA Disks (~4 TB, 7200 rpm) – 10 Gbps network card  Cluster – 10 nodes used for 1 TB – 15 nodes used for 10 TB – Double type used in data – TPC-DS (1.4) queries in Spark used for benchmarking
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Tuning Spark Parameter Name Default Value Tuned Value spark.sql.hive.convertMetastoreOrc false true spark.sql.orc.enabled false true spark.sql.orc.filterPushdown false true spark.sql.statistics.fallBackToHdfs false true spark.sql.autoBroadcastJoinThreshold 10L * 1024 * 1024 (10MB) 26214400 spark.shuffle.io.numConnectionsPerPeer 1 10 spark.io.compression.lz4.blockSize 32k 128kb spark.sql.shuffle.partitions 200 300 spark.network.timeout 120s 600s spark.locality.wait 3s 0s *Hive Parameter Name (hive-site.xml) Default Value Tuned Value hive.exec.max.dynamic.partitions 1000 10000 hive.exec.dynamic.partition.mode strict nonstrict *mainly for data ingestion *all tables analyzed with `noscan` option for getting basic stats
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) * Total runtime of 74 queries in TPC-DS - New ORC is > 2x better than old ORC
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) New ORC vs Old ORC New ORC consistently outperforms old ORC in all queries significantly
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 1 TB Scale (10 executors, 170 GB, 25 cores) * Total runtime of 97 queries in TPC-DS @ 1 TB scale • New ORC is ~2x faster than old ORC
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved *Profiler shows high CPU usage with libzip.so during ORC reads. ORC-175 can be used for improving this further *ORC-175 (Intel ISAL: intelligent storage acceleration libraries for inflate) can be useful here
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Comparison of New ORC vs Old ORC vs Parquet (10 TB Scale) * Total runtime of 74 queries in TPC-DS * 10 TB Scale (15 executors, 170 GB, 25 cores)
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Performance Update Spark – Apache ORC 1.4 Integration Benchmark Overview Results Roadmap
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RoadMap  Tune default ambari configs for spark based on these benchmark results to get better OOTB experience for end users  Additional enhancements like parallel reading of footers in ORC  Include ORC-175 (Intel ISAL) when complete  Contribute back the fixes back to community
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thanks!  Questions – For ORC – dev@@orc.apache.org – For Spark – dev@spark.apache.org – Benchmarks – dhyun@hortonworks.com
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results : 10 TB Scale (15 executors, 170 GB, 25 cores) Comparison of New ORC vs Parquet

×