Successfully reported this slideshow.
Your SlideShare is downloading. ×

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 32 Ad

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li

Download to read offline

Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.

To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.

Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.

Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.

To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.

Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yuanjian Li

  1. 1. Daoyuan Wang (Intel) Yuanjian Li (Baidu) OAP: Optimized Analytics Package for Spark Platform
  2. 2. Notice and Disclaimers: • Intel, the Intel logo are trademarks of IntelCorporation in the U.S. and/or other countries. *Othernames and brandsmay be claimed as the property of others. See Trademarkson intel.com for fulllist of Intel trademarks. • Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intelmicroprocessorsfor optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Inteldoes not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependentoptimizations in this product are intended for use with Intelmicroprocessors. Certain optimizations not specific to Intelmicroarchitecture are reserved for Intelmicroprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. • Intel technologies may require enabled hardware, specific software, or servicesactivation. Checkwith your system manufacturer or retailer. • No computer systemcan be absolutely secure. Inteldoes not assumeany liability for lost or stolen data or systems or any damages resulting from such losses. • You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intela non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. • No license (express or implied, by estoppelor otherwise) to any intellectualpropertyrights is granted by this document. • The products described may contain design defectsor errorsknownas errata which maycausethe product to deviate from publish.
  3. 3. About me Daoyuan Wang • developer@Intel • Focuses on Spark optimization • An active Spark contributor since 2014 Yuanjian Li • Baidu INF distributed computation • Apache Spark contributor • Baidu Spark team leader
  4. 4. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  5. 5. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  6. 6. Data Analytics in Big Data Definition • People wants OLAP against large dataset as fast as possible. • People wants extract information from new coming data as soon as possible.
  7. 7. Data Analytics Acceleration is Required by Spark Users http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/2016_Spark_Survey/2016_Spark_Infographic.pdf
  8. 8. Emerging hardware technology Intel® Optane™ Technology Data Center Solutions Accelerate applications for fast caching and storage, reduce transaction costs for latency-sensitive workloads and increase scale per server. Intel® Optane™ technology allows data centers to deploy bigger and more affordable datasets to gain new insights from large memory pools.
  9. 9. Our proposal – OAP Spark* Job Server Spark SQL / StructuredStreaming/ Core Cassandra* HBase*Redis*Alluxio* HDFS* S3* … Storage Layer Hive* Table Parquet * JSON * ORC * Redis * Connector Cassandra * Connector OAP (Codename “Spinach”) • IndexedDataSource / CacheAware • RDMA, QAT, ISA-L,FPGA … • User Customized Indices • Columnar formats & supportParquet, ORC • Runtime ComputingV.S.Data Store • Columnar Fine-grainedCache • Spark Executor in-process Cache • 3D Xpoint (APP Direct Mode) • Auto tuningbasedonperiodicaljobhistory • K8S Integration/ AES-NI Encryption
  10. 10. Why OAP Low cost • Makes full use of existing hardware • Open source Good Performance • Index just like traditional database • Up to 5x boost in real-world Easy to Use • Easy to deploy • Easy to maintain • Easy to learn
  11. 11. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  12. 12. A Simple Example 1. Run with OAP $SPARK_HOME/sbin/start-thriftserver --package oap.jar; 2. Create a OAP table beeline> CREATE TABLE src(a: Int, b: String) USING spn; 3. Create a single column B+ Tree index beeline> CREATE SINDEX idx_1 ON src (a) USING BTREE; 4. Insert data beeline> INSERT INTO TABLE src SELECT key, value FROM xxx; 5. Refresh index beeline> REFRESH SINDEX on src; 6. Execution would automatically utilize index beeline> SELECT MAX(value), MIN(value) FROM src WHERE a > 100 and a < 1000;
  13. 13. OAP Files and Fibers Column (Fiber) #1 Column (Fiber) #2 Column (Fiber) #N RowGroup #1 …RowGroup #2 RowGroup #N Index meta statistics Index data structure (Index Fiber) One Index file for every data file Index meta statistics Index data structure (Index Fiber) OAP meta file OAP data files OAP index files OAP index files
  14. 14. 14 OAP Internals - index Spark predicate push down FilteredScan Read OAP Meta Available index? read statistics before use index Get Local RowID from index Full table scan Access data file for RowIDs directly Y N OAP cached access Index selection Supports Btree Index and BitMap Index, find best match among all created indices Supports statistics such as MinMax, PartbyValue, Sample, BloomFilter Only reads data fibers we need and puts those fibers into cache (in- memory fiber)
  15. 15. OAP compatible layer RowGroup #k RowGroup #1 RowGroup #2 Parquet compatible layer Read row #m from parquet file Find Row group #k Read row group and get specific rows Parquet data file Cache
  16. 16. OAP Data locality Spark as a Service Meta Data FiberCacheManager Executor Index Storage(HDFS / S3 / OSS) SpinachContext (Driver) FiberSensor HeartBeat
  17. 17. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  18. 18. Performance 72.083 7.095 2.304 0 10 20 30 40 50 60 70 80 Parquet Vectorized Read OAP Indexed Read OAP Indexed Read with Fiber Cache QueryTime(seconds) OAP Index And Cache Performance Cluster: 1 Master + 2 Slaves Hardware: CPU – 2x E5-2699 v4 RAM – 256 GB Storage – S3610 1.6TB Data: 300GB (Compressed Parquet) 2 Billion Records
  19. 19. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  20. 20. Spark In Baidu • Spark import to Baidu • Version: 0.8 80 1000 3000 6500 50 300 1500 5800 0 1000 2000 3000 4000 5000 6000 7000 Nodes Jobs/day 2014 2015 2016 2017 • Build standalone cluster • Integratewith in- houseFSPub- SubDW • Version: 1.4 • Build Cluster over YARN • Integratewith in- houseResource Scheduler System • Version: 1.6 • SQLGraph Service over Spark • OAP • Version: 2.1
  21. 21. Baidu Big SQLBaiduBigSQL Web UI Restful API BBS HTTPServer BBS Worker BBS Worker BBS Worker BBS Master Cache & Index Layer(OAP) Spark Over Yarn Roll Up Table Layer API Layer: • Meta Control API • Job API: LoadExportQueryInde x Control Control Layer: • Meta Control • Job Scheduler • Spark Driver • Query Classification Boosting Layer: • Roll Up Table Management • Roll Up Query Change • Index CreateUpdate • CacheHit
  22. 22. Baidu Big SQL Query Physical Queue(FAIR) Import Physical Queues BBS Worker Big Query Pool Small Query Pool Index Create Pool BBS Master Import Physical Queues Load Physical Queues Spark Over YARN Data Sources Logs DW Load Job alter table create indexclassify query Resource Management & Isolation Query Job
  23. 23. Introductory Story
  24. 24. Introductory Story Get the top 10 charge sum and correspond advertiser which triggered by the query word‘flower’ • Create index on ‘userid’ column • Various index types to choosefor different fields types • ×5 speed boosting than native spark sql, ×80 than MR Job • 3 day baidu charging log, 4TB data,70000+files, query timein 10~15s
  25. 25. Roll Up Table Layer date userid searchid baiduid cmatch … … shows clicks charge 1 1 1 10 2 10 1 5 1 1 2 11 3 10 1 5 1 1 3 12 2 10 1 5 1 1 4 13 1 10 1 5 1 1 5 14 1 10 1 5 1 2 6 14 2 10 1 5 1 2 7 15 3 10 1 5 1 2 8 16 4 10 1 5 1 2 9 17 5 10 1 5 700+ Columns 99% query only use <10 columns Select date,userid,shows,clicks,charge from… date userid shows clicks charge 1 1 50 5 25 1 2 40 4 20 Multi Roll Up Table (user-transparent) date cmatch shows clicks charge 1 1 20 2 10 1 2 30 3 15 1 3 20 2 10 1 4 10 1 5 1 5 10 1 5
  26. 26. OAP In BigSQL … Name Department Age … … … … … … … John INF 35 … … Michelle AI-Lab 29 … … Amy INF 42 … … Kim AI-Lab 27 … … Mary AI-Lab 47 … … … … … … DataFile IndexFile Sorted Age Row Index in Data File 27 3 29 1 35 0 42 2 45 4 Department Bit Array INF 10100 AI-Lab 01011 Index Build NormalTableScan UseIndex Skippable Reader Select xxx from xxx where age > 29 and department in (INF, AI-Lab)
  27. 27. OAP In BigSQL … Name Department Age … … … … … … … John INF 35 … … Michelle AI-Lab 29 … … Amy INF 42 … … Kim AI-Lab 27 … … Mary AI-Lab 47 … … … … … … DataFile InMemoryCache Load Cache Department Row Index in Data File INF 2 AI-Lab 3 Age Row Index in Data File 35 0 29 1
  28. 28. BBS’s Contribute to Spark • Spark-4502 Spark SQL reads unneccesary nested fields from Parquet • Spark-18700 getCached in HiveMetastoreCatalog not thread safe cause driver OOM • Spark-20408 Get glob path in parallel to reduce resolve relation time • …
  29. 29. Agenda • Background for OAP • Key features • Benchmark • OAP and Spark in Baidu • Future plans
  30. 30. Future plans • Compatible with more data formats • Explicit cache and cache management • Optimize SQL operators (join, aggregate) with index • Integrate with structured streaming • Utilize Latest hardware technology, such as Intel QAT or 3D XPoint. • Welcome to contribute! https://github.com/Intel-bigdata/OAP
  31. 31. Thank You. daoyuan.wang@intel.com liyuanjian@baidu.com

×