Improving Spark SQL at LinkedIn

•

2 likes•1,117 views

Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as: * Improving Dataset performance with automated column pruning * Bringing an efficient 2d join algorithm to Spark SQL * Fixing join skewness with adaptive execution * Enhancing the cost-optimizer with a history-based learning approach

Improving Spark SQL
At LinkedIn
Fangshi Li
Staff Software Engineer
LinkedIn

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based optimizer

Spark SQL adoptions at LinkedIn
60% jobs running on
our cluster are Spark
jobs
Spark jobs:
⅔ Spark SQL
⅓ RDD
Spark SQL jobs:
⅔ DataFrame/SQL API
⅓ Dataset API
60% 2/3 1/3

goals
Enable computations
that could not be
completed before
Make every job run
faster

Spark SQL roadmap at Linkedin: 3-level optimization
Operator-level
Dataset ser-de
joins
Plan-level
Adaptive Execution,
CBO
Cluster-level
Multi-query
optimization

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
Cost-based
optimization (CBO)

Dataset performance
val ds: Dataset<TrackingEvent> ds = createDataset()
val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key))
Dataset has performance issue due to
1. Excessive conversion overhead
2. No column pruning for Orc/Parquet

Solutions
Apple:
Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames”
Using a bytecode analyzer, converting the user lambda functions into SQL expressions
E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0))
Linkedin:
Using a bytecode analyzer, find out which columns are used in the user lambdas, and
prune columns that are not needed
val ds: Dataset<TrackingEvent> ds = createDataset()
val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key))
Big performance boost for orc/parquet since columns can be pushed to readers

A recommendation use case at Linkedin
1. Pair feature joins with viewer feature
2. Intermediate result joins with entity feature
3. Scores each joined record a ML model
4. Rank the top N entities for each viewer

Exploding intermediate data
Can we perform 3-way join and score in a single step
without exploding intermediate data?

2d partitioned join
- Partition left, right, and pair table into M,
N, M*N partitions
- Left and pair table are sorted within each
partition
- For each partition in pair table
- join left table with a sort-merge join
- join right table with a shuffle-hash join
- For each joined record, perform scoring
right away, and output the scorable
- Rank the scorables

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d hash
partitioned join
Adaptive Execution
(AE)
Cost-based
optimization(CBO)

Adaptive Execution(AE) at LinkedIn
Optimize query plan while job is running (SPARK-23128)
Handle data skew in join
Works great!
Convert shuffle-based join
to broadcast join at
runtime
Need shuffle map stage before converting
to broadcast join
Should we use Adaptive
Execution to optimize join
plan at runtime？

1
2
3
4
Agenda
Automated column
pruning for Dataset
2d hash
partitioned join
Adaptive Execution
Cost-based
optimization(CBO)

CBO(Cost-based optimizer)
CBO in Spark can optimize the query plan based on the
operators cost(data size, # of records).
Benefits:
Choose best join strategy:
broadcast vs shuffle-hash vs sort-merge
Multi-Join reordering

CBO(Cost-based optimizer)
The native CBO in Spark has usability issue:
Requires detailed stats(count, min,max,distinct,
histograms) available for the input datasets.
Requires scheduled jobs to compute stats on all datasets
which is very expensive.

CBO(Cost-based optimizer)
Can we learn the stats from history? YES!

Learning-based CBO
Eliminate the CBO’s dependency on pre-computing stats by
learning stats from job histories
A general approach to benefit all SQL engines

Learning-based CBO
Approach 2: Model-based learning
Ref: “SageDB: A Learned Database System”
Approach 1: Instance-based learning
Ref: “LEO: DB2’s Learning Optimizer”

Learning-based CBO vs no-CBO
Approach 2: Model-based learning
Ref: “SageDB: A Learned Database System”
Approach 1: Instance-based learning
Ref: “LEO: DB2’s Learning Optimizer”

1
2
3
4
Summary
Automated column
pruning for Dataset
2d partitioned join
Adaptive Execution
History-based CBO
(Cost-based optimizer)

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Understanding Query Plans and Spark UIs

Databricks

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

A Deep Dive into Query Execution Engine of Spark SQL

Databricks

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

The Parquet Format and Performance Optimization Opportunities

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Hive Bucketing in Apache Spark with Tejas Patil

Databricks

Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling. In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.

Parquet performance tuning: the missing guide

Ryan Blue

Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include: * The tools and techniques Netflix uses to analyze Parquet tables * How to spot common problems * Recommendations for Parquet configuration settings to get the best performance out of your processing platform * The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform

Dynamic Partition Pruning in Apache Spark

Databricks

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.

Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).

Optimizing Apache Spark SQL Joins

Databricks

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

Spark shuffle introduction

colorant

Apache Spark Core – Practical Optimization

Databricks

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Databricks

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Bo Yang

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Databricks

Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.

Processing Large Data with Apache Spark -- HasGeek

Venkata Naga Ravi

Vectorized Query Execution in Apache Spark at Facebook

Databricks

A standard query execution system processes one row at a time. Vectorized query execution batches multiples rows together in a columnar format, and each operator uses simple loops to iterate over data within a batch. This feature greatly reduces the CPU usage for reading, writing and query operations like scanning, filtering. In this talk, we will take a deep dive into Facebook's ORC-based vectorized reader and writer implementation, discuss how vectorization affects performance of various data types in Hive/Spark, and quantify the improvements vectorization brings to the Facebook Warehouse. Speaker: Chen Yang

Deep Dive: Memory Management in Apache Spark

Databricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

How to Extend Apache Spark with Customized Optimizations

Databricks

There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Databricks

Productizing Structured Streaming Jobs

Databricks

"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base. We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."

Scaling Apache Spark at Facebook

Databricks

Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users. Speakers: Ankit Agarwal, Sameer Agarwal

Physical Plans in Spark SQL

Databricks

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

Spark (Structured) Streaming vs. Kafka Streams

Guido Schmutz

Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application. This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Databricks

The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Databricks

Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.

Spark SQL Join Improvement at Facebook

Databricks

Optimizations in Spark; RDD, DataFrame

Knoldus Inc.

Developing Apache Spark Jobs is the easier part of the process but the difficult portion comes in while executing them under full load as each job is unique when it comes to performance. Spark programs often face bottlenecks in terms of CPU, network bandwidth, memory usage which stems from Spark’s basic nature of in-memory computations. In this webinar, we will deal with the problem of how optimally you can perform your job operations in Apache Spark. We will address common performance problems including - ~ Inadequate transformations when working with RDD API as optimization is the developer’s responsibility, unlike in SQL querying language. ~ Proper partitioning of data so that Spark can perform tasks optimally ~ Why DataFrames have better performance than RDD? Here’s the agenda of the webinar - ~ Spark Execution Model ~ Optimizing Shuffle Operations ~ Optimizing Functions ~ SQL VS RDD ~ Logical & Physical Plan ~ Optimizing Joins

New Developments in Spark

Databricks

What's hot

Deep Dive into the New Features of Apache Spark 3.0

Databricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Databricks

Optimizing Apache Spark SQL Joins

Databricks

Spark shuffle introduction

colorant

Apache Spark Core – Practical Optimization

Databricks

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Databricks

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Bo Yang

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Databricks

Processing Large Data with Apache Spark -- HasGeek

Venkata Naga Ravi

Vectorized Query Execution in Apache Spark at Facebook

Databricks

Deep Dive: Memory Management in Apache Spark

Databricks

How to Extend Apache Spark with Customized Optimizations

Databricks

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Databricks

Productizing Structured Streaming Jobs

Databricks

Scaling Apache Spark at Facebook

Databricks

Physical Plans in Spark SQL

Databricks

Spark (Structured) Streaming vs. Kafka Streams

Guido Schmutz

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Databricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Databricks

Spark SQL Join Improvement at Facebook

Databricks

What's hot (20)

Deep Dive into the New Features of Apache Spark 3.0

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Optimizing Apache Spark SQL Joins

Spark shuffle introduction

Apache Spark Core – Practical Optimization

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Processing Large Data with Apache Spark -- HasGeek

Vectorized Query Execution in Apache Spark at Facebook

Deep Dive: Memory Management in Apache Spark

How to Extend Apache Spark with Customized Optimizations

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Productizing Structured Streaming Jobs

Scaling Apache Spark at Facebook

Physical Plans in Spark SQL

Spark (Structured) Streaming vs. Kafka Streams

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Spark SQL Join Improvement at Facebook

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Game Development with Unity3D (Game Development lecture 3)

abdulrafaychaudhry

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Aftab Hussain

Understanding variable roles in code has been found to be helpful by students in learning programming -- could variable roles help deep neural models in performing coding tasks? We do an exploratory study. - These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

Vitthal Shirke Java Microservices Resume.pdf

Vitthal Shirke

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

AI Genie Review: World’s First Open AI WordPress Website Creator

Google

AI Genie Review: World’s First Open AI WordPress Website Creator 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-genie-review AI Genie Review: Key Features ✅Creates Limitless Real-Time Unique Content, auto-publishing Posts, Pages & Images directly from Chat GPT & Open AI on WordPress in any Niche ✅First & Only Google Bard Approved Software That Publishes 100% Original, SEO Friendly Content using Open AI ✅Publish Automated Posts and Pages using AI Genie directly on Your website ✅50 DFY Websites Included Without Adding Any Images, Content Or Doing Anything Yourself ✅Integrated Chat GPT Bot gives Instant Answers on Your Website to Visitors ✅Just Enter the title, and your Content for Pages and Posts will be ready on your website ✅Automatically insert visually appealing images into posts based on keywords and titles. ✅Choose the temperature of the content and control its randomness. ✅Control the length of the content to be generated. ✅Never Worry About Paying Huge Money Monthly To Top Content Creation Platforms ✅100% Easy-to-Use, Newbie-Friendly Technology ✅30-Days Money-Back Guarantee See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review #AIGenieApp #AIGenieBonus #AIGenieBonuses #AIGenieDemo #AIGenieDownload #AIGenieLegit #AIGenieLiveDemo #AIGenieOTO #AIGeniePreview #AIGenieReview #AIGenieReviewandBonus #AIGenieScamorLegit #AIGenieSoftware #AIGenieUpgrades #AIGenieUpsells #HowDoesAlGenie #HowtoBuyAIGenie #HowtoMakeMoneywithAIGenie #MakeMoneyOnline #MakeMoneywithAIGenie

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

Crescat

Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry. Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events. With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use. Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements. If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io

Graspan: A Big Data System for Big Code Analysis

Aftab Hussain

We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18. - Accepted in ASPLOS ‘17, Xi’an, China. - Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17. - Invited for presentation at SoCal PLS ‘16. - Invited for poster presentation at PLDI SRC ‘16.

APIs for Browser Automation (MoT Meetup 2024)

Boni García

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)

Alina Yurenko

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

Navigating the Metaverse: A Journey into Virtual Evolution"

Donna Lenk

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

Introduction to Pygame (Lecture 7 Python Game Development)

abdulrafaychaudhry

Need for Speed: Removing speed bumps from your Symfony projects ⚡️

Łukasz Chruściel

No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception. In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed. We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.

Recently uploaded (20)

Game Development with Unity3D (Game Development lecture 3)

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

Vitthal Shirke Java Microservices Resume.pdf

BoxLang: Review our Visionary Licenses of 2024

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

AI Genie Review: World’s First Open AI WordPress Website Creator

Globus Connect Server Deep Dive - GlobusWorld 2024

Cracking the code review at SpringIO 2024

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

Graspan: A Big Data System for Big Code Analysis

APIs for Browser Automation (MoT Meetup 2024)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Navigating the Metaverse: A Journey into Virtual Evolution"

A Sighting of filterA in Typelevel Rite of Passage

Introduction to Pygame (Lecture 7 Python Game Development)

Need for Speed: Removing speed bumps from your Symfony projects ⚡️

Improving Spark SQL at LinkedIn

1. Improving Spark SQL At LinkedIn Fangshi Li Staff Software Engineer LinkedIn

2. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimizer

3. Spark SQL adoptions at LinkedIn 60% jobs running on our cluster are Spark jobs Spark jobs: ⅔ Spark SQL ⅓ RDD Spark SQL jobs: ⅔ DataFrame/SQL API ⅓ Dataset API 60% 2/3 1/3

4. goals Enable computations that could not be completed before Make every job run faster

5. Spark SQL roadmap at Linkedin: 3-level optimization Operator-level Dataset ser-de joins Plan-level Adaptive Execution, CBO Cluster-level Multi-query optimization

6. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)

7. Dataset performance val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Dataset has performance issue due to 1. Excessive conversion overhead 2. No column pruning for Orc/Parquet

8. Solutions Apple: Spark + AI 2019 talk: “Bridging the Gap Between Datasets and DataFrames” Using a bytecode analyzer, converting the user lambda functions into SQL expressions E.g., x.id > 0 ----> isLargerThan(col(“id”) , Literal(0)) Linkedin: Using a bytecode analyzer, find out which columns are used in the user lambdas, and prune columns that are not needed val ds: Dataset<TrackingEvent> ds = createDataset() val ds2 = ds.filter(x.id > 0).map(x=> (x.id, x.key)) Big performance boost for orc/parquet since columns can be pushed to readers

9. 1 2 3 4 Agenda Automated column pruning for Dataset 2d partitioned join Adaptive Execution Cost-based optimization (CBO)

10. A recommendation use case at Linkedin 1. Pair feature joins with viewer feature 2. Intermediate result joins with entity feature 3. Scores each joined record a ML model 4. Rank the top N entities for each viewer

11. Exploding intermediate data Can we perform 3-way join and score in a single step without exploding intermediate data?

12. 2d partitioned join - Partition left, right, and pair table into M, N, M*N partitions - Left and pair table are sorted within each partition - For each partition in pair table - join left table with a sort-merge join - join right table with a shuffle-hash join - For each joined record, perform scoring right away, and output the scorable - Rank the scorables

13. 10+hBefore 1hAfter

14. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution (AE) Cost-based optimization(CBO)

15. Adaptive Execution(AE) at LinkedIn Optimize query plan while job is running (SPARK-23128) Handle data skew in join Works great! Convert shuffle-based join to broadcast join at runtime Need shuffle map stage before converting to broadcast join Should we use Adaptive Execution to optimize join plan at runtime？

16. 1 2 3 4 Agenda Automated column pruning for Dataset 2d hash partitioned join Adaptive Execution Cost-based optimization(CBO)

17. CBO(Cost-based optimizer) CBO in Spark can optimize the query plan based on the operators cost(data size, # of records). Benefits: Choose best join strategy: broadcast vs shuffle-hash vs sort-merge Multi-Join reordering

18. CBO(Cost-based optimizer) The native CBO in Spark has usability issue: Requires detailed stats(count, min,max,distinct, histograms) available for the input datasets. Requires scheduled jobs to compute stats on all datasets which is very expensive.

19. CBO(Cost-based optimizer) Can we learn the stats from history? YES!

20. Learning-based CBO Eliminate the CBO’s dependency on pre-computing stats by learning stats from job histories A general approach to benefit all SQL engines

21. Learning-based CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”

22. Learning-based CBO vs no-CBO Approach 2: Model-based learning Ref: “SageDB: A Learned Database System” Approach 1: Instance-based learning Ref: “LEO: DB2’s Learning Optimizer”

23. 1 2 3 4 Summary Automated column pruning for Dataset 2d partitioned join Adaptive Execution History-based CBO (Cost-based optimizer)

24. Thank you

Improving Spark SQL at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Improving Spark SQL at LinkedIn

Similar to Improving Spark SQL at LinkedIn (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Improving Spark SQL at LinkedIn