Developing apache spark jobs in .net using mobius

Mobius is a C# binding for Apache Spark that allows .NET developers to build Spark applications using C#. It enables reusing existing .NET code and libraries in Spark and makes C# a first-class language for Spark. Mobius integrates with the Spark runtime by launching C# worker processes that communicate with the Java Virtual Machine to execute C# transformations and actions on RDDs in a pipelined fashion for better performance.

Scaling spark on kubernetes at Lyft

Li Gao

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure. Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.

Scaling Apache Spark on Kubernetes at Lyft

Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup. Speakers: Li Gao, Rohit Menon

Spark Compute as a Service at Paypal with Prabhu Kasinathan

Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning. This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.

Migrating to Apache Spark at Netflix

In the last two years, Netflix has seen a mass migration to Spark from Pig and other MR engines. This talk will focus on the challenges of that migration and the work that has made it possible. This will include contributions that Netflix has made to Spark to enable wider adoption and on-going projects to make Spark appeal to a broader range of analysts, beyond data and ML engineers. Speaker Ryan Blue

There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely. This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

HostedbyConfluent

Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more. In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.

Whirlpools in the Stream with Jayesh Lalwani

This document summarizes some challenges and solutions related to structured streaming in Spark. It discusses issues with joining streaming and batch data due to lack of pushdown predicates. It also covers problems with caching batch dataframes, lack of a JDBC sink in streaming mode initially, issues with checkpoints being inconsistent, and limitations on aggregating aggregated dataframes. Solutions proposed include caching data outside Spark, looking up batch data in map/flatmap, direct database writes, using NFS for checkpoints, and custom aggregations without Spark SQL.

High Performance Python on Apache Spark

Wes McKinney

This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.

fluentd -- the missing log collector

Muga Nishizawa

Fluentd is an open source log collector that allows flexible collection and routing of log data. It uses JSON format for log messages and supports many input and output plugins. Fluentd can collect logs from files, network services, and applications before routing them to storage and analysis services like MongoDB, HDFS, and Treasure Data. The open source project has grown a large community contributing over 100 plugins to make log collection and processing easier.

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Shelan Perera

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed

This presentation describes the journey we went through in containerizing Spark workload into multiple elastic Spark clusters in a multi-tenant kubernetes environment. Initially we deployed Spark binaries onto a host-level filesystem, and then the Spark drivers, executors and master can transparently migrate to run inside a Docker container by automatically mounting host-level volumes. In this environment, we do not need to prepare a specific Spark image in order to run Spark workload in containers. We then utilized Kubernetes helm charts to deploy a Spark cluster. The administrator could further create a Spark instance group for each tenant. A Spark instance group, which is akin to the Spark notion of a tenant, is logically an independent kingdom for a tenant’s Spark applications in which they own dedicated Spark masters, history server, shuffle service and notebooks. Once a Spark instance group is created, it automatically generates its image and commits to a specified repository. Meanwhile, from Kubernetes’ perspective, each Spark instance group is a first-class deployment and thus the administrator can scale up/down its size according to the tenant’s SLA and demand. In a cloud-based data center, each Spark cluster can provide a Spark as a service while sharing the Kubernetes cluster. Each tenant that is registered into the service gets a fully isolated Spark instance group. In an on-prem Kubernetes cluster, each Spark cluster can map to a Business Unit, and thus each user in the BU can get a dedicated Spark instance group. The next step on this journey will address the resource sharing across Spark instance groups by leveraging new Kubernetes’ features (Kubernetes31068/9), as well as the Elastic workload containers depending on job demands (Spark18278). Demo: https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5s

Understanding and Improving Code Generation

RedisConf17 - Pain-free Pipelining

Redis Labs

The document discusses how pipelining commands can improve performance when making requests to Redis over a network. It shows that pipelining multiple commands together can increase throughput significantly by reducing the number of round trips needed to the server. The document provides a benchmark showing single commands getting 17k requests/second while pipelined commands achieve 260k requests/second. It also demonstrates how to simulate network latency using Toxiproxy to throttle connections and see even larger gains from pipelining when there is network overhead. The RedPipe library is introduced as a way to pipeline commands while still maintaining a familiar API and handling responses with futures to avoid blocking.

[Spark Summit 2017 NA] Apache Spark on Kubernetes

Timothy Chen

This document summarizes a presentation about running Apache Spark on Kubernetes. It discusses how Spark jobs can be scheduled and run on Kubernetes, including scheduling the driver and executor pods. Key points of the design include the Kubernetes scheduler backend for Spark and components like the file staging server. The roadmap outlines upcoming support for features like Spark Streaming and improvements to dynamic allocation.

Tuning and Monitoring Deep Learning on Apache Spark

Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs. Speaker: Tim Hunter This talk was originally presented at Spark Summit East 2017.

Connect Code to Resource Consumption to Scale Your Production Spark Applicati...

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will also discuss various sources of information on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. We will show how Pepperdata’s products can clearly identify such usages and tie them to specific lines of code. We will show how Spark application owners can quickly identify the root causes of such common problems as job slowdowns, inadequate memory configuration, and Java garbage collection issues.

A Collaborative Data Science Development Workflow

Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s. The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Brandon O'Brien

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Evan Chan

This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies. We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts. We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.

Portable Streaming Pipelines with Apache Beam

1) Apache Beam is an open source unified model for defining both batch and streaming data processing pipelines. It allows writing pipelines once that can run on multiple distributed processing backends. 2) The Beam model separates the data processing logic from runtime requirements. It defines concepts like processing time vs event time to allow portability across batch and streaming runners. 3) Beam supports extensible IO connectors and aims to allow pipelines written in one language to run on different runtimes through language-specific SDKs. Currently, Java and Python SDKs can run on backends like Apache Spark, Flink, and Google Cloud Dataflow.

The Future of Real-Time in Spark

Reynold Xin

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...

DataWorks Summit/Hadoop Summit

The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, Schema Registry, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn't it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!

Scylla Summit 2022: ORM and Query Building in Rust

ScyllaDB

This document discusses building ORM and queries in Rust for ScyllaDB. It covers why Rust is suitable, features of the Rust programming language, and modules in the ScyllaDB Rust ORM like query transformation and table metadata. It also discusses concepts like load balancing policies, query building, and mapping tables to Rust structs in the ORM. Lastly, it briefly mentions Scylla products/drivers and areas of future work.

File Format Benchmark - Avro, JSON, ORC & Parquet

This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The study also found that column projection was significantly faster for columnar formats like ORC and Parquet compared to row-oriented formats. Overall, the document provides a high-level overview of performance comparisons between file formats for different use cases.

Graph Analytics

Khalid Salama

What's hot

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

HostedbyConfluent

Whirlpools in the Stream with Jayesh Lalwani

High Performance Python on Apache Spark

Wes McKinney

fluentd -- the missing log collector

Muga Nishizawa

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Shelan Perera

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed

Understanding and Improving Code Generation

RedisConf17 - Pain-free Pipelining

Redis Labs

[Spark Summit 2017 NA] Apache Spark on Kubernetes

Timothy Chen

Tuning and Monitoring Deep Learning on Apache Spark

Connect Code to Resource Consumption to Scale Your Production Spark Applicati...

A Collaborative Data Science Development Workflow

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Brandon O'Brien

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Evan Chan

Portable Streaming Pipelines with Apache Beam

The Future of Real-Time in Spark

Reynold Xin

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...

DataWorks Summit/Hadoop Summit

Scylla Summit 2022: ORM and Query Building in Rust

ScyllaDB

What's hot (20)

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

Whirlpools in the Stream with Jayesh Lalwani

High Performance Python on Apache Spark

fluentd -- the missing log collector

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed

Understanding and Improving Code Generation

RedisConf17 - Pain-free Pipelining

[Spark Summit 2017 NA] Apache Spark on Kubernetes

Tuning and Monitoring Deep Learning on Apache Spark

Connect Code to Resource Consumption to Scale Your Production Spark Applicati...

A Collaborative Data Science Development Workflow

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Portable Streaming Pipelines with Apache Beam

The Future of Real-Time in Spark

Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...

Scylla Summit 2022: ORM and Query Building in Rust

Viewers also liked

File Format Benchmark - Avro, JSON, ORC & Parquet

Graph Analytics

Khalid Salama

Machine learning with Spark

Khalid Salama

Parquet and AVRO

airisData

Parquet Strata/Hadoop World, New York 2013

Julien Le Dem

Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.

Efficient Data Storage for Analytics with Apache Parquet 2.0

Cloudera, Inc.

Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

StampedeCon

At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.

Viewers also liked (7)

File Format Benchmark - Avro, JSON, ORC & Parquet

Graph Analytics

Machine learning with Spark

Parquet and AVRO

Parquet Strata/Hadoop World, New York 2013

Efficient Data Storage for Analytics with Apache Parquet 2.0

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

Similar to Developing apache spark jobs in .net using mobius

Spark Summit EU talk by Kaarthik Sivashanmugam

This document introduces Mobius, a C# API for building Apache Spark applications in .NET. Mobius allows organizations invested in .NET to develop Spark jobs using C#. It provides bindings between the Scala/Java Spark API and C#, enabling reuse of existing .NET libraries in Spark and development of Spark applications using C# and F#. The document demonstrates running and debugging Mobius applications, and discusses performance considerations and internals of the Mobius driver and worker architecture.

.NET per la Data Science e oltre

Marco Parenzan

.NET for Azure Synapse (and viceversa)

Marco Parenzan

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Michael Rys

Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

Mike Broberg

Ml2

poovarasu maniandan

Microsoft R Server allows users to run R code on large datasets in a distributed, parallel manner across SQL Server, Spark, and Hadoop without code changes. It provides scalable machine learning algorithms and tools to operationalize models for real-time scoring. The document discusses how R code can be run remotely on Hadoop and Spark clusters using technologies like RevoScaleR and Sparklyr for scalability.

Intro to big data analytics using microsoft machine learning server with spark

Alex Zeltov

Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes. Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together. R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics. This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS . - How to Scale R - Work with R and Hadoop + Spark -Demo of MLS on HDP/HDInsight server with RStudio - How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio: Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.

Introducing Kafka's Streams API

The document introduces Apache Kafka's Streams API for stream processing. Some key points covered include: - The Streams API allows building stream processing applications without needing a separate cluster, providing an elastic, scalable, and fault-tolerant processing engine. - It integrates with existing Kafka deployments and supports both stateful and stateless computations on data in Kafka topics. - Applications built with the Streams API are standard Java applications that run on client machines and leverage Kafka for distributed, parallel processing and fault tolerance via state stores in Kafka.

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Michael Rys

This document introduces .NET for Apache Spark, which allows .NET developers to use the Apache Spark analytics engine for big data and machine learning. It discusses why .NET support is needed for Apache Spark given that much business logic is written in .NET. It provides an overview of .NET for Apache Spark's capabilities including Spark DataFrames, machine learning, and performance that is on par or faster than PySpark. Examples and demos are shown. Future plans are discussed to improve the tooling, expand programming experiences, and provide out-of-box experiences on platforms like Azure HDInsight and Azure Databricks. Readers are encouraged to engage with the open source project and provide feedback.

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Yao Yao

Yao Yao Mooyoung Lee https://github.com/yaowser/learn-spark/tree/master/Final%20project https://www.youtube.com/watch?v=IVMbSDS4q3A https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/ Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3