Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.
To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.
2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.
At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks
No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade.
A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size.
To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.
To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.
2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.
At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks
No matter if your data pipelines are handling real-time event-driven streams, near-real-time streams, or batch processing jobs. When you work with a massive amount of data made out of small files, specifically parquet, your system performance will degrade.
A small file is one that is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3, Azure Blob, etc., there is minimum block size. Having a significantly smaller object file can result in wasted space on the disk since the storage is optimized to support fast read and write for minimal block size.
To understand why this happens, you need first to understand how cloud storage works with the Apache Spark engine. In this session, you will learn about Parquet, the Storage API calls, how they work together, why small files are a problem, and how you can leverage DeltaLake for a more straightforward, cleaner solution.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner. We’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Apache Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
how to build deep learning models in a few lines of code;
how to scale common tasks like transfer learning and prediction; and how to publish models in Spark SQL.
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Advanced Natural Language Processing with Apache Spark NLPDatabricks
NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
In this talk, we’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
• It has a simple API that integrates well with enterprise Machine Learning pipelines.
• It automatically scales out common Deep Learning patterns, thanks to Spark.
• It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
• how to build deep learning models in a few lines of code;
• how to scale common tasks like transfer learning and prediction; and
• how to publish models in Spark SQL.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner. We’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Apache Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
how to build deep learning models in a few lines of code;
how to scale common tasks like transfer learning and prediction; and how to publish models in Spark SQL.
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Advanced Natural Language Processing with Apache Spark NLPDatabricks
NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
In this talk, we’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
• It has a simple API that integrates well with enterprise Machine Learning pipelines.
• It automatically scales out common Deep Learning patterns, thanks to Spark.
• It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
• how to build deep learning models in a few lines of code;
• how to scale common tasks like transfer learning and prediction; and
• how to publish models in Spark SQL.
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing.
Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks.
We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
In this deck from FOSDEM 2020, Frank McQuillan from Pivotal presents: Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases.
"In this session we will present an efficient way to train many deep learning model configurations at the same time with Greenplum, a free and open source massively parallel database based on PostgreSQL. The implementation involves distributing data to the workers that have GPUs available and hopping model state between those workers, without sacrificing reproducibility or accuracy. Then we apply optimization algorithms to generate and prune the set of model configurations to try.
Deep neural networks are revolutionizing many machine learning applications, but hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. This is the challenge of model selection. It is time consuming and expensive, especially if you are only training one model at a time.
Massively parallel processing databases can have hundreds of workers, so can you use this parallel compute architecture to address the challenge of model selection for deep nets, in order to make it faster and cheaper?
It’s possible!
We will demonstrate results from this project using a version of Hyperband, which is a well known hyperparameter optimization algorithm, and the deep learning frameworks Keras and TensorFlow, all running on Greenplum database using Apache MADlib. Other topics will include architecture, scalability results and bright opportunities for the future."
Watch the video: https://wp.me/p3RLHQ-lsQ
Learn more: https://fosdem.org/2020/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Two popular tools for doing Machine Learning on top of JVM ecosystem is H2O and SparkML. This presentation compares these two tools as Machine Learning libraries (Didn't consider Spark's Data Munjing perspective). This work was done during June of 2018.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
Overview and extended description: AI is expected to be the engine of technological advancements in the healthcare industry, especially in the areas of radiology and image processing. The purpose of this session is to demonstrate how we can build a AI-based Radiologist system using Apache Spark and Analytics Zoo to detect pneumonia and other diseases from chest x-ray images. The dataset, released by the NIH, contains around 110,00 X-ray images of around 30,000 unique patients, annotated with up to 14 different thoracic pathology labels. Stanford University developed a state-of-the-art model using CNN and exceeds average radiologist performance on the F1 metric. This talk focuses on how we can build a multi-label image classification model in a distributed Apache Spark infrastructure, and demonstrate how to build complex image transformations and deep learning pipelines using BigDL and Analytics Zoo with scalability and ease of use. Some practical image pre-processing procedures and evaluation metrics are introduced. We will also discuss runtime configuration, near-linear scalability for training and model serving, and other general performance topics.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
3. About us
▪ PHD students from ADALab at UCSD, advised by
Prof. Arun Kumar
▪ Our research mission: democratize data science
▪ More:
Supun Nakandala
https://scnakandala.github.io/
Yuhao Zhang
https://yhzhang.info/
ADALab
https://adalabucsd.github.io/
5. Problem: training deep nets is Painful!
Batch size?
8, 16, 64, 256 ...
Model architecture?
3 layer CNN,5 layer
CNN, LSTM…
Learning rate?
0.1, 0.01, 0.001,
0.0001 ...
Regularization?
L2, L1, Dropout,
Batchnorm ...
4 4 4 4
256 Different configurations !
Model performance = f(model architecture, hyperparameters, ...)
→Trial and error
Need for speed → $$$
(Distributed DL)
→ Better utilization of resources
6. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
7. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
8. Introduction - mini-batch SGD
Model
Updated Model
η ∇
Learning
rate
Avg. of
gradients
X1 X2 y
1.1 2.3 0
0.9 1.6 1
0.6 1.3 1
... ... ...
... ... ...
... ... ...
... ... ...
... ... ...
... ... ...
One
mini-batch
The most popular algorithm family for
training deep nets
10. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
14. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
15. Data Parallelism - Problem Setting
Models(tasks)
Partitioned data
High data scalability
16. Data Parallelism
Queue
Training on one mini-batch
or full partition
● Update only per epoch: bulk synchronous parallelism
(model averaging)
○ Bad convergence
● Update per mini-batch: sync parameter server
○ + Async updates: async parameter server
○ + Decentralized: MPI allreduce (Horovod)
○ High communication cost
Updates
17. Task Parallelism
+ high throughput
- low data scalability
- memory/storage wastage
Data Parallelism
+ high data scalability
- low throughput
- high communication cost
Model Hopper Parallelism (Cerebro)
+ high throughput
+ high data scalability
+ low communication cost
+ no memory/storage wastage
18. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
27. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
29. Implementation Details
▪ Spark DataFrames converted to partitioned Parquet
and locally cached in workers
▪ TensorFlow threads run training on local data
partitions
▪ Model Hopping implemented via shared file system
30. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
31. Example: Grid Search on
Model Selection + Hyperparameter
Search
▪ Two model architecture: {VGG16, ResNet50}
▪ Two learning rate: {1e-4, 1e-6}
▪ Two batch size: {32, 256}
32. Initialization
from pyspark.sql import SparkSession
import cerebro
spark = SparkSession.builder.master(...) # initialize spark
spark_backend = cerebro.backend.SparkBackend(
spark_context=spark.sparkContext, num_workers=num_workers
) # initialize cerebro
data_store = cerebro.storage.HDFSStore('hdfs://...') # set the shared data
storage
33. Define the Models
params = {'model_arch':['vgg16', 'resnet50'], 'learning_rate':[1e-4, 1e-6], 'batch_size':[32, 256]}
def estimator_gen_fn(params):
'''A model factory that returns an estimator,
given the input hyper-parameters, as well as model architectures'''
if params['model_arch'] == 'resnet50':
model = ... # tf.keras model
elif params['model_arch'] == 'vgg16':
model = ... # tf.keras model
optimizer = tf.keras.optimizers.Adam(lr=params['learning_rate']) # choose optimizer
loss = ... # define loss
estimator = cerebro.keras.SparkEstimator(model=model,
optimizer=optimizer,
loss=loss,
batch_size=params['batch_size'])
return estimator
34. Run Grid Search
df = ... # read data in as Spark DataFrame
grid_search = cerebro.tune.GridSearch(spark_backend,
data_store,
estimator_gen_fn,
params,
epoch=5,
validation=0.2,
feature_columns=['features'],
label_columns=['labels'])
model = grid_search.fit(df)
35. Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
36. Tests - Setups - Hardware
▪ 9-node cluster, 1 master + 8 workers
▪ On each nodes:
▪ Intel Xeon 10-core 2.20 GHz CPU x 2
▪ 192 GB RAM
▪ Nvidia P100 GPU x 1
42. Tests - Cerebro-Spark Gantt Chart
▪ Only overhead: stragglers randomly caused by TF 2.1 Keras Model saving/loading.
Overheads range from 1% to 300%
Stragglers