End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
In the last two years, Netflix has seen a mass migration to Spark from Pig and other MR engines. This talk will focus on the challenges of that migration and the work that has made it possible. This will include contributions that Netflix has made to Spark to enable wider adoption and on-going projects to make Spark appeal to a broader range of analysts, beyond data and ML engineers.
Speaker Ryan Blue
- A brief introduction to Spark Core
- Introduction to Spark Streaming
- A Demo of Streaming by evaluation top hashtags being used
- Introduction to Spark MLlib
- A Demo of MLlib by building a simple movie recommendation engine
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
In the last two years, Netflix has seen a mass migration to Spark from Pig and other MR engines. This talk will focus on the challenges of that migration and the work that has made it possible. This will include contributions that Netflix has made to Spark to enable wider adoption and on-going projects to make Spark appeal to a broader range of analysts, beyond data and ML engineers.
Speaker Ryan Blue
- A brief introduction to Spark Core
- Introduction to Spark Streaming
- A Demo of Streaming by evaluation top hashtags being used
- Introduction to Spark MLlib
- A Demo of MLlib by building a simple movie recommendation engine
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Scaling Apache Spark on Kubernetes at LyftDatabricks
Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
In this talk, we will present SPynq framework: A framework for the efficient mapping and acceleration of Spark applications on heterogeneous all-programmable MPSoC-based platforms, such as Zynq. Spark has been mapped to the Pynq platform and the proposed framework allows the seamlessly utilization of the programmable logic for the hardware acceleration of computational intensive Spark kernels. We have also developed the required libraries in Spark, by extending the MLLib library, that hides the accelerator’s details to minimize the design effort to utilize the accelerators. A cluster of 4 nodes (workers) based on the all-programmable MPSoCs has been implemented and the proposed platform is evaluated in a typical machine learning application based on logistic regression. The logistic regression kernel has been developed as an accelerator and incorporated to the Spark. The developed system is compared to a high-performance Xeon cluster that is typically used in cloud computing. The performance evaluation shows that the heterogeneous accelerator-based MpSoC can achieve up to 2.3x system speedup compared with a Xeon system (with 90% accuracy) and 20x better energy-efficiency. For embedded application, the proposed system can achieve up to 40x speedup compared to the software only implementation on low-power embedded processors and 30x lower energy consumption.
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.
To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.
2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.
At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Scaling Apache Spark on Kubernetes at LyftDatabricks
Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speakers: Li Gao, Rohit Menon
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...Spark Summit
In this talk, we will present SPynq framework: A framework for the efficient mapping and acceleration of Spark applications on heterogeneous all-programmable MPSoC-based platforms, such as Zynq. Spark has been mapped to the Pynq platform and the proposed framework allows the seamlessly utilization of the programmable logic for the hardware acceleration of computational intensive Spark kernels. We have also developed the required libraries in Spark, by extending the MLLib library, that hides the accelerator’s details to minimize the design effort to utilize the accelerators. A cluster of 4 nodes (workers) based on the all-programmable MPSoCs has been implemented and the proposed platform is evaluated in a typical machine learning application based on logistic regression. The logistic regression kernel has been developed as an accelerator and incorporated to the Spark. The developed system is compared to a high-performance Xeon cluster that is typically used in cloud computing. The performance evaluation shows that the heterogeneous accelerator-based MpSoC can achieve up to 2.3x system speedup compared with a Xeon system (with 90% accuracy) and 20x better energy-efficiency. For embedded application, the proposed system can achieve up to 40x speedup compared to the software only implementation on low-power embedded processors and 30x lower energy consumption.
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens.
To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns:
1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism.
2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time.
At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
Have you considered keeping your children at home and giving it a whirl? Are you nervous about all the schooly stuff you need to do. Relax and enjoy as you learn together. You can do it! Magazines, free curriculum, software, and good old fun.
This is the presentation I gave at LESS2010 on 18th of Oct in 2010. The deck describes how we do lean product development and customer development and what are the current results.
Here's all you want to know on Cloud Computing..... why used, advantanges, structure etc. All queries regarding cloud computing are met in this presentation. For demo of such software in accounting field visit www.arcus-universe.com
How Online Education Used to reduce illiteracy in remote areasshuvo510
Nowadays online education is a popular medium of education. How
online education can be used to reduce illiteracy problem in the remote areas of our country?
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/seamless-deployment-of-multimedia-and-machine-learning-applications-at-the-edge-a-presentation-from-qualcomm/
Megha Daga, Senior Director of Product Management for AIoT at Qualcomm, presents the “Seamless Deployment of Multimedia and Machine Learning Applications at the Edge” tutorial at the May 2022 Embedded Vision Summit.
There has been an explosion of opportunities for edge compute solutions across the internet of things. This growth in opportunities and the diversity of applications is leading to fragmentation in the IoT space both in hardware and software, which creates challenges for developers. In addition, customers and developers are facing challenges in efficient data management and optimized application deployment on embedded edge platforms.
In this session, Daga introduces the Qualcomm Intelligent Multimedia SDK, which empowers developers to tackle these challenges and deploy edge compute applications in a scalable, flexible and optimized way. The Qualcomm Intelligent Multimedia SDK easily decodes and organizes sensor data and executes applications efficiently on edge platforms.
Mobius talk in Seattle Spark Meetup (Feb 2106). Mobius adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#. More info @ https://github.com/Microsoft/Mobius. Tweet to @MobiusForSpark.
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
Bài techtalk của anh Khải Trần nói về hệ thống data pipeline của LinkedIn được dùng để thu thập hàng chục tỷ messages mỗi ngày, và cách họ chạy hệ thống real-time processing để thống kê lượng dữ liệu này cho mục đính metrics monitoring.
1 số điểm bài talk sẽ chia sẻ:
- Giới thiệu về hệ thống unified metrics platform của LinkedIn
- Cách LinkedIn setup hệ thống BigData pipeline dùng Kafka, HDFS, Apache Calcite và Apache Samza.
- Khái niệm nearline storage, và cách LinkedIn chuyển từ offline architecture sang nearline architecture.
Speaker: Khai Tran, Staff Software Engineer - LinkedIn.
- Hiện đang là staff software engineer ở LinkedIn, phụ trách hệ thống metrics monitoring system. Trước đây từng làm ở Amazon AWS và Oracle.
- PhD, University of Wisconsin-Madison, nghiên cứu về Database Systems.
Total Knockout: Start-to-Finish Development of Suitability Applications Using...Blue Raster
2014 Esri International Developer Summit User Presentation
Learn strategies for developing cross-platform suitability analysis applications using ArcGIS Image Server D3 and Knockout.
Round 1: Suitability Web Services, automating raster pre-processing and mosaic creation, creating custom raster functions, publishing and consuming image services
Round 2: Visualization, calculating statistics for user drawn polygons, displaying results using D3, creating map and raster data export
Round 3: Building the App, building suitability user interfaces in JavaScript, mobile development strategies, managing data using Knockout
Offline maps for mobile developers (Android/iOS)Vadim Nikolaev
There is an issue with accessing maps offline. The presentation considers the reasons for circulation and possible libs for solving the specified tasks.
Abhishek Singhal, Riversoft
Ben O’Donnell, BIMobject
Albert Szilvasy, Autodesk
AutoCAD DWG files are widely used in many industries today, the Forge platform provides REST APIs to unlock the data inside them or to create new ones. This class will show how to securely connect Forge to various data storage services where your DWGs reside. It will demonstrate how to access a database while processing a DWG, a critical piece for many customers who embed database keys in their DWG files. Finally, we will look at how a real world customer uses the Design Automation APIs today.
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
Flexibility and high efficiency are common design drivers in the embedded systems domain. Coarse-grained reconfigurable coprocessors can tackle these issues, but they suffer of complex design, debugging and applications mapping problems. In this paper, we propose an automated design flow that aids developers in design and managing coarse-grained reconfigurable coprocessors. It provides both the hardware IP and the software drivers, featuring two different levels of coupling with the host processor. The presented solution has been tested on a JPEG codec, targeting a commercial Xilinx Virtex-5 FPGA.
As a leading IT service provider in the consumer finance field, Shanghai Rongzhijia Financial Information Service Co., Ltd. built China's first Internet loan search platform. It went from zero to over 30 million users, who have taken out nearly 15 billion RMB in loans through the platform within only two years.
This slides introduces how they evolve their IT system from monolithic application to a Dubbo based micro-service architecture.
Ford Motor Company's mission to become both an Automotive and Mobility company has required an evolution in our analytics data flow, from traditional batch processing systems to dynamically routed stream processing based systems. Valuable data is continually being generated across the enterprise, from consumer WiFi in dealerships, robots working on the assembly line, and vehicle diagnostic data, and is now flowing into Ford's Real Time Streaming Architecture (RTSA). Our goal was to develop a provider agnostic, end to end solution to ingest and dynamically route individual streams of data in less than one second from edge node to Ford's on premise data center, or vice versa. The architecture dynamically scales in the cloud to reliably handle thousands of outbound and inbound transactions per second, with data provenance capabilities to audit data flow from end to end.
Romuald Zdebskiy, Games Lead for CEE, Microsoft
There are a lot of buzz on cloud for game development. At the same time there are scenarios which benefit more from cloud as well as those that benefit less. The very goal of this session is to walk you through more than 10 gamedev scenarios where cloud works best or not so best. We will use Microsoft Cloud – Azure – as an example.
In this presentation, keynoting a session at the 2012 Model Based Enterprise Summit 2012 on lightweight visualization, Consortium Executive Director Dave Opsahl informs the audience on how what we typically refer to as "visualization", while a needed and extremely valuable category of solutions, is not the same as "communication". The MBE Summit, hosted annually by the National Institue of Standards and Technology, is a gathering of managers, technologists, engineers, and thought leaders on how to enable the Model Based Enterpris (MBE).
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Large Scale Multimedia Data Intelligence And Analysis On Spark
1. Large Scale Multimedia Data
Processing on Spark
and Related Applications at Baidu
Quan Wang
Baidu USA
2. Motivations
• Why Multimedia?
• Examples of large scale multimedia processing at
Baidu:
– HD map generation and simulation for self-driving cars
– Image feature exaction and transform for CTR predictions
4. Data challenge in Self-driving Cars Project
• Backend map generation and simulation:
– Point clouds inputs of 40MB/s for one single 3D LiDAR sensor
– Counting all 2D/3D sensors => TB level of data, per hour, per car
• Example: HD map generation on Spark
5. Another Example: Baidu Image Search
• Billions of images need feature extractions &
transformations for deep learning applications:
– Image recognition and classification
– Ranking for best picture to show
– CTR prediction
6. Challenges
• Core functions for feature extraction
• Efficient large scale distributed executions of feature
extraction with multimedia input support
• Plug and play for any feature extraction executable
never designed for Spark; Flexible and easy to use for
platform users
7. Feature Extraction Core Function
• Feature extraction C++ program depends on CDNN + OpenCV
library
– Compute the per pixel difference based on a pre-computed mean
– Feed the difference values into a pre-computed CDNN model
– Produce image features after multi-layers of computations
• Need streaming/pipe based function support
13. Technical Details:
Binary Flow on Each Executor Node
Partition of
Multimedia
Files
Encoding Serialization Deserialization
Decoding
User Logic
Encoding
Serialization
Binary
stream
Binary
stream
14. Technical Details:
Binary Flow on Each Executor Node
Partition of
Multimedia
Files
Encoding Serialization Deserialization
Decoding
User Logic
Encoding
Serialization
Binary
stream
Local
Collection of
Output Bytes
Master
Collection of
Bytes
Binary
stream
15. Technical Details:
Binary Flow on Each Executor Node
Partition of
Multimedia
Files
Encoding Serialization Deserialization
Decoding
User Logic
Encoding
Serialization
Binary
stream
Local
Collection of
Output Bytes
Output
Binary Files
on DFS
Master
Collection of
Bytes
Decoding Deserialization
Binary
stream
16. Implementation Highlights
• All data serialization and encoding inside Spark
RDD, thus linear scalable
• Flexible output format (feature vectors, ranking
scores, processed binaries, etc.)
• Easy to plug in customized encoding/serialization
functions directly into platform
• Support of passing spark internal information (e.g.
partition id, task attempt id) into user program
17. Flexibility
Flexible output format (feature vectors, ranking scores, processed binaries, etc.)
Easy to plug in Customized encoding/serialization functions directly into Spark
19. Conclusion
• Introduce the Binary Piped RDD for:
– Platform level abstraction of input data format in their original binary form
– Seamless streaming to and from existing executable/libraries for high level
data analysis and understanding
– Linear scalability with input data
• General data intelligence and analysis
– Binary input format + Pipe based bin/lib execution
Missing functionality in Spark/Hadoop