For predicting vehicle defects at BMW, a machine learning pipeline evaluating several thousand features was implemented. As important features can be useful for evaluating specific defects, a feature selection approach has been used. For further evaluating the importance of features, several feature selection techniques (filters and wrappers) have been implemented as ml PipelineStages for usage on dataframes for incorporation in a complete Spark ml Pipeline, including preprocessing and classification. The general steps for building custom Spark ml Estimators are presented. The API of the newly implemented feature selection techniques is demonstrated and results of a performance analysis are shown. Besides that, experiences gained and pitfalls that should be avoided are shared.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
Microsoft Introduction to Automated Machine LearningSetu Chokshi
A gentle introduction to Microsoft's AutoML SDK package. This presentation introduces the concept of why automated machine learning has an important place in any data scientists tool box. Auto ML SDK allows you to to build and run machine learning workflows with the Azure Machine Learning service. You can interact with the service in any Python environment, including Jupyter Notebooks or your favourite Python IDE.
The demos included in the presentation are making use of the Azure Notebooks.
Introduction to the new Tensorflow 2.x and the Coral AI Edge TPU hardware. The presentation introduces Tensorflow main features such as Sequential and Functional APIs, mobile support with Tensorflow Lite, web support with TensorflowJS and Google Cloud support with TFX.
In addition, the presentation introduces the new edge TPU architecture coming from Coral AI, including its main hardware features and description of the compiling flow.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
Microsoft Introduction to Automated Machine LearningSetu Chokshi
A gentle introduction to Microsoft's AutoML SDK package. This presentation introduces the concept of why automated machine learning has an important place in any data scientists tool box. Auto ML SDK allows you to to build and run machine learning workflows with the Azure Machine Learning service. You can interact with the service in any Python environment, including Jupyter Notebooks or your favourite Python IDE.
The demos included in the presentation are making use of the Azure Notebooks.
Introduction to the new Tensorflow 2.x and the Coral AI Edge TPU hardware. The presentation introduces Tensorflow main features such as Sequential and Functional APIs, mobile support with Tensorflow Lite, web support with TensorflowJS and Google Cloud support with TFX.
In addition, the presentation introduces the new edge TPU architecture coming from Coral AI, including its main hardware features and description of the compiling flow.
How to use Azure Machine Learning service to manage the lifecycle of your models. Azure Machine Learning uses a Machine Learning Operations (MLOps) approach, which improves the quality and consistency of your machine learning solutions.
Explainability for Natural Language ProcessingYunyao Li
Tutorial at AACL'2020 (http://www.aacl2020.org/program/tutorials/#t4-explainability-for-natural-language-processing).
More recent version: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249912819
Title: Explainability for Natural Language Processing
@article{aacl2020xaitutorial,
title={Explainability for Natural Language Processing},
author= {Dhanorkar, Shipi and Li, Yunyao and Popa, Lucian and Qian, Kun and Wolf, Christine T and Xu, Anbang},
journal={AACL-IJCNLP 2020},
year={2020}
Presenter: Shipi Dhanorkar, Christine Wolf, Kun Qian, Anbang Xu, Lucian Popa and Yunyao Li
Video: https://www.youtube.com/watch?v=3tnrGe_JA0s&feature=youtu.be
Abstract:
We propose a cutting-edge tutorial that investigates the issues of transparency and interpretability as they relate to NLP. Both the research community and industry have been developing new techniques to render black-box NLP models more transparent and interpretable. Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP researchers, our tutorial has two components: an introduction to explainable AI (XAI) and a review of the state-of-the-art for explainability research in NLP; and findings from a qualitative interview study of individuals working on real-world NLP projects at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability in NLP. Then, we will discuss explainability for NLP tasks and report on a systematic literature review of the state-of-the-art literature in AI, NLP, and HCI conferences. The second component reports on our qualitative interview study which identifies practical challenges and concerns that arise in real-world development projects which include NLP.
The key challenge in making AI technology more accessible to the broader community is the scarcity of AI experts. Most businesses simply don’t have the much needed resources or skills for modeling and engineering. This is why automated machine learning and deep learning technologies (AutoML and AutoDL) are increasingly valued by academics and industry. The core of AI is the model design. Automated machine learning technology reduces the barriers to AI application, enabling developers with no AI expertise to independently and easily develop and deploy AI models. Automated machine learning is expected to completely overturn the AI industry in the next few years, making AI ubiquitous.
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
Vertex AI is a managed ML platform for practitioners to accelerate experiments and deploy AI models.
Enhanced developer experience
- Build with the groundbreaking ML tools that power Google
- Approachable from the non-ML developer perspective (AutoML, managed models, training)
- Ease the life of a data scientist/ML (has feature store, managed datasets, endpoints, notebooks)
- Infrastructure management overhead have been almost completely eliminated
- Unified UI for the entire ML workflow
- End-to-end integration for data and AI with build pipelines that outperform and solve complex ML tasks
- Explainable AI and TensorBoard to visualize and track ML experiments
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
Given at the MLOps. Summit 2020 - I cover the origins of MLOps in 2018, how MLOps has evolved from 2018 to 2020, and what I expect for the future of MLOps
AI-Assisted Feature Selection for Big Data ModelingDatabricks
The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model.
Machine learning allows us to build predictive analytics solutions of tomorrow - these solutions allow us to better diagnose and treat patients, correctly recommend interesting books or movies, and even make the self-driving car a reality. Microsoft Azure Machine Learning (Azure ML) is a fully-managed Platform-as-a-Service (PaaS) for building these predictive analytics solutions. It is very easy to build solutions with it, helping to overcome the challenges most businesses have in deploying and using machine learning. In this presentation, we will take a look at how to create ML models with Azure ML Studio and deploy those models to production in minutes.
As AI becomes more and more prevalent in our lives, the decisions it makes for us are becoming more and more impactful on our lives and those of others.
How can we help people to have trust in the models we're building? The field of Explainable AI focuses on making any machine learning model interpretable by non experts.
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Neo4j
by Ruben Menke, Lead Data Scientist at Banking Circle
In this talk, Banking Circle will show how a modern computational method is essential in the fight against money laundering.
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. We have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, the best parameters and the best features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. Our evaluation with real open data demonstrates that our system could explore hundreds of predictive models and discovers the highly-accurate predictive model in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
How to use Azure Machine Learning service to manage the lifecycle of your models. Azure Machine Learning uses a Machine Learning Operations (MLOps) approach, which improves the quality and consistency of your machine learning solutions.
Explainability for Natural Language ProcessingYunyao Li
Tutorial at AACL'2020 (http://www.aacl2020.org/program/tutorials/#t4-explainability-for-natural-language-processing).
More recent version: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249912819
Title: Explainability for Natural Language Processing
@article{aacl2020xaitutorial,
title={Explainability for Natural Language Processing},
author= {Dhanorkar, Shipi and Li, Yunyao and Popa, Lucian and Qian, Kun and Wolf, Christine T and Xu, Anbang},
journal={AACL-IJCNLP 2020},
year={2020}
Presenter: Shipi Dhanorkar, Christine Wolf, Kun Qian, Anbang Xu, Lucian Popa and Yunyao Li
Video: https://www.youtube.com/watch?v=3tnrGe_JA0s&feature=youtu.be
Abstract:
We propose a cutting-edge tutorial that investigates the issues of transparency and interpretability as they relate to NLP. Both the research community and industry have been developing new techniques to render black-box NLP models more transparent and interpretable. Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP researchers, our tutorial has two components: an introduction to explainable AI (XAI) and a review of the state-of-the-art for explainability research in NLP; and findings from a qualitative interview study of individuals working on real-world NLP projects at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability in NLP. Then, we will discuss explainability for NLP tasks and report on a systematic literature review of the state-of-the-art literature in AI, NLP, and HCI conferences. The second component reports on our qualitative interview study which identifies practical challenges and concerns that arise in real-world development projects which include NLP.
The key challenge in making AI technology more accessible to the broader community is the scarcity of AI experts. Most businesses simply don’t have the much needed resources or skills for modeling and engineering. This is why automated machine learning and deep learning technologies (AutoML and AutoDL) are increasingly valued by academics and industry. The core of AI is the model design. Automated machine learning technology reduces the barriers to AI application, enabling developers with no AI expertise to independently and easily develop and deploy AI models. Automated machine learning is expected to completely overturn the AI industry in the next few years, making AI ubiquitous.
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
Vertex AI is a managed ML platform for practitioners to accelerate experiments and deploy AI models.
Enhanced developer experience
- Build with the groundbreaking ML tools that power Google
- Approachable from the non-ML developer perspective (AutoML, managed models, training)
- Ease the life of a data scientist/ML (has feature store, managed datasets, endpoints, notebooks)
- Infrastructure management overhead have been almost completely eliminated
- Unified UI for the entire ML workflow
- End-to-end integration for data and AI with build pipelines that outperform and solve complex ML tasks
- Explainable AI and TensorBoard to visualize and track ML experiments
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
Given at the MLOps. Summit 2020 - I cover the origins of MLOps in 2018, how MLOps has evolved from 2018 to 2020, and what I expect for the future of MLOps
AI-Assisted Feature Selection for Big Data ModelingDatabricks
The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model.
Machine learning allows us to build predictive analytics solutions of tomorrow - these solutions allow us to better diagnose and treat patients, correctly recommend interesting books or movies, and even make the self-driving car a reality. Microsoft Azure Machine Learning (Azure ML) is a fully-managed Platform-as-a-Service (PaaS) for building these predictive analytics solutions. It is very easy to build solutions with it, helping to overcome the challenges most businesses have in deploying and using machine learning. In this presentation, we will take a look at how to create ML models with Azure ML Studio and deploy those models to production in minutes.
As AI becomes more and more prevalent in our lives, the decisions it makes for us are becoming more and more impactful on our lives and those of others.
How can we help people to have trust in the models we're building? The field of Explainable AI focuses on making any machine learning model interpretable by non experts.
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Neo4j
by Ruben Menke, Lead Data Scientist at Banking Circle
In this talk, Banking Circle will show how a modern computational method is essential in the fight against money laundering.
As the complexity of choosing optimised and task specific steps and ML models is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.
Although it focuses on end users without expert knowledge, AutoML also offers new tools to machine learning experts, for example to:
1. Perform architecture search over deep representations
2. Analyse the importance of hyperparameters.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. We have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, the best parameters and the best features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. Our evaluation with real open data demonstrates that our system could explore hundreds of predictive models and discovers the highly-accurate predictive model in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
Current workshop diagnostics are based on manually-generated decision trees. This approach is increasingly reaching its limits due to a growing variant diversity, and the increasing complexity of vehicle systems. This session will describe BMW’s new Apache Spark-enabled approach: Use the available data from cars and workshops to train models that are able to predict the right part to switch, or the action to take.
You’ll get an overview and presentation of BMW’s complete pipeline including ETL, model training based on Spark 2.1, serializing results along with metadata and serving the gained insights as Web-App. You’ll also hear how Spark helped BMW leverage the information from millions of observations and thousands of features, and learn what pitfalls they experienced (e.g. setting up a working dev-toolchain, working with 50K features, parallelizing well), and how you can avoid them.
Evolution of Vehicle aftter it has been released, How its made and managedSamuel Festus
a research project based on the theme: Evolution of vehicle after it has been released, how it’s made, and how it’s managed.
Focusing on the Renault system design used in the manufacturing of Automotive and also Serial life management.
James Goel, MIPI Technical Steering Group chair, shares a state-of-the-art MASS (MIPI Automotive SerDes Solutions) display architecture that leverages the latest MIPI DSI-2℠ protocols using VDC-M visually lossless compression algorithms to optimize pixel bandwidth within tightly constrained display systems.
Triple Forward Camera from Tesla Model 3system_plus
Complete analysis of the main sensing part of the Tesla‘s autopilot system.
Reverse Costing - Structure, process and cost report - find more here: https://www.systemplus.fr/reverse-costing-reports/triple-forward-camera-from-tesla-model-3/
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software GmbH
This talk was presented by Andreas Pabinger and Benjamin Engele (Intland Software) at Intland Connect: Annual User Conference 2020 on 22 Oct 2020. To learn more, visit: https://intland.com/intland-connect-annual-user-conference-2020/
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Yole Developpement
A cutting-edge ToF imager technology from Sony/Softkinetic, adapted by Melexis for automotive in-cabin applications
Today, Time-of-Flight (ToF) systems are among the most innovative technologies offering imaging companies an opportunity to lead the market. Every major player wants to integrate these devices to provide functions such as 3D imaging, proximity sensing, ambient light sensing and gesture recognition.
Sony/Softkinetic has been investigating this technology deeply, providing a unique pixel technology to several image sensor manufacturers in three application areas: consumer, automotive and industrial. For automotive applications, Sony/Softkinetic has licensed its technology to Melexis, which has worked on the pixel design to provide a ToF imager for gesture recognition.
The MLX75023 is an automotive 3D ToF Imager already integrated into gesture recognition systems from car makers like BMW. The 3D ToF Imager is packaged using Glass Ball Grid Array technology.
The device comprises the die sensor and the glass filter in the same component in thin, 0.7 mm-thick, packaging.
This report analyzes the complete component, from the glass near-infrared band pass filter to the collector, based on the ToF pixel technology licenses developed by Softkinetic and improved by Melexis.
The report includes a complete cost analysis and price estimation of the device based on a detailed description of the package, and the ToF imager.
It also features a complete ToF pixel technology comparison with Infineon, STMicroelectronics and Texas Instrument ToF imagers, which are also based on Sony/Softkinetic technology, with details on the companies’ choices.
More information on that report at http://www.i-micronews.com/reports.html
Overview of XBOM and its functional enablement around regulatory compliance, new audit capabilities and overall supply chain efficiencies throughout the entire supply chain from OEM down to the smallest vendor.
Position Sensor IC Innovations Creating Value in Automotive ApplicationsHEINZ OYRER
Realizing that innovations in electronics can complement production innovations in gaining a competitive advantage, semiconductors are becoming increasingly important in the automotive value chain. Sensing is a major function of the electronics system, with position sensors as the biggest market segment in regards to automotive sensor demand. Magnetic position sensors in vehicles are increasingly installed for applications such as throttle valve position, suspension control, power assisted steering, electronic gas pedals, electronic brake pedals and adaptive headlight systems. In addition to previously mechanically-driven applications such as power steering, transmission actuation and engine cooling systems are also increasingly being electrically-powered, driving the demand for brushless DC motors and therefore stimulating the demand for motor position and control sensors in the automotive sector.
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela PoklukarDataScienceConferenc1
Reproducibility of ML system is increasingly important topic in ML community. Reproducibility ensures conclusiveness of the model performance, provides an understanding how ML system works and reduces unnecessary errors when the system is deployed into production. With increasing AI regulation, it will soon become a requirement for many ML applications. In this talk, we will explore different aspects of reproducibility such as reproducibility of the dataset, data processing, ML model, its randomness and hyperparameters, code and SW environment, as well as concepts and practical tools such as data versioning, feature, metadata and artifact store, model registry and containerization that together ensure reproducibility of our experiments.
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work.
spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
1. Kaminski, Schlegel | Oct. 25, 2017
BUILDING CUSTOM ML PIPELINESTAGES
FOR FEATURE SELECTION.
SPARK SUMMIT EUROPE 2017.
2. WHATYOU WILL LEARN DURING THIS SESSION.
How data-driven car diagnostics look like at BMW.
Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).
Attention: There will be Scala code examples!
Howto use spark-FeatureSelection in your Spark ML Pipeline.
The impact of feature selection on learning performance andthe understanding of the big data black box.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
3. 1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
#3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
4. 1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
#3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
Potential root causes:
Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
Cars are getting more and more complex (hybridization, connectivity).
Less experienced workshop staff in evolving markets.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
5. 1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
#3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
Potential root causes:
Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
Cars are getting more and more complex (hybridization, connectivity).
Less experienced workshop staff in evolving markets.
Improve three workflows at once by shifting from a manual to a data driven approach:
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
6. 1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
#3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
Potential root causes:
Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
Cars are getting more and more complex (hybridization, connectivity).
Less experienced workshop staff in evolving markets.
Improve three workflows at once by shifting from a manual to a data driven approach:
Automatic knowledge generation.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
7. 1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
#3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
Potential root causes:
Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
Cars are getting more and more complex (hybridization, connectivity).
Less experienced workshop staff in evolving markets.
Improve three workflows at once by shifting from a manual to a data driven approach:
Automatic knowledge generation.
Automatic workshop diagnostics.
Predictive maintenance.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
20. PipelineStage
SPARK PIPELINE API.
Estimator
‘Learns from data’
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data dataTransformer data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
22. ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
23. ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
FeatureSelector
Interface for FS
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
FeatureSelectionModel
Model from
FeatureSelector
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
24. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
25. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
26. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
27. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
28. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
29. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
30. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception
⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0 features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
31. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception
⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0
Attention:
VectorColumns have Metadata:
Name, Type, Range, etc.
features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
32. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
33. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
fit
(= learn from data)
Dataset Transformer
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
34. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
35. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
36. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Not necessary, but avoids
code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
37. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
38. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
For persistence.
39. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
For persistence.
40. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Same idea as in Estimator, but
different tasks.
For persistence.
41. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
42. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
Adds persistence.
43. GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
44. GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
45. GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
46. GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
getters are shared between
Estimator and Transformer.
setters not, for the pursuit of
concatenation.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
47. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
What hasto be saved?
Metadata: uid, timestamp, version, …
Parameters
Learnt data: selectedFeatures & featureImportances
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
48. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
What hasto be saved?
Metadata: uid, timestamp, version, …
Parameters
Learnt data: selectedFeatures & featureImportances
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
49. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
What hasto be saved?
Metadata: uid, timestamp, version, …
Parameters
Learnt data: selectedFeatures & featureImportances
Create DataFrame and use write.parquet(…)
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
50. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
What hasto be saved?
Metadata: uid, timestamp, version, …
Parameters
Learnt data: selectedFeatures & featureImportances
Create DataFrame and use write.parquet(…)
How do we dothat?
Create companion object FeatureSelectorModel, which offersthe following classes:
abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}
class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…}
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
51. HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
54. import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
55. import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
56. import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
57. import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
val dfT = plModel.transform(df).drop(“Features")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
selected Label
[0,1] 1.0
[0,0] 0.0
[1,1] 0.0
[1,0] 1.0
df dft
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Transform
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
58. SPARK-FEATURESELECTION PACKAGE.
Offers selection based on:
Gini coefficient
Correlation coefficient
Information gain
L1-Logistic regression weights
Randomforest importances
Utility stage:
VectorMerger
Three modes:
Percentile (default)
Fixed number of columns
Compare to random column [4]
Find on GitHub: spark-FeatureSelection or on Spark-packages
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15
[4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
59. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
60. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
61. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
62. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
63. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Informationgain Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
64. LESSONS LEARNT.
Know what your data looks like and where it is located! Example:
Operations can succeed in local mode, but fail on a cluster.
Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.
Do not reinvent the wheel for common methods Consider putting your stages intothe spark.ml namespace.
Use the SparkWeb GUIto understand your Spark jobs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
65. QUESTIONS?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
Marc.Kaminski@bmw.de
Bernhard.bb.Schegel@bmw.de
Page 18
66. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19
BACKUP.
67. DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE.
Own namespace
Pro Con
Safer solution Code duplication
org.apache.spark.ml.*
Pro Con
Less code duplication
(sharedParams,
SchemaUtils, …)
More dangerous,
when not
cautious
Easier to implement
persistence
vs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
68. FEATURE SELECTION.
Motivation:
Many sparse features feature space hasto be reduced select featuresthat carry a lot of information for prediction.
Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
69. FEATURE SELECTION.
Motivation:
Many sparse features feature space hasto be reduced select featuresthat carry a lot of information for prediction.
Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
E.g.:
- Correlation
- InformationGain
- RandomForest
etc.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
70. FEATURE SELECTION.
Description Advantages Disadvantages Examples
Filter Evaluate intrinsic data properties
Fast
Scalable
Ignore inter-feature dependencies
Ignore interaction with classifier
Chi-squared
Information gain
Correlation
Wrapper
Evaluate model performance of
feature subset
Feature dependencies
Simple
Classifier dependent selection
Computational expensive
Risk of overfitting
Genetic algorithms
Search algorithms
Embedded
Feature selection is embedded in
classifier training
Feature dependencies Classifier dependent selection L1-Logistic regression
Random forest
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
71. CHALLENGES.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
Big plans for DataFrames when performing many operations on many columns Cantake a longtime to build and optimize DAG.
Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016 Hopefully fixed in Spark 2.3.0.
Spark PipelineStages are not consistent in howthey handle DataFrame schemas Sometimes no schema is appended.
Page 23