Alok Singh presented on R4ML, an R frontend for Apache SystemML that integrates with Apache Spark's SparkR APIs. R4ML allows users to perform scalable machine learning tasks like linear regression, classification, and factorization on big data using R's familiar linear algebra syntax. It supports both built-in algorithms as well as custom algorithms written in DML. R4ML bridges gaps in SparkR by supporting a wider range of algorithms and data types like wide tables and images. The goal is to make distributed machine learning easier for data scientists and analysts.
Balancing Automation and Explanation in Machine LearningDatabricks
For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics.
These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
Balancing Automation and Explanation in Machine LearningDatabricks
For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics.
These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Seamless End-to-End Production Machine Learning with Seldon and MLflowDatabricks
Deploying and managing machine learning models at scale introduces new complexities. Fortunately, there are tools that simplify this process. In this talk we walk you through an end-to-end hands on example showing how you can go from research to production without much complexity by leveraging the Seldon Core and MLflow frameworks. We will train a set of ML models, and we will showcase a simple way to deploy them to a kubernetes cluster through sophisticated deployment methods, including canary deployments, shadow deployments and we’ll touch upon richer ML graphs such as explainer deployments.
Aggregate programming is a novel paradigm that addresses, at the core, many issues commonly found in the development of large-scale, situated, self-adaptive systems. It is a particular macro-programming approach where a developer expresses the behaviour of the system at the aggregate-level, by targeting the distributed computational machine which is given by the entire set of (possibly mobile and heterogeneous) networked devices pervading the environment. It is the model that takes care of turning a system-level behaviour specification into the concrete, device-centric programs executed locally by each component.
Aggregate computing is formally grounded in the field calculus, a minimal functional language that works with computational fields, i.e., distributed data structures mapping devices (digital representatives of space-time portions) to computational objects. Fields are a useful unifying abstraction for drawing a connection between the physical and the computational world, and between the local and global programming viewpoints. This approach is compositional, allowing to define layers of building blocks of increasing abstraction, and is also amenable to formal analyses.
In this talk, I will present scafi (SCAla with computational FIels), an aggregate computing framework for the Scala programming language which provides (i) an internal DSL for expressing aggregate computations as well as (ii) a library support for the configuration and execution of aggregate systems. There is no need to learn ad-hoc external DSLs anymore: with scafi, Scala programmers can instantaneously start playing with this new, intriguing development approach!
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Object detection is a central problem in computer vision and underpins many applications from medical image analysis to autonomous driving. In this talk, we will review the basics of object detection from fundamental concepts to practical techniques. Then, we will dive into cutting-edge methods that use transformers to drastically simplify the object detection pipeline while maintaining predictive performance. Finally, we will show how to train these models at scale using Determined’s integrated deep learning platform and then serve the models using MLflow.
What you will learn:
Basics of object detection including main concepts and techniques
Main ideas from the DETR and Deformable DETR approaches to object detection
Overview of the core capabilities of Determined’s deep learning platform, with a focus on its support for effortless distributed training
How to serve models trained in Determined using MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc.
Next DSS NYC Event 👉 https://datascience.salon/newyork/
Next DSS LA Event 👉 https://datascience.salon/la/
Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services.
This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors.
A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose.
In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Seamless End-to-End Production Machine Learning with Seldon and MLflowDatabricks
Deploying and managing machine learning models at scale introduces new complexities. Fortunately, there are tools that simplify this process. In this talk we walk you through an end-to-end hands on example showing how you can go from research to production without much complexity by leveraging the Seldon Core and MLflow frameworks. We will train a set of ML models, and we will showcase a simple way to deploy them to a kubernetes cluster through sophisticated deployment methods, including canary deployments, shadow deployments and we’ll touch upon richer ML graphs such as explainer deployments.
Aggregate programming is a novel paradigm that addresses, at the core, many issues commonly found in the development of large-scale, situated, self-adaptive systems. It is a particular macro-programming approach where a developer expresses the behaviour of the system at the aggregate-level, by targeting the distributed computational machine which is given by the entire set of (possibly mobile and heterogeneous) networked devices pervading the environment. It is the model that takes care of turning a system-level behaviour specification into the concrete, device-centric programs executed locally by each component.
Aggregate computing is formally grounded in the field calculus, a minimal functional language that works with computational fields, i.e., distributed data structures mapping devices (digital representatives of space-time portions) to computational objects. Fields are a useful unifying abstraction for drawing a connection between the physical and the computational world, and between the local and global programming viewpoints. This approach is compositional, allowing to define layers of building blocks of increasing abstraction, and is also amenable to formal analyses.
In this talk, I will present scafi (SCAla with computational FIels), an aggregate computing framework for the Scala programming language which provides (i) an internal DSL for expressing aggregate computations as well as (ii) a library support for the configuration and execution of aggregate systems. There is no need to learn ad-hoc external DSLs anymore: with scafi, Scala programmers can instantaneously start playing with this new, intriguing development approach!
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Object detection is a central problem in computer vision and underpins many applications from medical image analysis to autonomous driving. In this talk, we will review the basics of object detection from fundamental concepts to practical techniques. Then, we will dive into cutting-edge methods that use transformers to drastically simplify the object detection pipeline while maintaining predictive performance. Finally, we will show how to train these models at scale using Determined’s integrated deep learning platform and then serve the models using MLflow.
What you will learn:
Basics of object detection including main concepts and techniques
Main ideas from the DETR and Deformable DETR approaches to object detection
Overview of the core capabilities of Determined’s deep learning platform, with a focus on its support for effortless distributed training
How to serve models trained in Determined using MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Slides for a presentation I gave for the Machine Learning with Spark Tokyo meetup.
Introduction to Spark, H2O, SparklingWater and live demos of GBM and DL.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
https://www.eventbrite.com/e/talk-by-paco-nathan-graph-analytics-in-spark-tickets-17173189472
Big Brains meetup hosted by BloomReach, 2015-06-04
Case study / demo of a large-scale graph analytics project, leveraging GraphX in Apache Spark to surface insights about open source developer communities — based on data mining of their email forums. The project works with any Apache email archive, applying NLP and machine learning techniques to analyze message threads, then constructs a large graph. Graph analytics, based on concise Scala coding examples in Spark, surface themes and interactions within the community. Results are used as feedback for respective developer communities, such as leaderboards, etc. As an example, we will examine analysis of the Spark developer community itself.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Similar to R4ML: An R Based Scalable Machine Learning Framework (20)
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Spark Technology Center
About Presenter Alok Singh
Principal Engineer at the IBM Spark
Technology Center. He is leading the R4ML
team at IBM. He has built/architected
analytical frameworks and implemented
various machine learning algorithms.
Interests: Big Data, Scalable Machine
Learning Software, and Algorithms.
Contact:
email- singh_alok@hotmail.com
Github -https://github.com/aloknsingh
2
3. Spark Technology Center
3
IBM
Spark Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology Center
5. Spark Technology Center
5
R4ML
R4ML is an R-language
front end to Apache
SystemML that
integrates seamlessly
with Apache Spark’s
SparkR APIs.
6. Spark Technology Center
SparkR is a
Thin Wrapper
Over Apache
Spark
Brings in the powerful statistics from R to Spark
6
Source: SparkR talk by Shivaram Venkatraman in Spark Summit 15
7. Spark Technology Center
Linear Algebra
is the Language of
Machine Learning.
Linear algebra is
powerful,
precise,
and high-level.
Express complex transformations
over large arrays of data…
…using a small number of operations.
…in a clear and unambiguous way
7
8. Spark Technology Center
8
Linear Algebra at
Scale via
Distributed
Computing
• In R Linear Algebra and various matrix packages
are very useful to create the new algorithms
• Wouldn’t it be nice that all the linear algebra scale to
millions and billions of rows ?
• Wouldn’t it be nicer if user can write his code
without worrying about the internal distributed
framework.
9. Spark Technology Center
Apache
SystemML
(incubating)
Goal: Take the linear algebra formulation of an
algorithm and make it scale automatically
IBM Research technology, now an open-source
Apache top level project.
To learn more: http://systemml.apache.org9
Top-level project!
10. Spark Technology Center
LinearAlgebra-BasedProgrammingLanguages
L2 regularized linear regression by the conjugate gradient method in DML:
X = read.csv(…); # n x m feature matrix
y = read.csv(…); # n x 1 feature vector
maxi = 50; lambda = 0.001; ...
r = -(t(X) %*% y)
norm_r2 = sum(r * r); p = -r # initial gradient
w = matrix(0, ncol(X), 1); i = 0
while(i < maxi & norm_r2 > norm_r2_trgt) {
q=((t(X)%*%(X%*%p))+lambda*p); # compute conjugate gradient
alpha = norm_r2 / sum(p * q) # compute step size
# update model and residuals
w=w+alpha*p; r=r+alpha*q
old_norm_r2 = norm_r2
norm_r2 = sum(r^2); i = i + 1
p = -r + norm_r2/old_norm_r2 * p
}
write.csv(w, …);
Variables
are
matrices
Expressions
use linear
algebra
operations
10
16. Spark Technology Center
IBM
Pre-processing
• Typical machine learning use cases require many pre-processing steps
• R4ML provides and enhances these commonly used pre-processing options:
• NA removal
• Binning
• Normalizing
• Recoding
• One-hot encoding
• R4ML also supports all the standard SparkR pre-processors
18. Spark Technology Center
IBM
R4ML’s Built-in ML Algorithms
• Regression
• Linear Model (LM)
• Generalized Linear Model (GLM)
• Step Linear Model (Step.LM)
• Classification
• Multi logistic Regression(MLOGIT)
• Support Vector Machine (SVM)
• Factorization
• Principal Component Analysis (PCA)
• Alternating Least Square (ALS)
• Survival analysis
• Kaplan Meier (KM) Survival model
• Cox Proportional Hazard model
20. Spark Technology Center
IBM
Custom ML algorithms
• Allows user to create the custom machine learning algorithms for POC .
• Works via the Apache SystemML DML connector.
• User need to attach the input and output variables from R/R4ML to the DML
variables.
• User can create the library using the R package management.
23. Spark Technology Center
IBM
Airline Data Analysis using R4ML
• Do exploratory analysis and manual feature selection for the real dataset
• Pre-process the inputs and do proper na removal , recoding etc
• PCA examples:
• SVM example: (predict airline delay using the SVM)
• See the results
• Talks about wider table support in R4ML and systemML
• Illustrate the cross validation and model tuning
24. Spark Technology Center
IBM
Bridging the gap for the Wide columns datasets
• Many Dataset when done one hot encoding , creates many columns
• Many image processing applications requires many columns
• So one of the distinguishing feature of the systemML and R4ML is the support for the
wider columns.
24
26. Spark Technology Center
IBM
Future work
• More ML algorithm support.
• We will have many real world use cases, examples, and PDF demos.
• Many more new work items are in the pipeline. Please see the latest changes at
https://github.com/SparkTC/r4ml
• Other useful links
• SparkR: https://spark.apache.org/docs/latest/sparkr.html
• Apache SystemML: http://apache.github.io/systemml/algorithms-reference.html
27. Spark Technology Center
IBM
Summary
• Distributed ML is complex problem and Apache SystemML makes it easier and we have
created a R frontend which work with underlying SparkR and whole R eco system also.
• R4ML is an open-source scalable machine learning package for implementing end to end ML
flows.
• It bridges important gaps in SparkR and uses the best of Apache SystemML and SparkR to
make data scientists and analysts get the most out of big data analytics.
• R4ML advantages:
• More ML algorithms
• Ability to write custom algorithms
• Wide table data set (images, etc.)
• Commonly used pre-processors
R4ML is build on the top of SystemML and SparkR and the basic core engine is the Spark.
Before we go deeper into what is R4ML . Lets talk a little bit about the two open source Apache project SystemML and SparkR
SparkR brings best of Spark and R together
Let me introduce why systemML is needed. Lets talk about linear algebra and why it is so important for the machine learning.
-SystemML makes writing code in the linear algebra more seamless and easy.
-Currently SystemML is an Apache top level project project.
-Automatic Optimization based on data and cluster characteristics to ensure both efficiency and scalability
-DML lang: declarative machine learning language
Latest state of the art, Spark
In memory distributed MR like framework
Various API support like Scala, Java, Python and R
System-ML
Library for Customized machine learning on Large Scale data
It has a compiler which will analyze, optimize, and create the spark based MR plan to be run on the cluster.
Focus on the simple, fortran and R like syntax
R
R has been the de-facto tools and standard for analyzing the stats and ML
How easy is to write the complex formulations
Left hand block: represent the R4ML
Top upper block represent SparkR
- have multiple workers
- can support Cran packages
Top bottom represent SystemML
-
The main engine is Spark
Distributed file system is used for storing data
Lets dive deeper into R4ML and see how it can help big data scientists in doing their job
A typical machine learning flow and how R4ML can fit seamlessly into it
Lets walk it through
Lets call out one special feature of custom machine learning algo provided by systemML base engine.
Support for csv and other standard formats.
APIs r4ml.read.csv same as read.csv.
-Explain
The input data usually requires some cleanup and pre-processing, this is to increase the predictive power of each of the features
Shorter Version
NA removeable and replace with mean or constant
Binning allows user to bucketized a col
Normalizing is centering and scaling
Recoding for converting string to integer for the ordinal values
One hot encoding when order is not imp i.e states of america
Longer Version
R4ML supports all the major pre-processing task
- NA removal: this options let one remove the missing data or user can substitute with the constant or with the mean (if it is numeric)
- One of the typical use case of binning is say we have the height of the people in feet and inches and if we only care about three top level category i.e short , medium and tall, user can use this option
- Most of the algo’s prediction power or the result become better if it is normalized i.e substract the mean and then divide it by stddev.
- Since most of the machine learning algorithm is the matrix based linear algebra,
- Many times the categorical columns have inherit order in it like height or shirt_size, it options provide user the ability to encode those string columns into the ordinal categorical numeric value
- But many times categorical columns have no inherit orders like people’s race or states they lives in, in that case, we would like to do one hot encoding for those.
This examples illustrates the typical pre-processing steps.
This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Note that we are doing both recoding and one hot encoding for the name of species
Shorter Version
---------------------
Go over the buckets
Explain differentiator (wide col support)
Mention Step.LM , KM, Cox
Longer Version
---------------------
GLM
===
Similar to LM for the solving the linear system to calcualte y_est
However the output is related to the y_est via the link function and distribution
Allows one to model a variety of use cases
Supports various family and the link function
unlike the base package we don’t suffer from the number of feature limitations
PCA
===
Rotate the data so that axis align to the maximum variance
User can then chose the rotated data eigen-vectors along the axis which results in dim reduction
R4ML APIs are similar to prcomp pca from R
MLOGIT
======
Uses logit function to separate the dataset into C different classes
Same as Poisson GLM (log linear model)
SVM
===
Maximum margin classifier.
User can use regularization using slack variables.
Multi classes uses, one against rest strategy.
ALS
===
Approximate factorization of a low rank matrix using alternating least square method
Used in the recommendation system
Survival models
===============
Use for the examining a particular event (usually death) to occur and it’s hazard
Support KM
Proportional hazard model cox
Here we illustrates training and scoring using the built-in GLM algorithms.
Note that in the typically ML flow, we divide the input data set into training and test set and build the model on train dataset and verify it on the test dataset
We in the end look at the our prediction and compare it with the real measurement.
We also print out the typical statistics which measures the perf of the model like R-squared, dispersion.
-Airline dataset
The Airline data set consists of flight arrival and departure details for all commercial flights from 1987 to 2008. The approximately 120MM records (CSV format), occupy 120GB space
This dataset are real world examples and showcase the typical pb and use case
How examples was created
- Via the R package knitr by running the R4ML’s R code embedded with docs
Notes:
- you can copy paste the examples from this and it will work in your R4ML console
- for convenient , we have create following markers
- green: the code of interest
- yellow: the outputs
- purple: notes
- call outs: to highlight the key message
Message: it will evolve and back of big company like IBM