A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks
Open-source technologies allow developers to build microservices framework to build myriad real-time applications. One such application is building the real-time model scoring. In this session,
we will showcase how to architect a microservice framework, in particular how to use it to build a low-latency, real-time model scoring system. At the core of the architecture lies Apache Spark’s Structured
Streaming capability to deliver low-latency predictions coupled with Docker and Flask as additional open source tools for model service. In this session, you will walk away with:
* Knowledge of enterprise-grade model as a service
* Streaming architecture design principles enabling real-time machine learning
* Key concepts and building blocks for real-time model scoring
* Real-time and production use cases across industries, such as IIOT, predictive maintenance, fraud detection, sepsis etc.
Best Practices for Engineering Production-Ready Software with Apache SparkDatabricks
Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering.
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
This talk describes migrating a large random forest classifier from scikit-learn to Spark's MLlib. We cut training time from 2 days to 2 hours, reduced failed runs, and track experiments better with MLflow. Kount provides certainty in digital interactions like online credit card transactions. One of our scores uses a random forest classifier with 250 trees and 100,000 nodes per tree. We used scikit-learn to train using 60 million samples that each contained over 150 features. The in-memory requirements exceeded 750 GB, took 2 days, and were not robust to disruption in our database or training execution. To migrate workflow to Spark, we built a 6-node cluster with HDFS. This provides 1.35 TB of RAM and 484 cores. Using MLlib and parallelization, the training time for our random forests are now less than 2 hours. Training data stays in our production environment, which used to require a deploy cycle to move locally-developed code onto our training server. The new implementation uses Jupyter notebooks for remote development with server-side execution. MLflow tracks all input parameters, code, and git revision number, while the performance and model itself are retained as experiment artifacts. The new workflow is robust to service disruption. Our training pipeline begins by pulling from a Vertica database. Originally, this single connection took over 8 hours to complete with any problem causing a restart. Using sqoop and multiple connections, we pull the data in 45 minutes. The old technique used volatile storage and required the data for each experiment. Now, we pull the data from Vertica one time and then reload much faster from HDFS. While a significant undertaking, moving to the Spark ecosystem converted an ad hoc and hands-on training process into a fully repeatable pipeline that meets regulatory and business goals for traceability and speed.
Speaker: Josh Johnston
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models
Speakers: Maximo Gurmendez, Javier Buquet
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017MLconf
Sanjeev Satheesh, leads the Deep Speech team at Baidu’s Silicon valley AI lab. Baidu SVAIL is focused on developing hard AI technologies to impact hundreds of millions of people.
The Story of End to End Models in Deep Learning
The past few years have seen the explosive entrance of end to end deep learning models - in computer vision, speech recognition, machine translation, text to speech and others. In this talk, we look at this trend to identify what has worked well, and try to make some predictions for the future based on the next set of unsolved problems.
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks
Open-source technologies allow developers to build microservices framework to build myriad real-time applications. One such application is building the real-time model scoring. In this session,
we will showcase how to architect a microservice framework, in particular how to use it to build a low-latency, real-time model scoring system. At the core of the architecture lies Apache Spark’s Structured
Streaming capability to deliver low-latency predictions coupled with Docker and Flask as additional open source tools for model service. In this session, you will walk away with:
* Knowledge of enterprise-grade model as a service
* Streaming architecture design principles enabling real-time machine learning
* Key concepts and building blocks for real-time model scoring
* Real-time and production use cases across industries, such as IIOT, predictive maintenance, fraud detection, sepsis etc.
Best Practices for Engineering Production-Ready Software with Apache SparkDatabricks
Notebooks are a great tool for Big Data. They have drastically changed the way scientists and engineers develop and share ideas. However, most world-class Spark products cannot be easily engineered, tested and deployed just by modifying or combining notebooks. Taking a prototype to production with high quality typically involves proper software engineering.
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
This talk describes migrating a large random forest classifier from scikit-learn to Spark's MLlib. We cut training time from 2 days to 2 hours, reduced failed runs, and track experiments better with MLflow. Kount provides certainty in digital interactions like online credit card transactions. One of our scores uses a random forest classifier with 250 trees and 100,000 nodes per tree. We used scikit-learn to train using 60 million samples that each contained over 150 features. The in-memory requirements exceeded 750 GB, took 2 days, and were not robust to disruption in our database or training execution. To migrate workflow to Spark, we built a 6-node cluster with HDFS. This provides 1.35 TB of RAM and 484 cores. Using MLlib and parallelization, the training time for our random forests are now less than 2 hours. Training data stays in our production environment, which used to require a deploy cycle to move locally-developed code onto our training server. The new implementation uses Jupyter notebooks for remote development with server-side execution. MLflow tracks all input parameters, code, and git revision number, while the performance and model itself are retained as experiment artifacts. The new workflow is robust to service disruption. Our training pipeline begins by pulling from a Vertica database. Originally, this single connection took over 8 hours to complete with any problem causing a restart. Using sqoop and multiple connections, we pull the data in 45 minutes. The old technique used volatile storage and required the data for each experiment. Now, we pull the data from Vertica one time and then reload much faster from HDFS. While a significant undertaking, moving to the Spark ecosystem converted an ad hoc and hands-on training process into a fully repeatable pipeline that meets regulatory and business goals for traceability and speed.
Speaker: Josh Johnston
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models
Speakers: Maximo Gurmendez, Javier Buquet
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017MLconf
Sanjeev Satheesh, leads the Deep Speech team at Baidu’s Silicon valley AI lab. Baidu SVAIL is focused on developing hard AI technologies to impact hundreds of millions of people.
The Story of End to End Models in Deep Learning
The past few years have seen the explosive entrance of end to end deep learning models - in computer vision, speech recognition, machine translation, text to speech and others. In this talk, we look at this trend to identify what has worked well, and try to make some predictions for the future based on the next set of unsolved problems.
Looking into the Future: Using Google's Prediction APIJustin Grammens
We all would like to predict the future at some point in our lives. Well thanks to Google we can now be one step closer! This talk will give an overview of what the Google Prediction API is, how you can use it to analyze data sets, it's strengths and weaknesses and run open data sets through the system covering both regression and categorization models.
Common Problems in Hyperparameter OptimizationSigOpt
Originally given at MLConf NYC 2017.
All large machine learning pipelines have tunable parameters, commonly referred to as hyperparameters. Hyperparameter optimization is the process by which we find the values for these parameters that cause our system to perform the best. SigOpt provides a Bayesian optimization platform that is commonly used for hyperparameter optimization, and I’m going to share some of the common problems we’ve seen when integrating into machine learning pipelines.
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Detecting Financial Fraud at Scale with Machine LearningDatabricks
Detecting fraudulent patterns at scale is a challenge given the massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior. In finance, added security concerns and the importance of explaining how fraudulent behavior was identified further increases the difficulty of the task. Legacy systems rely on rule-based detection that is difficult to implement and run at scale. The resulting code is very complex and brittle, making it difficult to update to keep up with new threats.
In this talk, we will go over how to convert a rule based financial fraud detection program to use machine learning on Spark as part of a scalable, modular solution. We will examine how to identify appropriate features and labels and how to create a feedback loop that will allow the model to evolve and improve overtime. We will also look at how MLflow may be leveraged throughout this effort for experiment tracking and model deployment.
Specifically, we will discuss:
-How to create a fraud-detection data pipeline
-How to leverage a framework for building features from large datasets
-How to create modular code to re-use and maintain new machine learning models
-How to choose appropriate models and algorithms for a given fraud-detection problem
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but these platforms are limited to each company’s internal infrastructure.
In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
In large enterprises, large solutions are sometimes required to tackle even the smallest tasks and ML is no different. At Comcast we are building a comprehensive, configuration based, continuously integrated and deployed platform for data pipeline transformations, model development and deployment. This is accomplished using a range of tools and frameworks such as Databricks, MLflow, Apache Spark and others. With a Databricks environment used by hundreds of researchers and petabytes of data, scale is critical to Comcast, so making it all work together in a frictionless experience is a high priority. The platform consists of a number of components: an abstraction for data pipelines and transformation to allow our data scientists the freedom to combine the most appropriate algorithms from different frameworks , experiment tracking, project and model packaging using MLflow and model serving via the Kubeflow environment on Kubernetes. The architecture, progress and current state of the platform will be discussed as well as the challenges we had to overcome to make this platform work at Comcast scale. As a machine learning practitioner, you will gain knowledge in: an example of data pipeline abstraction; ways to package and track your ML project and experiments at scale; and how Comcast uses Kubeflow on Kubernetes to bring everything together.
Keynote: Artificial Intelligence Methods for Time Series Forecasting and Classification of Real-Time IoT Sensor Data Streams, Romeo Kienzler, Chief Data Scientist - IBM Watson IoT WW, IBM Academy of Technology
Looking into the Future: Using Google's Prediction APIJustin Grammens
We all would like to predict the future at some point in our lives. Well thanks to Google we can now be one step closer! This talk will give an overview of what the Google Prediction API is, how you can use it to analyze data sets, it's strengths and weaknesses and run open data sets through the system covering both regression and categorization models.
Common Problems in Hyperparameter OptimizationSigOpt
Originally given at MLConf NYC 2017.
All large machine learning pipelines have tunable parameters, commonly referred to as hyperparameters. Hyperparameter optimization is the process by which we find the values for these parameters that cause our system to perform the best. SigOpt provides a Bayesian optimization platform that is commonly used for hyperparameter optimization, and I’m going to share some of the common problems we’ve seen when integrating into machine learning pipelines.
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Detecting Financial Fraud at Scale with Machine LearningDatabricks
Detecting fraudulent patterns at scale is a challenge given the massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior. In finance, added security concerns and the importance of explaining how fraudulent behavior was identified further increases the difficulty of the task. Legacy systems rely on rule-based detection that is difficult to implement and run at scale. The resulting code is very complex and brittle, making it difficult to update to keep up with new threats.
In this talk, we will go over how to convert a rule based financial fraud detection program to use machine learning on Spark as part of a scalable, modular solution. We will examine how to identify appropriate features and labels and how to create a feedback loop that will allow the model to evolve and improve overtime. We will also look at how MLflow may be leveraged throughout this effort for experiment tracking and model deployment.
Specifically, we will discuss:
-How to create a fraud-detection data pipeline
-How to leverage a framework for building features from large datasets
-How to create modular code to re-use and maintain new machine learning models
-How to choose appropriate models and algorithms for a given fraud-detection problem
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but these platforms are limited to each company’s internal infrastructure.
In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
In large enterprises, large solutions are sometimes required to tackle even the smallest tasks and ML is no different. At Comcast we are building a comprehensive, configuration based, continuously integrated and deployed platform for data pipeline transformations, model development and deployment. This is accomplished using a range of tools and frameworks such as Databricks, MLflow, Apache Spark and others. With a Databricks environment used by hundreds of researchers and petabytes of data, scale is critical to Comcast, so making it all work together in a frictionless experience is a high priority. The platform consists of a number of components: an abstraction for data pipelines and transformation to allow our data scientists the freedom to combine the most appropriate algorithms from different frameworks , experiment tracking, project and model packaging using MLflow and model serving via the Kubeflow environment on Kubernetes. The architecture, progress and current state of the platform will be discussed as well as the challenges we had to overcome to make this platform work at Comcast scale. As a machine learning practitioner, you will gain knowledge in: an example of data pipeline abstraction; ways to package and track your ML project and experiments at scale; and how Comcast uses Kubeflow on Kubernetes to bring everything together.
Keynote: Artificial Intelligence Methods for Time Series Forecasting and Classification of Real-Time IoT Sensor Data Streams, Romeo Kienzler, Chief Data Scientist - IBM Watson IoT WW, IBM Academy of Technology
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Creating a new language to support open innovationMike Hucka
Presentation given on 19 August 2013 at a BioBriefings meeting of the BioMelbourne Network (http://www.biomelbourne.org/events/view/289) in Melbourne, Australia.
Automated Construction of Node Software Using Attributes in a Ubiquitous Sens...JM code group
Sensors 2010, 10(9), 8663-8682; doi:10.3390/s100908663
Article
Automated Construction of Node Software Using Attributes in a Ubiquitous Sensor Network Environment
Woojin Lee, Juil Kim and JangMook Kang*
SCI급 저널 (컴퓨터 및 네트워크 분야)
http://www.mdpi.com/1424-8220/10/9/8663
Performance Comparison between Pytorch and Mindsporeijdms
Deep learning has been well used in many fields. However, there is a large amount of data when training neural networks, which makes many deep learning frameworks appear to serve deep learning practitioners, providing services that are more convenient to use and perform better. MindSpore and PyTorch are both deep learning frameworks. MindSpore is owned by HUAWEI, while PyTorch is owned by Facebook. Some people think that HUAWEI's MindSpore has better performance than FaceBook's PyTorch, which makes deep learning practitioners confused about the choice between the two. In this paper, we perform analytical and experimental analysis to reveal the comparison of training speed of MIndSpore and PyTorch on a single GPU. To ensure that our survey is as comprehensive as possible, we carefully selected neural networks in 2 main domains, which cover computer vision and natural language processing (NLP). The contribution of this work is twofold. First, we conduct detailed benchmarking experiments on MindSpore and PyTorch to analyze the reasons for their performance differences. This work provides guidance for end users to choose between these two frameworks.
Intro to Deep Learning with Keras - using TensorFlow backendAmin Golnari
An overview of deep learning. Keras installation in Windows and how to use it.
Create a sequential network and training it with using MNIST data.
visualization and optimization in keras with example
مرور کلی بر یادگیری عمیق. نصب و راه اندازی کراس در ویندوز. ایجاد یک شبکه عصبی چندلایه و آموزش آن با استفاده از مجموعه داده ارقام دست نویس لاتین
Similar to Neel Sundaresan - Teaching a machine to code (20)
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
Understanding Human Impact: Social and Equity Assessments for AI Technologies
Social and Equity Impact Assessments have broad applications but can be a useful tool to explore and mitigate for Machine Learning fairness issues and can be applied to product specific questions as a way to generate insights and learnings about users, as well as impacts on society broadly as a result of the deployment of new and emerging technologies.
In this presentation, my goal is to advocate for and highlight the need to consult community and external stakeholder engagement to develop a new knowledge base and understanding of the human and social consequences of algorithmic decision making and to introduce principles, methods and process for these types of impact assessments.
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
The Brain’s Guide to Dealing with Context in Language Understanding
Like the visual cortex, the regions of the brain involved in understanding language represent information hierarchically. But whereas the visual cortex organizes things into a spatial hierarchy, the language regions encode information into a hierarchy of timescale. This organization is key to our uniquely human ability to integrate semantic information across narratives. More and more, deep learning-based approaches to natural language understanding embrace models that incorporate contextual information at varying timescales. This has not only led to state-of-the art performance on many difficult natural language tasks, but also to breakthroughs in our understanding of brain activity.
In this talk, we will discuss the important connection between language understanding and context at different timescales. We will explore how different deep learning architectures capture timescales in language and how closely their encodings mimic the brain. Along the way, we will uncover some surprising discoveries about what depth does and doesn’t buy you in deep recurrent neural networks. And we’ll describe a new, more flexible way to think about these architectures and ease design space exploration. Finally, we’ll discuss some of the exciting applications made possible by these breakthroughs.
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
Applying Computer Vision to Reduce Contamination in the Recycling Stream
With China’s recent refusal of most foreign recyclables, North American waste haulers are scrambling to figure out how to make on-shore recycling cost-effective in order to continue providing recycling services. Recyclables that were once being shipped to China for manual sorting are now primarily being redirected to landfills or incinerators. Without a solution, a nearly $5 billion annual recycling market could come to a halt.
Purity in the recycling stream is key to this effort as contaminants in the stream can increase the cost of operations, damage equipment and reduce the ability to create pure commodities suitable for creating recycled goods. This market disruption as a result of China’s new regulations, however, provides us the chance to re-examine and improve our current disposal & collection habits with modern monitoring & artificial intelligence technology.
Using images from our in-dumpster cameras, Compology has developed an ML-based process that helps identify, measure and alert for contaminants in recycling containers before they are picked-up, helping keep the recycling stream clean.
Our convolutional neural network flags potential instances of contamination inside a dumpster, enabling garbage haulers to know which containers have the wrong type of material inside. This allows them to provide targeted, timely education, and when appropriate, assess fines, to improve recycling compliance at the businesses and residences they serve, helping keep recycling services financially viable.
In this presentation, we will walk through our ML-based contamination measurement and scoring process by showing how Waste Management, a national waste hauler, has experienced 57% contamination reduction in nearly 2,000 containers over six months, This progress shows significant strides towards financially viable recycling services.
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
Quantum Computing: a Treasure Hunt, not a Gold Rush
Quantum computers promise a significant step up in computational power over conventional computers, but also suffer a number of counterintuitive limitations --- both in their computational model and in leading lab implementations. In this talk, we review how quantum computers compete with conventional computers and how conventional computers try to hold their ground. Then we outline what stands in the way of successful quantum ML applications.
Josh Wills - Data Labeling as Religious ExperienceMLconf
Data Labeling as Religious Experience
One of the most common places to deploy a production machine learning systems is as a replacement for a legacy rules-based system that is having a hard time keeping up with new edge cases and requirements. I'll be walking through the process and tooling we used to help us design, train, and deploy a model to replace a set of static rules we had for handling invite spam at Slack, talk about what we learned, and discuss some problems to solve in order to make these migrations easier for everyone.
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
Project GaitNet: Ushering in the ImageNet moment for human Gait kinematics
The emergence of the upright human bipedal gait can be traced back 4 to 2.8 million years ago, to the now extinct hominin Australopithecus afarensis. Fine grained analysis of gait using the modern MEMS sensors found on all smartphones not just reveals a lot about the person’s orthopedic and neuromuscular health status, but also has enough idiosyncratic clues that it can be harnessed as a passive biometric. While there were many siloed attempts made by the machine learning community to model Bipedal Gait sensor data, these were done with small datasets oft collected in restricted academic environs. In this talk, we will introduce the ImageNet moment for human gait analysis by presenting 'Project GaitNet', the largest ever planet-sized motion sensor based human bipedal gait dataset ever curated. We’ll also present the associated state-of-the-art results in classifying humans harnessing novel deep neural architectures and the related success stories we have enjoyed in transfer-learning into disparate domains of human kinematics analysis.
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
Machine Learning Methods in Detecting Alzheimer’s Disease from Speech and Language
Alzheimer's disease affects millions of people worldwide, and it is important to predict the disease as early and as accurate as possible. In this talk, I will discuss development of novel ML models that help classifying healthy people from those who develop Alzheimer's, using short samples of human speech. As an input to the model, features of different modalities are extracted from speech audio samples and transcriptions: (1) syntactic measures, such as e.g. production rules extracted from syntactic parse trees, (2) lexical measures, such as e.g. features of lexical richness and complexity and lexical norms, and (3) acoustic measures, such as e.g. standard Mel-frequency cepstral coefficients. I will present the ML model that detects cognitive impairment by reaching agreement among modalities. The resulting model is able to achieve state of the art performance in both supervised and semi-supervised manner, using manual transcripts of human speech. Additionally, I will discuss potential limitations of any fully-automated speech-based Alzheimer's disease detection model, focusing mostly on the analysis of the impact of a not-so-accurate automatic speech recognition (ASR) on the classification performance. To illustrate this, I will present the experiments with controlled amounts of artificially generated ASR errors and explain how the deletion errors affect Alzheimer's detection performance the most, due to their impact on the features of syntactic and lexical complexity.
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
Optimized Image Classification on the Cheap
In this talk, we anchor on building an image classifier trained on the Stanford Cars dataset to evaluate two approaches to transfer learning -fine tuning and feature extraction- and the impact of hyperparameter optimization on these techniques. Once we define the most performant transfer learning technique for Stanford Cars, we will double the size of the dataset through image augmentation to boost the classifier’s performance. We will use Bayesian optimization to learn the hyperparameters associated with image transformations using the downstream image classifier’s performance as the guide. In conjunction with model performance, we will also focus on the features of these augmented images and the downstream implications for our image classifier.
To both maximize model performance on a budget and explore the impact of optimization on these methods, we apply a particularly efficient implementation of Bayesian optimization to each of these architectures in this comparison. Our goal is to draw on a rigorous set of experimental results that can help us answer the question: how can resource-constrained teams make trade-offs between efficiency and effectiveness using pre-trained models?
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
The Importance of Modeling Data Collection
Data sets used in machine learning are often collected in a systematically biased way - certain data points are more likely to be collected than others. We call this "observation bias". For example, in health care, we are more likely to see lab tests when the patient is feeling unwell than otherwise. Failing to account for observation bias can, of course, result in poor predictions on new data. By contrast, properly accounting for this bias allows us to make better use of the data we do have.
In this presentation, we discuss practical and theoretical approaches to dealing with observation bias. When the nature of the bias is known, there are simple adjustments we can make to nonparametric function estimation techniques, such as Gaussian Process models. We also discuss the scenario where the data collection model is unknown. In this case, there are steps we can take to estimate it from observed data. Finally, we demonstrate that having a small subset of data points that are known to be collected at random - that is, in an unbiased way - can vastly improve our ability to account for observation bias in the rest of the data set.
My hope is that attendees of this presentation will be aware of the perils of observation bias in their own work, and be equipped with tools to address it.
The Uncanny Valley of ML
Every so often, the conundrum of the Uncanny Valley re-emerges as advanced technologies evolve from clearly experimental products to refined accepted technologies. We have seen its effects in robotics, computer graphics, and page load times. The debate of how to handle the new technology detracts from its benefits. When machine learning is added to human decision systems a similar effect can be measured in increased response time and decreased accuracy. These systems include radiology, judicial assignments, bus schedules, housing prices, power grids and a growing variety of applications. Unfortunately, the Uncanny Valley of ML can be hard to detect in these systems and can lead to degraded system performance when ML is introduced, at great expense. Here, we'll introduce key design principles for introducing ML into human decision systems to navigate around the Uncanny Valley and avoid its pitfalls.
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
Deep Learning Architectures for Semantic Relation Detection Tasks
Recognizing and distinguishing specific semantic relations from other types of semantic relations is an essential part of language understanding systems. Identifying expressions with similar and contrasting meanings is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations. In this talk, I will first present novel techniques for creating labelled datasets required for training deep learning models for classifying semantic relations between phrases. I will further present various neural network architectures that integrate morphological features into integrated path-based and distributional relation detection algorithms and demonstrate that this model outperforms state-of-the-art models in distinguishing semantic relations and is capable of efficiently handling multi-word expressions.
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
Building an Incrementally Trained, Local Taste Aware, Global Deep Learned Recommender System Model
At Netflix, our main goal is to maximize our members’ enjoyment of the selected show by minimizing the amount of time it takes for them to find it. We try to achieve this goal by personalizing almost all the aspects of our product -- from what shows to recommend, to how to present these shows and construct their home-pages to what images to select per show, among many other things. Everything is recommendations for us and as an applied Machine Learning group, we spend our time building models for personalization that will eventually increase the joy and satisfaction of our members. In this talk we will primarily focus our attention on a) making a global deep learned recommender model that is regional tastes and popularity aware and b) adapting this model to changing taste preferences as well as dynamic catalog availability.
We will first go through some standard recommender system models that use Matrix Factorization and Topic Models and then compare and contrast them with more powerful and higher capacity deep learning based models such as sequence models that use recurrent neural networks. We will show what it entails to build a global model that is aware of regional taste preferences and catalog availability. We will show how models that are built on simple Maximum Likelihood principle fail to do that. We will then describe one solution that we have employed in order to enable the global deep learned models to focus their attention on capturing regional taste preferences and changing catalog.In the latter half of the talk, we will discuss how we do incremental learning of deep learned recommender system models. Why do we need to do that ? Everything changes with time. Users’ tastes change with time. What’s available on Netflix and what’s popular also change over time. Therefore, updating or improving recommendation systems over time is necessary to bring more joy to users. In addition to how we apply incremental learning, we will discuss some of the challenges we face involving large-scale data preparation, infrastructure setup for incremental model training as well as pipeline scheduling. The incremental training enables us to serve fresher models trained on fresher and larger amounts of data. This helps our recommender system to nicely and quickly adapt to catalog and users’ taste changes, and improve overall performance.
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
Vito Ostuni - The Voice: New Challenges in a Zero UI World
The adoption of voice-enabled devices has seen an explosive growth in the last few years and music consumption is among the most popular use cases. Music personalization and recommendation plays a major role at Pandora in providing a daily delightful listening experience for millions of users. In turn, providing the same perfectly tailored listening experience through these novel voice interfaces brings new interesting challenges and exciting opportunities. In this talk we will describe how we apply personalization and recommendation techniques in three common voice scenarios which can be defined in terms of request types: known-item, thematic, and broad open-ended. We will describe how we use deep learning slot filling techniques and query classification to interpret the user intent and identify the main concepts in the query.
We will also present the differences and challenges regarding evaluation of voice powered recommendation systems. Since pure voice interfaces do not contain visual UI elements, relevance labels need to be inferred through implicit actions such as play time, query reformulations or other types of session level information. Another difference is that while the typical recommendation task corresponds to recommending a ranked list of items, a voice play request translates into a single item play action. Thus, some considerations about closed feedback loops need to be made. In summary, improving the quality of voice interactions in music services is a relatively new challenge and many exciting opportunities for breakthroughs still remain. There are many new aspects of recommendation system interfaces to address to bring a delightful and effortless experience for voice users. We will share a few open challenges to solve for the future.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Neel Sundaresan - Teaching a machine to code
1. Teaching Machines to Code
Neel Sundaresan
Microsoft Corp.
Neel Sundaresan / MLConf 2019 NY
2. Its all about Data
• ~19M software developers in the world (Source: Tech Republic,
Ranger2013)
• 2/3 professionals, rest hobbyists
• 29 million IT/ICT Professionals
• Growing OSS data through Github, StackOverflow etc.
• 10 years of Github
• 10M users
• 26M projects
• 400M commits
• ~7M committers
• ~1M active users and ~250K monthly new users
• ~800K new projects per month
Neel Sundaresan / MLConf 2019 NY
4. New Opportunities
• Take advantage of large scale data, advances in AI algorithms,
availability of distributed systems and cloud and powerful compute
(GPU) to revolutionize developer productivity
Neel Sundaresan / MLConf 2019 NY
5. Lets first start with Data…
• DE Knuth(1971) analyzed about 800 fortran programs and found that
• 95% of the loops increment the index by 1.
• 85% of loops had 5 statements or less
• 53% of the loops were singly nested.
• More recent analysis ( Allamanis et al) of 25 MLOC showed the following stats:
• 90% have < 15 lines; 90% have no nesting; and very simple control structures.
• 50 classes of loop idioms covering 50% of concrete loops.
• Benefits
• Data driven frameworks for code refactoring
• Opportunities for program opportunities
• Language design opportunities
Neel Sundaresan / MLConf 2019 NY
6. Statistical model of code
• Lexical/Code generative models (tokenizers)
• E.g. sequence based models (n-gram models in NLP), sequence-sequence
character models in RNN, LSTMs, Sparse pointer based neural model for
Python
• Neural models are superior to N-gram models
• more expensive to train and execute and needs a lot more data.
• perform much better because one can model long range declare-use scenarios;
• can catch patterns across contexts better than n-gram (sequence of codes that are
similar but with changed variables – sentiment of the code)
• Word2Vec, For code more recently: Code2Vec, Code2Seq
Neel Sundaresan / MLConf 2019 NY
7. Statistical model of code
• Representational model (Abstract Syntax trees)
• These models are better representation of code than sequence models but
are more expensive.
• There’s work on using LSTM over such representations (for limited program
synthesis applications)
Neel Sundaresan / MLConf 2019 NY
8. Statistical model of code
• Latent model
• Looking for hidden design patterns, programming idioms, Standardized APIs,
Summaries, Anamolies etc.
• Need use of unsupervised learning: Challenging!
• Previous research has used Tree substitution grammars to identify similar
grammar productions (program tree fragment)
• Graph based representation used to identify common API usage
Neel Sundaresan / MLConf 2019 NY
9. Application of code models
• Recommenders Example: Code completion in IDEs
• Instead of using alphabetical or default orders, statistical learning could
• Early work by Bruch et al.
• Bayesian graphical models using structures for predicting next call by Proksh integrated
into Eclipse IDE.
• How to evaluate the recommender systems?
• Keystrokes saved? Overall productivity? Engagement models? Reduced bugs?
Neel Sundaresan / MLConf 2019 NY
10. Inferring coding conventions
• Coding conventions for better maintenance
• How to format code
• Variable, class naming conventions (Allamanis et al)
• Alternative for linter rules…
Neel Sundaresan / MLConf 2019 NY
11. Inferring bugs
• Buggy code identification is like anamoly detection
• Buggy code has unusual patterns and their probabilities are quite
different from normal code
• N-gram language model based complexity measures have shown
good results comparable to tools like FindBugs
• Even syntax error reporting closest to where the error occurs
• Since problematic code is rare (like anamolies, by definition) likely
more false positives / high precision is hard to achieve
Neel Sundaresan / MLConf 2019 NY
12. Program Synthesis
• Autogenerating programs from specifications
• With vast amount of program examples and associated metadata attempts to
match the specs to the metadata and extract matching code
• SciGen (Automatic paper generator from MIT)
“SCIgen is a program that generates random Computer Science research papers, including
graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements
of the papers. Our aim here is to maximize amusement, rather than coherence”. They use it to
detect bogus conferences!
• AirBnB Sketch2Code (design to code)
• A UX web design mockup to html using deep learning (Pix2Code)
• DeepCoder (MSR/U of Cambridge)
• Uses Inductive Program synthesis: given a set of input/outputs searches from a space of candidate
programs and finds the one that matches it.
• Works for DSLs (domain specific languages) with limited constructs and not to languages like C++
• Automatically finding patches (MIT Prophet/Genesis)
• Bayou system from Rice U.
Neel Sundaresan / MLConf 2019 NY
13. A Case Study: Intellisense (Code Completion)
Neel Sundaresan / MLConf 2019 NY
20. Data Source
Number of C# repos
Number of repos we were
able to build and parse to
form our dataset
Number of .cs documents in
the dataset
2000+
700+
200K+
Neel Sundaresan / MLConf 2019 NY
21. What questions can we ask of this dataset?
1. Which are the
most frequently
used classes?
2. Are there patterns
in how methods of
one class are
used?
Which features are
useful?
How is C# used? How to make
recommendations?
1. Will the same model and
parameters work for all
classes?
2. Do we have enough
data?
3. Would the previous
usage of methods from
other classes help with
prediction?
When making a prediction
1. Which pieces of
information provided by
code analyzers would be
helpful?
2. What is the reasonable
segment of code to look
at – the entire
document/function or
the most recent calls?
Neel Sundaresan / MLConf 2019 NY
22. How often is each class used? Top n
classes
Coverage
100 28%
300 37.5%
1088 50%
5,986 70%
13,203 80%
30,668 90%
0
4,500
9,000
13,500
18,000
22,500
27,000
31,500
36,000
40,500
45,000
string
System.Windows.Forms.Control
System.Collections.Generic.List
System.Linq.Enumerable
System.Array
System.Text.StringBuilder
System.Diagnostics.Debug
System.DateTime
System.Collections.Generic.Dictionary
System.Type
object
System.Math
System.IO.Path
double
System.IO.BinaryWriter
System.IO.File
System.Windows.Forms.Form
System.Exception
System.Reflection.MemberInfo
System.Convert
System.IO.BinaryReader
System.StringComparison
System.IO.Stream
System.Text.Encoding
System.IO.TextWriter
System.Collections.Generic.HashSet
System.Windows.Forms.AnchorStyles
ntModel.ComponentResourceManager
System.Enum
System.Environment
tem.Windows.Forms.TableLayoutPanel
Org.BouncyCastle.Math.BigInteger
System.Windows.Forms.TextBox
System.Linq.Expressions.Expression
em.Collections.ObjectModel.Collection
System.Xml.XmlNode
System.Windows.Forms.ComboBox
System.Xml.XmlWriter
System.Linq.Queryable
System.Guid
System.Reflection.Assembly
System.Tuple
OpenGL.Gl.Delegates
System.Collections.Generic.IDictionary
System.Drawing.Graphics
System.TimeSpan
System.Reflection.BindingFlags
System.Uri
System.Drawing.Size
Totalinvocationsindataset
Number of Invocations per Class for the Top 50 Classes
Neel Sundaresan / MLConf 2019 NY
23. How often do we face the cold start problem?
14.34%
25.42%
36.41%9.63%
15.31%
17.85%
6.65%
9.28%
9.88%
5.30%
6.81%
6.89%64.07%
43.18%
28.98%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Top 100 Classes Top 100 - 200 Classes Top 200 - 300 Classes
Invocation Composition in Different Class Groups
First Invocation Second Invocation Third Invocation
Fourth Invocation Fifth Invocation or After
Neel Sundaresan / MLConf 2019 NY
24. Sequence Model
• A second-order Markov chain: the
probability of the current invocation
depends on the two previous invocations
• Very fast to train
• Performed quite well in both offline and
online testing
Neel Sundaresan / MLConf 2019 NY
25. Sequence model performs
better both in offline and
online testing
Modeling Method Calls: Summary
1. Frequency
Model
3. Sequence Model 5.3 MB
0.0% 10.0% 20.0% 30.0% 40.0%
string.Format
string.Equals
string.IsNullOrE…
string.Replace
string.Trim
string.Substring
string.IndexOf
string.Contains
string.IsNullOr…
string.EndsWith
string.ToUpper
string.Compare…
string.LastIndex…
string.ToCharAr…
string.PadLeft
string.IndexOfA…
Percentage of Invocations
1 MB
Neel Sundaresan / MLConf 2019 NY
Top-1 Accuracy:
58%
Top-1 Accuracy:
38%
26. Our Intellisense system
• Languages supported
• C#, Python, C++, Java, XAML, TypeScript
• Platforms
• VSCode, Visual Studio
• Check out this blog:
Neel Sundaresan / MLConf 2019 NY
27. A Deep learning approach
Suppose recommendation is requested here.
• The deep learning model consumes ASTs corresponding to code snippets as an input for training
• AST tokens are mapped to numeric embedding vectors, which are learned via backpropagation using Word2Vec
• Substitute method call receiver token with its inferred type, when available
• Optionally normalize local variables according to <var:variable type>
…. "loss", "=", "tf", ".",
"reduce_sum", "(", "tf", ".",
"square", "(", "linear_model", "-",
"y", ")", ")", "n", "optimizer", "=",
"tf", ".", "tensorflow.train", ".”
array([11, 9, 4, 12, 11, 9, 8, 13, 14, 15, 16, 17,
18, 19, 20, 21, 14, 22, 16, 11, 9, 4, 12, 11, 9, 8,
13, 14, 23, 16, 11, 9, 3, 12, 11, 9, 5, 12, 15, 24,
22, 13, 13, 14, 25, 16, 11, 9, 7, 9, 6], dtype=int32)
array([[[-0.00179027, 0.01935565, -0.00102201, ...,
-0.11528983, 0.02137219, 0.08332191],
...,
[-0.04104977, 0.04417963, -0.01034168, ...,
0.04209893, 0.00140189, -0.10478071]]], dtype=float32)
Vectorize3
Code
embedding
Build
matrix
AST parser
Embed4
Extract training sequences2Source code snippet1
Neel Sundaresan / MLConf 2019 NY
28. Neural network architecture
Suppose recommendation is requested here.
Code Embedding
Linear layer
Softmax prediction
Code snippets
Predicted embedding
vectors
hT
y0
y1
y2
.
.
.
y|V|
c0 c1 …. cT
l0
…
ldx
hT0 … hTdh
…
x0 x1 …. xT
LSTM
…LSTM
The task of method completion is basically predicting a token m* conditional on a sequence of input
tokens ct where t= 0,..T corresponding to terminal nodes of AST for code snippet T ending in a terminal “.”
xt = Lct where L is the word embedding matrix dx X |V| where dx is the word embedding dimension and V
is the vocabulary
ht = f(xt, ht-1) where f is the stacked LSTM taking the previous hidden state, current input and producing the
next hidden state.
P(m|C) = yt = Softmax(Wht + b) where W is the output projection matrix and b is the bias.
m* = argmax(P(m|C))
Ref: Svyatkovskyy,Fu,Sundaresan,Zhao
The LSTM has 2 layers, 100 hidden units each
with application of recurrent dropout
and L2 regularization
Neel Sundaresan / MLConf 2019 NY
29. Hyperparameter tuning
Our model has several tunable hyperparameters determined by random search optimization
By rerunning model training till convergence via early stopping and selecting the best performing
combination by comparing accuracy at the validation level
Hyperparameter Best value
Base learning rate 0.002
Learning rate decay per epoch 0.97
Num. recurrent neural network layers 2
Num. hidden units in LSTM, per layer 100
Type of RNN LSTM
Batch size 256
Hyperparameter Best value
Type of loss function Categorical
cross-entropy
Num. lookback tokens 200+
Num. timesteps for backpropagation 100
Embedded vector dimension 150
Stochastic optimization scheme Adam
Weight regularization of all layers 10
Neel Sundaresan / MLConf 2019 NY
30. Offline model evaluation (top-5 accuracy)
Offline precision for all classes lifted by almost 20%
Category Number of classes
Improved with DL 8014
Approximately the same 1488
Declined with DL 235
Completion available with DL but not MC 263
• Most of the completion classes are improved
with deep learning approach
• 2.5% of classes are declined – mostly belonging
to Python web microframeworks like Flask,
Tornado
• For some classes type information of the
receiver token is not available, DL is still able to
provide completion in that case
Neel Sundaresan / MLConf 2019 NY
32. • Deep learning model allow to achieve a better accuracy
• Can be suitable for more advanced completion scenarios (not just methods)
• Opportunity to predict out-of-vocabulary tokens
• Why not?
• Bad interpretability
• Model sizes are bigger, performance is an issue
Why use deep learning?
Neel Sundaresan / MLConf 2019 NY
33. Deployment challenges
Suppose recommendation is requested here.
• Need to reduce model size on disk
• Change neural network architecture to reduce number of trainable
parameters
• Reuse the input word embedding matrix as the output classification
matrix, removing the large fully connected (model size reduction from
202 to 152 MB; with no accuracy loss)
• Model compression
• Apply post-training neural network quantization to store weight
matrices in 8-bit integer format (further model size reduction from 152
to 38 MB, 3% accuracy loss)
• Serving speed
• Current serving speeds on the edge 5x slower than a cheap model
Neel Sundaresan / MLConf 2019 NY
34. Can we teach machines to review code?
• What does data tell us?
• Open source python pull requests
0 5 10 15 20 25 30 35 40 45 50
Affirmitive reviews
Stylistic reviews
Docstring reviews
Python version related
Code duplication
Test related
error/exception related
String manipulation related
Regular expression related
Prin/debug/logging related
Import related
%
Typeofpeerreview
Distribution of type in open source peer reviews of
python pull requests
~43% reviews are basic/ stylistic reviews
~15% reviews are related to comments
Gupta,Sundaresan (KDD 2018)
Neel Sundaresan / MLConf 2019 NY
35. Architecture
Neel Sundaresan / MLConf 2019 NY
Historical
code reviews
Crawl Code and review
preprocessing
(code,review) pairs
Trainingdata
generation
Relevant,
Non-relevant
pairs
(code,review) pairs
Multi-Encoderdeep
learning model
Model
Training
Git Repositories
New pullrequest
Review candidate
selection
Review
clustering
Repository of common
reviews
Vectorized
(code,review)
Code Multi-Encoderdeep
learning model
(code, candidate
review) pairs
Review with
maximum
model confidence
Training phase
Testing phase
LSTM1
LSTM2
LSTM3
LSTM
Code
Review
DNNs
Relevance
Score
Code context
36. Opportunities
• OSS gives us lots and lots of Data about code and coders
• Cloud gives us opportunity to process lots and lots of data
• Recent and rapid advances in ML and AI
• Take advantage of newer advances (Transform networks / GPT-x)
• But…
• Challenges remain
• While computer languages are synthetic unlike natural languages and systems they are
programmed by human.
• Scale, sparsity, speed
• We have barely scratched the surface… a lot more to come…
Neel Sundaresan / MLConf 2019 NY
37. Thank you!
• We have a number of initiatives in the area of applying AI at scale to
Software engineering
• We are hiring! email neels@Microsoft.com
Neel Sundaresan / MLConf 2019 NY
In order to create an IntelliSense that suggests the right method when you need it, we need lots of examples of realistic usage of the various classes in the dot net framework and other common libraries.
So we crawled all the public C# repos on GitHub with more than 100 stars.
There were 2300 of those. We could automatically restore and build one third of these.
This gave us 200,000 .cs files.
--------------
When there are multiple solutions in one repo, we only parse the first one to avoid duplication
The ones that could not be parsed either did not contain a .sln file, or we could not load/open the first .sln file within 60 seconds, or we could not get a compilation of the code within 60 seconds
Each solution was given 2 minutes to “nugget restore” its NuGet packages.
One of the issues that JoC raised is that many popular repos on GitHub are libraries, and we suspected that the coding patterns employed could be different from normal application solutions. JoC pointed us a repo from MSIT
What are in the data? Different approaches we take. Talk a lot of data, rich information in the data. Jumping into sequence data too fast? Make a silde for questions? Are certain calls different from different.
Data driven approach
Mention collaboration
Here before we move on to the modeling part, let’s take a detour and think about what this dataset enables us to answer, and look at some data that justify our approach.
Generally, we want to understand which classes are used most often so we can focus our effort.
Are there patterns in how methods are used that we can take advantage of?
In relation to making useful recommendations,
we would want to know what are the most informative features,
and also how local should our context information be - the entire document, the current function or the last few calls?
Once we develop a model, we’d like to know if one model works for all, whether we have sufficient training data.
Here’s one question that’s readily answerable. This graph shows the number of invocations for the top 50 classes in our dataset.
We can see that the most popular class is string, followed by WinForms Control and List.
We can see that the popularity drops off very quickly, and it has a very long tail.
The top 300 classes cover nearly 40%.
(For the precision results you’ll see later, everything is reported for the top 300 classes.)
As is the case with all recommender systems, we also face the cold start problem, meaning that there’s no contextual information to base our recommendation on.
This happens when the current invocation is the first time this class is called in the current document, as we have little idea on what the developer is trying to write.
Let’s focus on the light blue part of each bar. This is the portion of invocations that are first of its class in the document.
We see that for the top 100 classes, only about 15% of the time would we need to make a recommendation with no context. This gets more severe as the class becomes more rarely used.
We have experimented with three types of models. The Frequency model is a simple popularity ranking which we use for the cold start scenarios.
We then implemented variants of the Clustering model because it is a popular approach for API recommenders in the literature.
The precision of the Clustering model was modest, and so we implemented the Sequence model, which we thought was a better model of the coding process. It turned out to have the highest precision and is the model we use in production today.