This document summarizes a talk on scaling machine learning algorithms to big data settings using a divide-and-conquer approach. It discusses three converging trends of big data, distributed computing, and machine learning. The goal is to extend machine learning to big data, but traditional ML algorithms do not scale well. The proposed approach divides data into subsets, applies existing ML algorithms to each subset in parallel, and then combines the results. Matrix factorization is provided as an example application, where the Divide-Factor-Combine framework allows preserving theoretical guarantees while enabling scalability.
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
There is more to recommendation algorithms than rating prediction. And, there is more to recommender systems than algorithms. In this tutorial, given at the 2012 ACM Recommender Systems Conference in Dublin, I review things such as different interaction and user feedback mechanisms, offline experimentation and AB testing, or software architectures for Recommender Systems.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Recommender Systems from A to Z – Model EvaluationCrossing Minds
The third meetup will be about evaluating different models for our recommender system. We will review the strategies we have to check if a model is under fitting or overfitting. After that, we will present and analyze the losses that are typically used in recommendation systems to train models. We will compare regression, classification, and rank based losses and when it's convenient to use each one. Finally, we are going to cover all the metrics that are typically used to evaluate the performance of different recommendation systems and how to test that the models are giving good results in production.
Artificial Intelligence Course: Linear models ananth
In this presentation we present the linear models: Regression and Classification. We illustrate with several examples. Concepts such as underfitting (Bias) and overfitting (Variance) are presented. Linear models can be used as stand alone classifiers for simple cases and they are essential building blocks as a part of larger deep learning networks
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
There is more to recommendation algorithms than rating prediction. And, there is more to recommender systems than algorithms. In this tutorial, given at the 2012 ACM Recommender Systems Conference in Dublin, I review things such as different interaction and user feedback mechanisms, offline experimentation and AB testing, or software architectures for Recommender Systems.
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.
Recommender Systems from A to Z – Model EvaluationCrossing Minds
The third meetup will be about evaluating different models for our recommender system. We will review the strategies we have to check if a model is under fitting or overfitting. After that, we will present and analyze the losses that are typically used in recommendation systems to train models. We will compare regression, classification, and rank based losses and when it's convenient to use each one. Finally, we are going to cover all the metrics that are typically used to evaluate the performance of different recommendation systems and how to test that the models are giving good results in production.
Artificial Intelligence Course: Linear models ananth
In this presentation we present the linear models: Regression and Classification. We illustrate with several examples. Concepts such as underfitting (Bias) and overfitting (Variance) are presented. Linear models can be used as stand alone classifiers for simple cases and they are essential building blocks as a part of larger deep learning networks
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
Summary: Graphs are structures commonly used in computer science that model the interactions among entities. I will start from introducing the basic formulations of graph based machine learning, which has been a popular topic of research in the past decade and led to a powerful set of techniques. Particularly, I will show examples on how it acts as a generic data mining and predictive analytic tool. In the second part, I am going to discuss applications of such learning techniques in media analytics: (1) image analysis, where visually coherent objects are isolated from images; (2) social analysis of videos, where actors' social properties are predicted from videos. Materials in this part are based on our recent publications in highly selective venues (papers on https://sites.google.com/site/leiding2010/ ).
Bio: Lei Ding is a researcher making sense of large amounts of data in all media types. He currently works in Intent Media as a scientist, focusing on data analytics and applied machine learning in online advertising. Previously, he has worked in several research institutions including Columbia University, UIUC and IBM Research on digital / social media analysis and understanding. He received a Ph.D. degree in Computer Science and Engineering from The Ohio State University, where he was a Distinguished University Fellow.
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
In this presentation we articulate when deep learning techniques yield best results from a practitioner's view point. Do we apply deep learning techniques for every machine learning problem? What characteristics of an application lends itself suitable for deep learning? Does more data automatically imply better results regardless of the algorithm or model? Does "automated feature learning" obviate the need for data preprocessing and feature design?
Understanding how high powered ML models arrive at their predictions is an important aspect of Machine Learning, and SHAP is a powerful tool that enables practitioners to understand how different features combine to help a model arrive at a prediction.
This slidedeck is from a presentation given at pydata global on the theoretical foundations of SHAP as well as how to use its library. Link to the presentation can be found here: https://pydata.org/global2021/schedule/presentation/3/behind-the-black-box-how-to-understand-any-ml-model-using-shap/
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
Proof of concept of how to use Scala, Spark and the recent library Sparkz for building production quality machine learning pipelines for predicting buyers of financial products.
The pipelines are implemented through custom declarative APIs that gives us greater control, transparency and testability of the whole process.
The example followed the validation and evaluation principles as defined in The Data Science Manifesto available in beta at www.datasciencemanifesto.org
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
Introduction to Machine Learning : Machine Learning (ML) is a type of Intelligence (AI) that allows Software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine Learning Algorithms use historical data as input to predict new output values.
A unique sorting algorithm with linear time & space complexityeSAT Journals
Abstract Sorting a list means selection of the particular permutation of the members of that list in which the final permutation contains members in increasing or in decreasing order. Sorted list is prerequisite of some optimized operations such as searching an element from a list, locating or removing an element to/ from a list and merging two sorted list in a database etc. As volume of information is growing up day by day in the world around us and these data are unavoidable to manage for real life situations, the efficient and cost effective sorting algorithms are required. There are several numbers of fundamental and problem oriented sorting algorithms but still now sorting a problem has attracted a great deal of research, perhaps due to the complexity of solving it efficiently and effectively despite of its simple and familiar statements. Algorithms having same efficiency to do a same work using different mechanisms must differ their required time and space. For that reason an algorithm is chosen according to one’s need with respect to space complexity and time complexity. Now a day, space (Memory) is available in market comparatively in cheap cost. So, time complexity is a major issue for an algorithm. Here, the presented approach is to sort a list with linear time and space complexity using divide and conquer rule by partitioning a problem into n (input size) number of sub problems then these sub problems are solved recursively. Required time and space for the algorithm is optimized through reducing the height of the recursive tree and reduced height is too small (as compared to the problem size) to evaluate. So, asymptotic efficiency of this algorithm is very high with respect to time and space. Keywords: sorting, searching, permutation, divide and conquer algorithm, asymptotic efficiency, space complexity, time complexity, and recursion.
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
Summary: Graphs are structures commonly used in computer science that model the interactions among entities. I will start from introducing the basic formulations of graph based machine learning, which has been a popular topic of research in the past decade and led to a powerful set of techniques. Particularly, I will show examples on how it acts as a generic data mining and predictive analytic tool. In the second part, I am going to discuss applications of such learning techniques in media analytics: (1) image analysis, where visually coherent objects are isolated from images; (2) social analysis of videos, where actors' social properties are predicted from videos. Materials in this part are based on our recent publications in highly selective venues (papers on https://sites.google.com/site/leiding2010/ ).
Bio: Lei Ding is a researcher making sense of large amounts of data in all media types. He currently works in Intent Media as a scientist, focusing on data analytics and applied machine learning in online advertising. Previously, he has worked in several research institutions including Columbia University, UIUC and IBM Research on digital / social media analysis and understanding. He received a Ph.D. degree in Computer Science and Engineering from The Ohio State University, where he was a Distinguished University Fellow.
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
In this presentation we articulate when deep learning techniques yield best results from a practitioner's view point. Do we apply deep learning techniques for every machine learning problem? What characteristics of an application lends itself suitable for deep learning? Does more data automatically imply better results regardless of the algorithm or model? Does "automated feature learning" obviate the need for data preprocessing and feature design?
Understanding how high powered ML models arrive at their predictions is an important aspect of Machine Learning, and SHAP is a powerful tool that enables practitioners to understand how different features combine to help a model arrive at a prediction.
This slidedeck is from a presentation given at pydata global on the theoretical foundations of SHAP as well as how to use its library. Link to the presentation can be found here: https://pydata.org/global2021/schedule/presentation/3/behind-the-black-box-how-to-understand-any-ml-model-using-shap/
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
Proof of concept of how to use Scala, Spark and the recent library Sparkz for building production quality machine learning pipelines for predicting buyers of financial products.
The pipelines are implemented through custom declarative APIs that gives us greater control, transparency and testability of the whole process.
The example followed the validation and evaluation principles as defined in The Data Science Manifesto available in beta at www.datasciencemanifesto.org
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 6: Time Series and Deepnets. By Charles Parker (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
Introduction to Machine Learning : Machine Learning (ML) is a type of Intelligence (AI) that allows Software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine Learning Algorithms use historical data as input to predict new output values.
A unique sorting algorithm with linear time & space complexityeSAT Journals
Abstract Sorting a list means selection of the particular permutation of the members of that list in which the final permutation contains members in increasing or in decreasing order. Sorted list is prerequisite of some optimized operations such as searching an element from a list, locating or removing an element to/ from a list and merging two sorted list in a database etc. As volume of information is growing up day by day in the world around us and these data are unavoidable to manage for real life situations, the efficient and cost effective sorting algorithms are required. There are several numbers of fundamental and problem oriented sorting algorithms but still now sorting a problem has attracted a great deal of research, perhaps due to the complexity of solving it efficiently and effectively despite of its simple and familiar statements. Algorithms having same efficiency to do a same work using different mechanisms must differ their required time and space. For that reason an algorithm is chosen according to one’s need with respect to space complexity and time complexity. Now a day, space (Memory) is available in market comparatively in cheap cost. So, time complexity is a major issue for an algorithm. Here, the presented approach is to sort a list with linear time and space complexity using divide and conquer rule by partitioning a problem into n (input size) number of sub problems then these sub problems are solved recursively. Required time and space for the algorithm is optimized through reducing the height of the recursive tree and reduced height is too small (as compared to the problem size) to evaluate. So, asymptotic efficiency of this algorithm is very high with respect to time and space. Keywords: sorting, searching, permutation, divide and conquer algorithm, asymptotic efficiency, space complexity, time complexity, and recursion.
Divide and Conquer Algorithms - D&C forms a distinct algorithm design technique in computer science, wherein a problem is solved by repeatedly invoking the algorithm on smaller occurrences of the same problem. Binary search, merge sort, Euclid's algorithm can all be formulated as examples of divide and conquer algorithms. Strassen's algorithm and Nearest Neighbor algorithm are two other examples.
MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc
Supervised Learning (Part II): Logistic Regression, Deepnets, and Time Series, by BigML.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
BOOSTING ADVERSARIAL ATTACKS WITH MOMENTUM - Tianyu Pang and Chao Du, THU - D...GeekPwn Keen
Youtube: https://www.youtube.com/watch?v=Pu2WQnU0GzA
Deep neural networks are vulnerable to adversarial examples, which poses security concerns on these algorithms due to the potentially severe consequences. Adversarial at- tacks serve as an important surrogate to evaluate the robustness of deep learning models before they are deployed. However, most of existing adversarial attacks can only fool a black-box model with a low success rate. To address this issue, we propose a broad class of momentum-based iterative algorithms to boost adversarial attacks. By integrating the momentum term into the iterative process for attacks, our methods can stabilize update directions and escape from poor local maxima during the iterations, resulting in more transferable adversarial examples. To further improve the success rates for black-box attacks, we apply momentum iterative algorithms to an ensemble of models, and show that the adversarially trained models with a strong defense ability are also vulnerable to our black-box attacks. We hope that the proposed methods will serve as a benchmark for evaluating the robustness of various deep models and defense methods. With this method, we won the first places in NIPS 2017 Non-targeted Adversarial Attack and Targeted Adversarial Attack competitions.
Tianyu Pang is a first-year Ph.D. student of TSAIL Group in the Department of Computer Science and Technology, Tsinghua University, advised by Prof. Jun Zhu. His research interest includes machine learning, deep learning and their applications in computer vision, especially robustness of deep learning.
Correlation, causation and incrementally recommendation problems at netflix ...Roelof van Zwol
Within Netflix, personalization is a key differentiator, helping members to quickly discover new content that matches their taste. Done well, it creates an immersive user experience, however when the recommendation is out of tune, it is immediately noticed by our members. During this presentation I will cover some of the personalization and recommendation tasks that jointly define the Netflix user experience that entertains more that 130M members world wide. In particular, I will focus on several of the algorithmic challenges related to the launch of new Netflix originals in the service, and go over concepts such as causality, incrementality and explore-exploit strategies.
The research presented in this talk represents the collaborative efforts of a team of research scientists and engineers at Netflix on our journey to create best in class user experiences.
Context-aware recommender systems (CARS) help improve the effectiveness of recommendations by adapting to users' preferences in different contextual situations. One approach to CARS that has been shown to be particularly effective is Context-Aware Matrix Factorization (CAMF). CAMF incorporates contextual dependencies into the standard matrix factorization (MF) process, where users and items are represented as collections of weights over various latent factors. In this paper, we introduce another CARS approach based on an extension of matrix factorization, namely, the Sparse Linear Method (SLIM). We develop a family of deviation-based contextual SLIM (CSLIM) recommendation algorithms by learning rating deviations in different contextual conditions. Our CSLIM approach is better at explaining the underlying reasons behind contextual recommendations, and our experimental evaluations over five context-aware data sets demonstrate that these CSLIM algorithms outperform the state-of-the-art CARS algorithms in the top-$N$ recommendation task. We also discuss the criteria for selecting the appropriate CSLIM algorithm in advance based on the underlying characteristics of the data.
In this Spark session Ravi Saraogi talks about why estimating default risk in fund structures can be a challenging task. He presents on how this process has evolved over the years and the current methodologies for assessing such risks.
By popular demand, here is a case study of my first Kaggle competition from about a year ago. Hope you find it useful. Thank you again to my fantastic team.
The Comprehensive Product Platform Planning (CP3) framework presents a flexible mathematical model of the platform planning process, which allows (i) the formation of sub-families of products, and (ii) the simultaneous identification and quantification of plat- form/scaling design variables. The CP3 model is founded on a generalized commonality matrix that represents the product platform plan, and yields a mixed binary-integer non- linear programming problem. In this paper, we develop a methodology to reduce the high dimensional binary integer problem to a more tractable integer problem, where the com- monality matrix is represented by a set of integer variables. Subsequently, we determine the feasible set of values for the integer variables in the case of families with 3 − 7 kinds of products. The cardinality of the feasible set is found to be orders of magnitude smaller than the total number of unique combinations of the commonality variables. In addition, we also present the development of a generalized approach to Mixed-Discrete Non-Linear Optimization (MDNLO) that can be implemented through standard non-gradient based op- timization algorithms. This MDNLO technique is expected to provide a robust and compu- tationally inexpensive optimization framework for the reduced CP3 model. The generalized approach to MDNLO uses continuous optimization as the primary search strategy, how- ever, evaluates the system model only at the feasible locations in the discrete variable space.
Should all a- rated banks have the same default risk as lehman?Zhongmin Luo
1. Financial institutions need to construct proxy CDS rates for counterparties lacking liquid CDS quotes, which are required for CVA pricing, CVA risk charge calculation, etc;
2. Existing CDS Proxy Methods do not meet regulatory requirements and are vulnerable to arbitrage;
3. After investigating 8 most popular Machine Learning algorithms, we show that Machine Learning techniques can be used to construct reliable CDS proxies that meet regulatory regulations while free from the above problem
4. Feature variable selection can be critical for performance of CDS-proxy construction methods
5. Effects of feature variable correlations on classification performances have to be investigated in the case of financial data
VOLT: A Provenance-Producing, Transparent SPARQL Proxy for the On-Demand Computation of Linked Data & its Applications to Spatiotemporally Dependent Data
Applying Linear Optimization Using GLPKJeremy Chen
A brief introduction to linear optimization with a focus on applying it with the high-quality open-source solver GLPK.
Originally prepared for an intra-department sharing session.
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
Talk (to be given) June 8, 2018 at UC Berkeley / NERSC
In Collaboration with Michael Mahoney, UC Berkeley
Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models. We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization, whereas large, modern deep networks display a new kind of heavy tailed self-regularization. We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training. Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs. We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
Understanding Human Impact: Social and Equity Assessments for AI Technologies
Social and Equity Impact Assessments have broad applications but can be a useful tool to explore and mitigate for Machine Learning fairness issues and can be applied to product specific questions as a way to generate insights and learnings about users, as well as impacts on society broadly as a result of the deployment of new and emerging technologies.
In this presentation, my goal is to advocate for and highlight the need to consult community and external stakeholder engagement to develop a new knowledge base and understanding of the human and social consequences of algorithmic decision making and to introduce principles, methods and process for these types of impact assessments.
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
The Brain’s Guide to Dealing with Context in Language Understanding
Like the visual cortex, the regions of the brain involved in understanding language represent information hierarchically. But whereas the visual cortex organizes things into a spatial hierarchy, the language regions encode information into a hierarchy of timescale. This organization is key to our uniquely human ability to integrate semantic information across narratives. More and more, deep learning-based approaches to natural language understanding embrace models that incorporate contextual information at varying timescales. This has not only led to state-of-the art performance on many difficult natural language tasks, but also to breakthroughs in our understanding of brain activity.
In this talk, we will discuss the important connection between language understanding and context at different timescales. We will explore how different deep learning architectures capture timescales in language and how closely their encodings mimic the brain. Along the way, we will uncover some surprising discoveries about what depth does and doesn’t buy you in deep recurrent neural networks. And we’ll describe a new, more flexible way to think about these architectures and ease design space exploration. Finally, we’ll discuss some of the exciting applications made possible by these breakthroughs.
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
Applying Computer Vision to Reduce Contamination in the Recycling Stream
With China’s recent refusal of most foreign recyclables, North American waste haulers are scrambling to figure out how to make on-shore recycling cost-effective in order to continue providing recycling services. Recyclables that were once being shipped to China for manual sorting are now primarily being redirected to landfills or incinerators. Without a solution, a nearly $5 billion annual recycling market could come to a halt.
Purity in the recycling stream is key to this effort as contaminants in the stream can increase the cost of operations, damage equipment and reduce the ability to create pure commodities suitable for creating recycled goods. This market disruption as a result of China’s new regulations, however, provides us the chance to re-examine and improve our current disposal & collection habits with modern monitoring & artificial intelligence technology.
Using images from our in-dumpster cameras, Compology has developed an ML-based process that helps identify, measure and alert for contaminants in recycling containers before they are picked-up, helping keep the recycling stream clean.
Our convolutional neural network flags potential instances of contamination inside a dumpster, enabling garbage haulers to know which containers have the wrong type of material inside. This allows them to provide targeted, timely education, and when appropriate, assess fines, to improve recycling compliance at the businesses and residences they serve, helping keep recycling services financially viable.
In this presentation, we will walk through our ML-based contamination measurement and scoring process by showing how Waste Management, a national waste hauler, has experienced 57% contamination reduction in nearly 2,000 containers over six months, This progress shows significant strides towards financially viable recycling services.
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
Quantum Computing: a Treasure Hunt, not a Gold Rush
Quantum computers promise a significant step up in computational power over conventional computers, but also suffer a number of counterintuitive limitations --- both in their computational model and in leading lab implementations. In this talk, we review how quantum computers compete with conventional computers and how conventional computers try to hold their ground. Then we outline what stands in the way of successful quantum ML applications.
Josh Wills - Data Labeling as Religious ExperienceMLconf
Data Labeling as Religious Experience
One of the most common places to deploy a production machine learning systems is as a replacement for a legacy rules-based system that is having a hard time keeping up with new edge cases and requirements. I'll be walking through the process and tooling we used to help us design, train, and deploy a model to replace a set of static rules we had for handling invite spam at Slack, talk about what we learned, and discuss some problems to solve in order to make these migrations easier for everyone.
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
Project GaitNet: Ushering in the ImageNet moment for human Gait kinematics
The emergence of the upright human bipedal gait can be traced back 4 to 2.8 million years ago, to the now extinct hominin Australopithecus afarensis. Fine grained analysis of gait using the modern MEMS sensors found on all smartphones not just reveals a lot about the person’s orthopedic and neuromuscular health status, but also has enough idiosyncratic clues that it can be harnessed as a passive biometric. While there were many siloed attempts made by the machine learning community to model Bipedal Gait sensor data, these were done with small datasets oft collected in restricted academic environs. In this talk, we will introduce the ImageNet moment for human gait analysis by presenting 'Project GaitNet', the largest ever planet-sized motion sensor based human bipedal gait dataset ever curated. We’ll also present the associated state-of-the-art results in classifying humans harnessing novel deep neural architectures and the related success stories we have enjoyed in transfer-learning into disparate domains of human kinematics analysis.
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
Machine Learning Methods in Detecting Alzheimer’s Disease from Speech and Language
Alzheimer's disease affects millions of people worldwide, and it is important to predict the disease as early and as accurate as possible. In this talk, I will discuss development of novel ML models that help classifying healthy people from those who develop Alzheimer's, using short samples of human speech. As an input to the model, features of different modalities are extracted from speech audio samples and transcriptions: (1) syntactic measures, such as e.g. production rules extracted from syntactic parse trees, (2) lexical measures, such as e.g. features of lexical richness and complexity and lexical norms, and (3) acoustic measures, such as e.g. standard Mel-frequency cepstral coefficients. I will present the ML model that detects cognitive impairment by reaching agreement among modalities. The resulting model is able to achieve state of the art performance in both supervised and semi-supervised manner, using manual transcripts of human speech. Additionally, I will discuss potential limitations of any fully-automated speech-based Alzheimer's disease detection model, focusing mostly on the analysis of the impact of a not-so-accurate automatic speech recognition (ASR) on the classification performance. To illustrate this, I will present the experiments with controlled amounts of artificially generated ASR errors and explain how the deletion errors affect Alzheimer's detection performance the most, due to their impact on the features of syntactic and lexical complexity.
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
Optimized Image Classification on the Cheap
In this talk, we anchor on building an image classifier trained on the Stanford Cars dataset to evaluate two approaches to transfer learning -fine tuning and feature extraction- and the impact of hyperparameter optimization on these techniques. Once we define the most performant transfer learning technique for Stanford Cars, we will double the size of the dataset through image augmentation to boost the classifier’s performance. We will use Bayesian optimization to learn the hyperparameters associated with image transformations using the downstream image classifier’s performance as the guide. In conjunction with model performance, we will also focus on the features of these augmented images and the downstream implications for our image classifier.
To both maximize model performance on a budget and explore the impact of optimization on these methods, we apply a particularly efficient implementation of Bayesian optimization to each of these architectures in this comparison. Our goal is to draw on a rigorous set of experimental results that can help us answer the question: how can resource-constrained teams make trade-offs between efficiency and effectiveness using pre-trained models?
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
The Importance of Modeling Data Collection
Data sets used in machine learning are often collected in a systematically biased way - certain data points are more likely to be collected than others. We call this "observation bias". For example, in health care, we are more likely to see lab tests when the patient is feeling unwell than otherwise. Failing to account for observation bias can, of course, result in poor predictions on new data. By contrast, properly accounting for this bias allows us to make better use of the data we do have.
In this presentation, we discuss practical and theoretical approaches to dealing with observation bias. When the nature of the bias is known, there are simple adjustments we can make to nonparametric function estimation techniques, such as Gaussian Process models. We also discuss the scenario where the data collection model is unknown. In this case, there are steps we can take to estimate it from observed data. Finally, we demonstrate that having a small subset of data points that are known to be collected at random - that is, in an unbiased way - can vastly improve our ability to account for observation bias in the rest of the data set.
My hope is that attendees of this presentation will be aware of the perils of observation bias in their own work, and be equipped with tools to address it.
The Uncanny Valley of ML
Every so often, the conundrum of the Uncanny Valley re-emerges as advanced technologies evolve from clearly experimental products to refined accepted technologies. We have seen its effects in robotics, computer graphics, and page load times. The debate of how to handle the new technology detracts from its benefits. When machine learning is added to human decision systems a similar effect can be measured in increased response time and decreased accuracy. These systems include radiology, judicial assignments, bus schedules, housing prices, power grids and a growing variety of applications. Unfortunately, the Uncanny Valley of ML can be hard to detect in these systems and can lead to degraded system performance when ML is introduced, at great expense. Here, we'll introduce key design principles for introducing ML into human decision systems to navigate around the Uncanny Valley and avoid its pitfalls.
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
Deep Learning Architectures for Semantic Relation Detection Tasks
Recognizing and distinguishing specific semantic relations from other types of semantic relations is an essential part of language understanding systems. Identifying expressions with similar and contrasting meanings is valuable for NLP systems which go beyond recognizing semantic relatedness and require to identify specific semantic relations. In this talk, I will first present novel techniques for creating labelled datasets required for training deep learning models for classifying semantic relations between phrases. I will further present various neural network architectures that integrate morphological features into integrated path-based and distributional relation detection algorithms and demonstrate that this model outperforms state-of-the-art models in distinguishing semantic relations and is capable of efficiently handling multi-word expressions.
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
Building an Incrementally Trained, Local Taste Aware, Global Deep Learned Recommender System Model
At Netflix, our main goal is to maximize our members’ enjoyment of the selected show by minimizing the amount of time it takes for them to find it. We try to achieve this goal by personalizing almost all the aspects of our product -- from what shows to recommend, to how to present these shows and construct their home-pages to what images to select per show, among many other things. Everything is recommendations for us and as an applied Machine Learning group, we spend our time building models for personalization that will eventually increase the joy and satisfaction of our members. In this talk we will primarily focus our attention on a) making a global deep learned recommender model that is regional tastes and popularity aware and b) adapting this model to changing taste preferences as well as dynamic catalog availability.
We will first go through some standard recommender system models that use Matrix Factorization and Topic Models and then compare and contrast them with more powerful and higher capacity deep learning based models such as sequence models that use recurrent neural networks. We will show what it entails to build a global model that is aware of regional taste preferences and catalog availability. We will show how models that are built on simple Maximum Likelihood principle fail to do that. We will then describe one solution that we have employed in order to enable the global deep learned models to focus their attention on capturing regional taste preferences and changing catalog.In the latter half of the talk, we will discuss how we do incremental learning of deep learned recommender system models. Why do we need to do that ? Everything changes with time. Users’ tastes change with time. What’s available on Netflix and what’s popular also change over time. Therefore, updating or improving recommendation systems over time is necessary to bring more joy to users. In addition to how we apply incremental learning, we will discuss some of the challenges we face involving large-scale data preparation, infrastructure setup for incremental model training as well as pipeline scheduling. The incremental training enables us to serve fresher models trained on fresher and larger amounts of data. This helps our recommender system to nicely and quickly adapt to catalog and users’ taste changes, and improve overall performance.
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
Vito Ostuni - The Voice: New Challenges in a Zero UI World
The adoption of voice-enabled devices has seen an explosive growth in the last few years and music consumption is among the most popular use cases. Music personalization and recommendation plays a major role at Pandora in providing a daily delightful listening experience for millions of users. In turn, providing the same perfectly tailored listening experience through these novel voice interfaces brings new interesting challenges and exciting opportunities. In this talk we will describe how we apply personalization and recommendation techniques in three common voice scenarios which can be defined in terms of request types: known-item, thematic, and broad open-ended. We will describe how we use deep learning slot filling techniques and query classification to interpret the user intent and identify the main concepts in the query.
We will also present the differences and challenges regarding evaluation of voice powered recommendation systems. Since pure voice interfaces do not contain visual UI elements, relevance labels need to be inferred through implicit actions such as play time, query reformulations or other types of session level information. Another difference is that while the typical recommendation task corresponds to recommending a ranked list of items, a voice play request translates into a single item play action. Thus, some considerations about closed feedback loops need to be made. In summary, improving the quality of voice interactions in music services is a relatively new challenge and many exciting opportunities for breakthroughs still remain. There are many new aspects of recommendation system interfaces to address to bring a delightful and effortless experience for voice users. We will share a few open challenges to solve for the future.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
6. Goal:
Extend
ML
to
the
Big
Data
SeAng
Challenge:
ML
not
developed
with
scalability
in
mind
✦
Does
not
naturally
scale
/
leverage
distributed
compuOng
Machine
Learning
Big
Data
Distributed
CompuOng
7. Goal:
Extend
ML
to
the
Big
Data
SeAng
Challenge:
ML
not
developed
with
scalability
in
mind
✦
Does
not
naturally
scale
/
leverage
distributed
compuOng
Our
approach:
Divide-‐and-‐conquer
✦
Apply
exisOng
base
algorithms
to
subsets
of
data
and
combine
Machine
Learning
Big
Data
Distributed
CompuOng
8. Goal:
Extend
ML
to
the
Big
Data
SeAng
Challenge:
ML
not
developed
with
scalability
in
mind
✦
Does
not
naturally
scale
/
leverage
distributed
compuOng
Our
approach:
Divide-‐and-‐conquer
✦
Apply
exisOng
base
algorithms
to
subsets
of
data
and
combine
✓
✓
✓
Build
upon
exisOng
suites
of
ML
algorithms
Preserve
favorable
algorithm
properOes
Naturally
leverage
distributed
compuOng
Machine
Learning
Big
Data
Distributed
CompuOng
9. Goal:
Extend
ML
to
the
Big
Data
SeAng
Challenge:
ML
not
developed
with
scalability
in
mind
✦
Does
not
naturally
scale
/
leverage
distributed
compuOng
Our
approach:
Divide-‐and-‐conquer
✦
Apply
exisOng
base
algorithms
to
subsets
of
data
and
combine
✓
✓
✓
✦
Build
upon
exisOng
suites
of
ML
algorithms
Preserve
favorable
algorithm
properOes
Naturally
leverage
distributed
compuOng
E.g.,
✦
✦
✦
Machine
Learning
Big
Data
Matrix
factorizaOon
(DFC) [MTJ, NIPS11; TMMFJ, ICCV13]
[KTSJ, ICML12; KTSJ,
Assessing
esOmator
quality
(BLB) JRSS13; KTASJ, KDD13]
Genomic
Variant
Calling [BTTJPYS13, submitted, CTZFJP13, submitted]
Distributed
CompuOng
10. Goal:
Extend
ML
to
the
Big
Data
SeAng
Challenge:
ML
not
developed
with
scalability
in
mind
✦
Does
not
naturally
scale
/
leverage
distributed
compuOng
Our
approach:
Divide-‐and-‐conquer
✦
Apply
exisOng
base
algorithms
to
subsets
of
data
and
combine
✓
✓
✓
✦
Build
upon
exisOng
suites
of
ML
algorithms
Preserve
favorable
algorithm
properOes
Naturally
leverage
distributed
compuOng
E.g.,
✦
✦
✦
Machine
Learning
Big
Data
Matrix
factorizaOon
(DFC) [MTJ, NIPS11; TMMFJ, ICCV13]
[KTSJ, ICML12; KTSJ,
Assessing
esOmator
quality
(BLB) JRSS13; KTASJ, KDD13]
Genomic
Variant
Calling [BTTJPYS13, submitted, CTZFJP13, submitted]
Distributed
CompuOng
16. Matrix
CompleOon
Goal: Recover a matrix from a
subset of its entries
Can we do this at scale?
✦
✦
✦
✦
✦
Netflix: 30M users, 100K+ videos
Facebook: 1B users
Pandora: 70M active users, 1M songs
Amazon: Millions of users and products
...
18. Reducing
Degrees
of
Freedom
✦
Problem: Impossible without
additional information
✦
mn degrees of freedom
n
m
19. Reducing
Degrees
of
Freedom
✦
Problem: Impossible without
additional information
✦
✦
mn degrees of freedom
Solution: Assume small # of
factors determine preference
n
m
r
=m
n
r
‘Low-rank’
20. Reducing
Degrees
of
Freedom
✦
Problem: Impossible without
additional information
✦
✦
mn degrees of freedom
Solution: Assume small # of
factors determine preference
✦
O(m + n) degrees of freedom
✦
Linear storage costs
n
m
r
=m
n
r
‘Low-rank’
25. Bad
InformaOon
Spread
✦
Problem:
Other
raOngs
don’t
inform
us
about
missing
raOng
✦
SoluOon:
Assume
incoherence
with
standard
basis [Candes and Recht, 2009]
bad
spread
of
informaOon
34. Divide-‐Factor-‐Combine
(DFC)
[MTJ, NIPS11]
✦
D
step:
Divide
input
matrix
into
submatrices
✦
F
step:
Factor
in
parallel
using
a
base
MC
algorithm
✦
C
step:
Combine
submatrix
esOmates
Advantages:
✦
Submatrix
factorizaOon
is
much
cheaper
and
easily
parallelized
✦
Minimal
communicaOon
between
parallel
jobs
✦
Retains
comparable
recovery
guarantees
(with
proper
choice
of
division
/
combinaOon
strategies)
36. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
37. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
✦
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
38. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
39. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
✦
Minimal
cost:
linear
in
n,
quadraOc
in
rank
of
sub-‐soluOons
40. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
✦
Minimal
cost:
linear
in
n,
quadraOc
in
rank
of
sub-‐soluOons
=
41. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
✦
Minimal
cost:
linear
in
n,
quadraOc
in
rank
of
sub-‐soluOons
=
42. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
✦
Minimal
cost:
linear
in
n,
quadraOc
in
rank
of
sub-‐soluOons
=
43. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
✦
Minimal
cost:
linear
in
n,
quadraOc
in
rank
of
sub-‐soluOons
=
=
44. DFC-‐Proj
✦
D
step:
Randomly
parOOon
observed
entries
into
t
submatrices:
✦
F
step:
Complete
the
submatrices
in
parallel
✦
Reduced
cost:
Expect
t-‐fold
speedup
per
iteraOon
✦
Parallel
computaOon:
Pay
cost
of
one
cheaper
MC
C
step:
Project
onto
single
low-‐dimensional
column
space
✦
✦
✦
✦
Roughly,
share
informaOon
across
sub-‐soluOons
Minimal
cost:
linear
in
n,
quadraOc
in
rank
of
sub-‐soluOons
Ensemble: Project onto column space of each sub-solution and
average
45. Does
It
Work?
Yes,
with
high
probability.
Theorem:
Assume:
✦ L
0
is
low-‐rank
and
incoherent,
✦
˜
entries
sampled
uniformly
at
random,
⌦(r(n + m))
✦
Nuclear
norm
heurisOc
is
base
algorithm.
46. Does
It
Work?
Yes,
with
high
probability.
Theorem:
Assume:
✦ L
0
is
low-‐rank
and
incoherent,
✦
˜
entries
sampled
uniformly
at
random,
⌦(r(n + m))
✦
Nuclear
norm
heurisOc
is
base
algorithm.
ˆ
Then
L
=
L0
with
(slightly
less)
high
probability.
47. Does
It
Work?
Yes,
with
high
probability.
Theorem:
Assume:
✦ L
0
is
low-‐rank
and
incoherent,
✦
˜
entries
sampled
uniformly
at
random,
⌦(r(n + m))
✦
Nuclear
norm
heurisOc
is
base
algorithm.
ˆ
Then
L
=
L0
with
(slightly
less)
high
probability.
✦
Noisy
seang:
(2
✏)
approximaOon
of
original
bound
+
✦
Can
divide
into
an
increasing
number
of
subproblems
˜
(
t
!
1
)
when
number
of
observed
entries
in ! (r2 (n + m))
58. Video
Surveillance
✦
Goal:
separate
foreground
from
background
✦
✦
✦
Store
video
as
matrix
Low-rank
=
background
Outliers
=
movement
Original
Frame
59. Video
Surveillance
✦
Goal:
separate
foreground
from
background
✦
✦
✦
Store
video
as
matrix
Low-rank
=
background
Outliers
=
movement
Original
Frame
Nuclear
Norm
(342.5s)
60. Video
Surveillance
✦
Goal:
separate
foreground
from
background
✦
✦
✦
Store
video
as
matrix
Low-rank
=
background
Outliers
=
movement
Original
Frame
Nuclear
Norm
(342.5s)
DFC-‐5%
(24.2s)
DFC-‐0.5%
(5.2s)
74. MoOvaOon:
Face
images
Subspace
Segmenta5on
=
In
+
‘noise’
Low-rank
✦
Model
images
of
five
people
via
five
low-‐dimensional
subspaces
75. MoOvaOon:
Face
images
Subspace
Segmenta5on
=
In
+
‘noise’
Low-rank
✦
Model
images
of
five
people
via
five
low-‐dimensional
subspaces
✦
Recover
subspaces
cluster
images
76. MoOvaOon:
Face
images
Subspace
Segmenta5on
=
In
✦
+
‘noise’
Low-rank
Nuclear
norm
heurisOc
to
provably
recovers
subspaces
✦ Guarantees
are
preserved
with
DFC [TMMFJ, ICCV13]
77. MoOvaOon:
Face
images
Subspace
Segmenta5on
=
In
+
‘noise’
Low-rank
✦
Toy
Experiment:
IdenOfy
images
corresponding
to
same
person
(10
people,
640
images)
✦
DFC
Results:
Linear
speedup,
State-‐of-‐the-‐art
accuracy
80. Video
Event
DetecOon
✦
✦
✦
Input:
videos,
some
of
which
are
associated
with
events
Goal:
predict
events
for
unlabeled
videos
Idea:
✦
Featurize
each
video
81. Video
Event
DetecOon
✦
✦
✦
Input:
videos,
some
of
which
are
associated
with
events
Goal:
predict
events
for
unlabeled
videos
Idea:
✦
✦
Featurize
each
video
Learn
video
clusters
via
nuclear
norm
heurisOc
82. Video
Event
DetecOon
✦
✦
✦
Input:
videos,
some
of
which
are
associated
with
events
Goal:
predict
events
for
unlabeled
videos
Idea:
✦
✦
✦
Featurize
each
video
Learn
video
clusters
via
nuclear
norm
heurisOc
Given
labeled
nodes
and
cluster
structure,
make
predicOons
83. Video
Event
DetecOon
✦
✦
✦
Input:
videos,
some
of
which
are
associated
with
events
Goal:
predict
events
for
unlabeled
videos
Idea:
✦
✦
✦
Featurize
each
video
Learn
video
clusters
via
nuclear
norm
heurisOc
Given
labeled
nodes
and
cluster
structure,
make
predicOons
Can
do
this
at
scale
with
DFC!
84. DFC
Summary
✦
DFC:
distributed
framework
for
matrix
factorizaOon
✦ Similar
recovery
guarantees
✦ Significant
speedups
✦
DFC
applied
to
3
classes
of
problems:
✦ Matrix
compleOon
✦ Robust
matrix
factorizaOon
✦ Subspace
recovery
✦
Extend
DFC
to
other
MF
methods,
e.g.,
ALS,
SGD?
85. Big
Data
and
Distributed
CompuOng
are
valuable
resources,
but
...
86. Big
Data
and
Distributed
CompuOng
are
valuable
resources,
but
...
✦
Challenge
1:
ML
not
developed
with
scalability
in
mind
87. Big
Data
and
Distributed
CompuOng
are
valuable
resources,
but
...
✦
Challenge
1:
ML
not
developed
with
scalability
in
mind
Divide-‐and-‐Conquer
(e.g.,
DFC)
88. Big
Data
and
Distributed
CompuOng
are
valuable
resources,
but
...
✦
Challenge
1:
ML
not
developed
with
scalability
in
mind
Divide-‐and-‐Conquer
(e.g.,
DFC)
✦
Challenge
2:
ML
not
developed
with
ease-‐of-‐use
in
mind
89. Big
Data
and
Distributed
CompuOng
are
valuable
resources,
but
...
✦
Challenge
1:
ML
not
developed
with
scalability
in
mind
ML base
ML base
Divide-‐and-‐Conquer
(e.g.,
DFC)
ML base
ML base
✦
Challenge
2:
ML
not
developed
with
ease-‐of-‐use
in
mind
ML base
ML base
www.mlbase.org
ML base