In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
We live with an abundance of ML resources; from open source tools, to GPU workstations, to cloud-hosted autoML. What’s more, the lines between AI research and everyday ML have blurred; you can recreate a state-of-the-art model from arxiv papers at home. But can you afford to? In this talk, we explore ways to recession-proof your ML process without sacrificing on accuracy, explainability, or value.
(Py)testing the Limits of Machine LearningRebecca Bilbro
Despite the hype cycle, each day machine learning becomes a little less magic and a little more real. Predictions increasingly drive our everyday lives, embedded into more of our everyday applications. To support this creative surge, development teams are evolving, integrating novel open source software and state-of-the-art GPU hardware, and bringing on essential new teammates like data ethicists and machine learning engineers. Software teams are also now challenged to build and maintain codebases that are intentionally not fully deterministic.
This nondeterminism can manifest in a number of surprising and oftentimes very stressful ways! Successive runs of model training may produce slight but meaningful variations. Data wrangling pipelines turn out to be extremely sensitive to the order in which transformations are applied, and require thoughtful orchestration to avoid leakage. Model hyperparameters that can be tuned independently may have mutually exclusive conditions. Models can also degrade over time, producing increasingly unreliable predictions. Moreover, open source libraries are living, dynamic things; the latest release of your team's favorite library might cause your code to suddenly behave in unexpected ways.
Put simply, as ML becomes more of an expectation than an exception in our industry, testing has never been more important! Fortunately, we are lucky to have a rich open source ecosystem to support us in our journey to build the next generation of apps in a safe, stable way. In this talk we'll share some hard-won lessons, favorite open source packages, and reusable techniques for testing ML software components.
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Yellowbrick is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
In this talk, we'll explore not only what you can do with Yellowbrick, but how it works under the hood (since we're always looking for new contributors!). We'll illustrate how Yellowbrick extends the Scikit-Learn and Matplotlib APIs with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process - providing iterative visual diagnostics throughout the transformation of high dimensional data.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick (scikit-yb.org), which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
The Incredible Disappearing Data ScientistRebecca Bilbro
The last decade saw advances in compute power combine with an avalanche of open source software development, resulting in a revolution in machine learning and scalable analytics. “Data science” and “data product” are now household terms. This led to a new job description, the Data Scientist, which quickly became one of the most significant, exciting, and misunderstood jobs of the 21st century. One part statistician, one part computer scientist, and one part domain expert, data scientists seem poised to become the most pivotal value creators of the information age. And yet, danger (supposedly) lies ahead: human decisions are increasingly outsourced to algorithms of questionable ethical design; we’re putting everything on the blockchain; and perhaps most disturbingly, data science salaries are dropping precipitously as new graduates and Machine Learning as a Service (MLaaS) offerings flood the market. As we move into a future where predictive analytics is no longer a differentiator but instead a core business function, will data scientists proliferate or be automated out of a job?
In this talk, one humble data scientist attempts to cut through the hype to present an alternate vision of what data science is and can become. If not the “Sexiest Job of the 21st Century" as the Harvard Business Review once quipped, what is it like to be a workaday data scientist? What problems are we solving? How do we integrate with mature engineering teams? How do we engage with clients and product owners? How do we deploy non-deterministic models in production? In particular, we’ll examine critical integration points — technological and otherwise — we are currently tackling, which will ultimately determine our success, and our viability, over the next 10 years.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
We live with an abundance of ML resources; from open source tools, to GPU workstations, to cloud-hosted autoML. What’s more, the lines between AI research and everyday ML have blurred; you can recreate a state-of-the-art model from arxiv papers at home. But can you afford to? In this talk, we explore ways to recession-proof your ML process without sacrificing on accuracy, explainability, or value.
(Py)testing the Limits of Machine LearningRebecca Bilbro
Despite the hype cycle, each day machine learning becomes a little less magic and a little more real. Predictions increasingly drive our everyday lives, embedded into more of our everyday applications. To support this creative surge, development teams are evolving, integrating novel open source software and state-of-the-art GPU hardware, and bringing on essential new teammates like data ethicists and machine learning engineers. Software teams are also now challenged to build and maintain codebases that are intentionally not fully deterministic.
This nondeterminism can manifest in a number of surprising and oftentimes very stressful ways! Successive runs of model training may produce slight but meaningful variations. Data wrangling pipelines turn out to be extremely sensitive to the order in which transformations are applied, and require thoughtful orchestration to avoid leakage. Model hyperparameters that can be tuned independently may have mutually exclusive conditions. Models can also degrade over time, producing increasingly unreliable predictions. Moreover, open source libraries are living, dynamic things; the latest release of your team's favorite library might cause your code to suddenly behave in unexpected ways.
Put simply, as ML becomes more of an expectation than an exception in our industry, testing has never been more important! Fortunately, we are lucky to have a rich open source ecosystem to support us in our journey to build the next generation of apps in a safe, stable way. In this talk we'll share some hard-won lessons, favorite open source packages, and reusable techniques for testing ML software components.
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Yellowbrick is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
In this talk, we'll explore not only what you can do with Yellowbrick, but how it works under the hood (since we're always looking for new contributors!). We'll illustrate how Yellowbrick extends the Scikit-Learn and Matplotlib APIs with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process - providing iterative visual diagnostics throughout the transformation of high dimensional data.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick (scikit-yb.org), which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
The Incredible Disappearing Data ScientistRebecca Bilbro
The last decade saw advances in compute power combine with an avalanche of open source software development, resulting in a revolution in machine learning and scalable analytics. “Data science” and “data product” are now household terms. This led to a new job description, the Data Scientist, which quickly became one of the most significant, exciting, and misunderstood jobs of the 21st century. One part statistician, one part computer scientist, and one part domain expert, data scientists seem poised to become the most pivotal value creators of the information age. And yet, danger (supposedly) lies ahead: human decisions are increasingly outsourced to algorithms of questionable ethical design; we’re putting everything on the blockchain; and perhaps most disturbingly, data science salaries are dropping precipitously as new graduates and Machine Learning as a Service (MLaaS) offerings flood the market. As we move into a future where predictive analytics is no longer a differentiator but instead a core business function, will data scientists proliferate or be automated out of a job?
In this talk, one humble data scientist attempts to cut through the hype to present an alternate vision of what data science is and can become. If not the “Sexiest Job of the 21st Century" as the Harvard Business Review once quipped, what is it like to be a workaday data scientist? What problems are we solving? How do we integrate with mature engineering teams? How do we engage with clients and product owners? How do we deploy non-deterministic models in production? In particular, we’ll examine critical integration points — technological and otherwise — we are currently tackling, which will ultimately determine our success, and our viability, over the next 10 years.
Winning Kaggle 101: Introduction to StackingTed Xiao
An Introduction to Stacking by Erin LeDell, from H2O.ai
Presented as part of the "Winning Kaggle 101" event, hosted by Machine Learning at Berkeley and Data Science Society at Berkeley. Special thanks to the Berkeley Institute of Data Science for the venue!
H2O.ai: http://www.h2o.ai/
ML@B: ml.berkeley.edu
DSSB: http://dssberkeley.org
BIDS: http://bids.berkeley.edu/
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
Building a Machine Learning Platform at Quora: Each month, over 100 million people use Quora to share and grow their knowledge. Machine learning has played a critical role in enabling us to grow to this scale, with applications ranging from understanding content quality to identifying users’ interests and expertise. By investing in a reusable, extensible machine learning platform, our small team of ML engineers has been able to productionize dozens of different models and algorithms that power many features across Quora.
In this talk, I’ll discuss the core ideas behind our ML platform, as well as some of the specific systems, tools, and abstractions that have enabled us to scale our approach to machine learning.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
Automated machine learning lectures given at the Advanced Course on Data Science & Machine Learning. AutoML, hyperparameter optimization, Bayesian optimization, Neural Architecture Search, Meta-learning, MAML
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf
Building a Machine Learning Platform at Quora: Each month, over 100 million people use Quora to share and grow their knowledge. Machine learning has played a critical role in enabling us to grow to this scale, with applications ranging from understanding content quality to identifying users’ interests and expertise. By investing in a reusable, extensible machine learning platform, our small team of ML engineers has been able to productionize dozens of different models and algorithms that power many features across Quora.
In this talk, I’ll discuss the core ideas behind our ML platform, as well as some of the specific systems, tools, and abstractions that have enabled us to scale our approach to machine learning.
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
VSSML16 LR1. Summary Day 1
Valencian Summer School in Machine Learning 2016
Day 1
Summary Day 1
Mercè Martin (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016
Top contenders in the 2015 KDD cup include the team from DataRobot comprising Owen Zhang, #1 Ranked Kaggler and top Kagglers Xavier Contort and Sergey Yurgenson. Get an in-depth look as Xavier describes their approach. DataRobot allowed the team to focus on feature engineering by automating model training, hyperparameter tuning, and model blending - thus giving the team a firm advantage.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
Automated machine learning lectures given at the Advanced Course on Data Science & Machine Learning. AutoML, hyperparameter optimization, Bayesian optimization, Neural Architecture Search, Meta-learning, MAML
Machine Learning : why we should know and how it worksKevin Lee
The most popular buzz word nowadays in the technology world is “Machine Learning (ML).” Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes such as: self-driving vehicles; online recommendation on Netflix and Amazon; fraud detection in banks; image and video recognition; natural language processing; question answering machines (e.g., IBM Watson); and many more. This is leading many organizations to seek experts who can implement Machine Learning into their businesses.
Statistical programmers and statisticians in the pharmaceutical industry are in very interesting positions. We have very similar backgrounds as Machine Learning experts, such as programming, statistics, and data expertise, thus embodying the essential technical skill sets needed. This similarity leads many individuals to ask us about Machine Learning. If you are the leaders of biometric groups, you get asked more often.
The paper is intended for statistical programmers and statisticians who are interested in learning and applying Machine Learning to lead innovation in the pharmaceutical industry. The paper will start with the introduction of basic concepts of Machine Learning - hypothesis and cost function and gradient descent. Then, paper will introduce Supervised ML (e.g., Support Vector Machine, Decision Trees, Logistic Regression), Unsupervised ML (e.g., clustering) and the most powerful ML algorithm, Artificial Neural Network (ANN). The paper will also introduce some of popular SAS ® ML procedures and SAS Visual Data Mining and Machine Learning. Finally, the paper will discuss the current ML implementation, its future implementation and how programmers and statisticians could lead this exciting and disruptive technology in pharmaceutical industry.
This tutor introduces the basic idea of machine learning with a very simple example. Machine learning teaches machines (and me too) to learn to carry out tasks and concepts by themselves. It is that simple, so here is an overview:
http://www.softwareschule.ch/examples/machinelearning.jpg
The algorithm is the basic for everything nearest neighbour is used for machine learning .These is used find the pattern by Nearest neighbour ones.This is supervised type of learning
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
This presentation on Machine Learning will help you understand what is clustering, K-Means clustering, flowchart to understand K-Means clustering along with demo showing clustering of cars into brands, what is logistic regression, logistic regression curve, sigmoid function and a demo on how to classify a tumor as malignant or benign based on its features. Machine Learning algorithms can help computers play chess, perform surgeries, and get smarter and more personal. K-Means & logistic regression are two widely used Machine learning algorithms which we are going to discuss in this video. Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps to predict the probability of an event by fitting data to a logit function. It is also called logit regression. K-means clustering is an unsupervised learning algorithm. In this case, you don't have labeled data unlike in supervised learning. You have a set of data that you want to group into and you want to put them into clusters, which means objects that are similar in nature and similar in characteristics need to be put together. This is what k-means clustering is all about. Now, let us get started and understand K-Means clustering & logistic regression in detail.
Below topics are explained in this Machine Learning tutorial part -2 :
1. Clustering
- What is clustering?
- K-Means clustering
- Flowchart to understand K-Means clustering
- Demo - Clustering of cars based on brands
2. Logistic regression
- What is logistic regression?
- Logistic regression curve & Sigmoid function
- Demo - Classify a tumor as malignant or benign based on features
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
Learn more at: https://www.simplilearn.com/
It's Not Magic - Explaining classification algorithmsBrian Lange
As organizations increasingly leverage data and machine learning methods, people throughout those organizations need to build a basic "data literacy" in those topics. In this session, data scientist and instructor Brian Lange provides simple, visual, and equation free explanations for a variety of classification algorithms, geared towards helping anyone understand how they work. Now with Python code examples!
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
Ever wanted to do a challenging data science project but feel like you don’t have enough experience? Just go for it! Diving into a 3 TB, 3D image dataset has been my best learning experience. I want to share this deep learning project with you, and tips for overcoming challenges along the way.
Innovations in technology has revolutionized financial services to an extent that large financial institutions like Goldman Sachs are claiming to be technology companies! It is no secret that technological innovations like Data science and AI are changing fundamentally how financial products are created, tested and delivered. While it is exciting to learn about technologies themselves, there is very little guidance available to companies and financial professionals should retool and gear themselves towards the upcoming revolution.
In this master class, we will discuss key innovations in Data Science and AI and connect applications of these novel fields in forecasting and optimization. Through case studies and examples, we will demonstrate why now is the time you should invest to learn about the topics that will reshape the financial services industry of the future!
Topic
- Frontier topics in Optimization
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
In this talk we present a decentralized messaging protocol and storage system in use across North America, Europe, and South East Asia. Built at the behest of an international nonprofit working group, the protocol and system are designed to address a unique problem at the intersection of financial crime regulation, distributed ledger technology, and user privacy. In our talk we discuss the many lessons learned in the process of architecting, implementing, and fostering adoption of this system. We present the Secure Envelope, a data structure that employs a combination of methods (symmetric and asymmetric encryption, mTLS, protocol buffers, etc) to safeguard data privacy both at rest and in flight.
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
Jupyter Notebook may be one of the most controversial open source projects released in the last ten years! Love them or hate them, they’ve become a mainstay of data science and machine learning, and a significant part of the Python ecosystem. While Jupyter can simplify experimentation, rapid prototyping, documentation, and visualization, it often impedes version control, code review, and test coverage. Dev teams must accept the good with the bad… but what if they didn’t have to? In this talk we introduce conflict-free replicated data types (CRDT), a special object that supports strong consistency, and which can be used to enhance Jupyter notebooks for a truly collaborative experience.
First proposed by Shapiro et al in 2011 conflict-free replicated data types (CRDTs) evolved out of the Distributed Systems community for replication of data across a network of replicas. CRDTs are objects that come with a special guarantee — namely, that two different copies of that object can be strongly consistent, meaning they can be kept in sync. While CRDTs have enjoyed a good amount of attention from academia over the last years, primarily amongst database and cloud researchers, they have not led to many practical applications for everyday developers. However, recent work by Kleppmann et al shows CRDTs can be used for real-time rich-text collaboration — creating a “Google doc”-type experience with any document in a networked file system.
In this talk, we’ll present the basics of CRDTs and demonstrate how they work with examples written in Python. Next, we’ll explain how CRDTs can enable more collaborative Jupyter notebooks, opening up features such as synchronous insertions, diffs, and auto-merges, even with multiple simultaneous contributors!
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
Eventually consistent systems are often more cost-effective to implement and maintain than their strongly consistent cousins. Gossip-based anti-entropy methods can be used to improve the consistency of such systems even as they expand geographically.
In the machine learning community, we're trained to think of size as inversely proportional to bias, driving us to ever larger datasets, increasingly complex model architectures, and ever better accuracy scores. But bigger doesn't always mean better.
What data quality issues emerge in large datasets? What complications surface as features become more geodistributed (e.g., diurnal patterns, seasonal variations, datetime formatting, multilingual text, etc.)? What happens as models attempt to extrapolate bigger and bigger patterns? Why is it that the pursuit of megamodels has driven a wedge between the ML definition of “bias” and the more colloquial sense of the word?
Perhaps the time has come to move away from monolithic models that reduce rich variations and complexities to a simple argmax on the output layer and instead embrace a new generation of model architectures that are just as organic and diverse as the data they seek to encode.
There have never been more commercial tools available for building distributed data apps — from cloud hosting services, to cloud-native databases, to cloud-based analytics platforms. So why is it still so hard to make a successful app with a global user base?
One of the toughest challenges cloud offerings take on is the problem of consensus, abstracting away most of the complexity. That's no small feat, given that this is a hard enough problem that people spend years getting a PhD just to understand it! Unfortunately, while buying off-the-shelf cloud services can accelerate the path to an MVP, it also makes optimization tough. How will we scale during a period of rapid user growth? How do we do I18n and l10n or guarantee a good UX for users on the other side of the world? How do we prevent replication that might get us into legal trouble?
In this talk, we'll consider several case studies of global apps (both successful and otherwise!), talk about the limitations of off-the-shelf consensus, and consider a future where everyday developers can use open source tools to build distributed data apps that are easier to reason about, maintain, and tune.
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
While data privacy challenges long predate current trends in machine-learning-as-a-service (MLAAS) offerings, predictive APIs do expose significant new attack vectors. To provide users with tailored recommendations, these applications often expose endpoints either to dynamic models or to pre-trained model artifacts, which learn patterns from data to surface insights. Problems arise when training data are collected, stored, and modeled in ways that jeopardize privacy. Even when user data is not exposed directly, private information can often be inferred using a technique called model inversion. In this talk, I discuss current research in black box model inversion and present a machine learning approach to discovering the model families of deployed black box models using only their decision topologies. Prior work suggests the efficacy of model family specific attack vectors (i.e., once the model is no longer a black box, it is easier to exploit). As such, we approach the problem only of model discovery and not of model inversion, reasoning that by solving the problem of model identification, we clear a path for information security and cryptography experts to use domain-specific tools for model inversion.
Learning machine learning with YellowbrickRebecca Bilbro
Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
5. from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!
7. ● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no guarantee of
a solution.
● As the search space gets larger, the
amount of time increases
exponentially.
Except ...
10. Enter Yellowbrick
● Extends the Scikit-Learn API.
● Enhances the model selection
process.
● Tools for feature visualization,
visual diagnostics, and visual
steering.
● Not a replacement for other
visualization libraries.
11. # Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)
Scikit-Learn Estimator Interface
12. # Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()
Yellowbrick Visualizer Interface
14. Is this room occupied?
● Given labelled data
with amount of light,
heat, humidity, etc.
● Which features are
most predictive?
● How hard is it going to
be to distinguish the
empty rooms from the
occupied ones?
19. Why isn’t my model predictive?
● What to do with a low-
accuracy classifier?
● Check for class
imbalance.
● Visual cue that we
might try stratified
sampling,
oversampling, or
getting more data.
23. What’s the right k?
● How many clusters do
you see?
● How do you pick an
initial value for k in k-
means clustering?
● How do you know
whether to increase or
decrease k?
● Is partitive clustering
the right choice?
33. The main API implemented by
Scikit-Learn is that of the
estimator. An estimator is any
object that learns from data;
it may be a classification,
regression or clustering
algorithm, or a transformer that
extracts/filters useful features
from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
Estimators
34. Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X’.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
Transformers
35. A visualizer is an estimator that
produces visualizations based
on data rather than new
datasets or predictions.
Visualizers are intended to
work in concert with
Transformers and Estimators to
shed light onto the modeling
process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()
Visualizers
Good morning and thank you for coming!
Stickers!
Today I’d like to tell you about an open source Python project I’ve been working on for the last two years.
It’s called Yellowbrick, and it’s a tool you can use to steer the machine learning process using visual transformers.
================================================== TALK DESC/REFERENCE==================================================
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
Ben and Rebecca are active contributors to the open source community, and this talk is based on Yellowbrick, a project they've been building together with the team at District Data Labs, an open source collaborative in Washington, DC. They are also co-authors of the forthcoming O'Reilly book, Applied Text Analysis with Python and organizers for Data Community DC - a not-for-profit organization of 9 meetups that organizes free monthly events and lectures for the local data community in Washington, DC.
Visualizing the Model Selection Process
1. The Model Selection Process
b. YB is part of the model selection workflow
b. Feature Selection
c. Algorithm Selection (Model Family/Form ⟶ Fitted Model)
d. Hyperparameter Tuning
e. Selection with Cross-Validation
f. Try all the models!
g. Hyperparameter Tuning
h. Model selection is Search
3. Main Point: Visual Steering Improves Model Selection:
a. Leads to better models (better F1/R2 Scores)
b. Gets to good models more quickly
c. Produces more insight about models
d. Research People: prove this.
4. The Scikit-Learn API
a. YB _extends_ the Scikit-Learn API
b. Tricky because: functional/procedural matplotlib + OO
b. Estimators
c. Transformers
d. Visualizers
e. Pipelines
f. Visual Workflows and Pipelines
5. Primary YB Requirements
a. Fits into the sklearn API and workflow
b. Implements matplotlib calls efficiently
c. Low overhead if `poof()` is not called
d. Just flexible enough for users to adapt to their data
e. Easy to add new visualizers
f. Looks as good as Seaborn
g. Minimal dependencies: sklearn, numpy, matplotlib -- c'est tout!
h. Primary Requirement: Implements Visual Steering
6. The Visualizer
a. Current class hierarchy
b. The Visualizer Interface
c. Axes management
d. `draw()`, `poof()`, and `finalize()`
e. A simple example of a visualizer
f. Feature Visualizers
g. Score Visualizers
h. Multi-Estimator Visualizers
9. Visual Pipelines
a. Multiple Visualizations
b. Interactivity
10. Optimizing Visualization!
11. Utilities
a. Style management: Sequences, Palettes, Color Codes
b. Best Fit Lines
c. Type Detection (is_classifier, etc.)
d. Exceptions
12. Documentation and Sphinx
13. Contributing
a. Git/Branch Management
b. Issues, milestones, and labels
c. Waffle Board
13. User Testing and Research
https://flic.kr/p/5EsRYP
It all started in an ivory tower somewhere.
Once upon a time, you had to go to school to learn machine learning.
And you’d spend years and ultimately specialize in a particular model family
Bayesian methods
Gaussian processes
support vector machines
...and get very very good at tuning those models.
Then everything got really easy really fast.
In 2010, Scikit-Learn was publicly released (by INRIA).
And it started to grow very very quickly.
Suddenly there were dozens of models at the fingertips of any Python programmer.
Scikit-Learn has so much going for it
Models, models, models
Also transformers, vectorization tools, sample datasets
Pipelines
But arguably the best part is the consistent API
You can plug the same data into nearly any of the models and it will work!
So it becomes just an optimization problem
You can loop through all the models and just pick the one with the best score.
It almost doesn’t matter which model you use, or why or how it’s working.
http://derekerdman.com/ilovemilkshakes/january2009/DO_IT/haircuts_try_them_all.jpg
Slide showing MLaaS
Except that hyperparameter space is large and gridsearch is slow if you don’t know already what you’re looking for
Alpha/penalty for regularization
Kernel function in support vector machine
Leaves or depth of a decision tree
Neighbors used in a nearest neighbor classifier
Clusters in a k-means clustering
https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Blindfold_%28PSF%29.png/1200px-Blindfold_%28PSF%29.png
The answer is visual steering.
Leverage human pattern recognition and engagement.
Get to better models, faster, with more insight,
Overview first; zoom and filter; details on demand.
Enhance model selection and evaluation
Goal is not to replace other libraries.
Model visualization not data visualization
YB extends the Scikit-Learn API
This is a high dimensional visualization problem.
For classification; potentially we want to see if there is good separability
Are some features more predictive than others?
Unit circle requires normalization
Drop points on the middle and are pulled out to outer edges
Ordering of features matters
Automatic ordering with optimization to minimize amt of overlap
Feature engineering requires understanding of the relationships between features
Visualize pairwise relationships
Heatmap
Pearson shows us strong correlations => potential collinearity
Covariance helps us understand the sequence of relationships
Other correlation metrics - we just used the ones that were implemented in numpy, but looking to expand
Frequency distribution - top 50 tokens
Stochastic Neighbor Embedding, decomposition then projection into 2D scatterplot
Visual part-of-speech tagging
YB extends the Scikit-Learn API
Can we quickly detect class imbalance issues
Stratified sampling, oversampling, getting more data -- tricks will help us balance
But supervised methods can mask training data; simple graphs like these give us an at-a-glance reference
As this gets into multiclass problems, domination could be harder to see and really effect modeling
Receiver operating characteristics/area under curve
Class imbalance
Classification report heatmap - Quickly identify strengths & weaknesses of model - F1 vs Type I & Type II error
Visual confusion matrix - misclassification on a per-class basis
Where/why/how is model performing good/bad
Prediction error plot - 45 degree line is theoretical perfect
Residuals plot - 0 line is no error
See change in amount of variance between x and y, or along x axis => heteroscedasticity
YB extends the Scikit-Learn API
Which regularization technique to use? Lasso/L1, Ridge/L2, or ElasticNet L1+L2
Regularization uses a Norm to penalize complexity at a rate, alpha
The higher the alpha, the more the regularization.
Complexity minimization reduces bias in the model, but increases variance
Goal: select the smallest alpha such that error is minimized
Visualize the tradeoff
Surprising to see: higher alpha increasing error, alpha jumping around, etc.
Embed R2, MSE, etc into the graph - quick reference
Want to contribute?
Here’s some information about how the Yellowbrick API works
YB extends the Scikit-Learn API
Where to hook in?
Estimators learn from data
Have a fit and predict method
Transformers transform data
Have a transform method
Visualizers can be estimators or transformers
Generally have a draw, finalize, and poof method
Contribute!
Needs: new features, testers, blog posts