20180526@Taiwan AI Academy, Professional Managers Class.
Covering important concepts of classical machine learning, in preparation for deep learning topics to follow. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
Covering important topics of Classical Machine Learning in 16 hours, in preparation for the following 10 weeks of Deep Learning courses at Taiwan AI academy from 2018/02-2018/05. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
20180804@Taiwan AI Academy, Hsinchu
6 hour lecture for those new to machine learning, to grasps the concepts, advantages and limitations of various classical machine learning methods. More importantly, to learn the skills to break down large complicated AI projects into manageable pieces, where features and functionalities could be added incrementally and annotated data accumulated. Take home message: machine learning is always a delicate balance between model complexity M and number of data N so that the trained classifier generalizes well and does not overfit.
Tips for would-be founders, technical or non-technical, before rolling up their sleeves and develop their products! From various ways of "pretotyping" to accurately gauge target customer's response, lean method, minimum viable product, feature selection, planning a product with robust data cycle, coping with delays, and guiding a team of rockstar engineers to build the right product and build the product right. Some personal experienced shared at the end as case studies.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Covering important topics of Classical Machine Learning in 16 hours, in preparation for the following 10 weeks of Deep Learning courses at Taiwan AI academy from 2018/02-2018/05. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
20180804@Taiwan AI Academy, Hsinchu
6 hour lecture for those new to machine learning, to grasps the concepts, advantages and limitations of various classical machine learning methods. More importantly, to learn the skills to break down large complicated AI projects into manageable pieces, where features and functionalities could be added incrementally and annotated data accumulated. Take home message: machine learning is always a delicate balance between model complexity M and number of data N so that the trained classifier generalizes well and does not overfit.
Tips for would-be founders, technical or non-technical, before rolling up their sleeves and develop their products! From various ways of "pretotyping" to accurately gauge target customer's response, lean method, minimum viable product, feature selection, planning a product with robust data cycle, coping with delays, and guiding a team of rockstar engineers to build the right product and build the product right. Some personal experienced shared at the end as case studies.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
機器學習速遊 (Quick Tour of Machine Learning)
機器學習旨在讓電腦能由資料中累積的經驗來自我進步,近年來已廣泛應用於資料探勘、計算機視覺、自然語言處理、生物特徵識別、搜尋引擎、醫學診斷、檢測信用卡欺詐、證券市場分析、DNA序列測序、語音和手寫識別、戰略遊戲和機器人等領域。它已成為資料科學的基礎學科之一,為任何資料科學家必備的工具。
這門課程將由台大資訊工程系林軒田教授利用短短的六個小時,快速地帶大家探索機器學習的基石、介紹核心的模型及一些熱門的技法,希望幫助大家有效率而紮實地了解這個領域,以妥善地使用各式機器學習的工具。此課程適合所有希望開始運用資料的資料分析者,推薦給所有有志於資料分析領域的資料科學愛好者。
Research grants from the European Research Council (ERC) are great to have, but hard to get, In this talk, I give twelve personal tips that have been helpful for preparing your grant proposal.
Slides of a talk at INRIA Nancy, 20 December 2017
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Dhiana Deva
Introducing Machine Learning is like opening the Pandora's Box - it unveils important issues in your data, metrics, and product. In order to deal with such complexity, pragmatic practices are required to obtain reliable results. In this talk, we will go through learnings gained from introducing Machine Learning in different contexts, from academia, start-ups, consulting to tech giants - covering practices for experimentation, infrastructure, planning, performance evaluation and product vision in the context of machine learning products.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Automated attendance system based on facial recognitionDhanush Kasargod
A MATLAB based system to take attendance in a classroom automatically using a camera. This project was carried out as a final year project in our Electronics and Communications Engineering course. The entire MATLAB code I've uploaded it in mathworks.com. Also the entire report will be available at academia.edu page. Will be delighted to hear from you.
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.
In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.
This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
- A high-level overview of artificial intelligence
- The importance of predictions across different domains of life
- Big (text) data
- Competition as a discovery process
- Domain-general learning
- Computer vision and natural language processing
- Elements of a machine learning system
- A hierarchy of problem classes
- Data collection
- The purpose of a model
- Logistic loss function
- Likelihood, log likelihood and maximum likelihood
- Ockham's Razor
- Intelligence as sequence prediction
- Building blocks of neural networks: neurons, weights and layers
- Logistic regression as a neural network
- Sigmoid function
- A look at backpropagation
- Gradient descent
- Convolutional neural networks
- Max-pooling
- Deep neural networks
機器學習速遊 (Quick Tour of Machine Learning)
機器學習旨在讓電腦能由資料中累積的經驗來自我進步,近年來已廣泛應用於資料探勘、計算機視覺、自然語言處理、生物特徵識別、搜尋引擎、醫學診斷、檢測信用卡欺詐、證券市場分析、DNA序列測序、語音和手寫識別、戰略遊戲和機器人等領域。它已成為資料科學的基礎學科之一,為任何資料科學家必備的工具。
這門課程將由台大資訊工程系林軒田教授利用短短的六個小時,快速地帶大家探索機器學習的基石、介紹核心的模型及一些熱門的技法,希望幫助大家有效率而紮實地了解這個領域,以妥善地使用各式機器學習的工具。此課程適合所有希望開始運用資料的資料分析者,推薦給所有有志於資料分析領域的資料科學愛好者。
Research grants from the European Research Council (ERC) are great to have, but hard to get, In this talk, I give twelve personal tips that have been helpful for preparing your grant proposal.
Slides of a talk at INRIA Nancy, 20 December 2017
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019Dhiana Deva
Introducing Machine Learning is like opening the Pandora's Box - it unveils important issues in your data, metrics, and product. In order to deal with such complexity, pragmatic practices are required to obtain reliable results. In this talk, we will go through learnings gained from introducing Machine Learning in different contexts, from academia, start-ups, consulting to tech giants - covering practices for experimentation, infrastructure, planning, performance evaluation and product vision in the context of machine learning products.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Automated attendance system based on facial recognitionDhanush Kasargod
A MATLAB based system to take attendance in a classroom automatically using a camera. This project was carried out as a final year project in our Electronics and Communications Engineering course. The entire MATLAB code I've uploaded it in mathworks.com. Also the entire report will be available at academia.edu page. Will be delighted to hear from you.
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.
In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.
This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
- A high-level overview of artificial intelligence
- The importance of predictions across different domains of life
- Big (text) data
- Competition as a discovery process
- Domain-general learning
- Computer vision and natural language processing
- Elements of a machine learning system
- A hierarchy of problem classes
- Data collection
- The purpose of a model
- Logistic loss function
- Likelihood, log likelihood and maximum likelihood
- Ockham's Razor
- Intelligence as sequence prediction
- Building blocks of neural networks: neurons, weights and layers
- Logistic regression as a neural network
- Sigmoid function
- A look at backpropagation
- Gradient descent
- Convolutional neural networks
- Max-pooling
- Deep neural networks
Modeling and Aggregation of Complex Annotations via Annotation Distances
- Alex Braylan and Matt Lease
- The University of Texas at Austin
ABSTRACT
Modeling annotators and their labels is valuable for ensuring col- lected data quality. Though many models have been proposed for binary or categorical labels, prior methods do not generalize to complex annotations (e.g., open-ended text, multivariate, or struc- tured responses) without devising new models for each specific task. To obviate the need for task-specific modeling, we propose to model distances between labels, rather than the labels them- selves. Our models are largely agnostic to the distance function; we leave it to the requesters to specify an appropriate distance func- tion for their given annotation task. We propose three models of annotation quality, including a Bayesian hierarchical extension of multidimensional scaling which can be trained in an unsupervised or semi-supervised manner. Results show the generality and effec- tiveness of our models across diverse complex annotation tasks: sequence labeling, translation, syntactic parsing, and ranking.
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
Supervised ML technique, K-Nearest Neighbor and Unsupervised Clustering techniques are learnt in this lesson. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Find Your Passion and Make a Difference in Your CareerAlbert Y. C. Chen
20180314 at National Taiwan Normal University.
Reflection on my own career from being inspired to work on CV/ML research during my graduate studies at NTNU, then going abroad to obtain my Ph.D. and later on my career in this field. The talk emphasizes on the importance of innovation and how to realize ones new ideas within large and small organizations.
AI gold rush, tool vendors and the next big thing
2017/12/27 at Mediatek
- Overview of booming AI applications, from media, entertainment, e-commerce, autonomous driving, surveillance, industrial inspection, medical imaging, bioinformatics, finance, etc., along with expert predictions of their market size and growth.
- Dissect the applications with largest size and growth into their technical components and their unmet demands.
- Among all the unmet demands and uncertainties in this AI gold rush, what should an IC design company do? I’ll briefly cover NVIDIA’s case, which most of us know well already, then supplement case studies of Qualcomm, Intel, Google TPU and other smaller firms.
Even when we have a clear target, it takes years for supporting libraries and software to be properly optimized. I’ll share some thoughts and personal experiences on how to make sequentially-ordered hardware/software/library optimization happen faster and in parallel, and the tools that the IC design house need to provide in order for it to happen.
Practical computer vision-- A problem-driven approach towards learning CV/ML/DLAlbert Y. C. Chen
Practical computer vision-- A problem-driven approach towards learning CV/ML/DL
Albert Chen Ph.D., 20170726 at Academia Sinica, Taiwan
Invited Speech during Academia Sinica's AI month
Think different, in Finance. An outsider's two cents on how could finance majors rethink their role and value in the rapidly changing AI era, with some FinTech case studies.
Albert Y. C. Chen, Ph.D., VP of R&D at Viscovery--Visual Search, Simply Smarter.
Invited speech at Automatic Optical Inspection Equipment Association (AOIEA) Annual Summit, Taiwan, 2017/06/15, "Deep Learning and Automatic Optical Inspection".
陳彥呈博士,Viscovery研發副總裁2017年6月15日於自動光學檢測設備聯盟 會員年會 專題演講「人工智慧下的AOI變革浪潮:影像辨識技術的突破與新契機」。
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
2. Albert Y. C. Chen, Ph.D.
陳彥呈博⼠士
albert@viscovery.com
http://www.linkedin.com/in/aycchen
http://slideshare.net/albertycchen
• Experience
2017-present: Vice President of R&D @ Viscovery
2015-2017: Chief Scientist @ Viscovery
2015-2015: Principal Scientist @ Nervve Technologies
2013-2014 Senior Scientist @ Tandent Vision Science
2011-2012 @ GE Global Research, Computer Vision Lab
• Education
Ph.D. in Computer Science, SUNY-Buffalo
M.S. in Computer Science, NTNU
B.S. in Computer Science, NTHU
3. When something is important enough,
you do it even if the odds are not in your favor.
Elon Musk
Falcon 9
takeoff
Falcon 9
decelerate
Falcon 9
vertical
touchdown
4. What is “Machine Learning”?
• Machine Learning (ML):
• Human Learning:
• Manual Programming:
rules
5. • Deterministic problems: repeat 1B
times, still get the same answer,
• problems lacking data,
• problems with easily separable data.
Manual Programming vs Machine Learning
• Data with noise,
• data of high dimension,
• data of large volume,
• data that changes over time.
When to manual program?
When to use machine learning?
our focus
today
6. • Data easily separable with Exploratory Data
Analysis (EDA), e.g.,
• What if the data remains messy/inseparable?
Problems with easily Separable Data
Box Plot Histograms Scatter Plots
7. • Automatic seafood sorting machine
• How do we sort them? By length? By weight?
Dealing with not-so-separable data?
Salmon
vs
Seabass
8. • Sort salmon and sea bass by weight? hmm...
Dealing with not-so-separable data?
9. • Sort salmon and sea bass by color? slightly better
Dealing with not-so-separable data?
10. • What if we sort salmon and sea bass with both
weight and color? Much better, but still...
Dealing with not-so-separable data?
11. What if we add another feature?
• More features ≠ better: number of features*N,
feature space grows by ^N, the number of samples
needed for ML grows proportionally as well.
12. • Most of the volume of an n-D sphere is
concentrated in a thin shell near the surface!!!
• nD sphere of , the volume of sphere
between and is:
The curse of dimensionality
r = 1
r = 1 ✏ r = 1 1 (1 ✏)D
13. • The curse of dimensionality not just effects the
feature space, but also input, output, and others.
• Much more challenging to train a good n-class
classifier, e.g., face recognition, 1-to-1
verification vs 1-to-n identification.
• Much more issues arise from using a general
purpose 1M-class classifier vs problem
specific 1k-class classifier.
Problems w. high-dim is prevalent
14. Recognition
Accuracy:
• 1 to 1: 99%+
• 1 to 100: 90%
• 1 to 10,000:
50%-70%.
• 1 to 1M: 30%.
LFW dataset, common FN↑, FP↓
Prevalent high-dim problem, eg.1
• 1-to-N face identification, in the wild!
15. Prevalent high-dim problem, eg.2
• Smart photo album, with Google Cloud Vision
Distance between
histograms of 1M bins
is very close to 0 for
most of the time.
16. • Real data will often be confined to a region of
the space having lower effective dimensionality.
• Data will typically exhibit some smoothness
properties (at least locally).
Living with high dimensions
E.g., Low-dimensional
“manifold” of faces,
embedded within a
high-dim space.
Keywords:
• dimension reduction,
• learned features,
• manifold learning.
17. • Data is often not clean and easily separable.
• Sometimes, data is way too noisy
• A way to deal with that is to add additional
features/measurements, but we run into the
problem of: feature dimension >> # data
• Sometimes, the data volume is too large to be
put into memory and learned at once.
• Sometimes, the data evolves over time.
That's what machine learning is about
19. We present you,
a simple & usable map for ML!
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
21. Dimension Reduction
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
22. • Goal: try to find a more compact
representation of the data
• Assume that the high
dimensional data actually
reside in an inherent low-
dimensional space.
• Additional dimensions are
just random noise
• Goal is to recover these
inherent dimensions and
discard noise.
Unsupervised Dimension Reduction
23. • Create a basis where
the axes represent the
dimensions of variance,
from high to low.
• Finds correlations in
data dimensions to
product best possible
lower-dimensional
representation based
on linear projections.
Principal Component Analysis (PCA)
25. PCA algorithm, conceptual steps
• Find a line s.t. when data is projected onto the
line, it has the maximum variance.
26. • Find new line orthogonal to the first that has the
maximum projected variance.
PCA algorithm, conceptual steps
27. • Repeated until d lines. The projected position of
a point on these lines gives the coordinates in
the m-dimensional reduced space.
• Computing these set of lines is achieved by
eigen-decomposition of the covariance matrix.
PCA algorithm, conceptual steps
28. • View PCA as minimizing the reconstruction error
of using a low-dimensional approximation of the
original data.
Alternative view of PCA
29. • Calculate the covariance matrix of the data S
• Calculate the eigen-vectors/eigen-values of S
• Rank the eigen-values in decreasing order
• Select eigen-vectors that retain a fixed % of the
variance, e.g., 80%, s.t.,
Dimension Reduction using PCA
Pd
i=1 i
P
i i
80%
30. PCA example: Eigenfaces
Mean face
Basis of variance (eigenvectors)
M. Turk; A. Pentland (1991). "Face recognition using eigenfaces".
Proc. IEEE Conference on Computer Vision and Pattern Recognition. pp. 586–591.
31. The ATT face database (formerly the ORL
database), 10 pictures of 40 subjects each
32. • Covariance of the image data is big. Finding
eigenvector of large matrices is slow.
• Singular Value Decomposition (SVD) can be
used to compute principal components.
• SVD steps:
• Create centered data matrix X
• Solve: X = USVT
• Columns of V are the eigenvectors of
sorted from largest to smallest eigenvalues.
PCA, scaling up
⌃
35. • Useful preprocessing for easing the "curse of
dimensionality" problem.
• Reduced dimension: simpler hypothesis
space
• Smaller VC dimension: less overfitting
• PCA can also be seen as noise reduction
• Fails when data consists of multiple separate
clusters
PCA discussion
36. • Also named Fisher Discriminant Analysis
• It can be viewed as
• a dimension reduction method,
• a generative classifier p(x|y), Gaussian with
distinct for each class but shared .
Linear Discriminant Analysis (LDA)
µ ⌃
classes mixed better separation
37. • Find a project direction so that the separation
between classes is maximized.
• Objective 1: maximize the distance between the
projected means of different classes
LDA Objectives
m1 =
1
N1
X
x2C1
x m2 =
1
N2
X
x2C2
x
original means:
projected means:
m0
1 =
1
N1
X
x2C1
wT
x m0
2 =
1
N2
X
x2C2
wT
x
38. • Objective 2: minimize scatter (variance within
class)
LDA Objectives
s2
i =
X
x2Ci
(wT
x m0
i)2Total within class scatter
for projected class i:
Total within class scatter: s2
1 + s2
2
39. • There are a number of different ways to combine
the two objectives.
• LDA seeks to optimize the following objective:
LDA Objective
42. • Objective remains the same, with slightly
different definition for between-class scatter:
• Solution: k-1 eigenvectors of
LDA for Multi-Classes
J(w) =
wT
SBw
wTSww
SB =
1
k
kX
i=1
(mi m)(mi m)T
S 1
w SB
43. • Data often lies on
or near a nonlinear
low-dimensional
curve.
• We call such a
low-d structure
manifolds
• Algorithms include:
ICA, LLE, Isomap.
Nonlinear Dimension Reduction
swiss roll data
44. • A non-linear method for dimensionality reduction
• Preserves the global, nonlinear geometry of the
data by preserving the geodesic distances.
• Geodesic: shortest route between two points on
the surface of a manifold.
ISOMAP: Isometric Feature Mapping
45. 1. Approximate the geodesic distance between
every pair of points in the data.
• The manifold is locally linear
• Euclidean distance works well for points that
are close enough.
• For points that are far apart, their geodesic
distance can be approximated by summing
up local Euclidean distances.
2. Find a Euclidean mapping of the data that
preserves the geodesic distance.
ISOMAP algorithm
46. • Construct a graph by:
• Connecting i and j if:
• d(i,j) < (if computing -isomap), or
• i is one of j's k nearest neighbors (k-isomap)
• Set the edge weight equal d(i,j) - Euclidean
distance
• Compute the Geodesic distance between any
two points as the shortest path distance.
Geodesic Distance
" "
47. • We can use Multi-Dimensional Scaling (MDS), a
class of statistical techniques that:
• Given:
• n x n matrix of dissimilarities between n
objects
• Outputs:
• a coordinate configuration of the data in low-d
space Rd whose Euclidean distances closely
match given dissimilarities.
Compute low-dimensional mapping
52. • Sometimes, the data volume is large.
• Group together similar points and represent
them with a single token.
• Issues:
• How do we define two points/images/patches
being "similar"?
• How do we compute an overall grouping from
pairwise similarity?
Clustering
53. • Grouping pixels of similar appearance and
spatial proximity together; there's so many ways
to do it, yet none are perfect.
Clustering Example
55. • Summarizing Data
• Look at large amounts of data
• Patch-based compression or denoising
• Represent a large continuous vector with the
cluster number
• Counting
• Histograms of texture, color, SIFT vectors
• Segmentation
• Separate the image into different regions
• Prediction
• Images in the same cluster may have the same
labels
Why do we cluster?
56. • K-means
• Iteratively re-assign points to the nearest cluster
center
• Gaussian Mixture Model (GMM) Clustering
• Mean-shift clustering
• Estimate modes of pdf
• Hierarchical clustering
• Start with each point as its own cluster and
iteratively merge the closest clusters
• Spectral clustering
• Split the nodes in a graph based on assigned
links with similarity weights
How do we cluster?
57. • Goal: cluster to minimize variance in data given
clusters while preserving information.
Clustering for Summarization
c⇤
, ⇤
= argmin
c,
1
N
NX
j=0
KX
i=0
i,j(ci xj)2
cluster center
data
Whether is assigned toxj ci
58. • Euclidean Distance:
• Cosine similarity:
How do we measure similarity?
✓ = arccos
✓
xy
|x||y|
◆
x
y
||y x|| =
p
(y x) · (y x)
distance(x, y) =
p
(y1 x1)2 + (y2 x2)2 + · · · + (yn xn)2
=
v
u
u
t
nX
i=1
(yi xi)2
x · y = ||x||2 ||y||2 cos ✓
similarity(x, y) = cos(✓) =
x · y
||x||2 ||y||2
59. • Compare distance of closest (NN1) and second
closest (NN2) feature vector neighbor.
• If NN1≈NN2, ratio NN1/NN2 will be ≈1 →
matches too close.
• As NN1 << NN2, ratio NN1/NN2 tends to 0.
• Sorting by this ratio puts matches in order of
confidence.
Nearest Neighbor Distance Ratio
60. • How to threshold the nearest neighbor ratio?
Nearest Neighbor Distance Ratio
Lowe IJCV
2004 on
40,000
points.
Threshold
depends on
data and
specific
applications
61. 1. Randomly select k initial cluster centers
2. Assign each point to nearest center
3. Update cluster centers as the mean of the points
4. repeat 2-3 until no points are re-assigned.
k-means clustering
63. • Initialization
• Randomly select K points as initial cluster
center
• Greedily choose K points to minimize residual
• Distance measures
• Euclidean or others?
• Optimization
• Will converge to local minimum
• May want to use the best out of multiple trials
k-means: design choices
64. • Cluster on one set, use another (reserved) set to
test K.
• Minimum Description Length (MDL) principal for
model comparison.
• Minimize Schwarz Criterion, a.k.a. Bayes
Information Criteria (BIC)
• (When building dictionaries, more clusters
typically work better.)
How to choose k
65. • Generative
• How well are points reconstructed from the
cluster?
• Discriminative
• How well do the clusters correspond to labels
(purity)
How to evaluate clusters?
66. • Pros
• Finds cluster center that minimize conditional
variance (good representation of data)
• simple and fast
• easy to implement
k-means pros & cons
67. • Cons
• Need to choose K
• Sensitive to outliers
• Prone to local minima
• All clusters have the same parameters
• Can be slow. Each iteration is O(KNd) for N d-
dimensional points
k-means pros & cons
68. • Clusters are spherical
• Clusters are well separated
• Clusters are of similar volumes
• Clusters have similar number of points
k-means works if
69. • Hard assignments, or probabilistic assignments?
• Case against hard assignments:
• Clusters may overlap
• Clusters may be wider than others
• Can use a probabilistic model,
• Challenge: need to estimate model
parameters without labeled Ys.
GMM Clustering
P(X|Y )P(Y )
70. • Assume m-dimensional data points
• still multinomial, with k classes
• are k
multivariate Gaussians
Gaussian Mixture Models
P(Y )
P(X|Y = i), i = 1, · · · , k
P(X = x|Y = i)
=
1
p
(2⇡)m|⌃i|
exp
✓
1
2
(x µi)T
⌃ 1
(x µi)
◆
mean (m-dim vector)
variance (m*m matrix)
determinant of matrix
72. • EM after 20 iterations
EM for GMM MLE example
73. • GMM for some bio assay data
EM for GMM MLE example
74. EM for GMM MLE example
• GMM for some bio
assay data, fitted
separately for three
different
compounds.
75. • GMM with hard assignments and unit variance,
EM is equivalent to k-means clustering
algorithm!!!
• EM, like k-NN, uses coordinate ascent, and can
get stuck in local optimum.
GMM Clustering, notes
76. • mean-shift seeks modes of a given set of points
1. Choose kernel and bandwidth
2. For each point:
1. center a window on that point
2. compute the mean of the data in the
search window
3. center the search window at the new
mean location, repeat 2,3 until converge.
3. Assign points that lead to nearby modes to
the same cluster.
Mean-Shift Clustering
77. • Try to find modes of a non-parametric density
Mean-shift algorithm
Color
space
Color
space
clusters
78. • Attraction basin: the region for which all
trajectories lead to the same mode.
• Cluster: all data points in the attraction basin of
a mode.
Attraction Basin
Slides by Y. Ukrainitz & B. Sarel
83. • Mean-shift can also be used as clustering-based
image segmentation.
Mean-Shift Segmentation
D. Comaniciu and P. Meer, Mean Shift: A Robust
Approach toward Feature Space Analysis, PAMI 2002.
84. • Compute features for each pixel (color, gradients,
texture, etc.).
• Set kernel size for features and position .
• Initialize windows at individual pixel locations.
• Run mean shift for each window until convergence.
• Merge windows that are within width of and .
Mean-Shift Segmentation
Color
space
Color
space
clusters
Kf Ks
Kf Ks
85. • Speedups:
• binned estimation
• fast neighbor search
• update each window in each iteration
• Other tricks
• Use kNN to determine window sizes
adaptively
Mean-Shift
86. • Pros
• Good general-practice segmentation
• Flexible in number and shape of regions
• robust to outliers
• Cons
• Have to choose kernel size in advance
• Not suitable for high-dimensional features
Mean-Shift pros & cons
87. • DBSCAN: Density-based spatial
clustering of applications with noise.
• Density: number of points within a
specified radius (ε-Neighborhood)
• Core point: a point with more than
a specified number of points
(MinPts) within ε.
• Border point: has fewer than
MinPts within ε, but is in the
neighborhood of a core point.
• Noise point: any point that is not a
core point or border point.
DBSCAN
MinPts=4
p is core point
q is border point
o is noise point
q p
"
"
o
88. • Density-reachable: p is density-
reachable from q w.r.t. ε and
MinPts if there is a chain of
objects p1, ..., pn with p1=q and
pn=p, s.t. pi+1 is directly density-
reachable from pi w.r.t. ε and
MinPts for all
• Density-connectivity: p is
density-connected to q w.r.t. ε
and MinPts if there is an object
o, s.t. both p and q are density-
reachable from o w.r.t. ε and
MinPts.
DBSCAN
1 i n
89. • Cluster: a cluster C in a set of objects D w.r.t. ε
and MinPts is a non-empty subset of D satisfying
• Maximality: for all p,q, if p ∈ C and if q is
density reachable from p w.r.t. ε.
• Connectivity: for all p,q ∈ C, p is density-
connected to q w.r.t. ε and MinPts in D.
• Note: cluster contains core & border points.
• Noise: objects which are not directly density-
reachable from at least one core object.
DBSCAN clustering
90. 1. Select a point p
2. Retrieve all points density-reachable from p
w.r.t. ε and MinPts.
1. if p is a core point, a cluster is formed
2. if p is a border point, no points are density
reachable from p and DBSCAN visits the
next point of the database
3. continue 1,2, until all points are processed.
(result independent of process ordering)
DBSCAN clustering algorithm
91. • Heuristic: for points in a cluster, their kth nearest
neighbors are at roughly the same distance.
• Noise points have the kth nearest neighbor at
farthest distance.
• So, plot sorted distance of every point to its kth
nearest neighbor.
DBSCAN parameters
sharp change;
good candidate
for ε and MinPts.
92. • Pros
• No need to decide K beforehand,
• Robust to noise, since it doesn't require every
point being assigned nor partition the data.
• Scales well to large datasets with .
• Stable across runs and different data ordering.
• Cons
• Trouble when clusters have different densities.
• ε may be hard to choose.
DBSCAN pros & cons
94. • Method:
1. Every point is its own cluster
2. Find closest pair of clusters, merge into one
3. repeat
• The definition of closest is what differentiates
various flavors of agglomerative clustering
algorithms.
Agglomerative Clustering
95. • How to define the linkage/cluster similarity?
• Maximum or complete-linkage clustering
(a.k.a., farthest neighbor clustering)
• Minimum or single linkage clustering (UPGMA)
(a.k.a., nearest neighbor clustering)
• Centroid linkage clustering (UPGMC)
• Minimum Energy Clustering
• Sum of all intra-cluster variance
• Increase in variance for clusters being merged
Agglomerative Clustering
single linkage complete linkage average linkage centroid linkage
96. • How many clusters?
• Clustering creates a dendrogram (a tree)
• Threshold based on max number of clusters or
based on distance between merges.
Agglomerative Clustering
97. • Pros
• Simple to implement, widespread application
• Clusters have adaptive shapes
• Provides a hierarchy of clusters
• Cons
• May have imbalanced clusters
• Still have to choose the number of clusters or
thresholds
• Need to use an ultrametric to get a meaningful
hierarchy
Agglomerative Clustering
98. • Group points based on links in a graph
Spectral Clustering
A
B
99. • Normalized Cut
• A cut in a graph that penalizes large
segments
• Fix by normalizing for size of segments
volume(A) = sum of costs of all edges that
touch A
Spectral Clustering
Normalized Cut(A, B) =
cut(A, B)
volume(A)
+
cut(A, B)
volume(B)
100. • Determining importance by random walk
• What's the probability of visiting a given node?
• Create adjacency matrix based on visual similarity
• Edge weights determine probability of transition
Visual Page Rank
Jing Baluja 2008
101. • Quantization/Summarization: K-means
• aims to preserve variance of original data
• can easily assign new point to a cluster
Which Clustering Algorithm to use?
Quantization for computing
histograms
Summary of 20,000 photos of Rome using “greedy k-means”
http://grail.cs.washington.edu/projects/canonview/
102. • Image segmentation: agglomerative clustering
• More flexible with distance measures (e.g.,
can be based on boundry prediction)
• adapts better to specific data
• hierarchy can be useful
Which Clustering Algorithm to use?
http://www.cs.berkeley.edu/~arbelaez/UCM.html
103. • K-means useful for
summarization, building
dictionaries of patches,
general clustering.
• Agglomerative clustering
useful for segmentation,
general clustering.
• Spectral clustering useful for
determining relevance,
summarization, segmentation.
Which Clustering Algorithm to use?
112. • In correlation, two variables are treated as
independent.
• In regression, one variable (x) is independent,
while the other (y) is dependent.
• Goal: if you know something about x, this would
help you predict something about y.
Regression
113. • Expected value at a
given level of x:
• Predicted value for a
new x:
Simple Linear Regression
y
x
random error that
follows a normal distribution
with 0 mean and variance
"
2
fixed exactly
on the line
y = w0 + w1x
y0
= w0 + w1x + "
w0
w0/w1
114. Multiple Linear Regression
y(x, w) = w0 + w1x1 + · · · + wDxD
w0, ..., wD
xi
• Linear function of parameters , also a
linear function of the input variables , has very
restricted modeling power (can't even fit curves).
• Assumes that:
• The relationship between X and Y is linear.
• Y is distributed normally at each value of X.
• The variance of Y at each value of X is the
same.
• The observations are independent.
115. • Before going further, let’s take a look at
polynomial line fitting (polynomial regression.)
Linear Regression
Given N=10 blue dots, try to find the function
that is used for generating the data points.
sin(2⇡x)
116. • Polynomial line fitting:
• M is the order of the polynomial
• linear function of the coefficients
• nonlinear function of
• Objective: minimize the error between the
predictions and the target value of
Polynomial Regression
x
w
y(xn, w) tn xn
ERMS =
p
2E(w⇤)/Nor, the root-mean-square error
E(w) =
1
2
NX
n=1
{y(xn, w) tn}
2
y(x, w) = w0 + w1x + w2x2
+ · · · + wM xM
+ "
118. • There's only 10 data points, i.e., 9 degrees of
freedom; we can get 0 training error when M=9.
• Food for thought: make sure your deep neural
network's is not just "memorizing the training
data when its M >> data's DoF.
Polynomial regression w. var. M
119. • With M=9, but N=15 (left) and N=100, the over-
fitting problem is greatly reduced.
• ML is all about balancing M and N. One rough
heuristic is that N should be 5x-10x of M (model
complexity, not necessarily the number of param.)
What happens with more data?
120. • Regularization: used for controlling over-fitting.
• E.g., discourage coefficients from reaching
large values:
where
Regularization
˜E(w) =
1
2
NX
n=1
{y(xn, w) tn}
2
+
2
||w||2
||w||2
= wT
w = w2
0 + w2
1 + · · · + w2
M
121. • Extending linear regression to linear
combinations of fixed nonlinear functions:
where
• Basis functions: act as "features" in ML.
• Linear basis function:
• Polynomial basis function:
• Gaussian basis function
• Sigmoid basis function
Linear Models for Regression
y(x, w) =
M 1X
j=0
wj (x)
w = (w0, . . . , wM 1)T
, = ( 0, . . . , M 1)T
{ j(x)}
j(x) = xj
j(x) = xj
122. • Global functions of
the input variable,
s.t. changes in one
region of input
space affect all
other regions.
Polynomial Basis Functions
j(x) = xj
123. • Local functions, a
small change in x
only affect nearby
basis functions.
• and control
the location and
scale (width).
Gaussian Basis Functions
j(x) = exp
⇢
(x µj)2
2s2
µj s
124. • Local functions, a
small change in x
only affect nearby
basis functions.
• and control
the location and
scale (slope).
Sigmoidal Basis Functions
µj s
j(x) =
✓
x µj
s
◆
(a) =
1
1 + exp( a)
where
125. • Adding a regularization term to an error function:
• One of simplest forms of regularizer is sum-of-
squares of the weight vector elements:
• This type of weight decay regularizer (in ML),
a.k.a., parameter shrinkage (in statistics)
encourages weight values to decay towards
zero, unless supported by the data.
Regularized Least Squares
EW (w) =
1
2
wT
w
ED(w) + EW (w)
126. • A more general regularizer in the form of:
• q=2 is the quadratic regularizer (last page).
• q=1 is known as lasso in statistics.
Regularized Least Squares
1
2
NX
n=1
tn wT
(xn)
2
+
2
MX
j=1
|wj|q
sum of squared error generalized regularizer,
127. • LASSO: least absolute shrinkage and selection
operator
• When is sufficiently large, some of the
coefficients are driven to zero, leading to a
sparse model
LASSO
wj
131. • Before we start, we need to estimate data
distribution and develop sampling strategies,
• figure out how to measure/quantify data, or, in
other words, represent them as features,
• figure out how to split data to training and
validation set.
• After we learn a model, we need to measure the
fit, or the error on validation set.
• Finally, how do we evaluate how well our trained
model generalize.
Steps for Supervised Learning
132. Sampling & Distributions
😄
😃 🤪
😀
🤣
😂
😅😆
😁
☺
😊
😇
🙂
🙃
😉😌
😍
🤓
😎
🤩
😏
😬
🤠
😋
The importance of good sampling & distribution estimation.
Population with attribute
modeled by functionf : X ! Y
X Y
Learn from D =
😄
😃 🤪🤣
😂
🤩
😋
sample
x 2 X, y 2 Y
{(x1, y1), (x2, y2), ..., (xN , yN )}
f0
incorrectly predicts that
everyone else “smiles crazily”
f0
133. • The chances of getting a "perfect" sample of the
population at first try is very very small. When
the population is huge, this problem worsens.
• Noise during the measurement process adds
additional uncertainties.
• As a result, it is natural to try multiple times, and
formulate the problem in a probabilistic way.
Sampling & Distributions
134. When we measure the wrong
features, we’ll need very
complicated classifiers, and
the results are still not ideal.
Features
baseball tennis ball
vs
There’s always “exceptions”
that would ruin our perfect
assumptions yellow
baseball?
we learn the best features from data with deep learning.
135. • k-fold cross validation
Splitting data
😄😃 🤪😀 🤣 😂😅😆 😁 ☺😊 😇🙂🙃 😉😌 😍🤓 😎🤩 😏 😬🤠 😋
Repurposing the smily faces
figures to represent the set of
annotated data.
😄
😃 🤪
😀
🤣
😂
😅😆
😁
☺
😊
😇
🙂
🙃
😉😌
😍
🤓
😎
🤩
😏
😬
🤠
😋
Randomly split into k groups
136. • Given a set of samples and their ground
truth annotation , learn a function
that minimizes the prediction error
for new .
• The function is a classifier. Classifiers
divides input space into decision regions
separated by decision boundaries.
Supervised Learning
xj /2 X
xi 2 X
yi
decision boundary
E(yj, f(xj))
y = f(x)
y = f(x)
x1
x2
R1
R2
R3
137. • Spam detection:
• X = { characters and words in the email }
• Y = { spam, not spam}
• Digit recognition:
• X = cut out, normalized images of digits
• Y = {0,1,2,3,4,5,6,7,8,9}
• Medical diagnosis
• X = set of all symptoms
• Y = set of all diseases
Supervised Learning Examples
138. • Joint probability of X taking
the value xi and Y taking the
value yi :
• Marginalizing: probability
that X takes the value xi
irrespective of Y:
Before we train classifiers, a gentle
review on probability notations
yj nij
xi
} rj
}
ci
p(X = xi, Y = yi) =
nij
N
p(X = xi) =
ci
N
, where ci =
X
j
nij
139. • Conditional Probability: the
fraction of instances where Y
= yj given that X = xi.
• Product Rule:
yj nij
xi
} rj
}
ci
p(Y = yj|X = xi) =
nij
ci
p(X = xi, Y = yj) =
nij
N
=
nij
ci
·
ci
N
= p(Y = yj|X = xi)p(X = xi)
we will be seeing this a lot when building classifiers
Before we train classifiers, a gentle
review on probability notations
140. • Bayes' Rule plays a central
role in pattern recognition
and machine learning.
• From the product rule,
together with the symmetric
property
we get:
Bayes' Rule & Posterior Probability
yj nij
xi
} rj
}
ci
p(X, Y ) = p(Y, X)
p(Y |X) =
p(X|Y )p(Y )
p(X)
, where p(X) =
X
Y
p(X|Y )p(Y )
posterior probability, given prior p(Y) and likelihood p(X|Y)
141. • p(Y = a) = 1/4, p(Y = b) = 3/4
• p(X = blue | Y = a) = 3/5
• p(X = green | Y = a) = 2/5
When we randomly draw a ball that is blue, the
probability that it comes from Y=a is?
Example of Bayes' Rule
Y=a Y=b
p(Y = a|X = blue) =
p(X = blue|Y = a)p(Y = a)
p(X = blue)
=
p(X = blue|Y = a)p(Y = a)
(p(X = blue|Y = a)p(Y = a) + (p(X = blue|Y = b)p(Y = b)
=
3
5 · 1
4
3
5 · 1
4 + 2
5 · 3
4
=
3
20
3
20 + 6
20
=
3
20
9
20
=
1
3
142. What are Posterior Probability and
Generative Models good for?
Discriminative Model:
directly learn the data
boundary
Generative Model:
represent the data
and boundary
143. • Learn to directly predict labels from the data
• Often uses simpler boundaries (e.g., linear) for
hopes of better generalization.
• Often easier to predict a label from the data than
to model the data.
• E.g.,
• Logistic Regression
• Support Vector Machines
• Max Entropy Markov Model
• Conditional Random Fields
Discriminative Models
144. • Represent both the data and the boundary.
• Often use conditional independence and priors.
• Modeling data is challenging; need to make and
verify assumptions about data distribution
• Modeling data aids prediction & generalization.
• E.g.,
• Naive Bayes
• Gaussian Mixture Model (GMM)
• Hidden Markov Model
• Generative Adversarial Networks (GAN)
Generative Models
145. • Find a linear function to separate the classes
Linear Classifiers
• Logistic Regression
• Naïve Bayes
• Linear SVM
146. • Using a probabilistic approach to model data,
the distribution of P(X,Y): given data X, find the Y
that maximizes the posterior probability p(Y|X).
• Problem: we need to model all p(X|Y) and p(Y).
If | X | = n, there are 2n possible values for X.
• The Naïve Bayes' assumption assumes that xi's
are conditionally independent.
Naïve Bayes Classifier
p(Y |X) =
p(X|Y )p(Y )
p(X)
, where p(X) =
X
Y
p(X|Y )p(Y )
p(X1 . . . Xn|Y ) =
Y
i
p(Xi|Y )
147. • Given:
• Prior p(Y)
• n conditionally independent features,
represented by the vector X, given the class Y
• For each Xi, we have likelihood p(Xi | Y)
• Decision rule:
Naïve Bayes Classifier
Y ⇤
= argmax
Y
p(Y )p(X1, . . . , Xn|Y )
= argmax
Y
p(Y )
Y
i
p(Xi|Y )
148. • For discrete Naïve Bayes, simply count:
• Prior:
• Likelihood:
• Naïve Bayes Model:
Maximum Likelihood for Naïve Bayes
p(Y = y0
) =
Count(Y = y0
)
P
y Count(Y = y)
p(Xi = x0
|Y = y0
) =
Count(Xi = x0
, Y = y0
)
P
x Count(Xi = x, Y = y)
p(Y |X) / p(Y )
Y
i,j
p(X|Y )
149. • Conditional probability model over:
• Classifier:
Naïve Bayes Classifier
p(Ck|x1, . . . , xn) =
1
Z
p(Ck)
nY
i=1
p(xi|Ck)
˜y = argmax
k2{1,...,K}
p(Ck)
nY
i=1
p(xi|Ck)
150. • Features X are entire document. Xi for ith word in
article. X is huge! NB assumption helps a lot!
Naïve Bayes for Text Classification
151. • Typical additional assumption: Xi's position in
document doesn't matter: bag of words.
aardvark 0
about 2
all 2
Africa 1
apple 0
...
gas 1
...
oil 1
...
Zaire 0
Naïve Bayes for Text Classification
152. • Learning Phase:
• Prior: p(Y), count how many documents in
each topic (prior).
• Likelihood: p(Xi|Y), for each topic, count how
many times a word appears in documents of
this topic.
• Testing Phase: for each document, use Naïve
Bayes' decision rule:
argmax
y
p(y)
wordsY
i=1
p(xi|y)
Naïve Bayes for Text Classification
153. • Given 1000 training documents from each
group, learn to classify new documents
according to which newsgroup it came from.
• comp.graphics,
• comp.os.ms-windows.misc
• ...
• soc.religion.christian
• talk.religion.misc
• ...
• misc.forsale
• ...
Naïve Bayes for Text Classification
155. • Usually, features are not conditionally independent:
• Actual probabilities p(Y|X) often bias towards 0 or 1
• Nonetheless, Naïve Bayes is the single most used
classifier.
• Naïve Bayes performs well, even when
assumptions are violated.
• Know its assumptions and when to use it.
Naïve Bayes Classifier Issues
p(X1, . . . , Xn|Y ) 6=
Y
i
p(Xi|Y )
156. • Regression model for which the dependent
variable is categorical.
• Binomial/Binary Logistic Regression
• Multinomial Logistic Regression
• Ordinal Logistic Regression (categorical, but
ordered)
• Substituting Logistic Function
,
we get:
Logistic Regression
y(x, w) =
1
1 + e (w0+w1x)
˜x = w0 + w1xf(˜x) =
1
1 + e ˜x
157. • E.g., for predicting:
• mortality of injured patients,
• risk of developing a certain disease based on
observations of the patient,
• whether an American voter would vote
Democratic or Republican,
• probability of failure of a given process, system or
product,
• customer's propensity to purchase a product or
halt a subscription,
• likelihood of homeowner defaulting on mortgage.
When to use logistic regression?
158. • Hours studied vs passing the exam
Logistic Regression Example
Ppass(h) =
1
1 + e ( 4.0777+1.5046·h)
159. • Prediction: output the Y with highest p(Y|X). For
binary Y, output Y if
Logistic Regression: decision boundary
p(Y = 0|X, w) =
1
1 + exp(w0 +
P
i wiXi)
p(Y = 1|X, w) =
exp(w0 +
P
i wiXi)
1 + exp(w0 +
P
i wiXi)
1 <
P(Y = 1|X)
P(Y = 0|X)
1 < exp(w0 +
nX
i=1
wiXi)
0 < w0 +
nX
i=1
wiXi
w0 + w · X = 0
160. • Decision boundary: p(Y=0 | X, w) = 0.5
• Slope of the line defines how quickly probabilities go to 0
or 1 around decision boundary.
Visualizing p(Y = 0|X, w) =
1
1 + exp(w0 + w1x1)
163. • Maximize conditional log likelihood (Maximum
Likelihood Estimation, MLE):
• No closed-form solution.
• Concave function of w → no need to worry
about local optima; easy to optimize.
l(w) ⌘ ln
Y
j
p(yj
|xj
, w)
=
X
j
yj
(w0 +
X
i
wixj
i ) ln(1 + exp(w0 +
X
i
wixj
i )
Logistic Regression Param. Estimation
164. • Conditional likelihood for logistic regression is convex!
• Gradient:
• Gradient Ascent update rule:
• Simple, powerful, use in
many places.
rwl(w) =
dl(w)
dw0
, . . . ,
dl(w)
dwn
w = ⌘rwl(w)
w
(t+1)
i w
(t)
i + ⌘
dl(w)
dwi
Logistic Regression Param. Estimation
165. • MLE tends to prefer large weights
• Higher likelihood of properly classified
examples close to decision boundary.
• Larger influence of corresponding features on
decision.
• Can cause overfitting!!!
Logistic Regression Param. Estimation
166. • Regularization to avoid large weights, overfitting.
• Add priors on w and formulate as Maximum a
Posteriori (MAP) optimization problem.
• Define prior with normal distribution, zero
mean, identity towards zero; pushes
parameters towards zero.
• MAP estimate:
Logistic Regression Param. Estimation
p(w|Y, X) / p(Y |X, w)p(w)
w⇤
= argmax
w
ln
2
4p(w)
NY
j=1
p(yj
|xj
, w)
3
5
167. • Logistic Regression in more general case, where
Y = { y1, ..., yR}. Define a weight vector wi for
each yi, i=1,...,R-1.
Logistic Regression for Discrete Classification
p(Y = 1|X) / exp(w10 +
X
i
w1iXi)
p(Y = 2|X) / exp(w20 +
X
i
w2iXi)
p(Y = r|X) = 1
r 1X
j=1
p(Y = j|X)
...
168. • E.g., Y={0,1}, X = <X1, ..., Xn>, Xi continuous.
Naïve Bayes vs Logistic Regression
Naïve Bayes
(generative)
Logistic Regression
(discriminative)
Number of parameters 4n+1 n+1
parameter estimation uncoupled coupled
when # training samples → infinite
& model correct
good classifier good classifier
when # training samples → infinite
& model incorrect
biased classifier
less-biased
classifier
Training samples needed O(log N) O (N)
Training convergence speed faster slower
169. Naïve Bayes vs Logistic Regression
• Examples from UCI Machine Learning dataset
170. Perceptron
• Invented in 1957 at the Cornell Aeronautical
Lab. Intended to be a machine instead of a
program that is capable of recognition.
• A linear (binary) classifier.
Mark I
perceptron machine
i1
i2
in
...
+ f o
o = f
nX
k=1
ik · wk
!
171. • Start with zero weights: w=0
• For t=1...T (T passes over data)
• For i=1...n (each training sample)
• Classify with current weights
(sign(x) is +1 if x>0, else -1)
• If correct, (i.e., y=yi), no change!
• If wrong, update
Binary Perceptron Algorithm
w = w + yi
xi
y = sign(w · xi
)
w xi
w + (-1) xi
179. • If we have more than two classes:
• Have a weight vector for each class wy
• Calculate an activation function for each class
• Highest activation wins
Multiclass Perceptron
activationw(x, y) = wy · x
y⇤
= argmax
y
(activationw(x, y))
180. • Starts with zero weights
• For t=1, ..., T, i=1, ..., n (T times over data)
• Classify with current weights
• If correct (y=yi), no change!
• If wrong: subtract features xi from weights for
predicted class wy and add them to weights
for correct class wyi.
Multiclass Perceptron
y = argmax
y
wy · xi
wy = wy xi
wyi = wyi xi
xi
wyi
wyi + xi
wy
wy xi
181. • Text classification example:
x = "win the vote" sentence
Multiclass Perceptron Example
BIAS 1
win 1
game 0
vote 1
the 1
,,,
BIAS -2
win 4
game 4
vote 0
the 0
,,,
BIAS 1
win 2
game 0
vote 4
the 0
,,,
BIAS 2
win 0
game 2
vote 0
the 0
,,,
wsports
wpolitics
wtech
x
x · wsports = 2
x · wpolitics = 7
x · wtech = 2
Classified as "politics"
182. • The data is linearly separable with margin if
Linearly separable (binary)
9w 8t yt
(w · xt
) > 0
x1
x2
183. • Assume data is separable with margin
• Also assume there is a number R such that
• Theorem: the number of mistakes (parameter
updates) made by the perceptron is bounded:
Mistake Bound for Perceptron
9w⇤
s.t.||w⇤
||2 = 1 and 8t yt
(w⇤
·t
)
8t ||xt
||2 R
mistakes
R2
r2
184. • Noise: if the data isn't separable,
weights might thrash (averaging
weight vectors over time can help).
• Mediocre generalization: finds a
barely separating solution.
• Overtraining: test / hold-out
accuracy usually rises then falls.
Issues with Perceptrons
Seperable: Non-Seperable:
thrashing
barely separable
185. • Find a linear function to separate the classes
Linear SVM Classifier
f(x) = g(w · x + b)
• Define hyperplane where is the
tangent to hyperplane, is the matrix of all
data points. Minimize s.t.
produces correct label for all .
t
X
tX b = 0
||t|| tX b
X
x1
x2
186. • Find a linear function to separate the classes
Linear SVM Classifier
x1
x2 f(x) = g(w · x + b)
• Define hyperplane where is the
tangent to hyperplane, is the matrix of all
data points. Minimize s.t.
produces correct label for all .
t
X
tX b = 0
||t|| tX b
X
support vectors
187. • Some data sets are not linearly separable!
• Option 1:
• Use non-linear features, e.g., polynomial basis
functions
• Learn linear classifers in a transformed, non-
linear feature space
• Option 2:
• Use non-linear classifiers (decision trees,
neural networks, nearest neighbors)
Nonlinear Classifiers
188. • Assign label of nearest training data point to
each test data point.
Nearest Neighbor Classifier
Duda, Hart and Stork, Pattern Classification
189. K-Nearest Neighbor Classifier
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
1-nearest
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
3-nearest
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
5-nearest
190. • Data that are linearly separable work out great:
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space!
Nonlinear SVMs
0
0
x
x
0
x
x2
191. • Map the input space to some higher dimensional
feature space where the training set is
separable:
Nonlinear SVMs
: x ! (x)
192. • The kernel trick: instead of explicitly computing
the lifting transformation
• This gives a non-linear decision boundary in the
original feature space:
• Common kernel function: Radial basis function
kernel.
Nonlinear SVMs
K(xi, xj) = (xi) · (xj)
X
i
↵iyi (xi) · (x) + b =
X
i
↵iyiK(xi, x) + b
194. • Histogram intersection kernel:
• Generlized Gaussian kernel:
D can be (inverse) L1 distance, Euclidean
distance, distance, etc.
Kernels for bags of features
I(h1, h2) =
NX
i=1
min(h1(i), h2(i))
K(h1, h2) = exp
✓
1
A
D(h1, h2)2
◆
X2
195. • Combine multiple two-class SVMs
• One vs others:
• Training: learn an SVM for each class vs the others.
• Testing: apply each SVM to test example and
assign it to the class of the SVM that returns the
highest decision value.
• One vs one:
• Training: learn an SVM for each pair of classes
• Testing: each learned SVM votes for a class to
assign to the test example.
Multi-class SVM
196. • Pros:
• SVMs work very well in practice, even with very
small training sample sizes.
• Cons:
• No direct multi-class SVM; must combine two-class
SVMs.
• Computation and memory usage:
• Must compute matrix of kernel values for each
pair of examples.
• Learning can take a long time for large problems.
SVMs: Pros & Cons
197. • Prediction is done by sending the example down
the tree until a class assignment is reached.
Decision Tree Classifier
198. • Internal Nodes: each test a feature
• Leaf nodes: each assign a classification
• Decision Trees divide the feature space into axis-
parallel rectangles and label each rectangle with
one of the K classes.
Decision Tree Classifier
199. • Goal: find a decision tree that achieves minimum
misclassification errors on the training data.
• Brute-force solution: create a tree with one path
from root to leaf for each training sample.
(problem: just memorizing, won't generalize.)
• Find the smallest tree that minimizes error.
(problem: this is NP-hard.)
Training Decision Trees
200. 1. Choose the best feature a* for the root of the tree.
2. Split training set S into subsets {S1, S2, ..., Sk}
where each subset Si contains examples having
the same value for a*.
3. Recursively apply the algorithm on each new
subset until all examples have the same class
label.
The problem is, what defines the "best" feature?
Top-down induction of Decision Tree
201. • Decision Tree feature selection based on
classification error.
Choosing Best Feature
Does not work well, since it doesn't reflect progress
towards a good tree.
202. • Choose feature that gives the highest
information gain (X that has the highest mutual
information with Y).
• Define to be the expected remaining
uncertainty about y after testing xj.
Choosing Best Feature
argmax
j
I(Xj; Y ) = argmax
j
H(Y ) H(Y |Xj)
= argmin
j
H(Y |Xj)
˜J(j)
˜J(j) = H(YX)j) =
X
x
p(Xj = x)H(Y |Xj = x)
203. • Before we start, we need to estimate data
distribution and develop sampling strategies,
• figure out how to measure/quantify data, or, in
other words, represent them as features,
• figure out how to split data to training and
validation set.
• After we learn a model, we need to measure the
fit, or the error on validation set.
• Finally, how do we evaluate how well our trained
model generalize.
Steps for Supervised Learning
204. • Minimizing the misclassification rate
• Minimizing the expected loss
• The reject option
Decision Theory
205. • Decision boundary, or simply, in 1D, a threshold,
s.t. anything larger than the threshold are
classified as a class, and smaller than the
threshold as another class.
Decision Boundary
206. • Different metrics & names used in different fields
for measuring ML performance; however, the
common cornerstones are:
• True positive (TP): sample is an apple,
classified as an apple.
• False positive (FP): sample is not an apple, but
classified as an apple.
• True negative (TN): sample is not an apple,
classified as not an apple.
• False negative (FN): sample is an apple, but
misclassified as "not an apple.
True/False, Positive/Negative
207. • Precision:
Classifier identified (TP+FP)
apples, only TP are apples.
(aka positive predictive value.)
• Recall:
Total (TP+FN) apples,
classifier identified TP.
(aka, hit rate, sensitivity, true
positive rate)
Precision vs Recall
TP
TP + FP
TP
TP + FN
208. • F-measure:
harmonic mean of precision and recall. F-
measure is criticized outside Information
Retrieval field for neglecting the true negative.
• Accuracy (ACC):
a weighted arithmetic mean of precision and
inverse precision, as well as the weighted
arithmetic mean of recall and inverse recall.
A single balanced metric?
TP + TN
TP + TN + FP + FN
2 ·
precision · recall
precision + recall
210. • Different types of errors are weighted differently;
e.g., medical examinations, minimize false
negative but can tolerate false positive.
• Reformulate objectives from maximizing
probability to minimizing weighted loss
functions.
• The reject option: refrain from making decisions
on difficult cases (e.g., for samples within a
certain region inside the decision boundary.)
Minimizing the expected loss
211. • Minimizing Training and Validation Error, v.s.
minimizing Testing Error.
• Memorizing every “practice exam” question ≠
doing well on new questions. Avoid overfitting.
Generalization
E.g., training a classifier
that recognizes trees
215. • Bias:
• Difference between the expected (or
averaged) prediction of our model and the
correct value.
• Error due to inaccurate assumptions/
simplifications.
• Variance:
• Amount that the estimate of the target function
will change if different training data was used.
Generalization Error
217. • Model is too simple to represent all the relevant
class characteristics.
• High bias (few degrees of freedom, DoF) and
low variance.
• High training error and high test error.
Underfitting
218. • Model is too complex and fits
irrelevant noise in the data
• Low bias, high variance
• Low training error, high test error
Overfitting
219. Error (mean square error, MSE)
= noise2 + bias2 + variance
Bias-Variance Trade-off
unavoidable
error
error due to incorrect
assumptions made
about the data
error due to variance
of training samples
224. 1. Create T bootstrap samples, {S1, ..., ST} of S as
follows:
• For each Si, randomly draw |S| examples from
S with replacement.
• With large |S|, each Si will contain 1 - 1/e =
63.2% unique examples.
2. For each i=1, ..., T, hi = Learn (Si)
3. Output H = <{h1, ..., hT}, majority vote >
Bootstrap Aggregating (Bagging)
Leo Breiman, "Bagging Predictors", Machine Learning, 24, 123-140 (1996)
225. • A learning algorithm is unstable if small changes
in the training data produces large changes in
the output hypothesis.
• Bagging will have little benefit when used with
stable learning algorithms.
• Bagging works best when used with unstable
yet relatively accurate classifiers.
Learning Algorithm Stability
227. • Bagging: individual classifiers are independent
• Boosting: classifiers are learned iteratively
• Look at errors from previous classifiers to
decide what to focus on for the next iteration
over data.
• Successive classifiers depends upon its
predecessors.
• Result: more weights on "hard" examples, i.e.,
the ones classified incorrectly in the previous
iterations.
Boosting
228. • Consider E = <{h1, h2, h3}, majority vote>
• If h1, h2, h3 have error rates less than e, the error
rate of E is upper-bounded by g(a): 3e2-2e3 < e
Error Upper Bound
e
3e2-2e3
229. • Hypothesis of getting a classifier ensemble of
arbitrary accuracy, from weak classifiers.
Arbitrary Accuracy from Weak Classifiers
The original formulating of boosting learns too slowly.
Empirical studies show that Adaboost is highly effective.
230. • Adaboost works by learning many times on
different distributions over the training data.
• Modify learner to take distribution as input.
1. For each boosting round, learn on data set S
with distribution Dj to produce jth ensemble
member hj.
2. Compute the j+1th round distribution Dj+1 by
putting more weight on instances that hj made
mistake on.
3. Compute a voting weight wj for hj.
Adaboost
236. • Suppose the base learner L is a weak learner,
with error rate slightly less than 0.5 (better than
random guess)
• Training error goes to zero exponentially fast!!!
Adaboost Properties
237. Semi-supervised Learning
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
238. • When annotated data is costly to obtain.
• When data volume is HUGE!
When to use semi-
supervised learning?
239. • Assume that class boundary should go through
low density areas.
• Having unlabeled data helps getting better
decision boundary.
Why can unlabeled data help?
supervised learning
semi-supervised learning
240. • Assume that each
class contains a
coherent group of
points (e.g., Gaussian)
• Having unlabeled data
points can help learn
the distribution more
accurately.
Why can unlabeled data help?
241. • Generative models:
• Use unlabeled data to more accurately
estimate the models.
• Discriminative models:
• Assume that p(y|x) is locally smooth
• Graph/manifold regularization
• Multi-view approach: multiple independent
learners that agree on unlabeled data
• Cotraining
Semi-Supervised Learning (SSL)
242. SSL Bayes Gaussian Classifier
Without SSL:
optimize
With SSL:
optimize
p(Xl, Yl|✓)
p(Xl, Yl, Xu|✓)
243. • In SSL, the learned needs to explain the
unlabeled data well, too.
• Find MLE or MAP estimate of joint and marginal
likelihood:
• Common mixture models used in SSL:
• GMM
• Mixture of Multinomials
SSL Bayes Gaussian Classifier
✓
p(Xl, Yl, Xu|✓) =
X
Yu
p(Xl, Yl, Xu, Yu|✓)
244. • Binary classification with GMM using MLE
• Using labeled data only, MLE is trivial:
• With both labeled and unlabeled data, MLE is
harder---use EM:
Estimating SSL GMM params
log p(Xl, Yl|✓) =
lX
i=1
log p(yi|✓) p(xi|yi, ✓)
+
l+uX
i=l+1
log (
2X
y=1
p(y|✓) p(xi|y, ✓))
log p(Xl, Yl|✓) =
lX
i=1
log p(yi|✓) p(xi|yi, ✓)
245. • Start with MLE
• = proportion of class c
• = sample mean of class c
• = sample covariance of class c
• The E-step: compute the expected label
for all .
• The M-step: update MLE with (now labeled)
Semi-Supervised EM for GMM
✓ = {w, µ, ⌃}1:2 on (Xl, Yl)
wc
µc
⌃c
p(y|x, ✓) =
p(x, y|✓)
P
y0 p(x, y0|✓)
x 2 Xµ
✓ Xµ
246. • SSL is sensitive to assumptions!!!
• Cases when the assumption is wrong:
SSL GMM Discussions
247. So, where's Deep Learning?
Machine Learning Roadmap
Dimension
Reduction
Clustering
Regression Classification
continuous
(predicting a quantity)
discrete
(predicting a category)
supervisedunsupervised
248. Machine Learning Workflow
Classical Workflow:
1. Data collection
2. Feature Extraction
3. Dimension Reduction
4. Classifier (re)Design
5. Classifier Verification
6. Deploy
Modern workflow; brute-force deep learning
1. Data collection
2. Throw everything into a Deep Neural Network
3. Mommy, why doesn’t it work ???
250. Features Learned by modern
Deep Neural Networks
• Neurons act like “custom-trained filters”; react to
very different visual cues, depending on data.
251. • Does not “memorize” millions of viewed images.
• Extracts greatly reduced number of features that
are vital to classify different classes of data.
• Classifying data becomes a simple task when
the features measured are “”good”.
What do DNNs learn?
252. More to follow in the
remainder of the semester
• Deep Learning
• Transfer Learning
• Reinforcement Learning
• Generative Adversarial Networks (GAN)
• ...
253. ML/AI is all about data.
Form the data cycle
and iterate rapidly
Business
DataTechnology
Speed Speed
Speed
254. Choose your battle wisely, for
the sake of a strong data cycle
Problem Data Use Case Data Cycle
Face Recognition
User photos
throughout the
world
most users would
correct label for
free
★★★★★
Face Recognition
Surveillance
video throughout
China
police would
correct label
★★★★
Face detection
and
beautification
Users with
beautification
app
need to sample
then manually
annotate
★★
Face detection
for virtual make-
up
Users with virtual
make-up app
need to sample
then manually
annotate
★★
255. Business stories of same AI
but different angle of approach
• Smart photo album
• Content moderation
• Merchandise recognition for o2o redirection
• Smart retail applications
• Training weak AI is hard already; knowing what
to do with the weak AI is even harder...