Machine learning involves using patterns in data to make predictions without being explicitly programmed. This document provides an introduction to machine learning concepts through a real-world project example. It discusses what data scientists do, including prediction, anomaly detection, gaining insights, and decision making. The document then demonstrates machine learning applications in areas like predicting flight delays or employee attrition. It also covers important steps like data preprocessing, feature engineering, and building predictive models using decision trees.
Introduction to Data Science and AnalyticsSrinath Perera
This webinar serves as an introduction to WSO2 Summer School. It will discuss how to build a pipeline for your organization and for each use case, and the technology and tooling choices that need to be made for the same.
This session will explore analytics under four themes:
Hindsight (what happened)
Oversight (what is happening)
Insight (why is it happening)
Foresight (what will happen)
Recording http://t.co/WcMFEAJHok
a) What is data.
b) types of data.
c) difference between data science and big data and data analytics.
d) relationship between data and artificial intelligence.
Advantages and disadvantages of machine learning languagebusiness Corporate
we will learn Advantages and Disadvantages of Machine Learning. As we will try to cover all Limitations and Benefits of Machine Learning to understand where to use it and where not to use Machine learning.
Machine learning_ Replicating Human BrainNishant Jain
Slides will make you realize how humans makes decision and following the same pattern how Machines are trained to learn and make decisions. Slides gives an overview of all the steps involved in designing an efficient decision making machine.
Introduction to Data Science and AnalyticsSrinath Perera
This webinar serves as an introduction to WSO2 Summer School. It will discuss how to build a pipeline for your organization and for each use case, and the technology and tooling choices that need to be made for the same.
This session will explore analytics under four themes:
Hindsight (what happened)
Oversight (what is happening)
Insight (why is it happening)
Foresight (what will happen)
Recording http://t.co/WcMFEAJHok
a) What is data.
b) types of data.
c) difference between data science and big data and data analytics.
d) relationship between data and artificial intelligence.
Advantages and disadvantages of machine learning languagebusiness Corporate
we will learn Advantages and Disadvantages of Machine Learning. As we will try to cover all Limitations and Benefits of Machine Learning to understand where to use it and where not to use Machine learning.
Machine learning_ Replicating Human BrainNishant Jain
Slides will make you realize how humans makes decision and following the same pattern how Machines are trained to learn and make decisions. Slides gives an overview of all the steps involved in designing an efficient decision making machine.
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
This Data Science Presentation will help you in understanding what is Data Science, why we need Data Science, prerequisites for learning Data Science, what does a Data Scientist do, Data Science lifecycle with an example and career opportunities in Data Science domain. You will also learn the differences between Data Science and Business intelligence. The role of a data scientist is one of the sexiest jobs of the century. The demand for data scientists is high, and the number of opportunities for certified data scientists is increasing. Every day, companies are looking out for more and more skilled data scientists and studies show that there is expected to be a continued shortfall in qualified candidates to fill the roles. So, let us dive deep into Data Science and understand what is Data Science all about.
This Data Science Presentation will cover the following topics:
1. Need for Data Science?
2. What is Data Science?
3. Data Science vs Business intelligence
4. Prerequisites for learning Data Science
5. What does a Data scientist do?
6. Data Science life cycle with use case
7. Demand for Data scientists
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
The Data Science with python is recommended for:
1. Analytics professionals who want to work with Python
2. Software professionals looking to get into the field of analytics
3. IT professionals interested in pursuing a career in analytics
4. Graduates looking to build a career in analytics and data science
5. Experienced professionals who would like to harness data science in their fields
Methods for Sentiment Analysis: A Literature Studyvivatechijri
Sentiment analysis is a trending topic, as everyone has an opinion on everything. The systematic
study of these opinions can lead to information which can prove to be valuable for many companies and
industries in future. A huge number of users are online, and they share their opinions and comments regularly,
this information can be mined and used efficiently. Various companies can review their own product using
sentiment analysis and make the necessary changes in future. The data is huge and thus it requires efficient
processing to collect this data and analyze it to produce required result.
In this paper, we will discuss the various methods used for sentiment analysis. It also covers various techniques
used for sentiment analysis such as lexicon based approach, SVM [10], Convolution neural network,
morphological sentence pattern model [1] and IML algorithm. This paper shows studies on various data sets
such as Twitter API, Weibo, movie review, IMDb, Chinese micro-blog database [9] and more. The paper shows
various accuracy results obtained by all the systems.
modeling and predicting cyber hacking breaches Venkat Projects
Analyzing cyber incident data sets is an important method for deepening our understanding of the evolution of the threat situation. This is a relatively new research topic, and many studies remain to be done. In this paper, we report a statistical analysis of a breach incident data set corresponding to 12 years (2005–2017) of cyber hacking activities that include malware attacks. We show that, in contrast to the findings reported in the literature, both hacking breach incident inter-arrival times and breach sizes should be modeled by stochastic processes, rather than by distributions because they exhibit autocorrelations. Then, we propose particular stochastic process models to, respectively, fit the inter-arrival times and the breach sizes. We also show that these models can predict the inter-arrival times and the breach sizes. In order to get deeper insights into the evolution of hacking breach incidents, we conduct both qualitative and quantitative trend analyses on the data set. We draw a set of cybersecurity insights, including that the threat of cyber hacks is indeed getting worse in terms of their frequency, but not in terms of the magnitude of their damage.
App;ying Different Classification Technologies and for Different types of datasets such as Text and image dataset. Here I have used Machine learning and Deep Learning respectively for text and image datasets.
A PPT which gives a brief introduction on Machine Learning and on the products developed by using Machine Learning Algorithms in them. Gives the introduction by using content and also by using a few images in the slides as part of the explanation. It includes some examples of cool products like Google Cloud Platform, Cozmo (a tiny robot built by using Artificial Intelligence), IBM Watson and many more.
How ml can improve purchase conversionsSudeep Shukla
- What is Machine Learning and what problems can it solve?
- Basic Machine Learning models
- Data gathering and data cleaning
- Parameters for judging whether the model is performing well?
- Making it easy for sales & marketing teams to use the ML program
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...Simplilearn
This presentation on "Machine Learning Engineer Salary, Skills & Resume" will help you understand who is a Machine Learning engineer, the salary of a Machine Learning engineer, skills required to become a Machine Learning engineer and what a Machine Learning engineer's resume should look like. Machine Learning is the study of algorithms and data models that computer systems utilize to perform specific tasks without using instructions, relying on previous patterns. To make this possible, a Machine Learning engineer is required. Now, let us get started and understand what the job of a Machine Learning engineer looks like.
Below are the topics that we will be discussing in the presentation:
1. Introduction to Machine Learning
2. Responsibilities of a Machine Learning engineer
3. Salary Trends of a Machine Learning engineer
4. Skills of a Machine Learning engineer
5. Resume of a Machine Learning engineer
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
Learn more at https://www.simplilearn.com/big-data-and-analytics/machine-learning-certification-training-course
Application Of Python in Medical ScienceAditya Nag
Python is an interpreted high-level general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs, as well as its object-oriented approach, aim to help programmers write clear, logical code for small and large-scale projects.
These slides are from a presentation on understanding Machine Learning at a high level. The talk touches on linear regression, neural networks, and how Deep Learning fits into Machine Learning.
Introduction to Machine Learning and Artificial Intelligence Technologies. Discover the basics surrounding this tech, including business uses and evolution over time.
The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
This Data Science Presentation will help you in understanding what is Data Science, why we need Data Science, prerequisites for learning Data Science, what does a Data Scientist do, Data Science lifecycle with an example and career opportunities in Data Science domain. You will also learn the differences between Data Science and Business intelligence. The role of a data scientist is one of the sexiest jobs of the century. The demand for data scientists is high, and the number of opportunities for certified data scientists is increasing. Every day, companies are looking out for more and more skilled data scientists and studies show that there is expected to be a continued shortfall in qualified candidates to fill the roles. So, let us dive deep into Data Science and understand what is Data Science all about.
This Data Science Presentation will cover the following topics:
1. Need for Data Science?
2. What is Data Science?
3. Data Science vs Business intelligence
4. Prerequisites for learning Data Science
5. What does a Data scientist do?
6. Data Science life cycle with use case
7. Demand for Data scientists
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. Data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
The Data Science with python is recommended for:
1. Analytics professionals who want to work with Python
2. Software professionals looking to get into the field of analytics
3. IT professionals interested in pursuing a career in analytics
4. Graduates looking to build a career in analytics and data science
5. Experienced professionals who would like to harness data science in their fields
Methods for Sentiment Analysis: A Literature Studyvivatechijri
Sentiment analysis is a trending topic, as everyone has an opinion on everything. The systematic
study of these opinions can lead to information which can prove to be valuable for many companies and
industries in future. A huge number of users are online, and they share their opinions and comments regularly,
this information can be mined and used efficiently. Various companies can review their own product using
sentiment analysis and make the necessary changes in future. The data is huge and thus it requires efficient
processing to collect this data and analyze it to produce required result.
In this paper, we will discuss the various methods used for sentiment analysis. It also covers various techniques
used for sentiment analysis such as lexicon based approach, SVM [10], Convolution neural network,
morphological sentence pattern model [1] and IML algorithm. This paper shows studies on various data sets
such as Twitter API, Weibo, movie review, IMDb, Chinese micro-blog database [9] and more. The paper shows
various accuracy results obtained by all the systems.
modeling and predicting cyber hacking breaches Venkat Projects
Analyzing cyber incident data sets is an important method for deepening our understanding of the evolution of the threat situation. This is a relatively new research topic, and many studies remain to be done. In this paper, we report a statistical analysis of a breach incident data set corresponding to 12 years (2005–2017) of cyber hacking activities that include malware attacks. We show that, in contrast to the findings reported in the literature, both hacking breach incident inter-arrival times and breach sizes should be modeled by stochastic processes, rather than by distributions because they exhibit autocorrelations. Then, we propose particular stochastic process models to, respectively, fit the inter-arrival times and the breach sizes. We also show that these models can predict the inter-arrival times and the breach sizes. In order to get deeper insights into the evolution of hacking breach incidents, we conduct both qualitative and quantitative trend analyses on the data set. We draw a set of cybersecurity insights, including that the threat of cyber hacks is indeed getting worse in terms of their frequency, but not in terms of the magnitude of their damage.
App;ying Different Classification Technologies and for Different types of datasets such as Text and image dataset. Here I have used Machine learning and Deep Learning respectively for text and image datasets.
A PPT which gives a brief introduction on Machine Learning and on the products developed by using Machine Learning Algorithms in them. Gives the introduction by using content and also by using a few images in the slides as part of the explanation. It includes some examples of cool products like Google Cloud Platform, Cozmo (a tiny robot built by using Artificial Intelligence), IBM Watson and many more.
How ml can improve purchase conversionsSudeep Shukla
- What is Machine Learning and what problems can it solve?
- Basic Machine Learning models
- Data gathering and data cleaning
- Parameters for judging whether the model is performing well?
- Making it easy for sales & marketing teams to use the ML program
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...Simplilearn
This presentation on "Machine Learning Engineer Salary, Skills & Resume" will help you understand who is a Machine Learning engineer, the salary of a Machine Learning engineer, skills required to become a Machine Learning engineer and what a Machine Learning engineer's resume should look like. Machine Learning is the study of algorithms and data models that computer systems utilize to perform specific tasks without using instructions, relying on previous patterns. To make this possible, a Machine Learning engineer is required. Now, let us get started and understand what the job of a Machine Learning engineer looks like.
Below are the topics that we will be discussing in the presentation:
1. Introduction to Machine Learning
2. Responsibilities of a Machine Learning engineer
3. Salary Trends of a Machine Learning engineer
4. Skills of a Machine Learning engineer
5. Resume of a Machine Learning engineer
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
Learn more at https://www.simplilearn.com/big-data-and-analytics/machine-learning-certification-training-course
Application Of Python in Medical ScienceAditya Nag
Python is an interpreted high-level general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs, as well as its object-oriented approach, aim to help programmers write clear, logical code for small and large-scale projects.
These slides are from a presentation on understanding Machine Learning at a high level. The talk touches on linear regression, neural networks, and how Deep Learning fits into Machine Learning.
Introduction to Machine Learning and Artificial Intelligence Technologies. Discover the basics surrounding this tech, including business uses and evolution over time.
The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data
Unit I and II Machine Learning MCA CREC.pptxtrishipaul
Machine Learning topics presentation covering the topics:
Unit I – Introduction: Towards Intelligent Machines, Well posed Problems, Example of Applications in diverse fields, Data Representation, Domain Knowledge for Productive use of Machine Learning, Diversity of Data: Structured / Unstructured, Forms of Learning, Machine Learning and Data Mining, Basic Linear Algebra in Machine Learning Techniques.
Unit II – Supervised Learning – Rationale and Basics: Learning from Observations: Why Learning Works, Bias and Variance: Computations Learning Theory, Occam’s Razor Principle and Overfitting Avoidance, Heuristic Search in Inductive Learning, Estimating Generalization Errors, Metrics for Assessing Regression, Metrics for Assessing Classification.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
The paper aims at proposing a solution for designing and developing a seamless automation and
integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to
seamlessly handle and scale very large amount of unstructured and structured data from diversified and
heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for
analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing
component; 3) the ability to automatically select the most appropriate libraries and tools to compute and
accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high
learning performance, but with minimal human intervention and supervision. The whole focus is to provide
a seamless automated and integrated solution which can be effectively used to analyze Big Data with highfrequency
and high-dimensional features from different types of data characteristics and different
application problem domains, with high accuracy, robustness, and scalability. This paper highlights the
research methodologies and research activities that we propose to be conducted by the Big Data
researchers and practitioners in order to develop and support seamless automation and integration of
machine learning capabilities for Big Data analytics.
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
The paper aims at proposing a solution for designing and developing a seamless automation and integration of machine learning capabilities for Big Data with the following requirements: 1) the ability to seamlessly handle and scale very large amount of unstructured and structured data from diversified and heterogeneous sources; 2) the ability to systematically determine the steps and procedures needed for
analyzing Big Data datasets based on data characteristics, domain expert inputs, and data pre-processing component; 3) the ability to automatically select the most appropriate libraries and tools to compute and accelerate the machine learning computations; and 4) the ability to perform Big Data analytics with high learning performance, but with minimal human intervention and supervision. The whole focus is to provide
a seamless automated and integrated solution which can be effectively used to analyze Big Data with highfrequency
and high-dimensional features from different types of data characteristics and different application problem domains, with high accuracy, robustness, and scalability. This paper highlights the research methodologies and research activities that we propose to be conducted by the Big Data researchers and practitioners in order to develop and support seamless automation and integration of machine learning capabilities for Big Data analytics.
If you’re learning data science, you’re probably on the lookout for cool data science projects. Look no further! We have a wide variety of guided projects that’ll get you working with real data in real-world scenarios while also helping you learn and apply new data science skills.
The projects in the list below are also designed to help you get a job! Each project was designed by a data scientist on our content team, and they’re representative examples of the real projects working data analysts and data scientists do every day. They’re designed to guide you through the process while also challenging your skills, and they’re open-ended so that you can put your own twist on each project and use it for your data science portfolio.
You can complete each project right in your browser, or you can download the data set to your computer and work locally! If you work on our site, you’ll also be able to download your code at any time so that you can continue locally, or upload your project to GitHub.
The sky is the limit here and what you decide to look into further is completely up to you and your imagination!
1. Learning by Doing
Learning by doing refers to a theory of education expounded by American philosopher John Dewey. It is a hands-on approach to learning, meaning students must interact with their environment in order to adapt and learn. This way of learning sharpen your current skills and knowledge and also helps in gaining new skills that could only be acquired by doing.
Car driving is a perfect example of this, you can read as much as you would like about the theory of driving and the rules, and this is very important, and the more you understand the theory the better you get in the practical part. But you will only be able to drive better by applying this knowledge on the real road. In addition to that, there are some skills and knowledge that will be only gained by actually driving.
Data science is the same as driving. It is very important to have solid theoretical knowledge and to regularly increase them to be able to get better while working on a project. However, you should always apply this theoretical knowledge to projects. By this, you will deepen your understanding of these concepts and Knowledge, have a better point of view of how they work in a real-life, and will also show others that you have strong theoretical knowledge and are able to put them into practice.
There are different types of guided projects. One of them is a guided project for
There are a lot of benefits for it:
It removes the barriers between you and doing projects
Saves you much time thinking about the project and preparing the data.
It allows you to apply the theoretical knowledge without getting distracted by obstacles.
Practical tips that can save your effort and time in the future.
#datasciencefree
#rohitdubey
#teachtechtoe
#linkedin.com/in/therohitdubey
Machine Learning: Need of Machine Learning, Its Challenges and its ApplicationsArpana Awasthi
BCA Department of JIMS Vasant Kunj-II is one of the best BCA colleges in Delhi NCR. The curriculum is well updated and the subjects included all the latest technologies which are in demand.
JIMS BCA course teaches Python to II semester students and Artificial Intelligence Using Python to Sixth Semester students.
Here is a small article on the Future of Machine Learning, hope you will find it useful.
Machine Learning is a field of Computer science in which computer systems are able to learn from past experiences, examples, environments. With help of various Machine Learning Algorithms, Computers are provided with the ability to sense the data and produce some relevant results.
Machine learning Algorithms provide the technique of predicting the future outcomes or classifying information from the given input to the Machines so that the appropriate decisions can be taken.
what-is-machine-learning-and-its-importance-in-todays-world.pdfTemok IT Services
Machine Learning is an AI method for teaching computers to learn from their mistakes. Machine learning algorithms can “learn” data directly from data without using an equation as a model by employing computational methods.
https://bit.ly/RightContactDataSpecialists
Unlocking the Potential of Artificial Intelligence_ Machine Learning in Pract...eswaralaldevadoss
Machine learning is a subset of artificial intelligence that involves training computers to learn from data and make predictions or decisions based on that data. It involves building algorithms and models that can learn patterns and relationships from data and use that knowledge to make predictions or take actions.
Here are some key concepts that can help beginners understand machine learning:
Data: Machine learning algorithms require data to learn from. This data can come from a variety of sources such as databases, spreadsheets, or sensors. The quality and quantity of data can greatly impact the accuracy and effectiveness of machine learning models.
Training: In machine learning, training involves feeding data into a model and adjusting its parameters until it can accurately predict outcomes. This process involves testing and tweaking the model to improve its accuracy.
Algorithms: There are many different algorithms used in machine learning, each with its own strengths and weaknesses. Common machine learning algorithms include decision trees, random forests, and neural networks.
Supervised vs. Unsupervised Learning: Supervised learning involves training a model on labeled data, where the desired outcome is already known. Unsupervised learning, on the other hand, involves training a model on unlabeled data and allowing it to identify patterns and relationships on its own.
Evaluation: After training a model, it's important to evaluate its accuracy and performance on new data. This involves testing the model on a separate set of data that it hasn't seen before.
Overfitting vs. Underfitting: Overfitting occurs when a model is too complex and fits the training data too closely, leading to poor performance on new data. Underfitting occurs when a model is too simple and fails to capture important patterns in the data.
Applications: Machine learning is used in a wide range of applications, from predicting stock prices to identifying fraudulent transactions. It's important to understand the specific needs and constraints of each application when building machine learning models.
Overall, machine learning is a powerful tool that can help businesses and organizations make more informed decisions based on data. By understanding the basic concepts and techniques of machine learning, beginners can begin to explore the potential applications and benefits of this exciting field.
A brief introduction to DataScience with explaining of the concepts, algorithms, machine learning, supervised and unsupervised learning, clustering, statistics, data preprocessing, real-world applications etc.
It's part of a Data Science Corner Campaign where I will be discussing the fundamentals of DataScience, AIML, Statistics etc.
Mixed Methods Research in the Age of Big Data: A Primer for UX ResearchersUXPA International
What does UX research entail in what some are calling the “Age of Data Science?” Most would agree that some level of collaboration is needed -- Data Science results feeding UX Research and vice versa -- but can this be more meaningful than simply attending each other’s readouts?
In this session, you’ll hear some practical, approachable tips for qualitative UX Researchers to play a larger role in Big Data discussions. Stats expertise not required! These tips will help you break through the lexicon barriers between UX Research and Data Science, and provide a framework for collaboration that can lead to even more impactful research.
UXPA 2016: Mixed Methods Research in the Age of Big DataZachary Sam Zaiss
UX professionals have a long history of blending quantitative and qualitative research to better understand the customer experience. As Data Science has emerged as a discipline (with an increasing amount of hype), it's all too easy to engage only during results time, sharing information but working independently. At UXPA 2016, I made the case for deeper collaboration between UX professionals and Data Scientists during research and analysis time, for the sake of better Design outcomes for all.
Machine Learning with Azure and Databricks Virtual WorkshopCCG
Join CCG and Microsoft for a hands-on demonstration of Azure’s machine learning capabilities. During the workshop, we will:
- Hold a Machine Learning 101 session to explain what machine learning is and how it fits in the analytics landscape
- Demonstrate Azure Databricks’ capabilities for building custom machine learning models
- Take a tour of the Azure Machine Learning’s capabilities for MLOps, Automated Machine Learning, and code-free Machine Learning
By the end of the workshop, you’ll have the tools you need to begin your own journey to AI.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
3. What to expect
To have an introduction to machine learning by walking through the steps
of a real project as an example.
1. What does a data scientist do
2. Basic ideas in machine learning
3. Demo: a machine learning application
4. Data preparation for predictive modeling
5. Building predictive models: Decision trees and forests
6. Validating predictive models
7. Measuring performance of predictive models
7. Prediction
The ability to make reliable predictions about future events by using the
patterns seen in historical data.
Examples:
- Which one of my customers will end their contract
based on their mobile phone usage data?
- Given the friendship graph of my users,
what new connections are likely to be made?
8. Anomaly detection
Uncovering unusual events, potential frauds by noticing deviation of the data
from what is normal.
Examples:
- It could be suspicious if a customer suddenly
consumes much less power than it is usual for
them according to the data from the meters.
- By knowing typically which user issues which
commands, I am able to recognize weird and
outlying operations on a computer.
9. Gaining insights
Extracting hidden connections, knowledge about our customers, products,
business processes.
Examples:
- Based on the data about their visits, we can
discover typical segments of users and observe
in which aspect they use our web site similarly
- We crawl Twitter for thousands of user
feedbacks and learn the general sentiment and
emotions towards about our company
10. Making valid decisions
The possibility to validate business related hypotheses or comparing
alternatives in a mathematical sense.
Examples:
- Will the subscription rate drop if we change the text
used in my email marketing campaign?
- How do I redesign my web page to maximize the time
spent by the visitors?
1) Define an experiment. 2) Measure the results on a
sample. 3) Infer the properties of the whole population.
11. What do we do?
We are building a data driven IT security product.
The software aims to find anomalies in IT security related system logs.
The behavior of the users of the IT system is analyzed and if unusual
behavior is detected, alerts are raised.
This helps the work if IT security experts of a company by drawing their
attention on the most important events in the system.
13. What is Big Data?
From a technical point of view:
“a term for data sets that are so large or complex that traditional data processing
applications are inadequate” (Wikipedia) -> Infrastructure-wise Big Data
From a layman’s point of view:
“extremely large data sets that may be analysed computationally to reveal
patterns, trends, and associations, especially relating to human behaviour and
interactions” (Oxford Dictionary) -> Impact-wise Big Data
14. What is Big Data?
The 3 Vs:
- Volume the data to be processed takes GBs, TBs of space
- Velocity new data comes frequently at a high speed
- Variety the data has a variety of formats and cannot be stored in
tabular (relational) form
(Gartner)
15. Data scientist vs Big Data
Data scientists
Professionals who
deal with big data
17. Types of learning
Understanding (meaningful learning):
What you learn converts to understanding of the concept.
Your knowledge is general: you can apply your it to new situations.
Memorizing (rote learning):
What you learn can be quickly recalled, but is superficial and cannot
be applied in another context.
18. Computers and memorization
Computers themselves are the best to accurately store and recall huge
amounts of data: documents, dictionaries, bits of video files, etc.
But this is solely memorizing. Do they understand what’s going on?
If a student is memorizing a question
bank before the exam without
understanding a word, she might pass an
exam containing the same questions, but
fails when answering new questions.
19. Machine learning
Machine learning is about making computers to be able to learn through
examples.
To goal is to, after having seen many examples, finding such patterns that
can be generalized so well that they can be used in future situations.
Can a student actually gain understanding
seeing questions and their answers?
20. Meaningful machine learning
As data scientist, while using machine learning as a tool, our most
important task is to prevent memorizing (or overfitting in this
context) because we want to use the acquired knowledge for new
examples in the future.
Although the machine will not “understand” the
data ever,
we can motivate our algorithms to find trends,
correlation structures, and connections.
21. A machine processes (learns from) more data than a human; it can deal
with amounts of data that we cannot.
With machines, learning can be automated;
machines deal with repetitive tasks more easily than humans.
The patterns found by the machines will never be perfect,
but given enough examples of appropriate quality and quantity, they will
be useful.
Nature of machine learning
22. When to use machine learning
If the following conditions hold:
1) There is a pattern to be learned; a pattern between the questions
(inputs) and answers (output)
2) We cannot formulate the pattern mathematically
3) We have enough data (examples) for learning
(Abu-Mostafa: Learning from Data)
23. When not to use machine learning
To find out, for instance:
- The winning numbers of next week’s lottery (no pattern)
- The area of a triangle (can be formulated)
- The time of the next financial crisis (not enough data)
25. Learning game (Abu-Mostafa: Learning from Data)
Takeaways:
- There is no single solution but there are many possible
ones
- The amount of learning examples during learning raises our
confidence about our solution
26. Key aspects to consider in a machine learning
task
Data What are the examples, how do we get them?
Unit of observation What is considered one example?
Observed features What attributes do we store about an example?
Observed target variable What is the attribute we want to be able to
predict?
Outcome What is the meaning of the predicted target
variable?
Business Case How can we use the predictions?
27. Predict if an employee wants to quit
Data Personal, work-related data from HR database
Unit of observation One employee
Observed features Overtime, effectiveness, patterns in days-off and
sick-days, commuting time, etc.
Observed target variable Who quit in the past?
Outcome What are the chances of someone quitting?
Business Case Prevent quitting by focused countermeasures, eg.
mentoring.
28. Predicting flight delays
Data Air traffic data from airport systems
Unit of observation A single flight from A to B
Observed features Origin, destination, airline, day of year, weather
Observed target variable Delay in minutes
Outcome Prediction of punctuality
Business Case What are the expected loss on delays?
29. Biometric authentication with mouse dynamics
Data Server logs about user sessions
Unit of observation A single movement of a mouse cursor from A to B
Observed features Length, straightness, speed
Observed target variable The username of the user
Outcome Anomaly level of a user session
Business Case Improved security with automatic alerts
30. Classify mood of music
Data 500 mp3 files
Unit of observation Song in mp3
Observed features ?
Observed target variable Manually defined labels either “cheerful” or “blue”
Outcome ?
Business Case ?
31. How to represent in data table format
Feature #1 Feature #2 ...
Target
variable
...
...
Examples,
data point
Headers
Features
(observed attributes) Observed target
variable
34. Data preprocessing
Raw data
:(
Data ready to
be analyzed
:)
Data representations
Join data tables
Character encoding Aggregations
Pivoting
Parsing raw data
Date formats
REPRODUCIBLE
PROCESS
37. Raw data of a session
record timestamp client timestamp button state x y
1434623080.316000 4053743.247000 NoButton Move 686 281
1434623080.419000 4053743.357000 NoButton Move 687 287
1434623080.615000 4053743.559000 Left Pressed 687 287
1434623080.745000 4053743.684000 Left Released 687 287
1434623081.557000 4053744.495000 NoButton Move 690 288
1434623081.667000 4053744.605000 NoButton Move 742 300
… (some 10k lines)
39. The target variable: what is the goal of the
analysis?
Feature #1 Feature #2 ...
Target
variable
...
Examples,
data point
...
Features
(observed attributes) Observed target
variable
40. The examples: was would be appropriate as
an example?
Feature #1 Feature #2 ...
Made by
user?
1
1
0
0
...
...
Examples,
data point
...
Features
(observed attributes) Observed target
variable
41. Gesture
A gesture: moving the cursor from one point to another in one go.
- Large enough to capture the mouse moving characteristics of a user,
- Small enough to have a lot of them to learn from.
A possible definition of a gesture:
We process the raw file from the beginning row-by-row. At each step,
if the time difference is larger than 0.3 sec, or a mouse button is
pressed, the current gesture ends and a new one starts.
43. The features: what are appropriate features of
a gesture?
Feature #1 Feature #2 ...
Made by
user?
1
1
0
0
...
...
Gestures
...
Features
(observed attributes) Observed target
variable
44. Feature engineering
What properties of gestured can be defined that might be useful in
differentiating between users?
*CLICK*
ts0, x0, y0
ts1, x1, y1
tsn, xn, yn
45. Feature engineering
What properties of gestured can be defined that might be useful in
differentiating between users?
*CLICK*
ts0, x0, y0
ts1, x1, y1
tsn, xn, yn
Duration: tsn - ts0
Path length: sum of distances between consecutive points
Avg. speed: path length / duration
Time to click: time spent between last move and click (if any)
Mean/std/etc. of (consecutive) speed/acceleration/etc. values
Also: angles
46. The data set is now ready to be analyzed
Avg speed
(pixel/sec)
Duration
(sec)
...
Made by
user?
34.5 5 1
12.1 3 1
1.23 12 0
55.9 3 0
... ... ...
...
Gestures
...
Features
(observed attributes) Observed target
variable
48. Outline of predictive modeling
We have many observations about a certain event, process etc. Each
observation is a pair of several features and a target variable.
With a learning algorithm and our data we aim to build a predictive
model that learns what is the typical value of the target for any
combination of feature values.
We then can use the model for predicting the value of the target of
(new) observations solely based on their features.
49. Example: wine prices
Rain during
harvest
(mm)
mean
temperature
in May (℃)
... Price (€)
18 5 2.5
200 4 16
180 10 250
100 2 9.5
... ... ...
...
...
Examples,
data point
Headers
Features
(observed attributes) Observed target
variable
50. Example: the Titanic data set
sex fare (£) ... survived
male 200 1
female 40 1
female 150 1
male 40 0
... ... ...
...
...
Examples,
data point
Headers
Features
(observed attributes) Observed target
variable
51. Prediction problems
The two main types of prediction problems are:
- Classification: the target variable is a categorical variable (e.g.,
yes/no decision, letters to be recognized)
- Regression: the target variable is a continuous variable (e.g., age,
income, stock prices)
Either for classification or regression, there are hundreds of learning
algorithms to choose from. Picking one is a problem itself, and influences
the success of the project.
52. Example of regression (1D)
Each blue point is an observation. We have to build a model that can tell
the income based on the age of the client.
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000
53. Example of regression (1D)
The task translates to fitting a curve to the points that we see!
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000
54. Example of regression (1D)
1st solution: connecting the dots.
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000
55. Example of regression (1D)
2nd solution: draw a straight line through the points.
Monthlyincome(y)
Age (x)
15 25 35 45 55 65 75
1000
2000
58. Decision tree
We make predictions about the target (y) by answering
questions about the features (x1
, …, xn
).
An answer to a question either leads to a next question or directly to a
prediction.
We store the series of decisions in a tree structure. The leaves contain the
predictions. Each node that is not a leaf contains a question.
59. Example: Titanic decision tree
Male?
Age >= 10?
Family members on
ship >= 3?
survives
survivesdies
dies
yes no
yes no
yes no
60. Building a tree (2D)
Let us build a decision tree to decide whether an article on a news portal will
be popular or not! We have two features: # of photos, # of paragraphs.
# paragraphs
#photos
: popular : not popular
61. Building a tree (2D)
Cut the space based on # of paragraphs.
# paragraphs
> 10?
10
# paragraphs
#photos
: popular : not popular
62. Building a tree (2D)
Cut the space based on # of paragraphs.
# paragraphs
> 10?
not
popular
yes
10
# paragraphs
#photos
: popular : not popular
63. Building a tree (2D)
The rest is not homogeneous enough; we proceed with cutting.
# paragraphs
> 10?
not
popular
yesno
10
# paragraphs
#photos
: popular : not popular
64. Building a tree (2D)
Cut the rest of the space based on # of photos.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
# paragraphs
#photos
: popular : not popular
65. Building a tree (2D)
Cut the rest of the space based on # of photos.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
yes
popular
# paragraphs
#photos
: popular : not popular
66. Building a tree (2D)
The rest is not homogeneous enough; we proceed with cutting.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
yes
popular
no
# paragraphs
#photos
: popular : not popular
67. Building a tree (2D)
Cut again based on # of paragraphs.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
noyes
popular # paragraphs
< 2?
# paragraphs
#photos
: popular : not popular
68. Building a tree (2D)
Cut again based on # of paragraphs.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
noyes
popular # paragraphs
< 2?
yes
not popular# paragraphs
#photos
: popular : not popular
69. Building a tree (2D)
The rest is homogeneous enough; we stop cutting the space.
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
10
6
2
noyes
popular # paragraphs
< 2?
noyes
not popular popular# paragraphs
#photos
: popular : not popular
70. Using tree for prediction
What will be the popularity of a new article according to the tree?
# paragraphs
> 10?
not
popular
yesno
# photos > 6?
# paragraphs
#photos
10
6
: popular : not popular
2
noyes
popular # paragraphs
< 2?
noyes
not popular popular
71. What if we do not stop cutting the space?
Take this task as an example.
72. What if we do not stop cutting the space?
We cut the space to fully homogeneous areas.
73. What if we do not stop cutting the space?
You see that red area in the middle of a large blue one?
74. What if we do not stop cutting the space?
You see that red area in the middle of a large blue one? It is more like the
result of “getting lost in the details” than seeing the true trend.
If a new point that is in fact blue,
accidentally falls there, it will be
classified as red.
75. Stopping criteria
A couple of rules for trees to prevent getting lost in the
details, i.e., growing too large, i.e., overfitting:
- They cannot have more levels than x,
- We do not cut areas with less than x points,
- We do not cut areas if one of the new areas would
have less than x points,
- We do not cut areas that are homogeneous enough
(as measured by entropy, Gini index etc.)
76. Random forests
In a forest there are several independent trees. Each tree grows seeing a
different random part of the whole data set.
When making prediction, the prediction of the forest is voted by the trees.
?
?
?
?
?
?
?
?
?
?
...
77. Aggregating the gesture-level predictions
After learning, the forest can predict if a gesture was legal or not.
Avg. speed
(pixel/sec)
Duration
(sec)
...
Made by
user?
(prediction)
34.5 5 1
12.1 3 1
1.23 12 0
55.9 3 0
... ... ...
78. Aggregating the gesture-level predictions
We need to make a decision about a whole session for a user!
For this, we aggregate the predictions for the gestures in a whole session.
Avg. speed
(pixel/sec)
Duration
(sec)
...
Made by
user?
(prediction)
34.5 5 1
12.1 3 1
1.23 12 0
55.9 3 0
... ... ...
- If < 0.5, the session is
regarded as illegal
- If > 0.5, the session is
regarded as legal
Average
80. Overfitting is bad, what to do about it?
We are afraid of the more complex models but we need them!
How should we decide the amount of complexity which is JUST ENOUGH?
● A good model fits on the known examples (obviously) but also fits on unseen
examples
● That is the point: predicting the outcome of unseen examples is similar to
predicting examples from the future
● We can simulate having new examples by slicing the known dataset into two
parts:
○ Training dataset: examples only for training
○ Test dataset: examples only for measuring performance
82. Feature #1 Feature #2 ... Target
Every
known
example
Feature #1 Feature #2 ... Target
Training set
Test set
Validation
83. How it is done?
One can increase the complexity of the learning model as long as the
goodness of fit on the UNSEEN data increases.
● The goodness of fit on the training set will increase until full overfitting
● The goodness of fit on the test set will increase but just to a certain
point
We can visualize it with the learning curve.
84. The learning curve
“The goodness of fit on the training set will increase until full overfitting.”
That is, the error will decrease on the training set until full overfitting.
Amount of complexity we allow
Training dataset
Errorofmodel
85. “The goodness of fit on the test set will increase but just to a certain point.”
That is, the error on the test set will decrease but just to a certain point.
Amount of complexity we allow
The learning curve
Unseen test
dataset
Training dataset
Errorofmodel
86. The learning curve
After the optimal point every “bit of knowledge” the model gains, is not
general, but just data-specific knowledge about the particular training
dataset it sees.
Errorofmodel
Amount of complexity we allow
Training dataset
Unseen test
dataset
Optimal
complexity
87. Validation
There are some techniques (e.g., cross validation) which try to eliminate this loss of
information by selecting different parts of the known data as training sets, and then
aggregating the results of these different scenarios.
We sacrifice some data (and potentially
information) but we gain objective,
measurable knowledge about how well
our model will perform “out there”.
89. Measuring performance
What do we mean exactly by “goodness of fit”?
We would like to have minor differences between the predictions and the
real value of the target attributes from the test data set.
If our problem is regression (the truth is a continuous variable):
● Add up all the differences between the prediction and the truth for all
examples;
● The smaller the sum, the better our model.
● Exact match is rare, but a close guess is usable.
● E.g.: RMSE, root of mean squared error
90. Measuring performance
What do we mean exactly by “goodness of fit”?
We would like to have minor differences between the predictions and the
real value of the target attributes from the test data set.
If our problem is classification:
● If the predicted class misses the true class, there is no magnitude of
error. Not correct is not correct. (There is no “slightly pregnant
woman”.)
● Counting the rate of the correct predictions seems like a good idea,
but it is not a great one.
91. What can a classifier model do?
Not so many things, considering two classes, namely: “positive” and “negative”:
Predicts “Positive” when the reality is “Positive”
Predicts “Positive” when the reality is “Negative”
Predicts “Negative” when the reality is “Negative”
Predicts “Negative” when the reality is “Positive”
92. Let’s make a small table
If we rearrange the smiley faces:
True Positive
True Negative
False Positive
False Negative
False
Pos.
False
Neg.
True
Neg.
+ -
+
Reality
Prediction
-
True
Pos.
The confusion matrix catches ‘em all.
(The most important 2-by-2 matrix in machine learning.)
93. Let’s make a small table
With a perfect model:
5 0
0 5
+ -
+
Reality
Prediction
-
● 5 positive and 5 negative cases in the
dataset to be predicted.
● Every prediction is correct.
94. Let’s make a small table
1
1 4
+ -
+
Reality
Prediction
-
4
A more realistic scenario:
● 5 positive and 5 negative cases in the
dataset to be predicted
● There is one misclassified case for
each class.
95. Accuracy = calculate the rate of the correctly classified cases.
5
5 5
+ -
+
Reality
Prediction
-
985
With the confusion matrix:
sum(blue cells) / sum(all cells) = 990/1000 = 99%
Is this a good model?
Note that in a case like this, the model is likely to be
used to spot the NEGATIVE events. (Those are the
rare, interesting cases.)
This particular model has an awful performance on
those cases. Half of them are mis-classified!
How was that comment on measuring
performance?
96. Some performance measures...
There are several other methods which use the values of the confusion matrix in
order to evaluate a classification model.
The method needs to be chosen carefully for the purpose of the application.
97. Many classifiers don’t give strict verdicts
Though the target variable might be a discrete variable (orange/green), in practise
the classifier models are giving class-probabilities back (e.g., X% chance of
being green).
0% 100%
This means that one can decide which probability is high enough to predict a
particular label. If a music song seems to be 95% “cheerful” it’s a safer bet, than
one which is 52% cheerful.
Legend:
Color: the true class of a particular event known from the test dataset
Position: the probability of being in the green class as estimated by the model.
98. Many classifiers don’t give strict verdicts
Though the target variable might be a discrete variable (orange/green), in practise
the classifier models are giving class-probabilities back (e.g., X% chance of
being green).
0% 100%
The good news: We have a much more detailed view on how the model works,
and on the amount of confidence it has about each prediction it makes.
The bad news: In order to retrieve discrete predictions the user must decide how
to transform the probabilities into classes, i.e, define a probability threshold which
separates the classes.
99. Many classifiers don’t give strict verdicts
Though the target variable might be a discrete variable (orange/green), in practise
the classifier models are giving class-probabilities back (e.g., X% chance of
being green).
0% 100%
Legend:
Color: the true class of a particular event known from the test dataset
Position: the probability of being in the green class as estimated by the model.
This is an amazing model! We can find a point in the middle, which separates the
points into two groups, which are 100% the same as the original two categories.
100. Remember… we don’t have perfect models :(
What should we do when our model outputs something more realistic, like this:
0% 100%
Legend:
Color: the true class of a particular event known from the test dataset
Position: the probability of being in the green class as estimated by the model.
In the two sides, the picture is clear. But there are some borderline cases, where
there are some confusion.
2 questions arise:
● How should we find a good threshold?
● How to evaluate a model, without a pre-defined threshold?
101. Finding a threshold depends on the application, and the problem domain itself, and
has little to do with machine learning.
A threshold with low false positive rate is needed before applying a risky treatment.
A threshold with low false negative rate is needed before blood transfusion.
How should we find a good threshold?
- Towards the left hand side, we classify every green correctly but misclassify a lot of
oranges as greens. This means a lots of false positives.
- Towards the right hand side, we classify every orange correctly, but misclassify a lot of
greens as oranges. This means a lots of false negatives.
0% 100%
A B
102. 0% 100%
A B
How to evaluate without a pre-defined threshold?
ALL
Every application have to deal with the
false positive - false negative trade-off,
and they deal with it differently.
Regardless of the application, we have to
be able to tell if a model is better than
another, objectively.
Why not to compute the false positives
and false negatives for EVERY threshold,
and have a look at a particular model, by
considering these different scenarios?
103. ROC curve (Receiver Operating Characteristic)
Every point on the red curve is
showing the corresponding rate of
false positives and true positives for
a particular threshold.
The dotted line is a random model.
The further the red line is from the
dotted line the better the model.
Howmanytruepositives
atathreshold?
How many false positives
at a threshold?
0% 100%
0%
100%
104. ROC curve (Receiver Operating Characteristic)
AUC = “Area Under Curve”
The bigger the area under the red line,
the better the model.
The area under the dotted line: 0.5
The perfect model: 1.0
“You can decrease the false-positive rate
to 0, and in that process, you don’t
generate any false negatives.”
0% 100%
0%
100%
Howmanytruepositives
atathreshold?
How many false positives
at a threshold?
105. Wrap up
- Data science as a field is big and diverse, machine learning is a key
tool to master
- Given enough examples machines can learn
- Learning is more complex than memorizing
- A great effort is needed to prepare the examples (features and target)
- The bigger challenge is not fitting a model, but to avoid overfitting
- Several key decisions have to be made, after a model has been
constructed