SlideShare a Scribd company logo
1 of 12
Download to read offline
11. Machine Learning System
Design:
PRIORITIZING WHAT TO WORK ON: BUILDING A SPAM CLASSIFIER:
If a word is to be found in the email, we would assign its respective
entry a 1, else 0.
So how could you spend your time to improve the accuracy of this
classifier?
ERROR ANALYSIS:
We can try out different quick and dirty implementations of ML
algorithms and learning curves.
Lets take a high error example:
For example, assume that we have 500 emails and our algorithm
misclassifies a 100 of them. We could manually analyse the 100
emails and categorize them based on what type of emails they are.
We could then try to come up with new cues and features that
would help us classify these 100 emails correctly.
Hence, if most of our misclassified emails are those which try to
steal passwords, then we could find some features that are
particular to those emails and add them to our model.
It is very important to get error results as a single, numerical value.
Otherwise it is difficult to assess your algorithm's performance.
For example, if we use stemming, which is the process of treating
the same word with different forms (fail/failing/failed) as one word
(fail), and get a 3% error rate instead of 5%, then we should
definitely add it to our model.
However, if we try to distinguish between upper case and lower-
case letters and end up getting a 3.2% error rate instead of 3%,
then we should avoid using this new feature.
Hence, we should try new things, get a numerical value for our
error rate, and based on our result decide whether we want to
keep the new feature or not.
ERROR METRICS FOR SKEWED CLASSES:
Suppose we train a model and it gives an error of 1%.
But in the training examples, only 0.5 % people actually have
cancer… so predicting y=0 always is better choice as it gives only
0.5% error
This is an ambiguous form of judgement.
So, we come up with a new form of error metric: Precision and
Recall
Precision and Recall are defined by setting the rare class = 1not =0…
Now, if the algo predicts y=0 all the time, then this will give Recall =
0
➢Classifier with high precision and high recall is a good classifier
➢Even with very skewed classes, the algo can’t cheat
But:
So, to get high precision:
This will give high precision → as no of predicted positives is
decreased
And low recall
If:
Then:
How to compare precision recall values bw two algos:
Previously we had a single no metric for error analysis, but now we
have two nos.
How to determine which is better?
Taking average is not a good choice
A NEW EVALUATION METRICS:
F1 score:also called F score:
➢To pick a threshold, try different values of threshold and one
with the best F-score is to be picked.
IMPORTANCE OF DATA IN MACHINE LEARNING:
Problems like confusable words problem are hard to entertain.
So what can we do to solve that ?
These algos were tried on different sizes of training data:
All of them give approx. same accuracy on a large dataset
It shows that:
So, for confusable words problem:
We have enough features that capture the surrounding words
o So, that’s enough info to predict the right answer
As a counter example:
We don’t have features like: how old the house is?
Does it have furniture?
Location?
Etc…
We have just the size!
An approach to determine if its possible to predict the o/p with
given set of features:
➢֍ In case of confusable words, a human expert can easily tell
the missing word accurately
➢֍ In case of housing price example: even a real estate expert
cant tell the exact priec of house based on just the size.
So, can having the lot of data help?
➢Assuming we have many features and a rich class of functions
→ low bias
➢Assuming we have huge huge dataset → low variance
→ This makes up for a very high performance algorithm.

More Related Content

Similar to 11 ml system design

Math 009 Final Examination Spring, 2015 1 Answer Sheet M.docx
Math 009 Final Examination Spring, 2015 1 Answer Sheet M.docxMath 009 Final Examination Spring, 2015 1 Answer Sheet M.docx
Math 009 Final Examination Spring, 2015 1 Answer Sheet M.docxandreecapon
 
Important Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxImportant Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxChode Amarnath
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdfjuan631
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Aptitude questions all topics (1)
Aptitude questions all topics (1)Aptitude questions all topics (1)
Aptitude questions all topics (1)SnehaNallabheem
 
Performance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper ParameterPerformance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper ParameterIndraFransiskusAlam1
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxMohamed Essam
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
30019781969295_report
30019781969295_report30019781969295_report
30019781969295_reportPIYALI ROY
 
AMCAT REPORT
AMCAT REPORTAMCAT REPORT
AMCAT REPORTabdulsafa
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Lecture11.pptx
Lecture11.pptxLecture11.pptx
Lecture11.pptxSanjarBey
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeOsama Ghandour Geris
 
Proportion
ProportionProportion
Proportiontvierra
 
Proportion
ProportionProportion
Proportiontvierra
 
Sampling methods theory and practice
Sampling methods theory and practice Sampling methods theory and practice
Sampling methods theory and practice Ravindra Sharma
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview QuestionsRock Interview
 

Similar to 11 ml system design (20)

Math 009 Final Examination Spring, 2015 1 Answer Sheet M.docx
Math 009 Final Examination Spring, 2015 1 Answer Sheet M.docxMath 009 Final Examination Spring, 2015 1 Answer Sheet M.docx
Math 009 Final Examination Spring, 2015 1 Answer Sheet M.docx
 
Important Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxImportant Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptx
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdf
 
Sample Size Determination
Sample Size DeterminationSample Size Determination
Sample Size Determination
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Aptitude questions all topics (1)
Aptitude questions all topics (1)Aptitude questions all topics (1)
Aptitude questions all topics (1)
 
Performance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper ParameterPerformance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper Parameter
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
30019781969295_report
30019781969295_report30019781969295_report
30019781969295_report
 
AMCAT REPORT
AMCAT REPORTAMCAT REPORT
AMCAT REPORT
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Lecture11.pptx
Lecture11.pptxLecture11.pptx
Lecture11.pptx
 
Machine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-codeMachine learning-in-details-with-out-python-code
Machine learning-in-details-with-out-python-code
 
Proportion
ProportionProportion
Proportion
 
4 1 ratio
4 1 ratio4 1 ratio
4 1 ratio
 
Proportion
ProportionProportion
Proportion
 
Sampling methods theory and practice
Sampling methods theory and practice Sampling methods theory and practice
Sampling methods theory and practice
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
 

More from TanmayVijay1

18 application example photo ocr
18 application example  photo ocr18 application example  photo ocr
18 application example photo ocrTanmayVijay1
 
1 Introduction to Machine Learning
1 Introduction to Machine Learning1 Introduction to Machine Learning
1 Introduction to Machine LearningTanmayVijay1
 
17 large scale machine learning
17 large scale machine learning17 large scale machine learning
17 large scale machine learningTanmayVijay1
 
16 recommender systems
16 recommender systems16 recommender systems
16 recommender systemsTanmayVijay1
 
15 anomaly detection
15 anomaly detection15 anomaly detection
15 anomaly detectionTanmayVijay1
 
14 dimentionality reduction
14 dimentionality reduction14 dimentionality reduction
14 dimentionality reductionTanmayVijay1
 
13 unsupervised learning clustering
13 unsupervised learning   clustering13 unsupervised learning   clustering
13 unsupervised learning clusteringTanmayVijay1
 
12 support vector machines
12 support vector machines12 support vector machines
12 support vector machinesTanmayVijay1
 
10 advice for applying ml
10 advice for applying ml10 advice for applying ml
10 advice for applying mlTanmayVijay1
 
9 neural network learning
9 neural network learning9 neural network learning
9 neural network learningTanmayVijay1
 
8 neural network representation
8 neural network representation8 neural network representation
8 neural network representationTanmayVijay1
 
6 logistic regression classification algo
6 logistic regression   classification algo6 logistic regression   classification algo
6 logistic regression classification algoTanmayVijay1
 
4 linear regeression with multiple variables
4 linear regeression with multiple variables4 linear regeression with multiple variables
4 linear regeression with multiple variablesTanmayVijay1
 
3 linear algebra review
3 linear algebra review3 linear algebra review
3 linear algebra reviewTanmayVijay1
 
2 linear regression with one variable
2 linear regression with one variable2 linear regression with one variable
2 linear regression with one variableTanmayVijay1
 

More from TanmayVijay1 (17)

18 application example photo ocr
18 application example  photo ocr18 application example  photo ocr
18 application example photo ocr
 
1 Introduction to Machine Learning
1 Introduction to Machine Learning1 Introduction to Machine Learning
1 Introduction to Machine Learning
 
17 large scale machine learning
17 large scale machine learning17 large scale machine learning
17 large scale machine learning
 
16 recommender systems
16 recommender systems16 recommender systems
16 recommender systems
 
15 anomaly detection
15 anomaly detection15 anomaly detection
15 anomaly detection
 
14 dimentionality reduction
14 dimentionality reduction14 dimentionality reduction
14 dimentionality reduction
 
13 unsupervised learning clustering
13 unsupervised learning   clustering13 unsupervised learning   clustering
13 unsupervised learning clustering
 
12 support vector machines
12 support vector machines12 support vector machines
12 support vector machines
 
10 advice for applying ml
10 advice for applying ml10 advice for applying ml
10 advice for applying ml
 
9 neural network learning
9 neural network learning9 neural network learning
9 neural network learning
 
8 neural network representation
8 neural network representation8 neural network representation
8 neural network representation
 
7 regularization
7 regularization7 regularization
7 regularization
 
6 logistic regression classification algo
6 logistic regression   classification algo6 logistic regression   classification algo
6 logistic regression classification algo
 
5 octave tutorial
5 octave tutorial5 octave tutorial
5 octave tutorial
 
4 linear regeression with multiple variables
4 linear regeression with multiple variables4 linear regeression with multiple variables
4 linear regeression with multiple variables
 
3 linear algebra review
3 linear algebra review3 linear algebra review
3 linear algebra review
 
2 linear regression with one variable
2 linear regression with one variable2 linear regression with one variable
2 linear regression with one variable
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

11 ml system design

  • 1. 11. Machine Learning System Design: PRIORITIZING WHAT TO WORK ON: BUILDING A SPAM CLASSIFIER: If a word is to be found in the email, we would assign its respective entry a 1, else 0.
  • 2. So how could you spend your time to improve the accuracy of this classifier? ERROR ANALYSIS: We can try out different quick and dirty implementations of ML algorithms and learning curves. Lets take a high error example: For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them. We could manually analyse the 100 emails and categorize them based on what type of emails they are. We could then try to come up with new cues and features that would help us classify these 100 emails correctly.
  • 3. Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance.
  • 4. For example, if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower- case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not. ERROR METRICS FOR SKEWED CLASSES: Suppose we train a model and it gives an error of 1%. But in the training examples, only 0.5 % people actually have cancer… so predicting y=0 always is better choice as it gives only 0.5% error This is an ambiguous form of judgement.
  • 5. So, we come up with a new form of error metric: Precision and Recall Precision and Recall are defined by setting the rare class = 1not =0…
  • 6. Now, if the algo predicts y=0 all the time, then this will give Recall = 0 ➢Classifier with high precision and high recall is a good classifier ➢Even with very skewed classes, the algo can’t cheat But: So, to get high precision: This will give high precision → as no of predicted positives is decreased And low recall If:
  • 7. Then: How to compare precision recall values bw two algos: Previously we had a single no metric for error analysis, but now we have two nos.
  • 8. How to determine which is better? Taking average is not a good choice A NEW EVALUATION METRICS: F1 score:also called F score:
  • 9. ➢To pick a threshold, try different values of threshold and one with the best F-score is to be picked. IMPORTANCE OF DATA IN MACHINE LEARNING: Problems like confusable words problem are hard to entertain. So what can we do to solve that ?
  • 10. These algos were tried on different sizes of training data: All of them give approx. same accuracy on a large dataset It shows that: So, for confusable words problem: We have enough features that capture the surrounding words o So, that’s enough info to predict the right answer
  • 11. As a counter example: We don’t have features like: how old the house is? Does it have furniture? Location? Etc… We have just the size! An approach to determine if its possible to predict the o/p with given set of features: ➢֍ In case of confusable words, a human expert can easily tell the missing word accurately ➢֍ In case of housing price example: even a real estate expert cant tell the exact priec of house based on just the size.
  • 12. So, can having the lot of data help? ➢Assuming we have many features and a rich class of functions → low bias ➢Assuming we have huge huge dataset → low variance → This makes up for a very high performance algorithm.