SlideShare a Scribd company logo
1 of 14
What’s Cooking ?
OUTLINE
 Problem Statement
 Techniques
 Bernoulli Naïve Bayes
 Problems Encountered
 Data Mining
 Performance
 Conclusion
THE PROBLEM
 What if a machine behaves like a food connoisseur ?
 Essentially this machine needs to predict the category of a dish's
cuisine given a list of ingredients
 We “humans” took up this challenge to build our own food
connoisseur
 Employed Machine Learning and Natural Language Processing
Cuisine’s Frequency
467
804
1546
2673
755
2646
1175
3003
667
7838
526
1423
830
6438
821
489
4320
989
1539
825
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Frequency
Techniques
 Supervised Learning
 Text mining
 Preprocessing
 Stemming
 Stop word removal
 Sparse term removal
 Term frequency
 Inverse document frequency
 Feature Selection
 Bag of Words Algorithm
 Classification
Bernoulli Naïve Bayes
 Bayes Theorem (Prior and Posterior probabilities)
 Posterior: Probability (Italian/cheese) = (Probability
(Italian) * Probability (cheese/ Italian))/ Probability
(cheese)
 Decision rule: P(word|class)=(word_count_in_class +
1)/(total_words_in_class+total_unique_words_in_class)
 Real World Applications: Used for Spam Filtering, news
classification, sentiment analysis
Problems Encountered
Id Cuisine Ingredients
29 Thai Sugar, hot chilli, Asian fish
sauce, lime juice.
 Possible Class Imbalance: Italian, Mexican, Southern U.S ~50%
 Accuracy on least frequent cuisines very low
 Counter Over fitting:
 Sampling
 Partitioning
 Preprocessing:
 Stemming eg. Almond vs almonds, thigh vs thighs
 Concatenation eg. Black olives vs black_olives
Problems Encountered
 Clustering of ingredients doesn’t work
 Disproportionate NULL values
 Poor Tf-idf accuracy
 Weighting of important ingredients
 Normalization: complex classification
 Least and most common ingredients
 Removed Sparse Terms
 Load all the packages such as “tm”, “caret”, “wordcloud” etc.
 Convert Data from JSON format to Data Frame using “jsonlite”
 Set the levels target feature – cuisine for better classification
 cook$cuisine <- factor(cook$cuisine)
 Initial Preprocessing: “-” and “ ” into “_” for all the ingredients
 Create a corpus and then convert it to a document term matrix
Data Mining (Import Data)
 Final pre-processing using “tm” package
 Remove numbers, stop words, punctuation, and stemming
 Removing least and most common ingredients
 Draw random samples for train and test dataset
 Total – 39774
 Training – 29830
 Testing – 9944
 Converting all the observations into Boolean using apply
function
 From 0/1/2… to Yes/No
 Building a Naïve Bayes Model and confusion Matrix
Data Mining (Finalize Data)
 Decision Tree
 Multiple classes – only option Confusion Matrix
 Overall Accuracy of the model – 52.5%.
 Random Forest
 Accuracy Increased : 64.70%
 Out of Bag Error Estimate : 36.11%
Performance
 Linear Support Vector Machines
 SVM implemented in R
 Accuracy Decreased – 49.5%
 Naïve Bayes
 Model improves the accuracy for Multi-classification
 Accuracy improved to 72.68%
Overall Statistics
Accuracy – 0.7268 95% CI(0.7179, 0.7355)
Performance
Word cloud
A visual representation for text
data, to visualize free form
text. Tags are usually single
words, and the importance of
each tag is shown with font size
or color. Italian
Indian
Task: Predict the cuisine for each recipe; Word cloud on
the right are examples of our project results. Cheese is
the most used ingredient in Italian food. Cumin is also
very important in Indian food.
Conclusion
 Accuracy 72.68% , highest on kaggle for this problem is
82.17%.
 Got head start in Natural Language Processing and
Machine Learning
 Looking forward to taking up more challenging and
intriguing problems in these domains. First one would be
doing sentiment analysis on Professor MacDonald’s
Twitter/ Facebook/ LinkedIn Real and Fake profiles :D :p
THANK YOU!!!

More Related Content

Similar to BI_Final_Group_1

Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksTuanNguyen1697
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design TrainingESCOM
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
Turning Medical Expert Knowledge into Responsible Language Models - K1st WorldTurning Medical Expert Knowledge into Responsible Language Models - K1st World
Turning Medical Expert Knowledge into Responsible Language Models - K1st WorldDavid Talby
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Toxic Comment Classification using Neural Network and Machine Learning
Toxic Comment Classification using Neural Network and Machine LearningToxic Comment Classification using Neural Network and Machine Learning
Toxic Comment Classification using Neural Network and Machine LearningCammy Soh Hui Shan
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boostingbutest
 
Introduction Machine Learning by MyLittleAdventure
Introduction Machine Learning by MyLittleAdventureIntroduction Machine Learning by MyLittleAdventure
Introduction Machine Learning by MyLittleAdventuremylittleadventure
 
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food CourtsADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food CourtsSelman Bozkır
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine LearningValéry BERNARD
 
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...devashishsarkar
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdfJyoti Yadav
 
Genetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuningGenetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuningDr. Jyoti Obia
 
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
[slide] Attentive Modality Hopping Mechanism for Speech Emotion RecognitionSeoul National University
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
This is a heavily data-oriented
This is a heavily data-orientedThis is a heavily data-oriented
This is a heavily data-orientedbutest
 

Similar to BI_Final_Group_1 (20)

Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risks
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
I2b2 2008
I2b2 2008I2b2 2008
I2b2 2008
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
Turning Medical Expert Knowledge into Responsible Language Models - K1st WorldTurning Medical Expert Knowledge into Responsible Language Models - K1st World
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Toxic Comment Classification using Neural Network and Machine Learning
Toxic Comment Classification using Neural Network and Machine LearningToxic Comment Classification using Neural Network and Machine Learning
Toxic Comment Classification using Neural Network and Machine Learning
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Introduction Machine Learning by MyLittleAdventure
Introduction Machine Learning by MyLittleAdventureIntroduction Machine Learning by MyLittleAdventure
Introduction Machine Learning by MyLittleAdventure
 
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food CourtsADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
ADEM: An Online Decision Tree Based Menu Demand Prediction Tool for Food Courts
 
Final
FinalFinal
Final
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
Md Mushfiqul Alam: Biological, NeuralNet Approaches to Recognition, Gain Cont...
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
 
Genetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuningGenetic algorithm for hyperparameter tuning
Genetic algorithm for hyperparameter tuning
 
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
[slide] Attentive Modality Hopping Mechanism for Speech Emotion Recognition
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
This is a heavily data-oriented
This is a heavily data-orientedThis is a heavily data-oriented
This is a heavily data-oriented
 

BI_Final_Group_1

  • 2. OUTLINE  Problem Statement  Techniques  Bernoulli Naïve Bayes  Problems Encountered  Data Mining  Performance  Conclusion
  • 3. THE PROBLEM  What if a machine behaves like a food connoisseur ?  Essentially this machine needs to predict the category of a dish's cuisine given a list of ingredients  We “humans” took up this challenge to build our own food connoisseur  Employed Machine Learning and Natural Language Processing
  • 5. Techniques  Supervised Learning  Text mining  Preprocessing  Stemming  Stop word removal  Sparse term removal  Term frequency  Inverse document frequency  Feature Selection  Bag of Words Algorithm  Classification
  • 6. Bernoulli Naïve Bayes  Bayes Theorem (Prior and Posterior probabilities)  Posterior: Probability (Italian/cheese) = (Probability (Italian) * Probability (cheese/ Italian))/ Probability (cheese)  Decision rule: P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_class)  Real World Applications: Used for Spam Filtering, news classification, sentiment analysis
  • 7. Problems Encountered Id Cuisine Ingredients 29 Thai Sugar, hot chilli, Asian fish sauce, lime juice.  Possible Class Imbalance: Italian, Mexican, Southern U.S ~50%  Accuracy on least frequent cuisines very low  Counter Over fitting:  Sampling  Partitioning  Preprocessing:  Stemming eg. Almond vs almonds, thigh vs thighs  Concatenation eg. Black olives vs black_olives
  • 8. Problems Encountered  Clustering of ingredients doesn’t work  Disproportionate NULL values  Poor Tf-idf accuracy  Weighting of important ingredients  Normalization: complex classification  Least and most common ingredients  Removed Sparse Terms
  • 9.  Load all the packages such as “tm”, “caret”, “wordcloud” etc.  Convert Data from JSON format to Data Frame using “jsonlite”  Set the levels target feature – cuisine for better classification  cook$cuisine <- factor(cook$cuisine)  Initial Preprocessing: “-” and “ ” into “_” for all the ingredients  Create a corpus and then convert it to a document term matrix Data Mining (Import Data)
  • 10.  Final pre-processing using “tm” package  Remove numbers, stop words, punctuation, and stemming  Removing least and most common ingredients  Draw random samples for train and test dataset  Total – 39774  Training – 29830  Testing – 9944  Converting all the observations into Boolean using apply function  From 0/1/2… to Yes/No  Building a Naïve Bayes Model and confusion Matrix Data Mining (Finalize Data)
  • 11.  Decision Tree  Multiple classes – only option Confusion Matrix  Overall Accuracy of the model – 52.5%.  Random Forest  Accuracy Increased : 64.70%  Out of Bag Error Estimate : 36.11% Performance
  • 12.  Linear Support Vector Machines  SVM implemented in R  Accuracy Decreased – 49.5%  Naïve Bayes  Model improves the accuracy for Multi-classification  Accuracy improved to 72.68% Overall Statistics Accuracy – 0.7268 95% CI(0.7179, 0.7355) Performance
  • 13. Word cloud A visual representation for text data, to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. Italian Indian Task: Predict the cuisine for each recipe; Word cloud on the right are examples of our project results. Cheese is the most used ingredient in Italian food. Cumin is also very important in Indian food.
  • 14. Conclusion  Accuracy 72.68% , highest on kaggle for this problem is 82.17%.  Got head start in Natural Language Processing and Machine Learning  Looking forward to taking up more challenging and intriguing problems in these domains. First one would be doing sentiment analysis on Professor MacDonald’s Twitter/ Facebook/ LinkedIn Real and Fake profiles :D :p THANK YOU!!!