The document discusses building a machine learning model to predict the cuisine of a dish based on its ingredients list. It explores various techniques like Naive Bayes classification and preprocessing steps. The best performing model was Naive Bayes, achieving an accuracy of 72.68%. While not the highest score on online challenges, the project provided valuable experience in natural language processing and machine learning.
3. THE PROBLEM
What if a machine behaves like a food connoisseur ?
Essentially this machine needs to predict the category of a dish's
cuisine given a list of ingredients
We “humans” took up this challenge to build our own food
connoisseur
Employed Machine Learning and Natural Language Processing
5. Techniques
Supervised Learning
Text mining
Preprocessing
Stemming
Stop word removal
Sparse term removal
Term frequency
Inverse document frequency
Feature Selection
Bag of Words Algorithm
Classification
6. Bernoulli Naïve Bayes
Bayes Theorem (Prior and Posterior probabilities)
Posterior: Probability (Italian/cheese) = (Probability
(Italian) * Probability (cheese/ Italian))/ Probability
(cheese)
Decision rule: P(word|class)=(word_count_in_class +
1)/(total_words_in_class+total_unique_words_in_class)
Real World Applications: Used for Spam Filtering, news
classification, sentiment analysis
7. Problems Encountered
Id Cuisine Ingredients
29 Thai Sugar, hot chilli, Asian fish
sauce, lime juice.
Possible Class Imbalance: Italian, Mexican, Southern U.S ~50%
Accuracy on least frequent cuisines very low
Counter Over fitting:
Sampling
Partitioning
Preprocessing:
Stemming eg. Almond vs almonds, thigh vs thighs
Concatenation eg. Black olives vs black_olives
8. Problems Encountered
Clustering of ingredients doesn’t work
Disproportionate NULL values
Poor Tf-idf accuracy
Weighting of important ingredients
Normalization: complex classification
Least and most common ingredients
Removed Sparse Terms
9. Load all the packages such as “tm”, “caret”, “wordcloud” etc.
Convert Data from JSON format to Data Frame using “jsonlite”
Set the levels target feature – cuisine for better classification
cook$cuisine <- factor(cook$cuisine)
Initial Preprocessing: “-” and “ ” into “_” for all the ingredients
Create a corpus and then convert it to a document term matrix
Data Mining (Import Data)
10. Final pre-processing using “tm” package
Remove numbers, stop words, punctuation, and stemming
Removing least and most common ingredients
Draw random samples for train and test dataset
Total – 39774
Training – 29830
Testing – 9944
Converting all the observations into Boolean using apply
function
From 0/1/2… to Yes/No
Building a Naïve Bayes Model and confusion Matrix
Data Mining (Finalize Data)
11. Decision Tree
Multiple classes – only option Confusion Matrix
Overall Accuracy of the model – 52.5%.
Random Forest
Accuracy Increased : 64.70%
Out of Bag Error Estimate : 36.11%
Performance
12. Linear Support Vector Machines
SVM implemented in R
Accuracy Decreased – 49.5%
Naïve Bayes
Model improves the accuracy for Multi-classification
Accuracy improved to 72.68%
Overall Statistics
Accuracy – 0.7268 95% CI(0.7179, 0.7355)
Performance
13. Word cloud
A visual representation for text
data, to visualize free form
text. Tags are usually single
words, and the importance of
each tag is shown with font size
or color. Italian
Indian
Task: Predict the cuisine for each recipe; Word cloud on
the right are examples of our project results. Cheese is
the most used ingredient in Italian food. Cumin is also
very important in Indian food.
14. Conclusion
Accuracy 72.68% , highest on kaggle for this problem is
82.17%.
Got head start in Natural Language Processing and
Machine Learning
Looking forward to taking up more challenging and
intriguing problems in these domains. First one would be
doing sentiment analysis on Professor MacDonald’s
Twitter/ Facebook/ LinkedIn Real and Fake profiles :D :p
THANK YOU!!!