BI_Final_Group_1

OUTLINE
 Problem Statement
 Techniques
 Bernoulli Naïve Bayes
 Problems Encountered
 Data Mining
 Performance
 Conclusion

THE PROBLEM
 What if a machine behaves like a food connoisseur ?
 Essentially this machine needs to predict the category of a dish's
cuisine given a list of ingredients
 We “humans” took up this challenge to build our own food
connoisseur
 Employed Machine Learning and Natural Language Processing

Cuisine’s Frequency
467
804
1546
2673
755
2646
1175
3003
667
7838
526
1423
830
6438
821
489
4320
989
1539
825
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Frequency

Techniques
 Supervised Learning
 Text mining
 Preprocessing
 Stemming
 Stop word removal
 Sparse term removal
 Term frequency
 Inverse document frequency
 Feature Selection
 Bag of Words Algorithm
 Classification

Bernoulli Naïve Bayes
 Bayes Theorem (Prior and Posterior probabilities)
 Posterior: Probability (Italian/cheese) = (Probability
(Italian) * Probability (cheese/ Italian))/ Probability
(cheese)
 Decision rule: P(word|class)=(word_count_in_class +
1)/(total_words_in_class+total_unique_words_in_class)
 Real World Applications: Used for Spam Filtering, news
classification, sentiment analysis

Problems Encountered
Id Cuisine Ingredients
29 Thai Sugar, hot chilli, Asian fish
sauce, lime juice.
 Possible Class Imbalance: Italian, Mexican, Southern U.S ~50%
 Accuracy on least frequent cuisines very low
 Counter Over fitting:
 Sampling
 Partitioning
 Preprocessing:
 Stemming eg. Almond vs almonds, thigh vs thighs
 Concatenation eg. Black olives vs black_olives

Problems Encountered
 Clustering of ingredients doesn’t work
 Disproportionate NULL values
 Poor Tf-idf accuracy
 Weighting of important ingredients
 Normalization: complex classification
 Least and most common ingredients
 Removed Sparse Terms

 Load all the packages such as “tm”, “caret”, “wordcloud” etc.
 Convert Data from JSON format to Data Frame using “jsonlite”
 Set the levels target feature – cuisine for better classification
 cook$cuisine <- factor(cook$cuisine)
 Initial Preprocessing: “-” and “ ” into “_” for all the ingredients
 Create a corpus and then convert it to a document term matrix
Data Mining (Import Data)

 Final pre-processing using “tm” package
 Remove numbers, stop words, punctuation, and stemming
 Removing least and most common ingredients
 Draw random samples for train and test dataset
 Total – 39774
 Training – 29830
 Testing – 9944
 Converting all the observations into Boolean using apply
function
 From 0/1/2… to Yes/No
 Building a Naïve Bayes Model and confusion Matrix
Data Mining (Finalize Data)

 Decision Tree
 Multiple classes – only option Confusion Matrix
 Overall Accuracy of the model – 52.5%.
 Random Forest
 Accuracy Increased : 64.70%
 Out of Bag Error Estimate : 36.11%
Performance

 Linear Support Vector Machines
 SVM implemented in R
 Accuracy Decreased – 49.5%
 Naïve Bayes
 Model improves the accuracy for Multi-classification
 Accuracy improved to 72.68%
Overall Statistics
Accuracy – 0.7268 95% CI(0.7179, 0.7355)
Performance

Word cloud
A visual representation for text
data, to visualize free form
text. Tags are usually single
words, and the importance of
each tag is shown with font size
or color. Italian
Indian
Task: Predict the cuisine for each recipe; Word cloud on
the right are examples of our project results. Cheese is
the most used ingredient in Italian food. Cumin is also
very important in Indian food.

Conclusion
 Accuracy 72.68% , highest on kaggle for this problem is
82.17%.
 Got head start in Natural Language Processing and
Machine Learning
 Looking forward to taking up more challenging and
intriguing problems in these domains. First one would be
doing sentiment analysis on Professor MacDonald’s
Twitter/ Facebook/ LinkedIn Real and Fake profiles :D :p
THANK YOU!!!

BI_Final_Group_1

Recommended

Recommended

More Related Content

Similar to BI_Final_Group_1

Similar to BI_Final_Group_1 (20)

BI_Final_Group_1