#### Talk 2: what we learn from our study group on ML
Speakers: your dear study group hosts Dave Snowdon (Software engineer at G-Research, ex VMware) and Jeremie Charlet (CTO at Trackener)
While working on the Kaggle project on sentiment analysis of IMDB reviews for several months, we have been trying multiple vectorization techniques (bag of word, tfidf), running multiple models (from Random Forest to CNN and RNN), read many articles and research papers, compared the results and, most above all, learned a great amount. We will showcase here our work and share our learnings.
1. MaM Machine Learning
Study group
Speaker Session - 17/04/2019
https://www.meetup.com/MaM-Machine-Learning-Study-Group/
2. Thanks to our host and sponsor
Imperial College Data Science Institute
Recworks Meet a Mentor community
https://recworks.co.uk/ https://meetamentor.co.uk
3. Agenda
Introduction to data analysis and data cleaning
By Mark Bell
Study group: the story so far
By Dave Snowdon and Jeremie Charlet
7. The power of the mob
Slack for group coordination
During a study session
● One person “drives”
● Everyone else contributes suggestions
● New “driver” each session
● Everyone literally on the same page
● Allows people to get up to speed without being put on the spot
10. Kaggle house prices
● Given a data set with information about house & area predict house price
● Lots of fumbling around in the dark
● Gradually got to grips with scikit learn and pandas
● From machinelearningmastery.com learnt to use a methodology
● Tried with and without outliers
● Tried various models: linear regression, random forest, simple NN
● Experimented with mapping categorical values
● Tried using RFE and random forest to find most important fields
11. Methodology
5-Step Systematic Process
1. Define the Problem
2. Prepare Data
3. Spot Check Algorithms
4. Improve Results
5. Present Results
https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/
14. It’s harder than it looks!
● Keep your data & labels separate!
○ “How to Prevent Catastrophic Failure in Production ML Systems, Martin Goodson”, QCon
London 2019
● Need a methodology - or can get overwhelmed by all the possibilities
● If it looks too good to be true...
callingbullshit.org
16. Learnings - IMDB Reviews project
Problem: Given (long) movie review as plain text
Decide: is review positive or negative?
'This is one of the silliest movies I have ever had the misfortune to watch! I should have expected it, after seeing the first two, but I keep
getting suckered into these types of movies with the idea of "Maybe they did it right this time". Nope - not even close. Where do I
begin? How about with the special effects... To give you an idea of what passes for SFX in this movie, at one point a soldier is shooting
at a "Raptor" as it runs down a hallway. Even with less than a second of screen time, the viewer can easily see that it is just a man with
a tail apparently taped to him running around. Bad bad bad bad. How about the acting? If that's what you can call it. There is one
character who, I suppose, is supposed to be from the south. However, after living in the south for six years now, I have never heard this
way of talking. Perhaps he has some sort of weird disability - the inability to talk normally. I find it fascinating that the character does
nothing that requires him to have that accent - therefore there was no reason for the actor to try to do one. How about the plot? It’s
pretty basic - Raptors escape, people with guns must hunt them down. I’m starting to wonder why the dinosaurs in these movies
always seem to run into the nearest system of tunnels... wouldn’t they stay outside to hunt prey? ...
17. Learnings - IMDB Reviews project
processing text data
vectorization techniques
18. The dog is on the table
Learnings - IMDB Reviews project
processing text data
vectorization techniques
[ 0, 1, 1, 0, 0, 0, 2, … ]
[Cat dog table monkey movie man the … ]
[12, 908, 35, 45, 12, 13]
12 The
...
908 Dog
21. Learnings - IMDB Reviews project
Read research papers to create models
Learn/practice with both scikit learn for ML and Keras for DL
Discovered different neural network architectures: CNN, LSTM
22. Learnings - IMDB Reviews project
compared multiple ML and DL architectures
24. Next steps
Start new project
Mob v2
● Pomodoros
● Mob + pair work
● With homework (reading articles or research papers)
Follow our methodology from start to finish
Join us: https://www.meetup.com/MaM-Machine-Learning-Study-Group/
Dealing with stop words, punctuation
Then applying a vectorization technique to transform a sentence in a numeric / vectorized representation
Dealing with stop words, punctuation
Then applying a vectorization technique to transform a sentence in a numeric / vectorized representation
Dealing with stop words, punctuation
Then applying a vectorization technique to transform a sentence in a numeric / vectorized representation
We started giving ourselves homework: read research paper before coming, and then spend a session reading through the researcher’s code, then rewriting it. We chose Keras as recommended by our datascientist mentors
If needed read a few articles on CNN to understand how it work