Machine Learning
Support Vector Machines
Sjoerd Maessen
•  AFOL
•  Works at E-sites Breda
•  Stock market enthusiast
Titanic
The chance of survival
PassengerId	 Survived	Pclass	 Name	 Sex	 Age	 SibSp	 Parch	 Ticket	 Fare	 Cabin	 Embarked	
1	 0	 3	 Braund,	Mr.	Owen	Harris	 male	 22	 1	 0	 A/5	21171	 7.25	 S	
2	 1	 1	
Cumings,	Mrs.	John	Bradley	(Florence	Briggs	
Thayer)	 female	 38	 1	 0	 PC	17599	 712.833	 C85	 C	
3	 1	 3	 Heikkinen,	Miss.	Laina	 female	 26	 0	 0	 STON/O2.	3101282	 7.925	 S	
4	 1	 1	 Futrelle,	Mrs.	Jacques	Heath	(Lily	May	Peel)	 female	 35	 1	 0	 113803	 53.1	 C123	 S	
5	 0	 3	 Allen,	Mr.	William	Henry	 male	 35	 0	 0	 373450	 8.05	 S	
6	 0	 3	 Moran,	Mr.	James	 male	 0	 0	 330877	 84.583	 Q	
7	 0	 1	 McCarthy,	Mr.	Timothy	J	 male	 54	 0	 0	 17463	 518.625	 E46	 S	
8	 0	 3	 Palsson,	Master.	Gosta	Leonard	 male	 2	 3	 1	 349909	 21.075	 S	
9	 1	 3	
Johnson,	Mrs.	Oscar	W	(Elisabeth	Vilhelmina	
Berg)	 female	 27	 0	 2	 347742	 111.333	 S	
10	 1	 2	 Nasser,	Mrs.	Nicholas	(Adele	Achem)	 female	 14	 1	 0	 237736	 300.708	 C	
11	 1	 3	 Sandstrom,	Miss.	Marguerite	Rut	 female	 4	 1	 1	 PP	9549	 16.7	 G6	 S	
12	 1	 1	 Bonnell,	Miss.	Elizabeth	 female	 58	 0	 0	 113783	 26.55	 C103	 S	
13	 0	 3	 Saundercock,	Mr.	William	Henry	 male	 20	 0	 0	 A/5.	2151	 8.05	 S	
14	 0	 3	 Andersson,	Mr.	Anders	Johan	 male	 39	 1	 5	 347082	 31.275	 S
Challenge accepted
“Could you take in account siblings?”
Sure…
“Oh and the number of parents and children aboard…”
Of course!
“Could you add a ‘simple if’ for age as well?”
“Great! We are almost there!
But…”
“Field of study that gives computers
the ability to learn without being
explicitly programmed”
Arthur Lee Samuel
hZps://personality-insights-livedemo.mybluemix.net/
Alright!
Let's become data scientists!
A comparison
Traditional
programming
Machine
learning
Input program Input output
output programnew input new output
Classification vs regression
Input	 Output	
0.98	 68	 0	
0.76	 42	 0	
1.23	 78	 1	
1.91	 109	 1	
Input	 Output	
0.98	 68	 0.23	
0.76	 42	 0.15	
1.23	 78	 4.74	
1.91	 109	 7.98
Support Vector Machine
•  Automatically creates a “program” or model
•  Inputs are ‘features’
•  Model represents a space
•  New input fits somewhere
Support Vectors
•  Optimal hyperplane
•  Linear classifier
•  Maximum margin
•  Classification
Linearly separable dataset
Non-linear decision boundary
The kernel trick
A whole new dimension
The kernel trick
Choosing a kernel
•  No kernel or linear kernel
•  Gaussian kernel
•  Polynomial kernel
•  Sigmoid kernel
•  Radial basis function kernel
•  …
Choosing a kernel
Rule of thumb
•  N much bigger than M
=> linear kernel
•  N small, M intermediate
=> gaussian kernel
N = number of features
M = number of training examples
Spam detection
•  N = 10000 (bad words, # of urls,…)
•  M = 250 (sample mails)
=> linear kernel
Validation of housing prices
•  N = 1-1000 (# of rooms, m3, location,…)
•  M = 100,000 (of transactions)
=> Gaussian kernel
Features
It’s all about preparation
Features
•  Representation of raw data
•  The hardest part
Raw data
Pre-
processing
Feature
scaling
Feature
extraction
Association discovery
OCR – Pre-processing
•  De-skew
OCR – Pre-processing
•  De-skew
•  Despeckle
OCR – Pre-processing
•  De-skew
•  Despeckle
•  Convert to black & white
OCR – Pre-processing
•  De-skew
•  Despeckle
•  Convert to black & white
•  Zoning
OCR – Pre-processing
•  De-skew
•  Despeckle
•  Convert to black & white
•  Zoning
•  Character segmentation
OCR – Feature extraction
OCR – Feature scaling
How to scale the number of black pixels?
22 / 56=> 0.39286
Scale between 0 - 1
Real life
OCR – Training the model
Training file
•  Labels
•  Features
Label	 %	Black	Pixels	 X1	%	 X2	%	 X3	%	 …	
0	 0.33	 0.546	 0.840	
1	 0.78	 0.123	 0.567	 0.347	
1	 0.75	 0.512	 0.543
Alice in Wonderland
Down the rabbit hole
•  Avg word length
•  Char frequency
•  …
Basic text features
•  Avg word length
•  Char frequency
•  …
Basic text features
Training the model
Predicting the unknown
Input
“Thank you for contacting us. This is an automated response confirming the
receipt of your ticket. Our team will get back to you as soon as possible. When
replying, please make sure that the ticket ID is kept in the subject so that we
can track your replies.”
Testing the unknown
Input
“Thank you for contacting us. This is an automated response confirming the
receipt of your ticket. Our team will get back to you as soon as possible. When
replying, please make sure that the ticket ID is kept in the subject so that we
can track your replies.”
Output
This is an English text
Testing the unknown
Input
“Hierbij bevestigen wij de ontvangst en verwerking van uw e-mail met
ticketnummer PCL-98124-735. Uw vraag wordt opgepakt door één van onze
engineers. Wij streven ernaar spoedig een oplossing aan u terug te kunnen
koppelen. “
Testing the unknown
Input
“Hierbij bevestigen wij de ontvangst en verwerking van uw e-mail met
ticketnummer PCL-98124-735. Uw vraag wordt opgepakt door één van onze
engineers. Wij streven ernaar spoedig een oplossing aan u terug te kunnen
koppelen. “
Output
This is a Dutch text
Testing the unknown
Titanic
The chance of survival
PassengerId	 Survived	Pclass	 Name	 Sex	 Age	 SibSp	 Parch	 Ticket	 Fare	 Cabin	 Embarked	
1	 0	 3	 Braund,	Mr.	Owen	Harris	 male	 22	 1	 0	 A/5	21171	 7.25	 S	
2	 1	 1	
Cumings,	Mrs.	John	Bradley	(Florence	Briggs	
Thayer)	 female	 38	 1	 0	 PC	17599	 712.833	 C85	 C	
3	 1	 3	 Heikkinen,	Miss.	Laina	 female	 26	 0	 0	 STON/O2.	3101282	 7.925	 S	
4	 1	 1	 Futrelle,	Mrs.	Jacques	Heath	(Lily	May	Peel)	 female	 35	 1	 0	 113803	 53.1	 C123	 S	
5	 0	 3	 Allen,	Mr.	William	Henry	 male	 35	 0	 0	 373450	 8.05	 S	
6	 0	 3	 Moran,	Mr.	James	 male	 0	 0	 330877	 84.583	 Q	
7	 0	 1	 McCarthy,	Mr.	Timothy	J	 male	 54	 0	 0	 17463	 518.625	 E46	 S	
8	 0	 3	 Palsson,	Master.	Gosta	Leonard	 male	 2	 3	 1	 349909	 21.075	 S	
9	 1	 3	
Johnson,	Mrs.	Oscar	W	(Elisabeth	Vilhelmina	
Berg)	 female	 27	 0	 2	 347742	 111.333	 S	
10	 1	 2	 Nasser,	Mrs.	Nicholas	(Adele	Achem)	 female	 14	 1	 0	 237736	 300.708	 C	
11	 1	 3	 Sandstrom,	Miss.	Marguerite	Rut	 female	 4	 1	 1	 PP	9549	 16.7	 G6	 S	
12	 1	 1	 Bonnell,	Miss.	Elizabeth	 female	 58	 0	 0	 113783	 26.55	 C103	 S	
13	 0	 3	 Saundercock,	Mr.	William	Henry	 male	 20	 0	 0	 A/5.	2151	 8.05	 S	
14	 0	 3	 Andersson,	Mr.	Anders	Johan	 male	 39	 1	 5	 347082	 31.275	 S
Feature extraction
•  Title (Mrs, Miss, Mr, Jonkheer, Capt,..)
•  Passengerclass
•  Sex
•  Age
•  Siblings/spouses
•  Parent/children
•  Cabin
•  Port of embarkation
Creating the trainingfile
Preprocessing and scaling
Preprocessing and scaling
Preprocessing and scaling
Preprocessing and scaling
Filling in blanks
Magic!
Result: 83,26% accuracy
Common issues
•  Feature numbering
•  Training data <> real world
•  Overfitting
•  Feature selection
•  Multiclass classification
Next step?
Learn R, Python,…
Resources
•  https://www.csie.ntu.edu.tw/~cjlin/libsvm/
•  http://php.net/manual/en/book.svm.php
•  https://www.kaggle.com/
•  http://scikit-learn.org/stable/
•  https://packagist.org/packages/sjoerdmaessen/machinelearning
@sjoerdmaessen
linkedin.com/in/sjoerdmaessen
https://joind.in/talk/d921d

Machine learning support vector machines