Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learning support vector machines

1,291 views

Published on

PHP talk for the Dutch PHP Conference 2016.

Published in: Data & Analytics
  • Be the first to comment

Machine learning support vector machines

  1. 1. Machine Learning Support Vector Machines
  2. 2. Sjoerd Maessen •  AFOL •  Works at E-sites Breda •  Stock market enthusiast
  3. 3. Titanic The chance of survival
  4. 4. PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 712.833 C85 C 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S 6 0 3 Moran, Mr. James male 0 0 330877 84.583 Q 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 518.625 E46 S 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 111.333 S 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 300.708 C 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7 G6 S 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.55 C103 S 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.05 S 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.275 S
  5. 5. Challenge accepted
  6. 6. “Could you take in account siblings?”
  7. 7. Sure…
  8. 8. “Oh and the number of parents and children aboard…”
  9. 9. Of course!
  10. 10. “Could you add a ‘simple if’ for age as well?”
  11. 11. “Great! We are almost there! But…”
  12. 12. “Field of study that gives computers the ability to learn without being explicitly programmed” Arthur Lee Samuel
  13. 13. hZps://personality-insights-livedemo.mybluemix.net/
  14. 14. Alright! Let's become data scientists!
  15. 15. A comparison Traditional programming Machine learning Input program Input output output programnew input new output
  16. 16. Classification vs regression Input Output 0.98 68 0 0.76 42 0 1.23 78 1 1.91 109 1 Input Output 0.98 68 0.23 0.76 42 0.15 1.23 78 4.74 1.91 109 7.98
  17. 17. Support Vector Machine •  Automatically creates a “program” or model •  Inputs are ‘features’ •  Model represents a space •  New input fits somewhere
  18. 18. Support Vectors •  Optimal hyperplane •  Linear classifier •  Maximum margin •  Classification
  19. 19. Linearly separable dataset
  20. 20. Non-linear decision boundary
  21. 21. The kernel trick A whole new dimension
  22. 22. The kernel trick
  23. 23. Choosing a kernel •  No kernel or linear kernel •  Gaussian kernel •  Polynomial kernel •  Sigmoid kernel •  Radial basis function kernel •  …
  24. 24. Choosing a kernel Rule of thumb •  N much bigger than M => linear kernel •  N small, M intermediate => gaussian kernel N = number of features M = number of training examples
  25. 25. Spam detection •  N = 10000 (bad words, # of urls,…) •  M = 250 (sample mails) => linear kernel
  26. 26. Validation of housing prices •  N = 1-1000 (# of rooms, m3, location,…) •  M = 100,000 (of transactions) => Gaussian kernel
  27. 27. Features It’s all about preparation
  28. 28. Features •  Representation of raw data •  The hardest part Raw data Pre- processing Feature scaling Feature extraction
  29. 29. Association discovery
  30. 30. OCR – Pre-processing •  De-skew
  31. 31. OCR – Pre-processing •  De-skew •  Despeckle
  32. 32. OCR – Pre-processing •  De-skew •  Despeckle •  Convert to black & white
  33. 33. OCR – Pre-processing •  De-skew •  Despeckle •  Convert to black & white •  Zoning
  34. 34. OCR – Pre-processing •  De-skew •  Despeckle •  Convert to black & white •  Zoning •  Character segmentation
  35. 35. OCR – Feature extraction
  36. 36. OCR – Feature scaling How to scale the number of black pixels? 22 / 56=> 0.39286 Scale between 0 - 1
  37. 37. Real life
  38. 38. OCR – Training the model Training file •  Labels •  Features Label % Black Pixels X1 % X2 % X3 % … 0 0.33 0.546 0.840 1 0.78 0.123 0.567 0.347 1 0.75 0.512 0.543
  39. 39. Alice in Wonderland Down the rabbit hole
  40. 40. •  Avg word length •  Char frequency •  … Basic text features
  41. 41. •  Avg word length •  Char frequency •  … Basic text features
  42. 42. Training the model
  43. 43. Predicting the unknown
  44. 44. Input “Thank you for contacting us. This is an automated response confirming the receipt of your ticket. Our team will get back to you as soon as possible. When replying, please make sure that the ticket ID is kept in the subject so that we can track your replies.” Testing the unknown
  45. 45. Input “Thank you for contacting us. This is an automated response confirming the receipt of your ticket. Our team will get back to you as soon as possible. When replying, please make sure that the ticket ID is kept in the subject so that we can track your replies.” Output This is an English text Testing the unknown
  46. 46. Input “Hierbij bevestigen wij de ontvangst en verwerking van uw e-mail met ticketnummer PCL-98124-735. Uw vraag wordt opgepakt door één van onze engineers. Wij streven ernaar spoedig een oplossing aan u terug te kunnen koppelen. “ Testing the unknown
  47. 47. Input “Hierbij bevestigen wij de ontvangst en verwerking van uw e-mail met ticketnummer PCL-98124-735. Uw vraag wordt opgepakt door één van onze engineers. Wij streven ernaar spoedig een oplossing aan u terug te kunnen koppelen. “ Output This is a Dutch text Testing the unknown
  48. 48. Titanic The chance of survival
  49. 49. PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 712.833 C85 C 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S 6 0 3 Moran, Mr. James male 0 0 330877 84.583 Q 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 518.625 E46 S 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 111.333 S 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 300.708 C 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7 G6 S 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.55 C103 S 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.05 S 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.275 S
  50. 50. Feature extraction •  Title (Mrs, Miss, Mr, Jonkheer, Capt,..) •  Passengerclass •  Sex •  Age •  Siblings/spouses •  Parent/children •  Cabin •  Port of embarkation
  51. 51. Creating the trainingfile
  52. 52. Preprocessing and scaling
  53. 53. Preprocessing and scaling
  54. 54. Preprocessing and scaling
  55. 55. Preprocessing and scaling
  56. 56. Filling in blanks
  57. 57. Magic! Result: 83,26% accuracy
  58. 58. Common issues •  Feature numbering •  Training data <> real world •  Overfitting •  Feature selection •  Multiclass classification
  59. 59. Next step?
  60. 60. Learn R, Python,…
  61. 61. Resources •  https://www.csie.ntu.edu.tw/~cjlin/libsvm/ •  http://php.net/manual/en/book.svm.php •  https://www.kaggle.com/ •  http://scikit-learn.org/stable/ •  https://packagist.org/packages/sjoerdmaessen/machinelearning
  62. 62. @sjoerdmaessen linkedin.com/in/sjoerdmaessen https://joind.in/talk/d921d

×