Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vít Listík - Email.cz workshop

444 views

Published on

Machine Learning Prague 2016
www.mlprague.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Vít Listík - Email.cz workshop

  1. 1. Email.cz workshop Vit Listik @tivwvit
  2. 2. Email stats ● 60M emails per day ● 3M users daily, 6M monthly ● 2 PB Email delivery process
  3. 3. Antispam ● Fighting with the bad guys
  4. 4. Antispam sources ● Content ○ Text ○ Images ○ Attachments ○ Links ○ Headers ● Metadata ○ Traffic ○ Historic data (reputation) ○ Blacklists ○ Rules (DKIM, DMARC, SPF)
  5. 5. Grey email Graymail is solicited bulk email messages that don't fit the definition of email spam (e.g., the recipient "opted into" receiving them). Recipient interest in this type of mailing tends to diminish over time, increasing the likelihood that recipients will report graymail as spam. In some cases, graymail can account for up to 82 percent of the average user's email inbox.
  6. 6. Antispam stats again
  7. 7. ML in antispam ● Topic ● Usubscribe ● Phishing ● Domain keywords ● Images ● Personalized filter ● Link naturalness
  8. 8. Examples https://github.com/tivvit/ML-Prague-2016-email-workshop
  9. 9. Tools ● Jupyter ○ Visualizations ○ State ● HDF5 ● Pandas ● Ipython cluster ● Cluster storage
  10. 10. Let's go (to) Jupyter
  11. 11. Topic categorization ● 16 categories ● Manually labeled dataset ● 2 languages (2 models) ● 7th version ● Overlapping classes
  12. 12. NLP ● Bag of words ● Lemmatization ● Stop words (1) John likes to watch movies. Mary likes movies too. (2) John also likes to watch football games. [ "John", "likes", "to", "watch", "movies", "also", "football", "games", "Mary", "too" ] (1) [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] (2) [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
  13. 13. SVM ● Classification ● Best split for classes ● Linear classifier (kernels) Multi class version ● One vs. all ● Winner takes all
  14. 14. Topic categorization
  15. 15. Image categorization ● Classes: spam x ham ● Based on user reaction ● Links analysis ● Low level image features ○ Size ○ DPI ○ Hists ○ Exif ○ Compression ● Raw pixels
  16. 16. Spam roulette
  17. 17. User reactions ● Noisy ● Inconsistent ● Bots ● Low ratio
  18. 18. Image topics ● Caffe ● Pretrained network ● Same classes as for words ● Cleaned dataset of images from classified emails ● 400k images ● Slow on CPU Loan non-bank Pharmacy DiscountEbola
  19. 19. Distributed learning ● Spark ● SparkNet (Caffe) ● Elepheas (Keras)
  20. 20. Image types ● Trivial ○ Animated ○ Monitoring ○ Border ● Photo ● Graphics ● Photo with graphics Graphics Photo
  21. 21. Image features Extraction ● PIL ● OpenCV ● Image Magick Features (142) ● Channel stats ○ Min, max, mean ○ Standard deviation ○ Skewness ○ Entropy
  22. 22. Learning ● Scipy - Decision Trees ● Keras (Tensorflow, theano) ● 30k Manually labeled samples
  23. 23. Trees vs. neurons
  24. 24. Message ● Gray email ● Explore (visualize) your data (in Jupyter) ● Use libraries ● Simple subtasks (boosting) may help ● Store intermediate results ● Store test results with the model

×