Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
tomekl007 @tomekl007
1
Tomasz Lelek
MACHINE LEARNING WITH APACHE
SPARK
What we will try to achieve?
Find an author of given post, based on text of
post
Input data
Forum with given structure of posts:
Preparing data
Tokenization
• Input: Swimmer like to swim, so he swims.
• Output: swimmer, like, to, swim, so, he, swims
Remove Stop Words
• Each language has stops words, e.g.:
to, as, a, the, …
Lemmatization -
Morphological Analysis
• mum:
mums
mummies
mummy
Load forum data
Tokenize and Stop Words
Transforming text into vector
of numbers
Bag-of-Words
1. Jon likes watching movies. Mary likes movies
too.
2. Jon also likes watching football games.
[“Jon”, “like...
Word2Vect
FRANCE closest words:
Skip-Gram
• Input:
In Poland rain mainly in September.
• Output:
In rain, Poland mainly, rain in, mainly September
Spark Word2Vect
Machine Learning
• Supervised Learning – input data needs to be
labeled
• Unsupervised Learning – not labeled, clustering.
Used techniques
• Logistic Regression
• Gaussian Mixture Model
I. Logistic Regression
• Supervised Learning
• Data that we want to analyze is labeled binary ( 1
or 0 )
• Input could be ...
Logistic Regression example
input
Hours of Study vspassing of exam (1 or 0 )
Chart
Example result
II. Gaussian Mixture Model
• Unsupervised learning
• Used to draw conclusions from time data
• Answer question: What is a ...
Graphic representation
hour
Next steps to build model
• What we want to achieve?
• Find author of given post with some
probability, based on text of p...
Input data for our algorithms
• Word2Vect
• Example sentence: “It is very important to plan for
a future but also being in...
Logistic Regression model
per author
Area under ROC
Interpreting measures
Prepare labeled data
Build model
Model validation
Add time when post was
written to model
Time of day distribution for
author X
Preparing data for GMM
Creating GMM
Evaluating Logistic
Regression with GMM model.
Find author for post:
• “Given that somebody could take that as a
granted, I think we should”
• Post was written at 18 hou...
Test run
Result
How it could be used?
Thank you, Questions?
Upcoming SlideShare
Loading in …5
×

JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

50 views

Published on

How to use text data to draw conclusions about users of our website or forum?
This talk describes a solution to a particular problem, using Machine Learning and Statistics. Based on provided forum we will create the program that learns the structure of posts using Natural Language Processing technics. Then after proper Machine Learning models are trained, program is able to answer with probability which of the users of the forum wrote a particular post.

We will go through all the steps required to create Machine Learning models for text. How to use Natural Language Processing and Bag-of-Words techniques to analyse text? How to prepare input data to further Processing by Machine Learning Models? I will answer those questions. Implementation will be written in Apache Spark, so we will get to know that technology with some important libraries like Spark MLlib and DataFrame API. In MLlib we will use Gaussian Mixure Model and Logistic Regression.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

JDD 2016 - Tomasz Lelek - Machine Learning With Apache Spark

  1. 1. tomekl007 @tomekl007 1 Tomasz Lelek MACHINE LEARNING WITH APACHE SPARK
  2. 2. What we will try to achieve? Find an author of given post, based on text of post
  3. 3. Input data Forum with given structure of posts:
  4. 4. Preparing data
  5. 5. Tokenization • Input: Swimmer like to swim, so he swims. • Output: swimmer, like, to, swim, so, he, swims
  6. 6. Remove Stop Words • Each language has stops words, e.g.: to, as, a, the, …
  7. 7. Lemmatization - Morphological Analysis • mum: mums mummies mummy
  8. 8. Load forum data
  9. 9. Tokenize and Stop Words
  10. 10. Transforming text into vector of numbers
  11. 11. Bag-of-Words 1. Jon likes watching movies. Mary likes movies too. 2. Jon also likes watching football games. [“Jon”, “likes”, “watching”, “movies”, “also”, “football”, “games”. “Mary”, “too”] 1. [1, 2, 1, 1, 0, 0, 0, 1, 1] 2. [1, 1, 1, 0, 1, 1, 1, 0, 0]
  12. 12. Word2Vect FRANCE closest words:
  13. 13. Skip-Gram • Input: In Poland rain mainly in September. • Output: In rain, Poland mainly, rain in, mainly September
  14. 14. Spark Word2Vect
  15. 15. Machine Learning • Supervised Learning – input data needs to be labeled • Unsupervised Learning – not labeled, clustering.
  16. 16. Used techniques • Logistic Regression • Gaussian Mixture Model
  17. 17. I. Logistic Regression • Supervised Learning • Data that we want to analyze is labeled binary ( 1 or 0 ) • Input could be vector of numbers (text transformed using Word2Vect) labeled binary • Vector ( text ) is written by an author (1) or not (0)
  18. 18. Logistic Regression example input Hours of Study vspassing of exam (1 or 0 )
  19. 19. Chart
  20. 20. Example result
  21. 21. II. Gaussian Mixture Model • Unsupervised learning • Used to draw conclusions from time data • Answer question: What is a probability of that some event occurred at given time?
  22. 22. Graphic representation hour
  23. 23. Next steps to build model • What we want to achieve? • Find author of given post with some probability, based on text of post
  24. 24. Input data for our algorithms • Word2Vect • Example sentence: “It is very important to plan for a future but also being in the moment” • Resulted vector may look like:
  25. 25. Logistic Regression model per author
  26. 26. Area under ROC
  27. 27. Interpreting measures
  28. 28. Prepare labeled data
  29. 29. Build model
  30. 30. Model validation
  31. 31. Add time when post was written to model
  32. 32. Time of day distribution for author X
  33. 33. Preparing data for GMM
  34. 34. Creating GMM
  35. 35. Evaluating Logistic Regression with GMM model.
  36. 36. Find author for post: • “Given that somebody could take that as a granted, I think we should” • Post was written at 18 hour.
  37. 37. Test run
  38. 38. Result
  39. 39. How it could be used?
  40. 40. Thank you, Questions?

×