SlideShare a Scribd company logo
1 of 40
tomekl007 @tomekl007
1
Tomasz Lelek
MACHINE LEARNING WITH APACHE
SPARK
What we will try to achieve?
Find an author of given post, based on text of
post
Input data
Forum with given structure of posts:
Preparing data
Tokenization
• Input: Swimmer like to swim, so he swims.
• Output: swimmer, like, to, swim, so, he, swims
Remove Stop Words
• Each language has stops words, e.g.:
to, as, a, the, …
Lemmatization -
Morphological Analysis
• mum:
mums
mummies
mummy
Load forum data
Tokenize and Stop Words
Transforming text into vector
of numbers
Bag-of-Words
1. Jon likes watching movies. Mary likes movies
too.
2. Jon also likes watching football games.
[“Jon”, “likes”, “watching”, “movies”, “also”, “football”,
“games”. “Mary”, “too”]
1. [1, 2, 1, 1, 0, 0, 0, 1, 1]
2. [1, 1, 1, 0, 1, 1, 1, 0, 0]
Word2Vect
FRANCE closest words:
Skip-Gram
• Input:
In Poland rain mainly in September.
• Output:
In rain, Poland mainly, rain in, mainly September
Spark Word2Vect
Machine Learning
• Supervised Learning – input data needs to be
labeled
• Unsupervised Learning – not labeled, clustering.
Used techniques
• Logistic Regression
• Gaussian Mixture Model
I. Logistic Regression
• Supervised Learning
• Data that we want to analyze is labeled binary ( 1
or 0 )
• Input could be vector of numbers (text
transformed using Word2Vect) labeled binary
• Vector ( text ) is written by an author (1) or not (0)
Logistic Regression example
input
Hours of Study vspassing of exam (1 or 0 )
Chart
Example result
II. Gaussian Mixture Model
• Unsupervised learning
• Used to draw conclusions from time data
• Answer question: What is a probability of that
some event occurred at given time?
Graphic representation
hour
Next steps to build model
• What we want to achieve?
• Find author of given post with some
probability, based on text of post
Input data for our algorithms
• Word2Vect
• Example sentence: “It is very important to plan for
a future but also being in the moment”
• Resulted vector may look like:
Logistic Regression model
per author
Area under ROC
Interpreting measures
Prepare labeled data
Build model
Model validation
Add time when post was
written to model
Time of day distribution for
author X
Preparing data for GMM
Creating GMM
Evaluating Logistic
Regression with GMM model.
Find author for post:
• “Given that somebody could take that as a
granted, I think we should”
• Post was written at 18 hour.
Test run
Result
How it could be used?
Thank you, Questions?

More Related Content

Similar to Jdd machine learning

Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineHakka Labs
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
CoreML for NLP (Melb Cocoaheads 08/02/2018)
CoreML for NLP (Melb Cocoaheads 08/02/2018)CoreML for NLP (Melb Cocoaheads 08/02/2018)
CoreML for NLP (Melb Cocoaheads 08/02/2018)Hon Weng Chong
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
2011 mongo sf-schemadesign
2011 mongo sf-schemadesign2011 mongo sf-schemadesign
2011 mongo sf-schemadesignMongoDB
 
Context-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecContext-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecJIN KYU CHANG
 
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsLucidworks
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzurePlain Concepts
 
2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdf2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdfColm Dunphy
 
Evaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationEvaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationTeoman Turan
 
Thinkful - Intro to JavaScript
Thinkful - Intro to JavaScriptThinkful - Intro to JavaScript
Thinkful - Intro to JavaScriptTJ Stalcup
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community EngineCommunity Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community enginemathraq
 
Intro to javascript (5:2)
Intro to javascript (5:2)Intro to javascript (5:2)
Intro to javascript (5:2)Thinkful
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning TalkMarin Todorov
 
Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012hungarianhc
 

Similar to Jdd machine learning (20)

Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Text features
Text featuresText features
Text features
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation Engine
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
CoreML for NLP (Melb Cocoaheads 08/02/2018)
CoreML for NLP (Melb Cocoaheads 08/02/2018)CoreML for NLP (Melb Cocoaheads 08/02/2018)
CoreML for NLP (Melb Cocoaheads 08/02/2018)
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
2011 mongo sf-schemadesign
2011 mongo sf-schemadesign2011 mongo sf-schemadesign
2011 mongo sf-schemadesign
 
Context-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecContext-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vec
 
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en Azure
 
2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdf2023-My AI Experience - Colm Dunphy.pdf
2023-My AI Experience - Colm Dunphy.pdf
 
Evaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationEvaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis Presentation
 
Thinkful - Intro to JavaScript
Thinkful - Intro to JavaScriptThinkful - Intro to JavaScript
Thinkful - Intro to JavaScript
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community engine
 
Intro to javascript (5:2)
Intro to javascript (5:2)Intro to javascript (5:2)
Intro to javascript (5:2)
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning Talk
 
Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012Schema Design by Example ~ MongoSF 2012
Schema Design by Example ~ MongoSF 2012
 

Recently uploaded

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

Jdd machine learning

Editor's Notes

  1. We will be solving problems with apache spark, using machine learning techniques
  2. We need to analyze text of post to be able to find author, so we will be using Natural Language Processing techiques We have text of post only, and we want to find which author, from the group of authors wrote post
  3. Let say that forum has around 10 000 posts. Each author can have one or many posts. There are some authors that has around 1000, but some has 1 post. We are mainly interested in body post, we will use it to further processing
  4. Before we get into data, we need to prepare and clean text that is in body of posts
  5. We need to split whole body into tokens, we want to see our text as a vector, and that is first step to achieve this.
  6. We need to remove them from text. They are not semantically value for analyzing text.
  7. In text can be written is multiple forms, we want that words to be in same form, because then our next processing step will perceive that word as the same. Otherwise words will be presented in vector as different number, and our algorithms will not be working effectively. In given example word mum could be written as mummy, could also be in plural form. We want all those words to be presented as one word, so we transform them to one common form
  8. We will be using data frame api. We want to load all data from mysql db, that was imported to our local mysql instance
  9. Having dataframe with data loaded, and sqlContext. We create spark Tokenizer. It will handle splliting “body” columns into array of “words”. Then StopWordsRemover will remove from array “words” all stop words and put them in without-stop-word column in data frame. We need to setStopWords that are specific for language that we analyze. We have all stop words in StopWords.allStopWords array.
  10. To be able to run machine learning algorithms on our data, we need to transform input arrays of words into vector of numbers, Agorithms like Logistic Regiression or GMM that we will use later operate on numbers. There are algorithms that will do it for use, we need to configure them In proper way. In reality vector for set of words will be much longer, and will have more dimensions
  11. Simple method to show text as a vector is Bag-of-words. It is commolny used is Document classification problems. By text classification we can assing document to proper category. In this method text is presentes as set of words and frequency of word occurance in text. We have two documents. Each document will be presented as vector. Indeks is a word into token array, and number on given index is number of word occurance in document.
  12. word2Vect is another, more complex algorithm that make vector of numbers from text. Firstly it constructs dictionary from text data, then present that data as a vector. It could be used to find closest word ( most similar word ). For example giving word "France" as input to algorithm, words that have closes word cosine distance are:
  13. Word2Vect use two algorithms to see text as vector. Continuous Bag-of-Words and Skip-gram. Skip-gram is a N-gram. Using skip-gram keep better semantics of text - place where word occurred in text due to that fact, that information is not lost when text is transformed into vector.
  14. Here we have function that take dataFrame. DataFrame has words column in it. We create new Word2Vect algorithm. There are two important conifg params, vectorSize: should be at least 100, to be able to distinctively present document as vector and minCount.
  15. There are two groups of ML algorithms. Eg. Algorithm to finding faces on photo. Supervided – input data is photo labeled with information if on this photo there is face or not. (1 or 0).
  16. Now we will see techniques that later will be used to analyze forum data. Right now we will look into them, using some simple examples.
  17. Input data is number of hours and information if student pass an exam(1) or not (0).
  18. Logistic regression result could be depcited by such chart, We will see, that as more hours is spend on learning then probabilty of passing exam is increasing
  19. Using logistic regression we could draw such conclusions. That algorithm will be used later to analyze our forum data
  20. Input we have infomration when event occured
  21. Having author and many of his posts, and each post has a timestamp when that post was written, we could create very effective GMM model. Most importatn parameter for GMM is number of clusters. In this example there are 3 clusters. We are seeing three Gaussian Distribution combined. On X axis there is an hour when that post was written, So we see that most ofter author wrote post at the evening, less often at the morning and seldom in the middle of the day. We could use that to analyze our forum data, and find author of given post, because we know when each author prefer to write posts.
  22. We have our input data transformed to vector of numbers using word2Vect algorithm
  23. We want to build one logistic regression model per author. So having N autohr there will be N models. First model-we are creating for author A. We see that first post ( presented as vector ) was written by author A. secnond not, and so one. Second model we are builidng for author B. Now when we have post prsenet as vecotr, we iterating over models, and asking if given post was written by author A, then B. Each model returng some probability that this post was written by that author. Model that return highes probability tells us that that author for wich that model was build wrote that post.
  24. We want to measure accuracy of modes. Measure that is used for that is called ROC. This is a mesaure of how fit model is.
  25. 0,50 means that you could toss a coin
  26. We prepare input data for logistic regression- vector with label (1 or 0) if post is for given author or not
  27. We are splitting data into training and test set. Traing set is used to build model, test will be used to validate model. From validation we will get measures ( i.e. area under ROC )
  28. We want to add information when post was written to our model. We are interesten only in hour of post. Then we could see tendency of author, i.e. he is writing posts mainly at midnight. Or we have some autohors that live in USA, then they will write posts on totally diferent time of day that authors from Poland.
  29. Example distribution for author we are seeing that author most often write posts from 3 to 12, and from 16 to 19 of GMT. Creating GMM model for author, model will answer question if author could write posts in given hour. It will be some probabiltiy. That one value is not enough to find an author for given post, but could be used as additionl dimesnion ( value ) in Logistc Regression model input.
  30. Normalize time basically take timestamp, create data in GMT format from it, and take hourOfDate from that data.
  31. Input for gmm needs to be a vector. Then we are creating GMM with three clusters.
  32. For Logistic Regression model that is trained using input word2Vect and gmm result probabilty our model gave given results. We will se that two top result models are excellent!
  33. We have input to our model: post body, and hour when post was written, We transfrom text to vector using word2Vect, and hour is passed to gmm predict method, we append two vectors, end evaluate logistic regression model. It will return probabilty that post that was written at 18 o’clock was written by analyzed author. We do it for each author, sort by probabilty descending. Result that will be on to of the results set, tell us that this author wrote post with biggest probabilty
  34. There is a probabiltiy of 87% that author wild wrote that post
  35. How that could be used? If we have a forum, we could find that one person has two accounts, because they are writing in exact same way. If someone will be writing as an anonym, but it wrote post previosuly as logged in user, we could identify that user.