Successfully reported this slideshow.
Your SlideShare is downloading. ×

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Ad

Globally Scalable Web Document Classification
Using Word2Vec
Kohei Nakaji (SmartNews)

Ad

keyword: machine learning for discovery

Ad

SmartNews Demo

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 54 Ad
1 of 54 Ad
Advertisement

More Related Content

Slideshows for you (19)

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec (20)

Advertisement

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

  1. 1. Globally Scalable Web Document Classification Using Word2Vec Kohei Nakaji (SmartNews)
  2. 2. keyword: machine learning for discovery
  3. 3. SmartNews Demo
  4. 4. About SmartNews Japan Launched 2013 4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers 2013 App of The Year US Launched Oct 2014 1M+ Monthly Active Users Same engagement 80+ Publishers Top News Category App International Launched Feb 2015 10M Downloads WW Same engagement English beta Featured App Funding: $50M
  5. 5. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+/day Diversification Signals on the Internet
  6. 6. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+ /day Diversification Signals on the Internet Web Document Classification ⊂
  7. 7. Web Document Classification ENTERTAINMENT SPORTS TECHNOLOGY LIFESTYLE SCIENCE … Task definition: When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set. WORLD
  8. 8. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  9. 9. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  10. 10. Main Content Extraction Two approaches: html html easier, but takes time difficult, but fast ・Extract after rendering whole page ・Extract from HTML
  11. 11. Main Content Extraction ・Extract after rendering whole page ・Extract from HTML html html easier, but takes time difficult, but fast Two approaches: Our Approach
  12. 12. Main Content Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html> Example: main content not main content
  13. 13. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: …
  14. 14. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: … But not scalable. Japanese: … … … …
  15. 15. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  16. 16. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  17. 17. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Separate HTML into ‘text block’s Step1:
  18. 18. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0
  19. 19. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0 Step3: Define feature of each text block as combination of local features word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1 ex:
  20. 20. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach: See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  21. 21. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  22. 22. Making Main Content Using Decision Tree (features)block1: not main (features)block2: not main (features)block3: main (features)block5: main (features)block4: not main
  23. 23. Main Content Extraction From HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  24. 24. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  25. 25. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  26. 26. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  27. 27. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector. Will deliver an NBA championship to Cleveland James LeBron
  28. 28. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector Will deliver an NBA championship to Cleveland James LeBron stop words sports players dictionary with some feature engineering. NBA_PLAYER tf-idf
  29. 29. Feature Extraction in Text Classification Similarly used in Japanese. 私は中路です。 よろしくお願いします。 stop words person dictionary 私 は 中路 よろしく お願い し ます です PERSON tf-idf
  30. 30. Another Option: Paragraph Vector
  31. 31. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vector (dimension ∼ several 100)
  32. 32. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053) (https://code.google.com/p/word2vec/)
  33. 33. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (https://code.google.com/p/word2vec/) (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  34. 34. Word Vector in word2vec Model Every word is mapped to unique word vector with good properties. [0.1, 0.2, ……0.2]= [0.1, 0.1, ……-0.1]= [0.3, 0.4, ……0]= [0.3, 0.3, ……0.3]= Germany Berlin Paris France … “Germany - Berlin = France - Paris” vFrance vParis vGermany vBerlin
  35. 35. Procedure to Create Word Vectors Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on A cat sat on the street. … I love cat very much. w220 w221 He comes from Japan. … … TX t=1 logP(wt|wt c, · · · wt+c) P(wt|wt c, · · · wt+c) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t for anduw vw vw is word vector for w. Word vectors are trained so that it becomes a good feature for predicting surrounding words. Objective Function (cbow-case) Model (sum-case) = Procedure ① Maximize ② L L
  36. 36. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  37. 37. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vectors (dimension ∼ 100s)
  38. 38. Procedure to Create Paragraph Vectors for uw vw A cat sat on the street. … doc_1 : doc_2 : … I love cat very much. w220 He comes from Japan. … w221 Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on doc_1 TX t=1 logP(wt|wt c, · · · wt+c, doc i) P(wt|wt c, · · · wt+c, doc i) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t + di , and di wt is included vw② Preserve uw , as ˜uw , ˜vw document where Add a vector to the model for each document. Objective Function (dbow-case) = Model (sum-case) Procedure ① Maximize L L
  39. 39. Procedure to Create Paragraph Vector for uw vw, and di vw② Preserve uw , as ˜uw , ˜vw After training, we can get a good paragraph vector as a feature for a new document. Objective Function (dbow-case) Model (sum-case) Procedure ① Maximize TX t=1 logP(wt|wt c, · · · wt+c, doc) P(wt|wt c, · · · wt+c, doc) = exp(˜uwt · ˜v) P W exp(˜uW · ˜v) ˜v = X t0 6=t, ct0 c ˜vwt 0 + d We love SmartNews. … doc : I love SmartNews very much. d Ldoc = ③ Maximize for L Ldoc d ④ Use as a paragraph vectord training live data
  40. 40. Procedure to Create Paragraph Vector Feature Extractor [0.2, 0.3, ……0.2] d ˜uw ˜vw Paragraph Vector : Lmaximize Ldocmaximize
  41. 41. Text Classification Ordinary text classification architecture: ② live data ([0.1, -0.1, …]) ① training ([0.1, 0.3, …], entertainment) ([0.2, -0.3, …], sports) ([0.1, 0.1, …], entertainment) features ? ? … entertainment sports ([0.1, -0.2, …], politics) … sports training algorithm classifier feature extraction
  42. 42. Good Benefits of Using Paragraph Vector ・High Scalability ・High Precision in Text Classification Several percent better than using Bag-of-Words with feature engineering in our Japanese/English data set. We don’t need to work hard for feature engineering in each language. Bad ・Difficulty in analyzing error It is hard to understand the meaning of each component of paragraph vector. labeled: ∼several 10000 unlabeled: ∼100000
  43. 43. Benefits of Using Paragraph Vector It is important that Paragraph Vector has a different nature than Bag-of-Words Reason: We can get a better classifier by combining two different types of classifiers.
  44. 44. Our Use Case Validation Use one to validate the other. Combination Use the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
  45. 45. In multilingual localization Use only Paragraph Vector-based classifier without any feature engineering. Our Use Case (future)
  46. 46. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  47. 47. The Challenge
  48. 48. The Challenge News is uncertainty seeking for long-term values. Exploitation Exploration What SmartNews does: uncertainty seeking discovery What Big Data Firms typically do: preference estimation and risk quantification What if parents don't feed vegetables to children who only like meat? What if you keep hearing only opinions that match yours?
  49. 49. The Challenge Searching not optimal, but acceptable form of exploration. Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews. ・topic extraction We are developing: ・image extraction ・multi-arm bandit based scoring model ① For better Feature Vector of users and articles ② For Human-Acceptable Exploration user interests ① ② … feature vector for 10 million users real-time feature vector for articles x
  50. 50. We are building our engineering team in SF - please join us! 採用してます ・ML/NLP Engineer ・Data Science Engineer …
  51. 51. kohei.nakaji@smartnews.com
  52. 52. References Main Content Extraction ・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl Text Classification Boilerplate Detection using Shallow Text Features ・BoilerPipe (GoogleCode) ・Quoc V. Le, Tomas Mikolov Distributed Representations of Sentences and Documents ・Word2Vec (GoogleCode)
  53. 53. References About SmartNews ・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S. ・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S. ・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M ・About our Company SmartNews Articles about SmartNews

Editor's Notes

  • Hello I am Kohei Nakaji, engineer of SmartNews Inc.
    I'm developing news delivery algorithm in SmartNews, using especially machine learning and natural language processing in SmartNews. My research background is not kind of ML things but particle physics theory, begining of universe, dark matter and so on. so if you guys have interest in physics thing I can also talk about it in another day.
    Anyway, Today I'm gonna talk about this topic: 'Grobally scalable web document classification using word2vec'. Because This talk is based on the technology in SmartNews, I will do brief introduction of our company SmartNews.
    We SmartNews, are developing ios/android application: SmartNews.
  • How many guys use SmartNews here? very few people. How many guys love machine learning? Great. So you will love SmartNews. Because our apps are made by machine learning. SmartNews is news app for more than 100 countries, but we have No writier, no editor, algorithm do everything.
    How many guys use news app every day? yeah most of news app fail. Some apps have great downloads but they are annoying with few engagement ratio. We SmartNews have 10M downloads grobally and more than 50% is active. We have possibility to get the position of successful news app.
    Then what makes SmartNews different?
  • Keyword is ‘machine learning for discovery’. Some apps rely on human editor. they are not scalable and also they can be biased.
    Some apps use machine learning for delivery algorithms, but they use it for personalization.
    We use machine learning for everyone on earth to discover and learn new things they might not otherwise have seen. This is our mission. We are trying to develop algorithm for users to discover new things. that makes our engagement ratio high.
    Now let me show you demo of our apps.
  • Let me show you guys how it works. First when you open it up, you can see the top news right here. Top news are latest important news chosen by our algorithm. Over here you got tabs of different categories which is the most straightforward result of web document classification. You see the latest important news in each category chosen by our algorithm. you may understand how precise our web document classification should be. One of the cool things is that when you find that you wanna read, for example see I wanna read this article right here, you’ve got this option right here which is the smart view option. you like this option, because it looks very very clean, no banners, no ads. Over here you can see the web view which is ordinary web browser, you see a lot of things you don’t wanna read in web view, but in smart view it is more simple and clean. You may understand how difficult to create smart view from arbitrary web site. I will introduce some of the algorithm in this talk. Another cool thing of smart view is you can see smart view even in offline. You can read in metro, in the airplane, anywhere.
  • As I told you we have 10M downloads and more than 50% is active.
    there are 3 types of editions, japanese edition, us edition, and international edition. In international edition, users can read English articles which is localized for more than 100 countries. But there is no editor for each country.
  • UI is good, Smart View is cool. But as I told you what makes us different is the algorithm to find articles from which users can discover new things.
    This is the outline of our algorithm for ‘users discovery’.
    urls are found from the signals on the Internet by our crawler,
    html structures are automatically analyzed, for example title, mainText, image is extracted
    then semantics of articles are analyzed, what category it has, what subject it has, what image is in…etc… - using signals and semantics, the importance score of each article for each category in each country are calculated
    diversify topic of the delivery list
    then we deliver the articles to users. the list of the article are refreshed in real time. We crawled 10 million urls/day and deliver only top 1000 articles to users and 100/category/day. There are many things to talk about this algorithm. Especially how we do importance estimation, we do personalize or do another approach is key feature because it is related with our mission. I will talk about it later and now let’s get into the today’s main topic
  • Web Document classification, which is part of our structure analysis, and semantics analysis. The reason why I choose Web Document Classification for today’s topic is for one thing it is important for our application as you have already seen and for another thing, classification of unstructured data is common task in many applications, from simple spam filter to category tagging in ec site.
  • The task definision is very simple.
    when arbitrary
  • There are roughly two steps. 1. main content extraction :
    we have to detect main content from news website. it is difficult because there are so many websites, and different websites have different structure.
    2. text classification : we classify main content into one category
    first I briefly show one of our algorithm to detect main content from Web Document,
    next I will talk about text classification using word2vec extended model
  • Let’s start from main content extraction. I want to add that in our app main content extraction is also important for making smart view we have seen.
  • when we do main content extraction, there are two approaches actually we use the bottom one. First approach is rendering all of the page loading all css, javascript and after that extract the main content. it is relatively easier because we can use the information of position, width, and height of each component but it takes time because we have to render all items. Second approach is extract main content directry from html. it is more difficult but needs much less computing resource comparing with first approach.
  • we use second approach in our algorithm, because we have to proceed 10 million articles per day, 100 article per second.
  • This is the example of main content extraction from html. It is the task to detect which is main content and which is not main content.
  • Rule based extraction algorithm is of course possible like div which has text length more than 200 is main content. Because there are so many websites, the number of rule tend to be large,
  • If we do it in multi-language, it becomes much harder.
  • So, as one of our algorithm to extract main content, we are using machine learning approach which is based on the paper in 2011. So today, let me introduce about this.
    In the training phase, first we prepare the sets of html document that main content is already labeled. In our case, we aggregate the articles by our crawler and annotator annotate main content. Next by using block separator, html is separated into each text block, and by using feature extractor, feature vector in each block is extracted.
  • let’s get into the block separation and feature extraction part.
  • For step one we separate html into text blocks. The definition of ‘text block’ in our case is roughly, the block which is sandwiched by block level tag.
  • For step 2, local features for each block is extracted. We use for example number of word, number of a tag, as local feature,
  • For Step3, we create feature vector of each block as the combination of local features of different blocks. In this example, feature vector of this text block has element of ‘word count and num of a tag in previous and current block’.
  • in training phase, after the block separation and feature extraction, we get sets of labeled feature vector. The label is binary value: main/not main.
    By using the labeled feature vector, decision tree is trained.

    When live data comes, html is separated into text blocks with features, and by using already trained decision tree, final result is obtained.
  • Let’s get into this part.
  • Feature vector in each block is classified into main/not main by using already trained decision tree. Then now, we know which text block is main content and which text block is not main content. By combining the result, we get the main text.
  • This is the end of main content extraction. easy, simple, but not bad. If you want to know more about it. please see the link,
    and also there is the library which includes already trained model in English, please try. I will share the reference later.
  • so let’s get into the text classification.
  • Probably you know everything already, but let me review the ordinary classification architecture.
    In the training phase, first we prepare sets of labeled texts as training data.
    by using feature extractor, sets of labeled feature vector is created,
    then using training algorithm, like SVM or logistic regression, classifier is trained.
    In bag-of-words feature extractor case, sets of word in the document is extracted as feature vector, and after training, roughly speaking, which word tends to show up in which category, is trained.
    when live data comes, feature vector is extracted and by using already trained classifier, category is determined.
  • Training algorithm itself is ordinary logistic regression in our application and there are many materials about it. So today, let’s focus on feature extraction part.
  • As a feature vector ‘Bag-of-words’ is commonly used. Bag-of-words is set of words in the document, it does not care about the order of words. very simple but not bad if we use it for text classification.
  • If we want to improve the quality of feature vector, we create, for example stop words dictionary for removing unnecessary words, create specific dictionary for adding a specific feature, or use tf-idf. But still Bag-of-Words are starting point.
  • In Japanese case, we have to use technique to separate words, but still Bag-of-Words with some feature engineering is commonly used.
    But Bag-of-Words definitely seems not perfect feature vector of text, for example it cannot include the information of word order. For another example we cannot use information that two words are close to each other or not. We wonder whether we can easily get better feature vector or not.
  • As a better feature vector, we use Paragraph Vector which is word2vec extended model. It is ‘better’ in precision of text classification.
  • by using the technique I will talk about today, every document is mapped to one dense vector with a few hundred dimensions named paragraph vector.
  • Because paragraph vector is kind of word2vec extended model, I should start from word2vec. In word2vec case, every word is mapped to unique word vector.
    In paragraph vector case, every document is mapped to unique vector.
  • So let’s get into word2vec.
  • Every word is mapped to unique vector.
    In this example, France, Paris, Germany, Berlin is mapped to each unique vector.
    What is surprising is the nature like Germany - Berlin = France - Paris. From this nature, we assure that some semantics is embedded in the vector.
  • This is Brief Overview of training word2vec model. First prepare sets of document. and label each word like w1, w2,
    then, maximize the objective function.
    The value of c is arbitary. 2 or 3 is commonly used.
    By looking at the shape of this objective function, you can see that maximizing this objective function means maximizing the probability to predict a word from surrounding words. In the example of the right figure, The model is refreshed so that the probability to predict ‘on’ from surrounding words ‘cat’, ‘sat’, ‘the’, ‘street’ becomes higher.
    The model of probability function is like this. For each word, 2 types of vectors: output vector u and input vector v are defined.

    Roughly speaking, when training converge, the more a pair of 2words shows up in a same sentence, the bigger the inner product of u and v for the 2words become.

    After training we use v for each word as word vector.
    Technically, training this model directly is really heavy because of this sum, and 2 types of approximation Negative sampling and Hierarchal softmax are used. Detail about the approximations is beyond the scope of this talk.

    This is how we create word vector by using word2vec model.
  • Then let’s get into paragraph vector.
  • As I told you, each document is mapped into one dense vector named paragraph vector.
  • The procedure to create paragraph vector is similar to word2vec case. Prepare sets of document. and label each word like w1, w2, we also label each document like doc_1, doc_2.
    Then, maximize this objective function. The difference from word2vec model is that, the objective function includes document_id where the word is included. So maximizing this objective function means maximizing the probability to predict a word not only from surrounding words but also from the document where the word is included.

    The model of the probability function is also a little bit different. Same as word2vec case, for each word outer vector u and inner vector v are defined. In addition, for each document, vector d_i is also defined.

    When training converge, we get optimized u, v for each word and d_i for each document.
    The final result of vector d_i is paragraph vector for each document. But what we really want to do is extracting paragraph vector from new document. For doing it we need one more step.
  • When new document comes, we label the words in the document, and maximize this objective function. In this time, T is the number of word in the document. We don’t need to maximize the objective function for u and v, we can use u and v which is already trained. All we have to do is just maximize objective function for d.

    After the objective function is maximized we get d as a paragraph vector for the document.
  • It was a little bit confusing, so I show a simple figure.
    First, we train the feature extractor by putting the large set of documents, and when new document comes, by using the already trained feature extractor, paragraph vector is extracted. very simple right?
  • By just using the paragraph vector as a feature vector, we can do ordinary text classification.
  • Good thing for using paragraph vector comparing with Bag-of-Words is these two. ①high precision
    In our Japanese/English data set, the result of 10-fold validation test becomes several percent better than bag-of-words with feature engineering case.
    ②high scalability. By just preparing the sets of Document for each language, without feature engineering, we can get good result.
    Bad thing is the difficulty in analyzing error. It is hard to understand the meaning of each component of paragraph vector.

    Because there is trade off, I don’t know which you should choose in your use-case even if the precision of text classification is several percent higher by using paragraph vector.
  • But still, I think it’s good for you to try paragraph vector.
    Paragraph vector has different nature from bag-of-words. So the combination of bag-of-words classifier and paragraph vector based classifier can be much better classifier.
  • In our app, there are many types of classifiers like sports classifier, entertainment classifier other than main category classifier.
    Depending on the purpose of each classification, in some case, we use the more reliable result of Bag-of-words based classifier and paragraph-vector based classifier. In another case we validate the result of bag-of-words based classifier by using paragraph-vector based classifier.

    Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  • Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  • This is the end of todays’ topic web document classification.
  • News is uncertainty seeking for long-term values.
    What other big data firms typically do is recommend what people have interest about, by using like matrix factorization.
    What we are doing is not simply suggest users what they like, but expand users’ interest by our algorithm.
  • How to explorer users’ interest space and suggest something new to users, are very challenging problem.

    We are now brushing up, these two.

    For better understanding of the users’ interest space
    we are brushing up the topic or the subject extraction from article,
    brushing up users’ feature vector

    For doing the good exploration
    multi-arm bandit based scoring model,

    Technically, we have to create and operate the good and reasonable model which includes feature vector of 10 million users and real time feature vector of articles,
    it is really exciting.
    Actually the number of people tuckling on these problems is 5, including ML PhD., Theoretical Physics PhD, but we need much much much more people to tackle on this difficult problem.
  • Then let’s get into paragraph vector.
  • Then let’s get into paragraph vector.

×