[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Globally Scalable Web Document Classiﬁcation
Using Word2Vec
Kohei Nakaji (SmartNews)

keyword: machine learning for discovery

About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users
50% DAU/MAU
100+ Publishers
2013 App of The Year
US
Launched Oct 2014
1M+ Monthly Active Users
Same engagement
80+ Publishers
Top News Category App
International
Launched Feb 2015
10M Downloads WW
Same engagement
English beta
Featured App
Funding: $50M

Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
Diversiﬁcation
Signals on the Internet

Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
Diversiﬁcation
Signals on the Internet
Web Document
Classiﬁcation
⊂

Web Document Classiﬁcation
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task deﬁnition:
When an arbitrary web document arrives, choose one
category exclusively from a pre-determined category set.
WORLD

ENTERTAINMENT
① Main Content Extraction
② Text Classiﬁcation
① ②
There are roughly two steps:

There are roughly two steps:
ENTERTAINMENT
① Main Content Extraction
② Text Classiﬁcation
① ②

Main Content Extraction
Two approaches:
html
html
easier, but takes time
difﬁcult, but fast
・Extract after rendering whole page
・Extract from HTML

・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difﬁcult, but fast
Two approaches:
Our Approach

Main Content Extraction from HTML
<html>
<body> 
<div>click <a>here</a> for </div> 
<div> 
<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy who'd
never led an arrest for the Tulsa County Sheriff's
Ofﬁce. 
</p> 
<a>you also like this</a>
<p>
So how did the 73-year-old insurance company
CEO end up joining a sting operation this month
that ended when he pulled out his handgun and
killed suspect Eric Harris instead of stunning
him with a Taser?</p>
</div>
</body>
</html>
Example:
main content
not main content

Rule1:
div which has 
text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has 
text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…

Rule1:
div which has 
text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has 
text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
But not scalable.
Japanese:
…
…
…
…

② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Feature Extraction from HTML
<html>
<body> 
<div> 
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Ofﬁce. 
</p> 
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Separate HTML into ‘text block’s
Step1:

<html>
<body> 
<div> 
<p>
</p> 
<p>
Taser?</p></div>
</body>
</html>
Step1:
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0

<html>
<body> 
<div> 
<p>
</p> 
<p>
Taser?</p></div>
</body>
</html>
Step1:
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Step3:
Deﬁne feature of each text block as
combination of local features
word count(current block) : 36,
num of <a>(current block) : 0,
word count (previous block) : 4,
num of <a> (previous block) : 1
ex:

② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach:

Making Main Content Using Decision Tree
(features)block1:
not main
(features)block2:
not main
(features)block3:
main
(features)block5:
main
(features)block4:
not main

Main Content Extraction From HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using machine learning approach;

Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction

Feature Extraction in Text Classiﬁcation
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector.
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron

Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
stop words
sports players dictionary
with some feature engineering.
NBA_PLAYER
tf-idf

Similarly used in Japanese.
私は中路です。
よろしくお願いします。
stop words
person dictionary
私
は
中路
よろしく
お願い
し
ます
です
PERSON
tf-idf

Another Option: Paragraph Vector

Example:
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vector
(dimension ∼ several 100)

Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
(https://code.google.com/p/word2vec/)

・word2vec
・paragraph vector
(https://code.google.com/p/word2vec/)

Word Vector in word2vec Model
Every word is mapped to unique word vector
with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0.1, ……-0.1]=
[0.3, 0.4, ……0]=
[0.3, 0.3, ……0.3]=
Germany Berlin
Paris
France
…
“Germany - Berlin = France - Paris”
vFrance
vParis
vGermany
vBerlin

Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on the street.
…
I love cat very much.
w220
w221
He comes from Japan.
…
…
TX
t=1
logP(wt|wt c, · · · wt+c)
P(wt|wt c, · · · wt+c) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
for anduw vw
vw is word vector for w.
Word vectors are trained so that it becomes a good
feature for predicting surrounding words.
Objective Function (cbow-case)
Model (sum-case)
=
Procedure
① Maximize
②
L
L

・word2vec
・paragraph vector

Example:
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vectors
(dimension ∼ 100s)

Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.
w220
He comes from Japan.
…
w221
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
doc_1
TX
t=1
logP(wt|wt c, · · · wt+c, doc i)
P(wt|wt c, · · · wt+c, doc i) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
+ di
, and di
wt is included
vw② Preserve uw , as ˜uw , ˜vw
document where
Add a vector to the model for each document.
Objective Function (dbow-case)
=
Model (sum-case)
Procedure
① Maximize
L
L

Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as ˜uw , ˜vw
After training, we can get a good paragraph vector as
a feature for a new document.
Objective Function (dbow-case)
Model (sum-case)
Procedure
① Maximize
TX
t=1
logP(wt|wt c, · · · wt+c, doc)
P(wt|wt c, · · · wt+c, doc) =
exp(˜uwt · ˜v)
P
W exp(˜uW · ˜v)
˜v =
X
t0
6=t, ct0
c
˜vwt
0 + d
We love SmartNews.
…
doc :
I love SmartNews
very much.
d
Ldoc =
③ Maximize for
L
Ldoc d
④ Use as a paragraph vectord
training
live data

Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]
d
˜uw ˜vw
Paragraph Vector :
Lmaximize
Ldocmaximize

Ordinary text classiﬁcation architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], entertainment)
([0.2, -0.3, …], sports)
([0.1, 0.1, …], entertainment)
features
? ?
…
entertainment
sports
([0.1, -0.2, …], politics)
…
sports
training
algorithm
classiﬁer
feature
extraction

Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text Classification
Several percent better than using Bag-of-Words
with feature engineering in our Japanese/English data set.
We don’t need to work hard for feature engineering in
each language.
Bad
・Difficulty in analyzing error
It is hard to understand the meaning of each
component of paragraph vector.
labeled: ∼several 10000
unlabeled: ∼100000

Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a
different nature than Bag-of-Words
Reason: We can get a better classifier by combining
two different types of classifiers.

Our Use Case
Validation
Use one to validate the other.
Combination
Use the more reliable result of two classifiers:
Bag-of-Words-based classifier vs.
Paragraph Vector-based classifier

In multilingual localization
Use only Paragraph Vector-based classiﬁer without
any feature engineering.
Our Use Case (future)

The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty seeking
discovery
What Big Data Firms
typically do:
preference estimation
and risk quantiﬁcation
What if parents don't feed vegetables to children who only like meat?
What if you keep hearing only opinions that match yours?

The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply accept the optimum.
Without acceptance, users will never read SmartNews.
・topic extraction
We are developing:
・image extraction
・multi-arm bandit based scoring model
① For better Feature Vector of users and articles
② For Human-Acceptable Exploration
user
interests
①
②
…
feature vector for 10 million users
real-time feature vector for articles
x

We are building our engineering team in SF -
please join us!
採用してます
・ML/NLP Engineer
・Data Science Engineer
…

References
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Boilerplate Detection using Shallow Text Features
・BoilerPipe (GoogleCode)
・Quoc V. Le, Tomas Mikolov
Distributed Representations of Sentences and Documents
・Word2Vec (GoogleCode)

References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation
To Expand In The U.S.
・SmartNews, The Minimalist News App That's A Hit In Japan,
Sets Its Sights On The U.S.
・Japanese news app SmartNews nabs $10M bridge round,
at pre-money valuation of $320M
・About our Company SmartNews
Articles about SmartNews

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

More Related Content

What's hot

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Recently uploaded

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Editor's Notes