SlideShare a Scribd company logo
1 of 54
Download to read offline
Globally Scalable Web Document Classification
Using Word2Vec
Kohei Nakaji (SmartNews)
keyword: machine learning for discovery
SmartNews Demo
About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users
50% DAU/MAU
100+ Publishers
2013 App of The Year
US
Launched Oct 2014
1M+ Monthly Active Users
Same engagement
80+ Publishers
Top News Category App
International
Launched Feb 2015
10M Downloads WW
Same engagement
English beta
Featured App
Funding: $50M
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
Diversification
Signals on the Internet
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
Diversification
Signals on the Internet
Web Document
Classification
⊂
Web Document Classification
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task definition:
When an arbitrary web document arrives, choose one
category exclusively from a pre-determined category set.
WORLD
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Main Content Extraction
Two approaches:
html
html
easier, but takes time
difficult, but fast
・Extract after rendering whole page
・Extract from HTML
Main Content Extraction
・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difficult, but fast
Two approaches:
Our Approach
Main Content Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy who'd
never led an arrest for the Tulsa County Sheriff's
Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance company
CEO end up joining a sting operation this month
that ended when he pulled out his handgun and
killed suspect Eric Harris instead of stunning
him with a Taser?</p>
</div>
</body>
</html>
Example:
main content
not main content
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has

text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has

text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
But not scalable.
Japanese:
…
…
…
…
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Separate HTML into ‘text block’s
Step1:
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Step3:
Define feature of each text block as
combination of local features
word count(current block) : 36,
num of <a>(current block) : 0,
word count (previous block) : 4,
num of <a> (previous block) : 1
ex:
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach:
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Making Main Content Using Decision Tree
(features)block1:
not main
(features)block2:
not main
(features)block3:
main
(features)block5:
main
(features)block4:
not main
Main Content Extraction From HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector.
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
stop words
sports players dictionary
with some feature engineering.
NBA_PLAYER
tf-idf
Feature Extraction in Text Classification
Similarly used in Japanese.
私は中路です。
よろしくお願いします。
stop words
person dictionary
私
は
中路
よろしく
お願い
し
ます
です
PERSON
tf-idf
Another Option: Paragraph Vector
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vector
(dimension ∼ several 100)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
(https://code.google.com/p/word2vec/)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(https://code.google.com/p/word2vec/)
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Word Vector in word2vec Model
Every word is mapped to unique word vector
with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0.1, ……-0.1]=
[0.3, 0.4, ……0]=
[0.3, 0.3, ……0.3]=
Germany Berlin
Paris
France
…
“Germany - Berlin = France - Paris”
vFrance
vParis
vGermany
vBerlin
Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on the street.
…
I love cat very much.
w220
w221
He comes from Japan.
…
…
TX
t=1
logP(wt|wt c, · · · wt+c)
P(wt|wt c, · · · wt+c) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
for anduw vw
vw is word vector for w.
Word vectors are trained so that it becomes a good
feature for predicting surrounding words.
Objective Function (cbow-case)
Model (sum-case)
=
Procedure
① Maximize
②
L
L
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vectors
(dimension ∼ 100s)
Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.
w220
He comes from Japan.
…
w221
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
doc_1
TX
t=1
logP(wt|wt c, · · · wt+c, doc i)
P(wt|wt c, · · · wt+c, doc i) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
+ di
, and di
wt is included
vw② Preserve uw , as ˜uw , ˜vw
document where
Add a vector to the model for each document.
Objective Function (dbow-case)
=
Model (sum-case)
Procedure
① Maximize
L
L
Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as ˜uw , ˜vw
After training, we can get a good paragraph vector as
a feature for a new document.
Objective Function (dbow-case)
Model (sum-case)
Procedure
① Maximize
TX
t=1
logP(wt|wt c, · · · wt+c, doc)
P(wt|wt c, · · · wt+c, doc) =
exp(˜uwt · ˜v)
P
W exp(˜uW · ˜v)
˜v =
X
t0
6=t, ct0
c
˜vwt
0 + d
We love SmartNews.
…
doc :
I love SmartNews
very much.
d
Ldoc =
③ Maximize for
L
Ldoc d
④ Use as a paragraph vectord
training
live data
Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]
d
˜uw ˜vw
Paragraph Vector :
Lmaximize
Ldocmaximize
Text Classification
Ordinary text classification architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], entertainment)
([0.2, -0.3, …], sports)
([0.1, 0.1, …], entertainment)
features
? ?
…
entertainment
sports
([0.1, -0.2, …], politics)
…
sports
training
algorithm
classifier
feature
extraction
Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text Classification
Several percent better than using Bag-of-Words
with feature engineering in our Japanese/English data set.
We don’t need to work hard for feature engineering in
each language.
Bad
・Difficulty in analyzing error
It is hard to understand the meaning of each
component of paragraph vector.
labeled: ∼several 10000
unlabeled: ∼100000
Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a
different nature than Bag-of-Words
Reason: We can get a better classifier by combining
two different types of classifiers.
Our Use Case
Validation
Use one to validate the other.
Combination
Use the more reliable result of two classifiers:
Bag-of-Words-based classifier vs.
Paragraph Vector-based classifier
In multilingual localization
Use only Paragraph Vector-based classifier without
any feature engineering.
Our Use Case (future)
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
The Challenge
The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty seeking
discovery
What Big Data Firms
typically do:
preference estimation
and risk quantification
What if parents don't feed vegetables to children who only like meat?
What if you keep hearing only opinions that match yours?
The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply accept the optimum.
Without acceptance, users will never read SmartNews.
・topic extraction
We are developing:
・image extraction
・multi-arm bandit based scoring model
① For better Feature Vector of users and articles
② For Human-Acceptable Exploration
user
interests
①
②
…
feature vector for 10 million users
real-time feature vector for articles
x
We are building our engineering team in SF -
please join us!
採用してます
・ML/NLP Engineer
・Data Science Engineer
…
kohei.nakaji@smartnews.com
References
Main Content Extraction
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Text Classification
Boilerplate Detection using Shallow Text Features
・BoilerPipe (GoogleCode)
・Quoc V. Le, Tomas Mikolov
Distributed Representations of Sentences and Documents
・Word2Vec (GoogleCode)
References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation
To Expand In The U.S.
・SmartNews, The Minimalist News App That's A Hit In Japan,
Sets Its Sights On The U.S.
・Japanese news app SmartNews nabs $10M bridge round,
at pre-money valuation of $320M
・About our Company SmartNews
Articles about SmartNews

More Related Content

What's hot

word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.choi kyumin
 
트위터의 추천 시스템 파헤치기
트위터의 추천 시스템 파헤치기트위터의 추천 시스템 파헤치기
트위터의 추천 시스템 파헤치기Yan So
 
AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19Xavier Amatriain
 
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Dawn Anderson MSc DigM
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?choi kyumin
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Weiwei Guo
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스BOAZ Bigdata
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색 제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색 BOAZ Bigdata
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
Word2Vec 개요 및 활용
Word2Vec 개요 및 활용Word2Vec 개요 및 활용
Word2Vec 개요 및 활용찬희 이
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게Seongyun Byeon
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)Myungjin Lee
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge GraphsHeiko Paulheim
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이BOAZ Bigdata
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfConnorShorten2
 

What's hot (20)

word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.
 
트위터의 추천 시스템 파헤치기
트위터의 추천 시스템 파헤치기트위터의 추천 시스템 파헤치기
트위터의 추천 시스템 파헤치기
 
AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19AI-driven product innovation: from Recommender Systems to COVID-19
AI-driven product innovation: from Recommender Systems to COVID-19
 
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색 제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Word2Vec 개요 및 활용
Word2Vec 개요 및 활용Word2Vec 개요 및 활용
Word2Vec 개요 및 활용
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Vector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdfVector Search for Data Scientists.pdf
Vector Search for Data Scientists.pdf
 

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec (20)

Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2
 
HTML CSS JS in Nut shell
HTML  CSS JS in Nut shellHTML  CSS JS in Nut shell
HTML CSS JS in Nut shell
 
Ember
EmberEmber
Ember
 
Getting Started with jQuery
Getting Started with jQueryGetting Started with jQuery
Getting Started with jQuery
 
Caste a vote online
Caste a vote onlineCaste a vote online
Caste a vote online
 
Jquery library
Jquery libraryJquery library
Jquery library
 
Dotnetintroduce 100324201546-phpapp02
Dotnetintroduce 100324201546-phpapp02Dotnetintroduce 100324201546-phpapp02
Dotnetintroduce 100324201546-phpapp02
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
Overview of PHP and MYSQL
Overview of PHP and MYSQLOverview of PHP and MYSQL
Overview of PHP and MYSQL
 
Javascript libraries
Javascript librariesJavascript libraries
Javascript libraries
 
JS Libraries and jQuery Overview
JS Libraries and jQuery OverviewJS Libraries and jQuery Overview
JS Libraries and jQuery Overview
 
Medium TechTalk — iOS
Medium TechTalk — iOSMedium TechTalk — iOS
Medium TechTalk — iOS
 
DotNet Introduction
DotNet IntroductionDotNet Introduction
DotNet Introduction
 
Build a game with javascript (april 2017)
Build a game with javascript (april 2017)Build a game with javascript (april 2017)
Build a game with javascript (april 2017)
 
MLBox
MLBoxMLBox
MLBox
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
 
Continuous Integration - Live Static Analysis with Puma Scan
Continuous Integration - Live Static Analysis with Puma ScanContinuous Integration - Live Static Analysis with Puma Scan
Continuous Integration - Live Static Analysis with Puma Scan
 
R data interfaces
R data interfacesR data interfaces
R data interfaces
 
Timothy N. Tsvetkov, Rails 3.1
Timothy N. Tsvetkov, Rails 3.1Timothy N. Tsvetkov, Rails 3.1
Timothy N. Tsvetkov, Rails 3.1
 
JQuery
JQueryJQuery
JQuery
 

Recently uploaded

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Recently uploaded (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

  • 1. Globally Scalable Web Document Classification Using Word2Vec Kohei Nakaji (SmartNews)
  • 2.
  • 5. About SmartNews Japan Launched 2013 4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers 2013 App of The Year US Launched Oct 2014 1M+ Monthly Active Users Same engagement 80+ Publishers Top News Category App International Launched Feb 2015 10M Downloads WW Same engagement English beta Featured App Funding: $50M
  • 6. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+/day Diversification Signals on the Internet
  • 7. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+ /day Diversification Signals on the Internet Web Document Classification ⊂
  • 8. Web Document Classification ENTERTAINMENT SPORTS TECHNOLOGY LIFESTYLE SCIENCE … Task definition: When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set. WORLD
  • 9. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  • 10. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  • 11. Main Content Extraction Two approaches: html html easier, but takes time difficult, but fast ・Extract after rendering whole page ・Extract from HTML
  • 12. Main Content Extraction ・Extract after rendering whole page ・Extract from HTML html html easier, but takes time difficult, but fast Two approaches: Our Approach
  • 13. Main Content Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html> Example: main content not main content
  • 14. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: …
  • 15. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: … But not scalable. Japanese: … … … …
  • 16. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 17. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 18. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Separate HTML into ‘text block’s Step1:
  • 19. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0
  • 20. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0 Step3: Define feature of each text block as combination of local features word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1 ex:
  • 21. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach: See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 22. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 23. Making Main Content Using Decision Tree (features)block1: not main (features)block2: not main (features)block3: main (features)block5: main (features)block4: not main
  • 24. Main Content Extraction From HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 25. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  • 26. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  • 27. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  • 28. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector. Will deliver an NBA championship to Cleveland James LeBron
  • 29. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector Will deliver an NBA championship to Cleveland James LeBron stop words sports players dictionary with some feature engineering. NBA_PLAYER tf-idf
  • 30. Feature Extraction in Text Classification Similarly used in Japanese. 私は中路です。 よろしくお願いします。 stop words person dictionary 私 は 中路 よろしく お願い し ます です PERSON tf-idf
  • 32. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vector (dimension ∼ several 100)
  • 33. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053) (https://code.google.com/p/word2vec/)
  • 34. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (https://code.google.com/p/word2vec/) (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  • 35. Word Vector in word2vec Model Every word is mapped to unique word vector with good properties. [0.1, 0.2, ……0.2]= [0.1, 0.1, ……-0.1]= [0.3, 0.4, ……0]= [0.3, 0.3, ……0.3]= Germany Berlin Paris France … “Germany - Berlin = France - Paris” vFrance vParis vGermany vBerlin
  • 36. Procedure to Create Word Vectors Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on A cat sat on the street. … I love cat very much. w220 w221 He comes from Japan. … … TX t=1 logP(wt|wt c, · · · wt+c) P(wt|wt c, · · · wt+c) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t for anduw vw vw is word vector for w. Word vectors are trained so that it becomes a good feature for predicting surrounding words. Objective Function (cbow-case) Model (sum-case) = Procedure ① Maximize ② L L
  • 37. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  • 38. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vectors (dimension ∼ 100s)
  • 39. Procedure to Create Paragraph Vectors for uw vw A cat sat on the street. … doc_1 : doc_2 : … I love cat very much. w220 He comes from Japan. … w221 Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on doc_1 TX t=1 logP(wt|wt c, · · · wt+c, doc i) P(wt|wt c, · · · wt+c, doc i) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t + di , and di wt is included vw② Preserve uw , as ˜uw , ˜vw document where Add a vector to the model for each document. Objective Function (dbow-case) = Model (sum-case) Procedure ① Maximize L L
  • 40. Procedure to Create Paragraph Vector for uw vw, and di vw② Preserve uw , as ˜uw , ˜vw After training, we can get a good paragraph vector as a feature for a new document. Objective Function (dbow-case) Model (sum-case) Procedure ① Maximize TX t=1 logP(wt|wt c, · · · wt+c, doc) P(wt|wt c, · · · wt+c, doc) = exp(˜uwt · ˜v) P W exp(˜uW · ˜v) ˜v = X t0 6=t, ct0 c ˜vwt 0 + d We love SmartNews. … doc : I love SmartNews very much. d Ldoc = ③ Maximize for L Ldoc d ④ Use as a paragraph vectord training live data
  • 41. Procedure to Create Paragraph Vector Feature Extractor [0.2, 0.3, ……0.2] d ˜uw ˜vw Paragraph Vector : Lmaximize Ldocmaximize
  • 42. Text Classification Ordinary text classification architecture: ② live data ([0.1, -0.1, …]) ① training ([0.1, 0.3, …], entertainment) ([0.2, -0.3, …], sports) ([0.1, 0.1, …], entertainment) features ? ? … entertainment sports ([0.1, -0.2, …], politics) … sports training algorithm classifier feature extraction
  • 43. Good Benefits of Using Paragraph Vector ・High Scalability ・High Precision in Text Classification Several percent better than using Bag-of-Words with feature engineering in our Japanese/English data set. We don’t need to work hard for feature engineering in each language. Bad ・Difficulty in analyzing error It is hard to understand the meaning of each component of paragraph vector. labeled: ∼several 10000 unlabeled: ∼100000
  • 44. Benefits of Using Paragraph Vector It is important that Paragraph Vector has a different nature than Bag-of-Words Reason: We can get a better classifier by combining two different types of classifiers.
  • 45. Our Use Case Validation Use one to validate the other. Combination Use the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
  • 46. In multilingual localization Use only Paragraph Vector-based classifier without any feature engineering. Our Use Case (future)
  • 47. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  • 49. The Challenge News is uncertainty seeking for long-term values. Exploitation Exploration What SmartNews does: uncertainty seeking discovery What Big Data Firms typically do: preference estimation and risk quantification What if parents don't feed vegetables to children who only like meat? What if you keep hearing only opinions that match yours?
  • 50. The Challenge Searching not optimal, but acceptable form of exploration. Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews. ・topic extraction We are developing: ・image extraction ・multi-arm bandit based scoring model ① For better Feature Vector of users and articles ② For Human-Acceptable Exploration user interests ① ② … feature vector for 10 million users real-time feature vector for articles x
  • 51. We are building our engineering team in SF - please join us! 採用してます ・ML/NLP Engineer ・Data Science Engineer …
  • 53. References Main Content Extraction ・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl Text Classification Boilerplate Detection using Shallow Text Features ・BoilerPipe (GoogleCode) ・Quoc V. Le, Tomas Mikolov Distributed Representations of Sentences and Documents ・Word2Vec (GoogleCode)
  • 54. References About SmartNews ・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S. ・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S. ・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M ・About our Company SmartNews Articles about SmartNews

Editor's Notes

  1. Hello I am Kohei Nakaji, engineer of SmartNews Inc. I'm developing news delivery algorithm in SmartNews, using especially machine learning and natural language processing in SmartNews. My research background is not kind of ML things but particle physics theory, begining of universe, dark matter and so on. so if you guys have interest in physics thing I can also talk about it in another day. Anyway, Today I'm gonna talk about this topic: 'Grobally scalable web document classification using word2vec'. Because This talk is based on the technology in SmartNews, I will do brief introduction of our company SmartNews. We SmartNews, are developing ios/android application: SmartNews.
  2. How many guys use SmartNews here? very few people. How many guys love machine learning? Great. So you will love SmartNews. Because our apps are made by machine learning. SmartNews is news app for more than 100 countries, but we have No writier, no editor, algorithm do everything. How many guys use news app every day? yeah most of news app fail. Some apps have great downloads but they are annoying with few engagement ratio. We SmartNews have 10M downloads grobally and more than 50% is active. We have possibility to get the position of successful news app. Then what makes SmartNews different?
  3. Keyword is ‘machine learning for discovery’. Some apps rely on human editor. they are not scalable and also they can be biased. Some apps use machine learning for delivery algorithms, but they use it for personalization. We use machine learning for everyone on earth to discover and learn new things they might not otherwise have seen. This is our mission. We are trying to develop algorithm for users to discover new things. that makes our engagement ratio high. Now let me show you demo of our apps.
  4. Let me show you guys how it works. First when you open it up, you can see the top news right here. Top news are latest important news chosen by our algorithm. Over here you got tabs of different categories which is the most straightforward result of web document classification. You see the latest important news in each category chosen by our algorithm. you may understand how precise our web document classification should be. One of the cool things is that when you find that you wanna read, for example see I wanna read this article right here, you’ve got this option right here which is the smart view option. you like this option, because it looks very very clean, no banners, no ads. Over here you can see the web view which is ordinary web browser, you see a lot of things you don’t wanna read in web view, but in smart view it is more simple and clean. You may understand how difficult to create smart view from arbitrary web site. I will introduce some of the algorithm in this talk. Another cool thing of smart view is you can see smart view even in offline. You can read in metro, in the airplane, anywhere.
  5. As I told you we have 10M downloads and more than 50% is active. there are 3 types of editions, japanese edition, us edition, and international edition. In international edition, users can read English articles which is localized for more than 100 countries. But there is no editor for each country.
  6. UI is good, Smart View is cool. But as I told you what makes us different is the algorithm to find articles from which users can discover new things. This is the outline of our algorithm for ‘users discovery’. urls are found from the signals on the Internet by our crawler, html structures are automatically analyzed, for example title, mainText, image is extracted then semantics of articles are analyzed, what category it has, what subject it has, what image is in…etc… - using signals and semantics, the importance score of each article for each category in each country are calculated diversify topic of the delivery list then we deliver the articles to users. the list of the article are refreshed in real time. We crawled 10 million urls/day and deliver only top 1000 articles to users and 100/category/day. There are many things to talk about this algorithm. Especially how we do importance estimation, we do personalize or do another approach is key feature because it is related with our mission. I will talk about it later and now let’s get into the today’s main topic
  7. Web Document classification, which is part of our structure analysis, and semantics analysis. The reason why I choose Web Document Classification for today’s topic is for one thing it is important for our application as you have already seen and for another thing, classification of unstructured data is common task in many applications, from simple spam filter to category tagging in ec site.
  8. The task definision is very simple. when arbitrary
  9. There are roughly two steps. 1. main content extraction : we have to detect main content from news website. it is difficult because there are so many websites, and different websites have different structure. 2. text classification : we classify main content into one category first I briefly show one of our algorithm to detect main content from Web Document, next I will talk about text classification using word2vec extended model
  10. Let’s start from main content extraction. I want to add that in our app main content extraction is also important for making smart view we have seen.
  11. when we do main content extraction, there are two approaches actually we use the bottom one. First approach is rendering all of the page loading all css, javascript and after that extract the main content. it is relatively easier because we can use the information of position, width, and height of each component but it takes time because we have to render all items. Second approach is extract main content directry from html. it is more difficult but needs much less computing resource comparing with first approach.
  12. we use second approach in our algorithm, because we have to proceed 10 million articles per day, 100 article per second.
  13. This is the example of main content extraction from html. It is the task to detect which is main content and which is not main content.
  14. Rule based extraction algorithm is of course possible like div which has text length more than 200 is main content. Because there are so many websites, the number of rule tend to be large,
  15. If we do it in multi-language, it becomes much harder.
  16. So, as one of our algorithm to extract main content, we are using machine learning approach which is based on the paper in 2011. So today, let me introduce about this. In the training phase, first we prepare the sets of html document that main content is already labeled. In our case, we aggregate the articles by our crawler and annotator annotate main content. Next by using block separator, html is separated into each text block, and by using feature extractor, feature vector in each block is extracted.
  17. let’s get into the block separation and feature extraction part.
  18. For step one we separate html into text blocks. The definition of ‘text block’ in our case is roughly, the block which is sandwiched by block level tag.
  19. For step 2, local features for each block is extracted. We use for example number of word, number of a tag, as local feature,
  20. For Step3, we create feature vector of each block as the combination of local features of different blocks. In this example, feature vector of this text block has element of ‘word count and num of a tag in previous and current block’.
  21. in training phase, after the block separation and feature extraction, we get sets of labeled feature vector. The label is binary value: main/not main. By using the labeled feature vector, decision tree is trained. When live data comes, html is separated into text blocks with features, and by using already trained decision tree, final result is obtained.
  22. Let’s get into this part.
  23. Feature vector in each block is classified into main/not main by using already trained decision tree. Then now, we know which text block is main content and which text block is not main content. By combining the result, we get the main text.
  24. This is the end of main content extraction. easy, simple, but not bad. If you want to know more about it. please see the link, and also there is the library which includes already trained model in English, please try. I will share the reference later.
  25. so let’s get into the text classification.
  26. Probably you know everything already, but let me review the ordinary classification architecture. In the training phase, first we prepare sets of labeled texts as training data. by using feature extractor, sets of labeled feature vector is created, then using training algorithm, like SVM or logistic regression, classifier is trained. In bag-of-words feature extractor case, sets of word in the document is extracted as feature vector, and after training, roughly speaking, which word tends to show up in which category, is trained. when live data comes, feature vector is extracted and by using already trained classifier, category is determined.
  27. Training algorithm itself is ordinary logistic regression in our application and there are many materials about it. So today, let’s focus on feature extraction part.
  28. As a feature vector ‘Bag-of-words’ is commonly used. Bag-of-words is set of words in the document, it does not care about the order of words. very simple but not bad if we use it for text classification.
  29. If we want to improve the quality of feature vector, we create, for example stop words dictionary for removing unnecessary words, create specific dictionary for adding a specific feature, or use tf-idf. But still Bag-of-Words are starting point.
  30. In Japanese case, we have to use technique to separate words, but still Bag-of-Words with some feature engineering is commonly used. But Bag-of-Words definitely seems not perfect feature vector of text, for example it cannot include the information of word order. For another example we cannot use information that two words are close to each other or not. We wonder whether we can easily get better feature vector or not.
  31. As a better feature vector, we use Paragraph Vector which is word2vec extended model. It is ‘better’ in precision of text classification.
  32. by using the technique I will talk about today, every document is mapped to one dense vector with a few hundred dimensions named paragraph vector.
  33. Because paragraph vector is kind of word2vec extended model, I should start from word2vec. In word2vec case, every word is mapped to unique word vector. In paragraph vector case, every document is mapped to unique vector.
  34. So let’s get into word2vec.
  35. Every word is mapped to unique vector. In this example, France, Paris, Germany, Berlin is mapped to each unique vector. What is surprising is the nature like Germany - Berlin = France - Paris. From this nature, we assure that some semantics is embedded in the vector.
  36. This is Brief Overview of training word2vec model. First prepare sets of document. and label each word like w1, w2, then, maximize the objective function. The value of c is arbitary. 2 or 3 is commonly used. By looking at the shape of this objective function, you can see that maximizing this objective function means maximizing the probability to predict a word from surrounding words. In the example of the right figure, The model is refreshed so that the probability to predict ‘on’ from surrounding words ‘cat’, ‘sat’, ‘the’, ‘street’ becomes higher. The model of probability function is like this. For each word, 2 types of vectors: output vector u and input vector v are defined. Roughly speaking, when training converge, the more a pair of 2words shows up in a same sentence, the bigger the inner product of u and v for the 2words become. After training we use v for each word as word vector. Technically, training this model directly is really heavy because of this sum, and 2 types of approximation Negative sampling and Hierarchal softmax are used. Detail about the approximations is beyond the scope of this talk. This is how we create word vector by using word2vec model.
  37. Then let’s get into paragraph vector.
  38. As I told you, each document is mapped into one dense vector named paragraph vector.
  39. The procedure to create paragraph vector is similar to word2vec case. Prepare sets of document. and label each word like w1, w2, we also label each document like doc_1, doc_2. Then, maximize this objective function. The difference from word2vec model is that, the objective function includes document_id where the word is included. So maximizing this objective function means maximizing the probability to predict a word not only from surrounding words but also from the document where the word is included. The model of the probability function is also a little bit different. Same as word2vec case, for each word outer vector u and inner vector v are defined. In addition, for each document, vector d_i is also defined. When training converge, we get optimized u, v for each word and d_i for each document. The final result of vector d_i is paragraph vector for each document. But what we really want to do is extracting paragraph vector from new document. For doing it we need one more step.
  40. When new document comes, we label the words in the document, and maximize this objective function. In this time, T is the number of word in the document. We don’t need to maximize the objective function for u and v, we can use u and v which is already trained. All we have to do is just maximize objective function for d. After the objective function is maximized we get d as a paragraph vector for the document.
  41. It was a little bit confusing, so I show a simple figure. First, we train the feature extractor by putting the large set of documents, and when new document comes, by using the already trained feature extractor, paragraph vector is extracted. very simple right?
  42. By just using the paragraph vector as a feature vector, we can do ordinary text classification.
  43. Good thing for using paragraph vector comparing with Bag-of-Words is these two. ①high precision In our Japanese/English data set, the result of 10-fold validation test becomes several percent better than bag-of-words with feature engineering case. ②high scalability. By just preparing the sets of Document for each language, without feature engineering, we can get good result. Bad thing is the difficulty in analyzing error. It is hard to understand the meaning of each component of paragraph vector. Because there is trade off, I don’t know which you should choose in your use-case even if the precision of text classification is several percent higher by using paragraph vector.
  44. But still, I think it’s good for you to try paragraph vector. Paragraph vector has different nature from bag-of-words. So the combination of bag-of-words classifier and paragraph vector based classifier can be much better classifier.
  45. In our app, there are many types of classifiers like sports classifier, entertainment classifier other than main category classifier. Depending on the purpose of each classification, in some case, we use the more reliable result of Bag-of-words based classifier and paragraph-vector based classifier. In another case we validate the result of bag-of-words based classifier by using paragraph-vector based classifier. Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  46. Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  47. This is the end of todays’ topic web document classification.
  48. News is uncertainty seeking for long-term values. What other big data firms typically do is recommend what people have interest about, by using like matrix factorization. What we are doing is not simply suggest users what they like, but expand users’ interest by our algorithm.
  49. How to explorer users’ interest space and suggest something new to users, are very challenging problem. We are now brushing up, these two. For better understanding of the users’ interest space we are brushing up the topic or the subject extraction from article, brushing up users’ feature vector For doing the good exploration multi-arm bandit based scoring model, Technically, we have to create and operate the good and reasonable model which includes feature vector of 10 million users and real time feature vector of articles, it is really exciting. Actually the number of people tuckling on these problems is 5, including ML PhD., Theoretical Physics PhD, but we need much much much more people to tackle on this difficult problem.
  50. Then let’s get into paragraph vector.
  51. Then let’s get into paragraph vector.