CountVectorizer_ApacheFlink

CountVectorizer in Flink

Insight Data Engineering Fellowship, Silicon Valley
Roshani Nagmote

Motivation

•  Open Source Contribu?on to Apache Flink

•  Add CountVectorizer in Flink ML library to help NLP

•  Usage : Documents similarity / Word Cloud applica?on

What is CountVectorizer

•  Class in Python Scikit-learn, ML library

Functionality Implemented

•  ﬁt() (Es?mator) :

[“this is a text”,

“this is not a text
document to
document”]
Output of ﬁt
(this,1) (is,2) (a,3)
(text,4) (not,5)
(document,6) (to,
7)


•  transform() (Transformer) :

•  get_feature_names() :
Output

(this,1) (is,2) (a,3)
(text,4) (not,5)
(document,6) (to,
7)

document to
document”]
Output of Transform
[(1,1.0) (2,1.0) (3,1.0)
(4,1.0)]

[(1,1.0) (2,1.0) (5,1.0)
(3,1.0) (4,1.0) (6,2.0)
(7,1.0)]


•  Parameters added to countVectorizer() constructor

-> setMinDF(2), setMaxDF(5)

Mindf =2 The sun in the sky is bright. We can see the shining sun, the bright sun
Maxdf = 5 The sun in the sky is bright. We can see the shining sun, the bright sun

-> setStopwords (List([“in”,”the”] ))

The sun in the sky is bright. We can see the shining sun, the bright sun


•  Parameters
-> setNgramRange(List(1,3))
•  Wri?ng test cases to test above func?ons and compare output
with scikit-learn countvectorizer for accuracy - 100%

Overview

CountVectorizer()
setStopwords setMinDF
setMaxDF setNgramRange
3
10
List[“is”,”the”]
List[1,3]

Overview

CountVectorizer() Fit()
Input
Data
ﬁles
setStopwords setMinDF

document to
document”] 3
10
List[1,3]
Output of ﬁt
(this,1) (is,2) (a,3)
(text,4) (not,5)
(document,6) (to,7)

Overview

Transform
Input
Data
ﬁles
setStopwords
setMinDF
Output of ﬁt
(this,1) (is,2) (a,3)
(text,4) (not,5)
(document,6) (to,7)

document to
document”] Output of Transform
[(1,1.0) (2,1.0) (3,1.0)
(4,1.0)]

[(1,1.0) (2,1.0) (5,1.0)
(3,1.0) (4,1.0) (6,2.0)
(7,1.0)]
3
10
List[1,3]

Overview

Transform
Input
Data
ﬁles
setStopwords
setMinDF
getFeature
Names
Output of ﬁt
(this,1) (is,2) (a,3)
(text,4) (not,5)
(document,6) (to,7)

document to
document”] Output of Transform
[(1,1.0) (2,1.0) (3,1.0)
(4,1.0)]

[(1,1.0) (2,1.0) (5,1.0)
(3,1.0) (4,1.0) (6,2.0)
(7,1.0)]
3
10
List[1,3]

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
GetParameters
Call
Fit(text)

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
GetParameters
Call
Fit(text)
Validate
Parameters

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
RegexParsing

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
Filter data on
mindf,maxdf
RegexParsing

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
Filter data on
mindf,maxdf
RegexParsing
Filter data on
stopwords

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
Filter data on
mindf,maxdf
RegexParsing
Filter data on
stopwords
Create tuples
of ngram range

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
Filter data on
mindf,maxdf
RegexParsing
Filter data on
stopwords
Create tuples
of ngram range
Add in hashset

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
Filter data on
mindf,maxdf
RegexParsing
Filter data on
stopwords
Create tuples
of ngram range
Add in hashset
Zipwithindex
Output

Fit(Text) function

Fit()
Override ﬁt
from
Es?mator
All
docum
ents
Train
Dataset
Func?on
GetParameters
Call
Fit(text)
Validate
Parameters
Filter data on
mindf,maxdf
RegexParsing
Filter data on
stopwords
Create tuples
of ngram range
Add in hashset
Zipwithindex
Map
ReduceGroup
Output
Flink
Operators

Transform(Text)

Transform()
Override
from
Transformer
Model
(Fit Output)
Document
Provide Req
Parameters

Transform(Text)

Transform()
Override
from
Transformer
Model
(Fit Output)
Document
Transform
Dataset
Func?on
Provide Req
Parameters

Transform(Text)

Transform()
Override
from
Transformer
Model
(Fit Output)
Document
Transform
Dataset
Func?on
RegexParsing
Provide Req
Parameters

Transform(Text)

Transform()
Override
from
Transformer
Model
(Fit Output)
Document
Transform
Dataset
Func?on
RegexParsing
Create tuples of
ngram range
Provide Req
Parameters

Transform(Text)

Transform()
Override
from
Transformer
Model
(Fit Output)
Document
Transform
Dataset
Func?on
RegexParsing
Create tuples of
ngram range
Matching
(index,1.0)
Provide Req
Parameters

Transform(Text)

Transform()
Override
from
Transformer
Model
(Fit Output)
Document
Transform
Dataset
Func?on
RegexParsing
Create tuples of
ngram range
Matching
(index,1.0)
Convert to
sparse vector
Sparse vector
(Flink func?on)
Output
Internally uses
quicksort and duplicate
entries are summed to
get frequency
Provide Req
Parameters

Word Cloud

•  Used Lightning Scala api for visualiza?on purpose
•  hgp://countvectorizer.us

About Me

Roshani Nagmote

●  Masters in Computer Science from
University of Utah

●  Data Plaiorm Engineering intern at Ask.com

●  Web Developer at Tata Consultancy Services

Things I love:

●  Playing with my Rini

●  Hiking and visi?ng new places

CountVectorizer_ApacheFlink

More Related Content

What's hot

Recently uploaded

CountVectorizer_ApacheFlink