CountVectorizer in Flink

	
Insight	Data	Engineering	Fellowship,	Silicon	Valley	
Roshani	Nagmote
Motivation

	
	
	
•  Open	Source	Contribu?on	to	Apache	Flink	
	
•  Add	CountVectorizer	in	Flink	ML	library	to	help	NLP	
	
•  Usage	:	Documents	similarity	/	Word	Cloud	applica?on
What is CountVectorizer

•  Class	in	Python	Scikit-learn,	ML	library
Functionality Implemented

	
•  fit()	(Es?mator)	:	
	
	
	
	
	
	
[“this	is	a	text”,	
	
	“this	is	not	a	text		
document	to	
document”]	
Output	of	fit	
(this,1)	(is,2)	(a,3)	
(text,4)	(not,5)	
(document,6)	(to,
7)
Functionality Implemented

	
•  transform()	(Transformer)	:	
	
	
•  get_feature_names()	:	
Output	
	
(this,1)	(is,2)	(a,3)	
(text,4)	(not,5)	
(document,6)	(to,
7)	
[“this	is	a	text”,	
	
	“this	is	not	a	text		
document	to	
document”]	
Output	of	Transform	
[(1,1.0)	(2,1.0)	(3,1.0)	
(4,1.0)]	
	
[(1,1.0)	(2,1.0)	(5,1.0)	
(3,1.0)	(4,1.0)	(6,2.0)	
(7,1.0)]
Functionality Implemented

	
•  Parameters	added	to	countVectorizer()	constructor	
	
->	setMinDF(2),	setMaxDF(5)	
	
	Mindf	=2						The	sun	in	the	sky	is	bright.	We	can	see	the	shining	sun,	the	bright	sun		
	Maxdf	=	5				The	sun	in	the	sky	is	bright.	We	can	see	the	shining	sun,	the	bright	sun		
	
	
->	setStopwords	(List([“in”,”the”]	))	
	
	The	sun	in	the	sky	is	bright.	We	can	see	the	shining	sun,	the	bright	sun
Functionality Implemented

•  Parameters	
	 	->	setNgramRange(List(1,3))	
•  Wri?ng	test	cases	to	test	above	func?ons	and	compare	output	
with	scikit-learn	countvectorizer	for	accuracy	-	100%
Sample Output
Sample Output
Overview

CountVectorizer()	
setStopwords	setMinDF	
setMaxDF	 setNgramRange	
3	
10	
List[“is”,”the”]	
List[1,3]
Overview

CountVectorizer()	 Fit()	
Input	
Data	
files	
setStopwords	setMinDF	
setMaxDF	 setNgramRange	
[“this	is	a	text”,	
	
	“this	is	not	a	text		
document	to	
document”]	3	
10	
List[“is”,”the”]	
List[1,3]	
Output	of	fit	
(this,1)	(is,2)	(a,3)	
(text,4)	(not,5)	
(document,6)	(to,7)
Overview

CountVectorizer()	 Fit()	
Transform	
Input	
Data	
files	
setStopwords	
setMinDF	
setMaxDF	 setNgramRange	
Output	of	fit	
(this,1)	(is,2)	(a,3)	
(text,4)	(not,5)	
(document,6)	(to,7)	
[“this	is	a	text”,	
	
	“this	is	not	a	text		
document	to	
document”]	 Output	of	Transform	
[(1,1.0)	(2,1.0)	(3,1.0)	
(4,1.0)]	
	
[(1,1.0)	(2,1.0)	(5,1.0)	
(3,1.0)	(4,1.0)	(6,2.0)	
(7,1.0)]	
3	
10	
List[“is”,”the”]	
List[1,3]
Overview

CountVectorizer()	 Fit()	
Transform	
Input	
Data	
files	
setStopwords	
setMinDF	
setMaxDF	 setNgramRange	
getFeature
Names	
Output	of	fit	
(this,1)	(is,2)	(a,3)	
(text,4)	(not,5)	
(document,6)	(to,7)	
[“this	is	a	text”,	
	
	“this	is	not	a	text		
document	to	
document”]	 Output	of	Transform	
[(1,1.0)	(2,1.0)	(3,1.0)	
(4,1.0)]	
	
[(1,1.0)	(2,1.0)	(5,1.0)	
(3,1.0)	(4,1.0)	(6,2.0)	
(7,1.0)]	
3	
10	
List[“is”,”the”]	
List[1,3]
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
GetParameters	
Call	
Fit(text)
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
RegexParsing
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
Filter	data	on	
mindf,maxdf	
RegexParsing
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
Filter	data	on	
mindf,maxdf	
RegexParsing	
Filter	data	on	
stopwords
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
Filter	data	on	
mindf,maxdf	
RegexParsing	
Filter	data	on	
stopwords	
Create	tuples	
of	ngram	range
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
Filter	data	on	
mindf,maxdf	
RegexParsing	
Filter	data	on	
stopwords	
Create	tuples	
of	ngram	range	
Add	in	hashset
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
Filter	data	on	
mindf,maxdf	
RegexParsing	
Filter	data	on	
stopwords	
Create	tuples	
of	ngram	range	
Add	in	hashset	
Zipwithindex	
Output
Fit(Text) function

Fit()	
Override	fit	
from	
Es?mator	
All	
docum
ents	
Train	
Dataset	
Func?on	
GetParameters	
Call	
Fit(text)	
Validate	
Parameters	
Filter	data	on	
mindf,maxdf	
RegexParsing	
Filter	data	on	
stopwords	
Create	tuples	
of	ngram	range	
Add	in	hashset	
Zipwithindex	
Map	
ReduceGroup	
Output	
Flink	
Operators
Transform(Text)

Transform()	
Override	
from	
Transformer	
Model	
(Fit	Output)	
Document	
Provide	Req	
Parameters
Transform(Text)

Transform()	
Override	
from	
Transformer	
Model	
(Fit	Output)	
Document	
Transform	
Dataset	
Func?on	
Provide	Req	
Parameters
Transform(Text)

Transform()	
Override	
from	
Transformer	
Model	
(Fit	Output)	
Document	
Transform	
Dataset	
Func?on	
RegexParsing	
Provide	Req	
Parameters
Transform(Text)

Transform()	
Override	
from	
Transformer	
Model	
(Fit	Output)	
Document	
Transform	
Dataset	
Func?on	
RegexParsing	
Create	tuples	of	
ngram	range	
Provide	Req	
Parameters
Transform(Text)

Transform()	
Override	
from	
Transformer	
Model	
(Fit	Output)	
Document	
Transform	
Dataset	
Func?on	
RegexParsing	
Create	tuples	of	
ngram	range	
Matching		
(index,1.0)	
Provide	Req	
Parameters
Transform(Text)

Transform()	
Override	
from	
Transformer	
Model	
(Fit	Output)	
Document	
Transform	
Dataset	
Func?on	
RegexParsing	
Create	tuples	of	
ngram	range	
Matching		
(index,1.0)	
Convert	to	
sparse	vector	
Sparse	vector	
(Flink	func?on)	
Output	
Internally	uses	
quicksort	and	duplicate	
entries	are	summed	to	
get	frequency	
Provide	Req	
Parameters
Word Cloud

•  Used	Lightning	Scala	api	for	visualiza?on	purpose	
•  hgp://countvectorizer.us
About Me

Roshani	Nagmote	
	
●  Masters	in	Computer	Science	from		
University	of	Utah	
	
●  Data	Plaiorm	Engineering	intern	at	Ask.com	
	
●  Web	Developer	at	Tata	Consultancy	Services	
	
	
Things	I	love:	
	
●  Playing	with	my	Rini	
	
●  Hiking	and	visi?ng	new	places

CountVectorizer_ApacheFlink