Submit Search
Upload
CountVectorizer_ApacheFlink
•
1 like
•
404 views
R
Roshani Nagmote
Follow
CountVectorizer in Flink
Read less
Read more
Data & Analytics
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 31
Download now
Download to read offline
Recommended
Digital image processing
Digital image processing
ABIRAMI M
Extraction of region of interest in an image
Extraction of region of interest in an image
Harsukh Chandak
JPEG Image Compression
JPEG Image Compression
Hemanth Kumar Mantri
Image enhancement techniques
Image enhancement techniques
sakshij91
Intensity Transformation and Spatial filtering
Intensity Transformation and Spatial filtering
Shajun Nisha
Hit and-miss transform
Hit and-miss transform
Krish Everglades
Predictive coding
Predictive coding
p_ayal
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
ananth
Recommended
Digital image processing
Digital image processing
ABIRAMI M
Extraction of region of interest in an image
Extraction of region of interest in an image
Harsukh Chandak
JPEG Image Compression
JPEG Image Compression
Hemanth Kumar Mantri
Image enhancement techniques
Image enhancement techniques
sakshij91
Intensity Transformation and Spatial filtering
Intensity Transformation and Spatial filtering
Shajun Nisha
Hit and-miss transform
Hit and-miss transform
Krish Everglades
Predictive coding
Predictive coding
p_ayal
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
ananth
Jpeg dct
Jpeg dct
darshan2518
Enhancement in frequency domain
Enhancement in frequency domain
Ashish Kumar
Filtering in frequency domain
Filtering in frequency domain
GowriLatha1
Digital Image Processing: Image Enhancement in the Frequency Domain
Digital Image Processing: Image Enhancement in the Frequency Domain
Mostafa G. M. Mostafa
World wide web with multimedia
World wide web with multimedia
Afaq Siddiqui
Halftone
Halftone
NCeLearning
Attributes of output primitives unit ii
Attributes of output primitives unit ii
Balamurugan M
Super resolution
Super resolution
Federico D'Amato
Image enhancement
Image enhancement
vsaranya169
Intensity Transformation
Intensity Transformation
Amnaakhaan
Computer graphics
Computer graphics
Mohsin Azam
what is Font in multimedia
what is Font in multimedia
maliksiddique1
Digital Image Processing: Image Restoration
Digital Image Processing: Image Restoration
Mostafa G. M. Mostafa
Image processing
Image processing
Mohammed Abraruddin
Color
Color
FNian
Fundamentals and image compression models
Fundamentals and image compression models
lavanya marichamy
Video Compression Basics
Video Compression Basics
Sanjiv Malik
Lecture 14 Properties of Fourier Transform for 2D Signal
Lecture 14 Properties of Fourier Transform for 2D Signal
VARUN KUMAR
Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)
Kalyan Acharjya
Module 2
Module 2
UllasSS1
prakash ppt (2).pdf
prakash ppt (2).pdf
ShivamKS4
Shivam PPT.pptx
Shivam PPT.pptx
ShivamDenge
More Related Content
What's hot
Jpeg dct
Jpeg dct
darshan2518
Enhancement in frequency domain
Enhancement in frequency domain
Ashish Kumar
Filtering in frequency domain
Filtering in frequency domain
GowriLatha1
Digital Image Processing: Image Enhancement in the Frequency Domain
Digital Image Processing: Image Enhancement in the Frequency Domain
Mostafa G. M. Mostafa
World wide web with multimedia
World wide web with multimedia
Afaq Siddiqui
Halftone
Halftone
NCeLearning
Attributes of output primitives unit ii
Attributes of output primitives unit ii
Balamurugan M
Super resolution
Super resolution
Federico D'Amato
Image enhancement
Image enhancement
vsaranya169
Intensity Transformation
Intensity Transformation
Amnaakhaan
Computer graphics
Computer graphics
Mohsin Azam
what is Font in multimedia
what is Font in multimedia
maliksiddique1
Digital Image Processing: Image Restoration
Digital Image Processing: Image Restoration
Mostafa G. M. Mostafa
Image processing
Image processing
Mohammed Abraruddin
Color
Color
FNian
Fundamentals and image compression models
Fundamentals and image compression models
lavanya marichamy
Video Compression Basics
Video Compression Basics
Sanjiv Malik
Lecture 14 Properties of Fourier Transform for 2D Signal
Lecture 14 Properties of Fourier Transform for 2D Signal
VARUN KUMAR
Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)
Kalyan Acharjya
Module 2
Module 2
UllasSS1
What's hot
(20)
Jpeg dct
Jpeg dct
Enhancement in frequency domain
Enhancement in frequency domain
Filtering in frequency domain
Filtering in frequency domain
Digital Image Processing: Image Enhancement in the Frequency Domain
Digital Image Processing: Image Enhancement in the Frequency Domain
World wide web with multimedia
World wide web with multimedia
Halftone
Halftone
Attributes of output primitives unit ii
Attributes of output primitives unit ii
Super resolution
Super resolution
Image enhancement
Image enhancement
Intensity Transformation
Intensity Transformation
Computer graphics
Computer graphics
what is Font in multimedia
what is Font in multimedia
Digital Image Processing: Image Restoration
Digital Image Processing: Image Restoration
Image processing
Image processing
Color
Color
Fundamentals and image compression models
Fundamentals and image compression models
Video Compression Basics
Video Compression Basics
Lecture 14 Properties of Fourier Transform for 2D Signal
Lecture 14 Properties of Fourier Transform for 2D Signal
Spatial Filters (Digital Image Processing)
Spatial Filters (Digital Image Processing)
Module 2
Module 2
Similar to CountVectorizer_ApacheFlink
prakash ppt (2).pdf
prakash ppt (2).pdf
ShivamKS4
Shivam PPT.pptx
Shivam PPT.pptx
ShivamDenge
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptx
ShivamDenge
Whats new in_mlflow
Whats new in_mlflow
Databricks
SPEC Innovations: New Features in Innoslate Webinar
SPEC Innovations: New Features in Innoslate Webinar
Elizabeth Steiner
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur
Generative AI Application Development using LangChain and LangFlow
Generative AI Application Development using LangChain and LangFlow
Gene Leybzon
PYTHON PPT.pptx
PYTHON PPT.pptx
AbhishekMourya36
Translate word press to your language
Translate word press to your language
mbigul
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnals
MSDEVMTL
Feature Engineering for NLP
Feature Engineering for NLP
Bill Liu
First Steps in Python Programming
First Steps in Python Programming
Dozie Agbo
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
KetanUmare
Python (3).pdf
Python (3).pdf
samiwaris2
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks
Expressing and sharing workflows
Expressing and sharing workflows
Daniel S. Katz
intro to python.pptx
intro to python.pptx
UpasnaSharma37
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
Backend Development - Django
Backend Development - Django
Ahmad Sakhleh
Similar to CountVectorizer_ApacheFlink
(20)
prakash ppt (2).pdf
prakash ppt (2).pdf
Shivam PPT.pptx
Shivam PPT.pptx
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptx
Whats new in_mlflow
Whats new in_mlflow
SPEC Innovations: New Features in Innoslate Webinar
SPEC Innovations: New Features in Innoslate Webinar
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Generative AI Application Development using LangChain and LangFlow
Generative AI Application Development using LangChain and LangFlow
PYTHON PPT.pptx
PYTHON PPT.pptx
Translate word press to your language
Translate word press to your language
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnals
Feature Engineering for NLP
Feature Engineering for NLP
First Steps in Python Programming
First Steps in Python Programming
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
Python (3).pdf
Python (3).pdf
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Expressing and sharing workflows
Expressing and sharing workflows
intro to python.pptx
intro to python.pptx
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Backend Development - Django
Backend Development - Django
Recently uploaded
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Rachmat Ramadhan H
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
olyaivanovalion
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
firstjob4
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
michael115558
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Pooja Nehwal
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Anupama Kate
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Call Girls in Nagpur High Profile Call Girls
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
olyaivanovalion
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
olyaivanovalion
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
Timothy Spann
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
olyaivanovalion
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
olyaivanovalion
Recently uploaded
(20)
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
CountVectorizer_ApacheFlink
1.
CountVectorizer in Flink Insight Data Engineering Fellowship, Silicon Valley Roshani Nagmote
2.
Motivation • Open Source Contribu?on to Apache Flink • Add CountVectorizer in Flink ML library to help NLP •
Usage : Documents similarity / Word Cloud applica?on
3.
What is CountVectorizer •
Class in Python Scikit-learn, ML library
4.
Functionality Implemented • fit() (Es?mator) : [“this is a text”, “this is not a text document to document”] Output of fit (this,1) (is,2) (a,3) (text,4) (not,5) (document,6) (to, 7)
5.
Functionality Implemented • transform() (Transformer) : •
get_feature_names() : Output (this,1) (is,2) (a,3) (text,4) (not,5) (document,6) (to, 7) [“this is a text”, “this is not a text document to document”] Output of Transform [(1,1.0) (2,1.0) (3,1.0) (4,1.0)] [(1,1.0) (2,1.0) (5,1.0) (3,1.0) (4,1.0) (6,2.0) (7,1.0)]
6.
Functionality Implemented • Parameters added to countVectorizer() constructor -> setMinDF(2), setMaxDF(5) Mindf =2 The sun in the sky is bright. We can see the shining sun, the bright sun Maxdf = 5 The sun in the sky is bright. We can see the shining sun, the bright sun -> setStopwords (List([“in”,”the”] )) The sun in the sky is bright. We can see the shining sun, the bright sun
7.
Functionality Implemented • Parameters
-> setNgramRange(List(1,3)) • Wri?ng test cases to test above func?ons and compare output with scikit-learn countvectorizer for accuracy - 100%
8.
Sample Output
9.
Sample Output
10.
Overview CountVectorizer() setStopwords setMinDF setMaxDF setNgramRange 3 10 List[“is”,”the”] List[1,3]
11.
Overview CountVectorizer() Fit() Input Data files setStopwords setMinDF setMaxDF setNgramRange [“this is a text”, “this is not a text document to document”] 3 10 List[“is”,”the”] List[1,3] Output of fit (this,1) (is,2) (a,3) (text,4) (not,5) (document,6) (to,7)
12.
Overview CountVectorizer() Fit() Transform Input Data files setStopwords setMinDF setMaxDF setNgramRange Output of fit (this,1) (is,2) (a,3) (text,4) (not,5) (document,6) (to,7) [“this is a text”, “this is not a text document to document”]
Output of Transform [(1,1.0) (2,1.0) (3,1.0) (4,1.0)] [(1,1.0) (2,1.0) (5,1.0) (3,1.0) (4,1.0) (6,2.0) (7,1.0)] 3 10 List[“is”,”the”] List[1,3]
13.
Overview CountVectorizer() Fit() Transform Input Data files setStopwords setMinDF setMaxDF setNgramRange getFeature Names Output of fit (this,1) (is,2) (a,3) (text,4) (not,5) (document,6) (to,7) [“this is a text”, “this is not a text document to document”]
Output of Transform [(1,1.0) (2,1.0) (3,1.0) (4,1.0)] [(1,1.0) (2,1.0) (5,1.0) (3,1.0) (4,1.0) (6,2.0) (7,1.0)] 3 10 List[“is”,”the”] List[1,3]
14.
Fit(Text) function Fit() Override fit from Es?mator All docum ents GetParameters Call Fit(text)
15.
Fit(Text) function Fit() Override fit from Es?mator All docum ents GetParameters Call Fit(text) Validate Parameters
16.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters
17.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters RegexParsing
18.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters Filter data on mindf,maxdf RegexParsing
19.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters Filter data on mindf,maxdf RegexParsing Filter data on stopwords
20.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters Filter data on mindf,maxdf RegexParsing Filter data on stopwords Create tuples of ngram range
21.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters Filter data on mindf,maxdf RegexParsing Filter data on stopwords Create tuples of ngram range Add in hashset
22.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters Filter data on mindf,maxdf RegexParsing Filter data on stopwords Create tuples of ngram range Add in hashset Zipwithindex Output
23.
Fit(Text) function Fit() Override fit from Es?mator All docum ents Train Dataset Func?on GetParameters Call Fit(text) Validate Parameters Filter data on mindf,maxdf RegexParsing Filter data on stopwords Create tuples of ngram range Add in hashset Zipwithindex Map ReduceGroup Output Flink Operators
24.
Transform(Text) Transform() Override from Transformer Model (Fit Output) Document Provide Req Parameters
25.
Transform(Text) Transform() Override from Transformer Model (Fit Output) Document Transform Dataset Func?on Provide Req Parameters
26.
Transform(Text) Transform() Override from Transformer Model (Fit Output) Document Transform Dataset Func?on RegexParsing Provide Req Parameters
27.
Transform(Text) Transform() Override from Transformer Model (Fit Output) Document Transform Dataset Func?on RegexParsing Create tuples of ngram range Provide Req Parameters
28.
Transform(Text) Transform() Override from Transformer Model (Fit Output) Document Transform Dataset Func?on RegexParsing Create tuples of ngram range Matching (index,1.0) Provide Req Parameters
29.
Transform(Text) Transform() Override from Transformer Model (Fit Output) Document Transform Dataset Func?on RegexParsing Create tuples of ngram range Matching (index,1.0) Convert to sparse vector Sparse vector (Flink func?on) Output Internally uses quicksort and duplicate entries are summed to get frequency Provide Req Parameters
30.
Word Cloud • Used Lightning Scala api for visualiza?on purpose •
hgp://countvectorizer.us
31.
About Me Roshani Nagmote ● Masters in Computer Science from University of Utah ●
Data Plaiorm Engineering intern at Ask.com ● Web Developer at Tata Consultancy Services Things I love: ● Playing with my Rini ● Hiking and visi?ng new places
Download now