Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Preprocessing
1. Steps involved in Preprocessing :
1.Tokenization :
●
Tokenization : The process of breaking a stream of text into words
●
Removal of Punctuation marks and numbers
●
Replacing ‘n’ by Spaces
●
Splitting the string by space as a delimiter
●
Tokens
2. Graphical view of steps in Tokenization :
Removal of Replacing n Using
Stream of text. spaces as
punctuation by Spaces
marks delimiter
Tokens
(words)
3. 2. Removal of stop words :
●
Passing the list of Tokens.
●
Removing the unnecessary words like the, an, so, after, all, etc (stop words).
●
Output : A list of meaningful words.
4. 3.Stemming :
●
Stemming : The process for reducing inflected words to their stem, base or root form.
For example : Stemming algorithm reduces “fishing", "fished", "fish", and "fisher"
to the root word "fish“.
●
Stemmer used : Porter Stemmer Algorithm.
●
Removing ‘–ee’,’ –ed’, ‘-ing’, ‘-ence’, ‘-er’, etc. & adding ‘y’, ‘I’ as required.
●
Doesn’t give accurate roots .
Example : stem(flying) =fli
stem(fly)=fli
●
Same roots for all inflected forms – serves our purpose
5. 4. Vocabulary creation :-
●
Vocabulary : Generally, vocabulary is the set of words.
●
Vocabulary = Union of words from all files.
●
For each document : Converting list obtained after stemming into Set &
taking union.
●
Processed further for Tf-idf evaluation.