Preprocessing

Steps involved in Preprocessing :

1.Tokenization :

●
Tokenization : The process of breaking a stream of text into words

●
Removal of Punctuation marks and numbers

●
Replacing ‘n’ by Spaces

●
Splitting the string by space as a delimiter

●
Tokens

Graphical view of steps in Tokenization :

Removal of Replacing n Using
Stream of text. spaces as
punctuation by Spaces
marks delimiter

Tokens
(words)

2. Removal of stop words :

●
Passing the list of Tokens.

●
Removing the unnecessary words like the, an, so, after, all, etc (stop words).

●
Output : A list of meaningful words.

3.Stemming :

●
Stemming : The process for reducing inflected words to their stem, base or root form.
For example : Stemming algorithm reduces “fishing", "fished", "fish", and "fisher"
to the root word "fish“.

●
Stemmer used : Porter Stemmer Algorithm.

●
Removing ‘–ee’,’ –ed’, ‘-ing’, ‘-ence’, ‘-er’, etc. & adding ‘y’, ‘I’ as required.

●
Doesn’t give accurate roots .
Example : stem(flying) =fli
stem(fly)=fli
●
Same roots for all inflected forms – serves our purpose

4. Vocabulary creation :-

●
Vocabulary : Generally, vocabulary is the set of words.

●
Vocabulary = Union of words from all files.

●
For each document : Converting list obtained after stemming into Set &
taking union.

●
Processed further for Tf-idf evaluation.

Preprocessing

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Preprocessing

Similar to Preprocessing (20)

Recently uploaded

Recently uploaded (20)

Preprocessing