programming assignment 2
Preprocessing Before building the language model, the following preprocessing steps should be
performed:
1. Input text should be split into sentences using an existing sentence boundary detector (e.g.
sentence tokenizer from NLTK, spaCy, etc.).
2. Each sentence should be split into words using an existing word tokenizer (word tokenizer
from NLTK, spaCy, etc.).
3. Discard sentences with fewer than n tokens.
4. Add start of sentence token, , and end of sentence token, to each sentence

programming assignment 2Preprocessing Before building the language.pdf

  • 1.
    programming assignment 2 PreprocessingBefore building the language model, the following preprocessing steps should be performed: 1. Input text should be split into sentences using an existing sentence boundary detector (e.g. sentence tokenizer from NLTK, spaCy, etc.). 2. Each sentence should be split into words using an existing word tokenizer (word tokenizer from NLTK, spaCy, etc.). 3. Discard sentences with fewer than n tokens. 4. Add start of sentence token, , and end of sentence token, to each sentence