Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"
Tokenizer: Breaks up a single string into smaller tokens.
You define what splitting rules are best for you.
Whitespace Tokenizer Just break into tokens wherever there is some space. So we get something like: