Tag similarity. Using the del.icio.us API, create a dataset of tags and items. Use this to calculate similarity between tags and see if you can find any that are almost identical. Find some items that could have been tagged “programming” but were not.
Word separation. The separatewords method currently considers any nonalphanumeric character to be a separator, meaning it will not properly index entries like “C++,” “$20,” “Ph.D.,” or “617-555-1212.” What is a better way to separate words? Does using whitespace as a separator work? Write a better word separation function.