What do we do?• Zemanta – Personal Writing Assistant - on your current platform• While bloggers write we suggest: - images - related articles - in-text links - tags
Some stats• 80k bloggers monthly• 1.3 million posts enhanced in 2011
How does it work• Natural Language Processing• Big database of “meanings” (entities, concepts, topics)• Word Sense Disambiguation • Linking out to Wikipedia, Freebase, … • Categorization, Named Entity Recognition• Information Retrieval • Solr based, using features from NLP • With some twists
“Text Understanding”- Input is meaningful chunk of text (not a keyword or aphrase)- Input is (semi) English language- Has to work across all domains in the open world- music, celebrities, finance, entertainment, politics,gardening, parenting, …
Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and the world wild web- Includes linguistical and semantical properties and unstructured data- Present in two forms: - in “original” custom built triple store on top of MySQL (150 GB) - processed into 7 GB optimized “memory mapped dump”
Analysis pipeline Known phrasesNamed Entity extraction Extraction (aho-corasick) Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc. Disambiguated entities
Connecting content• Indexing blogosphere and mediasphere• Solr based index • Twist: complicated queries – 50 terms• Filtering out spam is “fun”• Probably best “related content” in terms of accuracy• Coming soon: social signal
But why just for bloggers? Lets open up the API!