6. Strip mine
• A system for Slovenian National television in 2006
• Closed captioning → web page for each episode of
each show
• Natural Langauge Processing, Information
Retrieval...
13. Roller coaster
12. August Deadline
20. August Shortlist
23. August Phone interview
24. August Results
3. September London week start
7. September London week end
16. September ==> London
20. And then ...
• Figuring out US is our target market
• Figuring out where in US to be and who to have here
• Partnerships
• And naturally the business model
22. What do we do?
• Zemanta – Personal Writing Assistant
- on your current platform
• While bloggers write we suggest:
- images
- related articles
- in-text links
- tags
23.
24.
25.
26. Some stats
• 80k bloggers monthly
• 1.3 million posts enhanced in 2011
27. How does it work
• Natural Language Processing
• Big database of “meanings” (entities, concepts, topics)
• Word Sense Disambiguation
• Linking out to Wikipedia, Freebase, …
• Categorization, Named Entity Recognition
• Information Retrieval
• Solr based, using features from NLP
• With some twists
29. “Text Understanding”
- Input is meaningful chunk of text (not a keyword or a
phrase)
- Input is (semi) English language
- Has to work across all domains in the open world
- music, celebrities, finance, entertainment, politics,
gardening, parenting, …
31. Background knowledge
- Data from Wikipedia, MusicBrainz, Freebase… and the
world wild web
- Includes linguistical and semantical properties and
unstructured data
- Present in two forms:
- in “original” custom built triple store on top of MySQL
(150 GB)
- processed into 7 GB optimized “memory mapped
dump”
32. Analysis pipeline
Known phrases
Named Entity
extraction
Extraction
(aho-corasick)
Triple store
Surface form features evaluation
Statistical comparison to
background knowledge
Semantic coherence
and hand-tuned
heuristics
etc.
Disambiguated entities
34. Connecting content
• Indexing blogosphere and mediasphere
• Solr based index
• Twist: complicated queries – 50 terms
• Filtering out spam is “fun”
• Probably best “related content” in terms of accuracy
• Coming soon: social signal
35. But why just for bloggers?
Let's open up the API!
40. Some takeaways
• Accelerators are good
• World is getting flatter
But it will never be flat
• Start monetizing soon – to learn, not to earn
• Be where your market is
• Many markets left to innovate in