Zemanta Tech Talk at AudiblePresentation Transcript
Audible Tech Talk 23. April 2012 Andraz Tori email@example.com @andraz
Todays plan• Short story of Zemanta• The Zemanta technology
Where am I right now?
Wonders of modern communication
Strip mine• A system for Slovenian National television in 2006• Closed captioning → web page for each episode ofeach show• Natural Langauge Processing, InformationRetrieval...
Start-up? Why not? v
Tour de Slovénie
Seedcamp• First European program inspired by YC (2007)• London based• 3 months, 50.000 EUR / 10%
Roller coaster12. August Deadline20. August Shortlist23. August Phone interview24. August Results3. September London week start7. September London week end16. September ==> London
3 months in London
Back to Ljubljana
Back to Ljubljana
And then ...• Figuring out US is our target market• Figuring out where in US to be and who to have here• Partnerships• And naturally the business model
What do we do?• Zemanta – Personal Writing Assistant - on your current platform• While bloggers write we suggest: - images - related articles - in-text links - tags
Some stats• 80k bloggers monthly• 1.3 million posts enhanced in 2011
How does it work• Natural Language Processing• Big database of “meanings” (entities, concepts, topics)• Word Sense Disambiguation • Linking out to Wikipedia, Freebase, … • Categorization, Named Entity Recognition• Information Retrieval • Solr based, using features from NLP • With some twists
“Text Understanding”- Input is meaningful chunk of text (not a keyword or aphrase)- Input is (semi) English language- Has to work across all domains in the open world- music, celebrities, finance, entertainment, politics,gardening, parenting, …
Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and the world wild web- Includes linguistical and semantical properties and unstructured data- Present in two forms: - in “original” custom built triple store on top of MySQL (150 GB) - processed into 7 GB optimized “memory mapped dump”
Analysis pipeline Known phrasesNamed Entity extraction Extraction (aho-corasick) Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc. Disambiguated entities
Connecting content• Indexing blogosphere and mediasphere• Solr based index • Twist: complicated queries – 50 terms• Filtering out spam is “fun”• Probably best “related content” in terms of accuracy• Coming soon: social signal
But why just for bloggers? Lets open up the API!
Some API users
Back to reality.
Age of “smart”
Blog me up, Scotty! 23. April 2012
Some takeaways• Accelerators are good• World is getting flatter But it will never be flat• Start monetizing soon – to learn, not to earn• Be where your market is• Many markets left to innovate in