Zemanta Tech Talk at Audible
Upcoming SlideShare
Loading in...5

Zemanta Tech Talk at Audible



Tech talk about Zemanta's st

Tech talk about Zemanta's st



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Zemanta Tech Talk at Audible Zemanta Tech Talk at Audible Presentation Transcript

  • Audible Tech Talk 23. April 2012 Andraz Tori andraz@zemanta.com @andraz
  • Todays plan• Short story of Zemanta• The Zemanta technology
  • Where am I right now? View slide
  • Wonders of modern communication View slide
  • Ljubljana
  • Strip mine• A system for Slovenian National television in 2006• Closed captioning → web page for each episode ofeach show• Natural Langauge Processing, InformationRetrieval...
  • Start-up? Why not? v
  • Tour de Slovénie
  • Sales
  • Seedcamp• First European program inspired by YC (2007)• London based• 3 months, 50.000 EUR / 10%
  • Roller coaster12. August Deadline20. August Shortlist23. August Phone interview24. August Results3. September London week start7. September London week end16. September ==> London
  • 3 months in London
  • Back to Ljubljana
  • Back to Ljubljana
  • And then ...• Figuring out US is our target market• Figuring out where in US to be and who to have here• Partnerships• And naturally the business model
  • Technology
  • What do we do?• Zemanta – Personal Writing Assistant - on your current platform• While bloggers write we suggest: - images - related articles - in-text links - tags
  • Some stats• 80k bloggers monthly• 1.3 million posts enhanced in 2011
  • How does it work• Natural Language Processing• Big database of “meanings” (entities, concepts, topics)• Word Sense Disambiguation • Linking out to Wikipedia, Freebase, … • Categorization, Named Entity Recognition• Information Retrieval • Solr based, using features from NLP • With some twists
  • Indexed content Content suggestionsPlain text Semantic (article) Analysis search Background knowledge
  • “Text Understanding”- Input is meaningful chunk of text (not a keyword or aphrase)- Input is (semi) English language- Has to work across all domains in the open world- music, celebrities, finance, entertainment, politics,gardening, parenting, …
  • Indexed content Content suggestionsPlain text Semantic (article) Analysis search Background knowledge
  • Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and the world wild web- Includes linguistical and semantical properties and unstructured data- Present in two forms: - in “original” custom built triple store on top of MySQL (150 GB) - processed into 7 GB optimized “memory mapped dump”
  • Analysis pipeline Known phrasesNamed Entity extraction Extraction (aho-corasick) Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc. Disambiguated entities
  • Indexed content Content suggestionsPlain text Semantic (article) Analysis search Background knowledge
  • Connecting content• Indexing blogosphere and mediasphere• Solr based index • Twist: complicated queries – 50 terms• Filtering out spam is “fun”• Probably best “related content” in terms of accuracy• Coming soon: social signal
  • But why just for bloggers? Lets open up the API!
  • Some API users
  • Back to reality.
  • Age of “smart”
  • Blog me up, Scotty! 23. April 2012
  • Some takeaways• Accelerators are good• World is getting flatter But it will never be flat• Start monetizing soon – to learn, not to earn• Be where your market is• Many markets left to innovate in
  • Thank you!