Zemanta Tech Talk at Audible

Audible Tech Talk
23. April 2012

Andraz Tori
andraz@zemanta.com
@andraz

Today's plan
• Short story of Zemanta
• The Zemanta technology

Wonders of modern
communication

Strip mine
• A system for Slovenian National television in 2006
• Closed captioning → web page for each episode of
each show
• Natural Langauge Processing, Information
Retrieval...

Seedcamp

• First European program inspired by YC (2007)
• London based
• 3 months, 50.000 EUR / 10%

Roller coaster
12. August Deadline
20. August Shortlist
23. August Phone interview
24. August Results

3. September London week start
7. September London week end
16. September ==> London

And then ...

• Figuring out US is our target market
• Figuring out where in US to be and who to have here
• Partnerships
• And naturally the business model

What do we do?
• Zemanta – Personal Writing Assistant
- on your current platform
• While bloggers write we suggest:
- images
- related articles
- in-text links
- tags

Some stats

• 80k bloggers monthly
• 1.3 million posts enhanced in 2011

How does it work
• Natural Language Processing
• Big database of “meanings” (entities, concepts, topics)
• Word Sense Disambiguation
• Linking out to Wikipedia, Freebase, …
• Categorization, Named Entity Recognition

• Information Retrieval
• Solr based, using features from NLP
• With some twists

Indexed content

Content
suggestions
Plain text Semantic
(article) Analysis
search

Background
knowledge

“Text Understanding”
- Input is meaningful chunk of text (not a keyword or a
phrase)
- Input is (semi) English language
- Has to work across all domains in the open world
- music, celebrities, finance, entertainment, politics,
gardening, parenting, …

Background knowledge
- Data from Wikipedia, MusicBrainz, Freebase… and the
world wild web
- Includes linguistical and semantical properties and
unstructured data
- Present in two forms:
- in “original” custom built triple store on top of MySQL
(150 GB)
- processed into 7 GB optimized “memory mapped
dump”

Analysis pipeline
Known phrases
Named Entity
extraction
Extraction
(aho-corasick)

Triple store
Surface form features evaluation

Statistical comparison to
background knowledge

Semantic coherence
and hand-tuned
heuristics

etc.

Disambiguated entities

Connecting content
• Indexing blogosphere and mediasphere
• Solr based index
• Twist: complicated queries – 50 terms
• Filtering out spam is “fun”
• Probably best “related content” in terms of accuracy
• Coming soon: social signal

But why just for bloggers?

Let's open up the API!

Blog me up, Scotty!
23. April 2012

Some takeaways
• Accelerators are good
• World is getting flatter
But it will never be flat
• Start monetizing soon – to learn, not to earn
• Be where your market is
• Many markets left to innovate in

Zemanta Tech Talk at Audible

Recommended

Recommended

More Related Content

Similar to Zemanta Tech Talk at Audible

Similar to Zemanta Tech Talk at Audible (20)

More from Andraz Tori

More from Andraz Tori (9)

Recently uploaded

Recently uploaded (20)

Zemanta Tech Talk at Audible