2. CONTENT INTELLIGENCE
idio, content marketing
for marketing & sales teams
• content insight
• user interest profile
• recommendation demo
content model
user model
topic performance chart
3. CONTENT INTELLIGENCE
idio, content marketing
• focus on interests, not socio-demographics / firmographics
• automatic text analysis, model building and recommendation
5. CONTENT INTELLIGENCE
Content analysis at idio
• Python
• build automation (luigi)
• web services
• scala
• entity linking, i.e. finding topics in texts
6. CONTENT INTELLIGENCE
How most discussions start at conferences
• me: “I work on Natural Language Processing.”
• other: “So you’re in the field of deep learning?”
• our topic extractor is based on statistical analysis
however
8. CONTENT INTELLIGENCE
Entity Linking task
• find topics in a text written in a natural language
• For us: 1 topic = 1 uri to a wikipedia article.
• Sometimes called “wikification”.
demo
spotlight demo
9. CONTENT INTELLIGENCE
Entity Linking jargon
Surface Form: word (or phrase) that refers to a topic
“hoverboard” and “hover board” are surface
forms for the topic “Hoverboard”, the fictional
levitating board used for personal transportation.
Context: words surrounding a surface form.
13. CONTENT INTELLIGENCE
Build an annotation model
Data: wikipedia
• 1 link = 1 annotation
demo
edit mode of a wikipedia page
14. CONTENT INTELLIGENCE
Build an annotation model
Data: wikipedia
• 1 link = 1 annotation
Algorithm overview:
• find all potential Surface Forms
• decide to annotate or not
• decide which topic
So, what statistics do we need?
demo
spotlight demo: candidates
15. CONTENT INTELLIGENCE
Extract stats
Identify all known SFs
Decide to annotate or not
Decide which topic
‣ Set of SFs we’ve seen annotated
‣ P ( annotation | SF )
‣ number of annotations for each SF
‣ P ( topic | SF )
‣ P ( topic | annotation )
‣ P ( topic | context )
16. CONTENT INTELLIGENCE
Model
surface forms
• annotated count, total count
topics
• number of annotations
• context, i.e. surrounding words, with number of occurrences
surface form <-> topic associations
• number of annotations
17. CONTENT INTELLIGENCE
Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
18. CONTENT INTELLIGENCE
Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
19. CONTENT INTELLIGENCE
Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
20. CONTENT INTELLIGENCE
Model
surface forms: annotated count, total count
• “is” / “an” / “and” are skipped because too common (not informative)
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
SF anno total
ScyPy 0 1
Python 1 1
open 0 1
open source 1 1
engineering doing 0 1
21. CONTENT INTELLIGENCE
Model
topics: number of annotations, context
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
topic
num.
annotations
context
Open_source 1
SciPy, Python, library, engineer,
technical, computing, use,
science x2, analyst
Python_(programming
_language)
1 (same)
Scientific_computing 1 (same)
22. CONTENT INTELLIGENCE
Model
surface form <-> topic associations: number of annotations
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
SF topic
num.
annotations
open source Open_Source 1
Python Python_(programming_language) 1
scientific
computing
Scientific_computing 1
23. CONTENT INTELLIGENCE
Model
surface forms
• annotated count, total count
topics
• number of annotations
• context, i.e. surrounding words, with number of occurrences
surface form <-> topic associations
• number of annotations
24. CONTENT INTELLIGENCE
Text annotation
model
SF, topic and association statistics
input text
I'm on cloud 9 whenever I write Python code.
Known SFs: “cloud”, “9”, “cloud 9”, “write”, “Python”, “code”
Annotate?
• discard “9” and “write” because low anno probability
• “cloud” vs “cloud 9” overlap: keep higher anno probability
• keep “Python” and “code”
Which topic?
• SF “Python” is ambiguous: animal or programming language?
• The context supports the programming language.
26. CONTENT INTELLIGENCE
Data preparation
• stemming: reduce words to their word stem
• “laptops” -> “laptops”, “worked” -> “work”
• skip words
• wikipedia dump is a single XML file of ~50G.
(See our blog post on “idio’s Wikipedia toolkits”)
extract stats: “It’s just counting.”
27. CONTENT INTELLIGENCE
Challenges
Wikipedia is not representative of our client’s articles.
• Errors in wikipedia, but it’s not the worst
• Unwanted SF-topic associations
• Virtually all surface forms are ambiguous.
• Bad priors: “yesterday” -> Beatles song (even in lower-case)
• annotation of a sub-part of a SF
28. CONTENT INTELLIGENCE
How to fix?
• Whenever possible, find a pattern, understand the underlying issue
and make algorithm changes: give a boost to some SFs, some
topics, or some SF-topic associations, e.g. capitalised phrases.
• tweak the model/probabilities for isolated issues.
• We made a model editor for spotlight. Check our github.
29. CONTENT INTELLIGENCE
Takeaways
• Wikipedia is an awesome resource for NLP
• Statistics can solve some NLP tasks
• Adapt the formula with rules and boosts to make up for the
differences between the learning data set and the output we want