This is the deck presented at the Auckland Content Strategy Meetup in August 2015.
http://www.meetup.com/Auckland-content-strategy-meetup/events/223324647/
How can computers understand text? And what happens when they can? Come along to a Meetup that looks forward to a world where computers can truly understand human language.
Hear about what will be possible when 'text mining' machines can understand huge amounts of content. How much can today's computers understand from what we write? More than you think! And that's handy, given the massive amount of text data generated every day. How do they do it? And what can we puny humans do to help make text content easier (or harder) to analyse?
There's more research happening of this every day, all around the globe - including right here in Auckland. Dr. Anna Divoli will tell us all about it, and advise us how to take advantage of this automation. In our traditional style, we'll open up a wide-ranging chat afterwards.
===
Dr. ANNA DIVOLI is the Head of Research and Development at Pingar, a company that wants machines to learn as easily from text as they do from databases. Anna has been developing and evaluating algorithms and user interfaces for text mining systems since 2001. Her research has a wide range of applications including automatic database annotation, usability of search engines, knowledge acquisition, entity extraction and document clustering.
She has an MSc in Biosystems and Informatics from the University of Liverpool, a PhD in Biomedical Text Mining from the University of Manchester and held postdoctoral research positions in the prestigious School of Information at the University of California at Berkeley and later the Department of Medicine at the University of Chicago.
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Â
How computers understand text content - by Anna Divoli
1. How computers understand
text content
a presentation for the Auckland content strategy meetup
by Anna Divoli
@annadivoli
.
Ph.D. in Biomedical Text Mining | Text Analytics Researcher | Head of R&D at Pingar
2. Who am I?
⢠14 years in academia + 4 years in industry
⢠academically exposed to different disciplines:
biomedicine, bioinformatics,
computational linguistics, information retrieval,
information extraction, semantic technologies,
human-computer interaction, search user interface usability,
knowledge acquisition, visualizations
⢠lived in different countries:
Greece, UK, US, NZ
⢠learned English as a second language
(hint: I empathize with computer systems)
Anna Divoli Auckland content strategy meetup Aug 2015
3. Who are you?
⢠Marketing?
⢠Digital content?
⢠Information Architecture?
⢠Journalists?
⢠UX?
⢠Business Analysis?
⢠Software Development?
⢠CS research (incl. âtextâ people)?
⢠Other?
Anna Divoli Auckland content strategy meetup Aug 2015
4. What is âtextâ? Where is it?
www.nailingit.com/images/websites.jpg
www.bu.edu/today/files/2012/10/t_journals1.jpgweb.clarku.edu/offices/its/images/filepile.jpg
www.flickr.com/photos/jlconfor/14191286471
5. Human â Text Content Interaction
Humans:
Slow, Inconsistent, Expensive
Text content:
Overwhelmingly fast growing,
Disseminated across multiple sources
Anna Divoli Auckland content strategy meetup Aug 2015
6. NLP â Artificial Intelligence
Machine
Learning
NLP
Computational
Linguistics
Applied
Text
Analytics
Storage
Memory
Security
Friendly UIs
Visualizations
Anna Divoli Auckland content strategy meetup Aug 2015
7. So, whatâs in the text?
⢠Entities
⢠Facts
⢠Relations
⢠Themes/topics
⢠Opinions & sentiment
⢠âŚ
+ Time/Location dimensions:
⢠Trends & paradigm shifts
⢠Networks
⢠âŚ
Anna Divoli Auckland content strategy meetup Aug 2015
8. Named Entity Recognition
Find and classify namesâŚ
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
Anna Divoli Auckland content strategy meetup Aug 2015
9. Named Entity Recognition
Find and classify namesâŚ
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
People
Locations
Organizations
Methods: lexicon-based (gazeteers)
grammar-based (rule-based)
â statistical models (machine learning: algorithms + features)
â hybrids
Anna Divoli Auckland content strategy meetup Aug 2015
10. Named Entity Recognition
Find and classify namesâŚ
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
People Dates
Locations
Organizations
Who? Where?
When?
Anna Divoli Auckland content strategy meetup Aug 2015
11. Disambiguation & Normalization:
Word Sense Disambiguation & Text
Normalization
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
Word Sense Disambiguation: identifying which sense/meaning
of a word is used in a sentence, when the word has multiple
meanings. Synonyms & homonyms. Use context!!
Text normalization: transforming text into a single canonical
form that it might not have had before.
Anna Divoli Auckland content strategy meetup Aug 2015
12. Word Sense Disambiguation
& Text Normalization
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
Sam Arlington initiated partnership discussions during his visit to
Eureka offices in July.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
J. Smith went to Washington DC to see the Smithsonian Institute
and also met up with Virginia Peterson for a coffee.
Anna Divoli Auckland content strategy meetup Aug 2015
13. S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
Sam Arlington initiated partnership discussions during his visit to
Eureka office in July.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
J. Smith went to Washington DC to see the Smithsonian Institute
and also met up with Virginia Peterson for a coffee.
Word Sense Disambiguation
& Text Normalization
Anna Divoli Auckland content strategy meetup Aug 2015
14. Fact & Relationship extraction
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
What?
Anna Divoli Auckland content strategy meetup Aug 2015
15. Deeper knowledge & Sentiment
S. Arlington initiated partnership discussions during his visit to
Eurekaâs Ltd offices last month.
John Smith went to Washington to see the Smithsonian and also
met up with Virginia for a coffee.
How? Why? How do we feel about it?
S. Arlington visited the Eurekaâs Ltd offices last month to initiate
partnership discussions.
John Smith was delighted to go to Washington to see the
Smithsonian and also met up with Virginia for a coffee.
Anna Divoli Auckland content strategy meetup Aug 2015
16. Sentiment analysis & opinion mining
⢠Dictionary-based (e.g. LIWC)
⢠Statistical
⢠Hybrid
⢠Polarity & strength
⢠Feelings
⢠Mood
⢠Aspects
⢠Who has this sentiment (source)
⢠What is the target of the sentiment
Pos | Neu | Neg & score
Angry, sadâŚ
Happy, depressedâŚ
Location, cleanlinessâŚ
Employees, customersâŚ
Product, event, personâŚ
Anna Divoli Auckland content strategy meetup Aug 2015
17. So, whatâs in the text?
Anna Divoli Auckland content strategy meetup Aug 2015
⢠Entities
⢠Facts
⢠Relations
⢠Themes/topics ď no training or ontologies need!
can utilize web resources (e.g., Wikipedia)
⢠Opinions & sentiment
⢠âŚ
+ Time/Location dimensions:
⢠Trends & paradigm shifts
⢠Networks
⢠âŚ
18. So, what ELSE is in the text?
⢠Ambiguity
⢠Metaphors
⢠Sarcasm
⢠Colloquialism/Slang
⢠Negation
⢠Hedging
⢠Conditional statements
⢠Inconsistencies/Bad grammar
⢠Text speak
⢠Anaphora
⢠Humor
I want an apple.
He drowned in a sea of grief.
George W Bush. Love him!
I slept like crap last night.
I am not sure I want to go to NYC.
The results indicate this.
When it rains I feel sad.
I think your smart.
C u l8r @Jacks
John met with Nick. He was upset.
Did you take a bath today? No. Is one
missing?
Anna Divoli Auckland content strategy meetup Aug 2015
19. So, what ELSE is in the text?
⢠Ambiguity
⢠Metaphors
⢠Sarcasm
⢠Colloquialism/Slang
⢠Negation
⢠Hedging
⢠Conditional statements
⢠Inconsistencies/Bad grammar
⢠Text speak
⢠Anaphora
⢠Humor
I want an apple.
He drowned in a sea of grief.
George W Bush. Love him!
I slept like crap last night.
I am not sure I want to go to NYC.
The results indicate this.
When it rains I feel sad.
I think your smart.
C u l8r @Jacks
John met with Nick. He was upset.
Did you take a bath today? No. Is one
missing?
Consider: distributed information (dialogue), technical/scientific text,
legal text, creative/poetryâŚ
Anna Divoli Auckland content strategy meetup Aug 2015
20. Human language!
Eye drops off shelf.
Include your children when
baking cookies.
Turn right here.
John saw the man on the
mountain with a telescope.
He gave her cat food.
They are hunting dogs.
Anna Divoli Auckland content strategy meetup Aug 2015
21. Examples: BiologyâŚ
Looking for: interactions between SAF and viral LTR elements
(SAF is a transcription factor, LTR stands for âlong terminal repeatâ)
(Also: SAF = single and free, LTR = long term relationship)
Gene names:
tinman, lilliputian, dreadlocks, lush,
cheap date, methuselah, Van Gogh,
maggie, brainiac, grim, reaper,
cleopatra, swiss cheese, fucK, out cold,
ken and barbie, kenny, lava lamp,
hamlet, sonic hedgehog, werewolf, half
pint, drop dead, chardonnay, agnostic,
Iâm not dead yetâŚ
Anna Divoli Auckland content strategy meetup Aug 2015
22. Current State of NLP
⢠Rule-based systems for high precision results
⢠Hybrid systems for more robust performance
(rules + dictionaries/ontologies + statistical models)
⢠Limitation: specialized systems perform better
(much like humans!)
⢠Workflows offer work-around for more generic systems
e.g., check language ď check category ď choose model
Anna Divoli Auckland content strategy meetup Aug 2015
34. Take home messages
⢠Machines can do a lot of consistent, fast information
extraction
⢠Specialization is needed in several fields but systems can have
internal workflows
⢠Big data + statistics = magic!
⢠Always room for improvement
⢠Information management AND decisions AND predictions
35. Time for questions and discussion!
https://xkcd.com/1263/
Anna Divoli Auckland content strategy meetup Aug 2015
@annadivoli
.