More Related Content
Similar to "Little Words in Big Data", Jessica Perri, Attensity Director Linguistic Technology (20)
"Little Words in Big Data", Jessica Perri, Attensity Director Linguistic Technology
- 1. © Copyright 2013 Attensity All rights reserved
Little Words in Big Data
Jessica Perri
Dir. Linguistic Technology
Attensity Corp.
jperri@attensity.com
- 2. © Copyright 2013 Attensity All rights reserved
Overview
Big Data – Social Media
Natural Language Parsing and Extraction
Sentiment
- 3. © Copyright 2013 Attensity All rights reserved
We have more data available than
ever before…
- 4. © Copyright 2013 Attensity All rights reserved
Big Data and Big Growth
• The amount of available data growing exponentially
• Seeing a change in the discourse landscape
– Dramatic increase in personal narrative (blogs, reviews, twitter, etc)
– Shift in authorship and compositional methods (smart phones, tablets, etc)
• Result: More variation in data than ever before
- 5. © Copyright 2013 Attensity All rights reserved
But… more data does not
necessarily mean better data.
- 6. © Copyright 2013 Attensity All rights reserved
Processing Challenges - Where Did the Data Come From?
• Signal/Noise ratio worse than ever
– ETL problems
– Spam, spam, spam
– Marketing materials
– Shills, employees, interns and unsavory types gaming the system
• Domain detection critical for pragmatic assumptions
- 7. © Copyright 2013 Attensity All rights reserved
Processing Challenges - What is the Data Composed Of?
• Text is “degraded”
– Missing/excessive punctuation
– Missing words
– Typographical errors
– Rapid topic shift
• Language is extremely varied, and constantly changing
– A million words for a single picture
– Productive, phonological rules for emphasis (loooooooooooool, uggghhhhh)
– Novel and coined terms
• Not business relevant
- 8. © Copyright 2013 Attensity All rights reserved
Processing Challenges – Extralinguistic Cues
• People are opinionated
• People are sassy
• People are sarcastic
• People are clever
@jane: Obama won. I’m SO HAPPY to have a
#socialist #communist president.
@jane: Poor Romney. I’m so sad that he has to go
home to one of his 35 mansions. #not
@jane: It’s so great that Obama won. </sarcasm>
@jane: It’s so great that Obama won.
#saidnooneever
- 9. © Copyright 2013 Attensity All rights reserved
We need to use existing data more
intelligently!
- 10. © Copyright 2013 Attensity All rights reserved
What can we do with Big Data?
• “Looking for a needle in a haystack”
• Search for predefined scenarios: Recovery
• Implications for processing: Use a set of targeted
patterns over all possible data
- 11. © Copyright 2013 Attensity All rights reserved
What can we do with Big Data?
• “Looking for the shape of the haystack”
• Look for trends and novel events: Discovery
– IDKWILFBIKIWISI
• Implications for processing: Use dynamic patterns
over a sample of data (“exhaustive extraction”)
- 12. © Copyright 2013 Attensity All rights reserved
Attensity Exhaustive Extraction – Roles and Relationships
“I bought a beautiful Jimmy Choo scarf for my mom from Nordstrom.”
- 13. © Copyright 2013 Attensity All rights reserved
Attensity Voice – Shades of Meaning
Indefinite Voice depicts the uncertainty of the statement:
I might stay here again.
Intent Voice indicates the plans of a customer:
We will definitely stay here in the future!
Conditional Voice reveals customer’s stipulations:
I would shop more often if I got free shipping.
Negation cancels out the statement:
I have never reset my password.
Recur Voice conveys the recurring manner of the event:
This is the third time I’ve emailed them.
Command Voice detects strong demands from a customer, distinguishing them from requests or statements of fact:
Lower your prices.
- 14. © Copyright 2013 Attensity All rights reserved
Domain Knowledge Models
• Narrow topic definition
– Data variability across domains
– Reconciling ambiguity
• Iterative refreshing
– What is relevant NOW
– Growth in the lexicon because of new products, etc.
• Life cycle
– Predefinition
– Expiration
- 16. © Copyright 2013 Attensity All rights reserved
Sentiment Definitions
• Sentiment Type
– Opinion Mining (typically neg/pos)
– Emotion Detection
• Sentiment Scope
– Document level
– Sentence level
– Entity/aspect level
• A Couple Sentiment Use Cases
– Marketing
– Newsmakers
- 17. © Copyright 2013 Attensity All rights reserved
Sentiment Detection
• Attensity performs comprehensive language analysis
– Syntactic parse, providing linguistic analysis
– Semantic cues
– Pragmatic intelligence
• Single value for entities
• Sentiment features are weighted and combined to provide the final sentiment value
and score for document level sentiment
- 18. © Copyright 2013 Attensity All rights reserved
Marketing: A single picture is comprised of thousands of words
- 19. © Copyright 2013 Attensity All rights reserved
Political Newsmakers: Emotions
• Yahoo Social Media Widget “The Signal”
• Focused around Political Data for the 2012 Election
• Seven Emotions:
– Angry, Confused, Disengaged, Excited, Happy,
Sad, and Worried
• Candidate and Issue-centric:
– Fundraising, Religion, Race, etc.
– Economy, Environment, Foreign Affairs ,
Health Care, Social Issues, etc.
• Segmented by state