1. Master DMKM Presentation
Entity Aspect Analysis
By: Ahmed Kamel
Supervision: Ingmar Weber, Yahoo! Labs Barcelona
Marta Arias, Universitat Politècnica de Catalunya
Location: Yahoo! Labs Barcelona
5. Text Extraction
Boilerpipe
Stanford CoreNLP
Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
The son of a black man from Kenya and a white woman from Kansas, he is the first African-American to ascend
to the highest office in the land.
He defeated Hillary Rodham Clinton in a lengthy and bitter primary battle before defeating Senator John McCain
, the Arizona Republican, in November 2008.
…
6. Entity Recognition
Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
…
Entity Recognition
(Wikification)
Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907
President_of_the_United_States||0.9707||president of the United States||0.9918
…
7. Aspect Extraction
Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.
Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907
President_of_the_United_States||0.9707||president of the United States||0.9918
…
PoS tagging aspect extraction
Barack/NNP Hussein/NNP Obama/NNP was/VBD sworn/VBN in/IN as/IN the/DT 44th/JJ
president/NN of/IN the/DT United/NNP States/NNPS on/IN Jan./NNP 20/CD ,/, 2009/CD ./.
…
Barack_Obama President_of_the_United_States
Barack Hussein Obama Barack Hussein Obama
president president
United States United States
Jan Jan
44th president 44th president
8. Sentiment Analysis
The iPhone is in general very good, however, its battery life is very bad
…
distance=10
SentiStrength
distance=3
The iPhone is in general very good[2][+1 booster word],however ,its battery life is very bad[-2][-1
booster word][sentence: 3,-3] [result: max + and - of any sentence]
very good||3||3
very bad||-3||10
Score = 3/3 + -3/10
9. Our work is
• Doing the previous for
– Over 2 billion english pages
– Wikipedia entities (over 3.5 million entities)
• Mostly using
– Hadoop
– Pig
10. Experiments
• Lack of ground truth
• Correlations to real-world factors
• Three experiments
– Countries
– Countries’ economy
– Grammy award winners
11. Countries
•Travel
Costa Rica positive aspects
Top 10 positively mentioned
•Axis of Evil
•BBC Poll
Iran negative aspects
Top 10 negatively mentioned Israel negative aspects
12. Countries’ economy
• Correlation between sentiment scores and
countries’ nominal GDP
• Normalized scores vs. non-normalized scores
14. Conclusion
• Analysis
– Methodology for correlating sentiments with other real-
world factors
– Experiments
• Pipeline
– Big data
– Can be an online in-production system
• Future work
– Restricting the analysis to a subset of the Web, e.g., blogs
– Sentiment scoring scheme (taking the volume problem
into account)
Explosive growthUser Generated Content (UGC)Question-answering databasesDigital videoBloggingSocial networksWikisSelf expression and opinionated contentWeb of Concepts – or entitiesGoogle 2008 Over one trillion unique URLsIndexed web at least 8.47 billion pages
Opinion summaries allow for discovering all kinds of fun factsMessi vs. RonaldoFrance’s economy vs. Spain’s economyIt also allows for something that’s more interesting. That is, further studies between sentiments as discovered on the Web and other real-world factors
We build a system thatIs simple yet effective approach capable of handling sentiments from all over the WebGenerates opinion summary for entitiesGenerates opinion summary for entities’ aspectsThe system we are building here “allows for interesting types of analysis“
The Web is mostly in HTML. We need to be able to get the text out of itBoilerpipe is a machine learnt classifier that uses shallow text features – word counts – to extract text from htmlStanford CoreNLP allows for sentence splitting on common sentence ends like full stops, question and exclamation marks
In house propreitory tool that uses machine learning to learn a model that’s able to infer the topics of a given textWikipedia entities, allow for rich information about entities
An aspect is a predefined sequence of postagsWe use two main patters; nouns and adjectives nouns
Ranking countries by sentimentsMost frequent sentimental aspectsNormalized vs. non-normalized scoresRANKING
RANKINGS AND CORRELATIONS FOR RANKINGS
Are sentiments associated with Grammy Award winners different from those associated with other musicians?Statistical tests1. Correlations with Grammy2.Inequality of scores3.Positive score to predict a Grammy winner.Receiver Operating Characteristic (ROC) not shown
Analysis ExperimentsCountries: are really different in the sense that we picked up a good signal whether we normalize or notGDP: we unfortunately didn’t get the expected results where frequency tended to top the sentiments. Maybe it’s not the right criteria to compare against. Maybe unemployment rate or maybe the volume problem is just inherently thereGrammy: it worked – though with not strong correlation – when restricting frequencies and normalizing.Sentiments vs. volumeBig Dataif something can go wrong it will definitely go wrongWe had to choose simple effective approaches that can scale easilyOnline in production systemI imagine it running in parallel with the web crawlers, doing its analysis and updating the summariesThe methods chosen as well allow for continous updates, generating the summaries doesn’t require the presence of the whole set of webpages at onceINTERNSHIP STILL GOING ON