Master DMKM Presentation    Entity Aspect Analysis        By: Ahmed KamelSupervision: Ingmar Weber, Yahoo! Labs Barcelona ...
The Web
Opinion SummarizationEntity                Freq      +Freq     -Freq       +Score           -Score        ScoreLionel_Mess...
Architecture
Text Extraction                                             Boilerpipe                                                    ...
Entity RecognitionBarack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.…         ...
Aspect ExtractionBarack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.Barack_Obam...
Sentiment AnalysisThe iPhone is in general very good, however, its battery life is very bad…                           dis...
Our work is• Doing the previous for  – Over 2 billion english pages  – Wikipedia entities (over 3.5 million entities)• Mos...
Experiments• Lack of ground truth• Correlations to real-world factors• Three experiments  – Countries  – Countries’ econom...
Countries                                •Travel                                                Costa Rica positive aspect...
Countries’ economy• Correlation between sentiment scores and  countries’ nominal GDP• Normalized scores vs. non-normalized...
Grammy Award WinnersCorrelations with Grammy   Inequality of scores
Conclusion• Analysis   – Methodology for correlating sentiments with other real-     world factors   – Experiments• Pipeli...
Thanks     MerciGràcies – Gracias     Danke  Teşekkürler
Upcoming SlideShare
Loading in...5
×

Entity Aspect Analysis

784

Published on

DMKM Master Thesis Presentation

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
784
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Explosive growthUser Generated Content (UGC)Question-answering databasesDigital videoBloggingSocial networksWikisSelf expression and opinionated contentWeb of Concepts – or entitiesGoogle 2008 Over one trillion unique URLsIndexed web at least 8.47 billion pages
  • Opinion summaries allow for discovering all kinds of fun factsMessi vs. RonaldoFrance’s economy vs. Spain’s economyIt also allows for something that’s more interesting. That is, further studies between sentiments as discovered on the Web and other real-world factors
  • We build a system thatIs simple yet effective approach capable of handling sentiments from all over the WebGenerates opinion summary for entitiesGenerates opinion summary for entities’ aspectsThe system we are building here “allows for interesting types of analysis“
  • The Web is mostly in HTML. We need to be able to get the text out of itBoilerpipe is a machine learnt classifier that uses shallow text features – word counts – to extract text from htmlStanford CoreNLP allows for sentence splitting on common sentence ends like full stops, question and exclamation marks
  • In house propreitory tool that uses machine learning to learn a model that’s able to infer the topics of a given textWikipedia entities, allow for rich information about entities
  • An aspect is a predefined sequence of postagsWe use two main patters; nouns and adjectives nouns
  • Ranking countries by sentimentsMost frequent sentimental aspectsNormalized vs. non-normalized scoresRANKING
  • RANKINGS AND CORRELATIONS FOR RANKINGS
  • Are sentiments associated with Grammy Award winners different from those associated with other musicians?Statistical tests1. Correlations with Grammy2.Inequality of scores3.Positive score to predict a Grammy winner.Receiver Operating Characteristic (ROC) not shown
  • Analysis ExperimentsCountries: are really different in the sense that we picked up a good signal whether we normalize or notGDP: we unfortunately didn’t get the expected results where frequency tended to top the sentiments. Maybe it’s not the right criteria to compare against. Maybe unemployment rate or maybe the volume problem is just inherently thereGrammy: it worked – though with not strong correlation – when restricting frequencies and normalizing.Sentiments vs. volumeBig Dataif something can go wrong it will definitely go wrongWe had to choose simple effective approaches that can scale easilyOnline in production systemI imagine it running in parallel with the web crawlers, doing its analysis and updating the summariesThe methods chosen as well allow for continous updates, generating the summaries doesn’t require the presence of the whole set of webpages at onceINTERNSHIP STILL GOING ON
  • Entity Aspect Analysis

    1. 1. Master DMKM Presentation Entity Aspect Analysis By: Ahmed KamelSupervision: Ingmar Weber, Yahoo! Labs Barcelona Marta Arias, Universitat Politècnica de Catalunya Location: Yahoo! Labs Barcelona
    2. 2. The Web
    3. 3. Opinion SummarizationEntity Freq +Freq -Freq +Score -Score ScoreLionel_Messi 378,076 283,450 94,626 89,386.5 -29,449.3 59,937.2Cristiano_Ronaldo 312,338 228,480 83,858 72,342.7 -27,883.2 44,459.5Entity EFreq Aspect EAFreq +EAFreq -EAFreq +Score -Score ScoreFrance 11,697,238 economy 2,633 1,452 1,181 469.2 -390.6 78.5Spain 6,602,450 economy 1,561 620 941 211.7 -312.2 -100.3
    4. 4. Architecture
    5. 5. Text Extraction Boilerpipe Stanford CoreNLPBarack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.The son of a black man from Kenya and a white woman from Kansas, he is the first African-American to ascendto the highest office in the land.He defeated Hillary Rodham Clinton in a lengthy and bitter primary battle before defeating Senator John McCain, the Arizona Republican, in November 2008.…
    6. 6. Entity RecognitionBarack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.… Entity Recognition (Wikification) Barack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009. Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907 President_of_the_United_States||0.9707||president of the United States||0.9918 …
    7. 7. Aspect ExtractionBarack Hussein Obama was sworn in as the 44th president of the United States on Jan. 20, 2009.Barack_Obama||0.9727||Barack||0.9868||Barack Hussein Obama||0.9907President_of_the_United_States||0.9707||president of the United States||0.9918… PoS tagging aspect extractionBarack/NNP Hussein/NNP Obama/NNP was/VBD sworn/VBN in/IN as/IN the/DT 44th/JJpresident/NN of/IN the/DT United/NNP States/NNPS on/IN Jan./NNP 20/CD ,/, 2009/CD ./.…Barack_Obama President_of_the_United_StatesBarack Hussein Obama Barack Hussein Obamapresident presidentUnited States United StatesJan Jan44th president 44th president
    8. 8. Sentiment AnalysisThe iPhone is in general very good, however, its battery life is very bad… distance=10 SentiStrength distance=3The iPhone is in general very good[2][+1 booster word],however ,its battery life is very bad[-2][-1booster word][sentence: 3,-3] [result: max + and - of any sentence]very good||3||3very bad||-3||10Score = 3/3 + -3/10
    9. 9. Our work is• Doing the previous for – Over 2 billion english pages – Wikipedia entities (over 3.5 million entities)• Mostly using – Hadoop – Pig
    10. 10. Experiments• Lack of ground truth• Correlations to real-world factors• Three experiments – Countries – Countries’ economy – Grammy award winners
    11. 11. Countries •Travel Costa Rica positive aspectsTop 10 positively mentioned •Axis of Evil •BBC Poll Iran negative aspects Top 10 negatively mentioned Israel negative aspects
    12. 12. Countries’ economy• Correlation between sentiment scores and countries’ nominal GDP• Normalized scores vs. non-normalized scores
    13. 13. Grammy Award WinnersCorrelations with Grammy Inequality of scores
    14. 14. Conclusion• Analysis – Methodology for correlating sentiments with other real- world factors – Experiments• Pipeline – Big data – Can be an online in-production system• Future work – Restricting the analysis to a subset of the Web, e.g., blogs – Sentiment scoring scheme (taking the volume problem into account)
    15. 15. Thanks MerciGràcies – Gracias Danke Teşekkürler

    ×