Social Media = Big DataGartner 3V definition:1.Volume2.Velocity3.VarietyHigh volume & velocity of messages:Twitter has ~20 000 000 users per monthThey write ~500 000 000 messages per dayMassive variety:Stock markets;Earthquakes;Social arrangements;… Bieber
What resources do we have now?Large, content-rich, linked, digital streams of human communicationWe transfer knowledge via communicationSampling communication gives a sample of human knowledgeYouve only done that which you can communicateThe metadata (time – place – imagery) gives a richer resource:→A sampling of human behaviour
Linking these resourcesWhy is it useful to link this data?Identifying subjects, themes, … entitiesWhats the text about?Whod be interested in it?How can we summarise it?- very important, considering information overload in soc med!
Pipeline cumulative effectGood performance is important at each stage – not just entity linking
Language IDLADY GAGA IS BETTER THE 5th TIME OH BABY(:The Jan. 21 show started with the unveiling of animpressive three-story castle from which Gagaemerges. The band members were in variousportals, separated from each other for most of theshow. For the next 2 hours and 15 minutes, LadyGaga repeatedly stormed the moveable castle,turning it into her own gothic Barbie DreamhouseNewswire:Microblog:
Language ID difficultiesGeneral accuracy on microblogs: 89.5%Problems include switching language mid-text:je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt,het is duidelijk, ze bedekken hun gezicht. Get over itNew info in this format:Metadata:spatial informationlinked URLsEmoticons::) vs. ^_^cu vs. 88Accuracy when customised to genre: 97.4%
TokenisationGeneral accuracy on microblogs: 80%Goal is to convert byte stream to readily-digestible word chunksWord bound discovery is a critical language acquisition taskThe LIBYAN AID Team successfully shipped these broadcastingequipment to Misrata last August 2011, to establish an FM Radiostation ranging 600km, broadcasting to the west side of Libya tohelp overthrow Gaddafis regime.RT @JosetteSheeran: @WFP #Libya breakthru! We moveurgently needed #food (wheat, flour) by truck convoy intowestern Libya for 1st time :D
Tokenisation difficultiesNot curated, so typosImproper grammar – e.g. apostrophe usage; live with it!doesnt → doesntdoesnt → does ntSmileys and emoticonsI <3 you → I & lt ; youThis piece ;,,( so emotional → this piece ; , , ( so emotionalLoss of information (sentiment)Punctuation for emphasis*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *Words run together
Tokenisation fixesCustom tokeniser!Apostrophe insertionSlang unpackingIma get u → I m going to get youEmoticon rules- if we can spot them, we know not to break themCustomised accuracy on microblogs: 96%
Part of speech taggingGoal is to assign words to classes (verb, noun etc)General accuracy on newswire:97.3% token, 56.8% sentenceGeneral accuracy on microblogs:73.6% token, 4.24% sentenceSentence-level accuracy important:without whole sentence correct, difficult to extract syntax
Part of speech tagging difficultiesMany unknowns:Music bandsSoulja Boy | TheDeAndreWay.com in stores Nov 2, 2010Places#LB #news: Silverado Park Pool Swim LessonsCapitalisation way off@thewantedmusic on my tv :) aka dereklast day of sorting pope visit to birmingham stuff outSlang~HAPPY B-DAY TAYLOR !!! LUVZ YA~Orthographic errorsdont even have homwork today, suprising ?DialectShall we go out for dinner this evening?Ey yo wen u gon let me tap dat
Part of speech tagging fixesSlang dictionary for repairWont cover previously-unseen slangIn-genre labelled dataExpensive to create!Leverage MLExisting taggers can handle unknown wordsMaximise use of these features!General accuracy on microblogs:73.6% token, 4.24% sentenceAccuracy when customised to genre:88.4% token, 25.40% sentence
Named Entity RecognitionGoal is to find entities we might like to linkGeneral accuracy on newswire: 89% F1General accuracy on microblogs: 41% F1Newswire:Microblog:Gotta dress up for london fashion week and party instyle!!!London Fashion Week grows up – but mustnt takeitself too seriously. Once a launching pad for newdesigners, it is fast becoming the main event. ButLFW mustnt let the luxury and money crush itssense of silliness.
NER difficultiesRule-based systems get the bulk of entities (newswire 77% F1)ML-based systems do well at the remainder (newswire 89% F1)Small proportion ofdifficult entitiesMany complex issuesUsing improved pipeline:ML struggles, even with in-genre data: 49% F1Rules cut through microblog noise: 80% F1
NER on FacebookLonger texts than tweetsStill has informal toneMWEs are a problem!- all capitalised:Green Europe Imperiled as Debt CrisesTriggers Carbon Market DropDifficult, though easier than TwitterMaybe due to the possibility of including more verbal context?
Entity linkingGoal is to find out which entity a mention refers toThe murderer was Professor Plum, in the Library,with the Candlestick!Which Professor Plum?Disambiguation is through connecting text to the web of datadbpedia.org/resource/Professor_Plum_(astrophysicist)Two tasks:- Whole-text linking- Entity-level linking
AboutnessGoal: answer What entities is this text about?Good for tweets:Lack of lexicalised contextNot all related concepts are in the textHelpful for summarisationNo concern for entity bounds; (finding them is tough in microblog!)* but *Added concern for themes in texte.g. marketing, US elections
Aboutness performanceCorpus:from Meij et al. Adding semantics to microblog posts.468 tweetsFrom one to six concepts per tweetDBpedia spotlight: highest recall (47.5)TextRazor: highest precision (64.6)Zemanta: highest F1 (41.0)Zemanta tuned for blog entries – so compensates for some noise
Word-level linkingGoal is to link an entityGiven:The entity mentionSurrounding microblog contextNo corpora exist for this exact task:Two commercially produced onesPolicy says no sharingHow can we approach this key task?
Word-level linking performanceDataset: ReplabTask is to determine relatedness-or-notSix entities givenFew hundred tweets per entityDetect mentions of entity in tweetsWe disambiguate mentions to DBpedia / Wikipedia (easy to map)General performance: F1 around 70
Word-level linking issuesNER errorsMissed entities damages / destroys linkingSpecificity problemsLufthansa CargoLufthansa CargoWhich organisation to choose?Require good NERDirect linking chunking reduces precision:Apple trees in the home garden bit.ly/yOztKsPipeline NER does not mark Apple as entity hereLack of disambiguation context is a problem!
Word-level linking issuesAutomatic annotation:Branching out from Lincoln park(LOC) after dark ... Hello "RussianNavy(ORG)", its like the same thing but with glitter!Actual:Branching out from Lincoln park after dark(ORG) ... Hello "RussianNavy", its like the same thing but with glitter!Clue in unusual collocations+ ?
Whole pipeline: how to fix?Common genre problems centre on mucky, uncurated textOrth errorSlangBrevityCondensedNon-Chicago punctuation..Maybe clearing up this will improve performance?
NormalisationGeneral solution for overcoming linguistic noiseHow to repair?1. Gazetteer (quick & dirty); or..2. Noisy channel modelTask is to reverse engineer the noise on this channelBrown clustering; double metaphone; auto orth correctionAn honest, well-formed sentence u wot m8 biber #lol
Normalisation performanceNER on tweets:Rule-basedNo normalisation F1 80%Gazetter normalisation F1 81%Noisy channel F1 81%ML-basedNo normalisation F1 49.1%Gazetter normalisation F1 47.6%Noisy channel F1 49.3%Negligible performance impact, and introduces errors!Sentiment change:undisambiguable → disambiguableMeaning change:She has Huntingtons → She has Huntingdons
Future directionsMORE DATA!and better..no IAA for many resourcesMaybe from the crowd?MORE CONTEXT!Not just linguisticMicroblog has host of metatdataExplicit:Time, Place, URIs, HashtagsImplicit:Friend networkPrevious messages
Thank you!Thank you for listening!Do you have any questions?