Challenging Retrieval Scenarios: Social Media and Linked Open Data

  • 335 views
Uploaded on

Invited talk given in April 2012 at USI in Lugano at the IR research group of Fabio Crestani. Review of the work on Interestingness on Twitter and schema based indices on Linked Open Data (SchemEX).

Invited talk given in April 2012 at USI in Lugano at the IR research group of Fabio Crestani. Review of the work on Interestingness on Twitter and schema based indices on Linked Open Data (SchemEX).

More in: Science
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
335
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Online communities as mainly perceived in publicPrivate usersSocial interactionSharing of information, pictures, online ressources
  • Business communities slightly different:Grouped around a business/enterpriseAim: add value to business (Knowledge management, public relations, customer aquisition, support, etc.)Conclusion: communities have a value, that needs to be taken care ofValue is endangered by risks (e.g. experts leaving) or might be increased by seizing opportunities (e.g. connect people working on the same topic)
  • Main objectives as listed in DoW

Transcript

  • 1. Institute for Web Science & Technologies – WeST Challenging Retrieval Scenarios:Social Media and Linked Open Data Dr. Thomas Gottron gottron@uni-koblenz.de
  • 2. Outline The ROBUST project  Background  Use cases Retrieval on Microblogs  Particularities of Twitter  Interestingness  LiveTweet Search on the LOD cloud  Querying LOD as IR task  Schema extraction  SchemEXChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 2
  • 3. Online CommunitiesChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 3
  • 4. Business Communities Information ecosystems  Employees  Business Partners, Customers  General Public Valuable asset Risks OpportunitiesChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 4
  • 5. High Level Objectives Risk Community Management Analysis • Risk modelling • Contents • Detection • Single users • Automatic • Entire reaction communities Community Large Scale Forecasting Processing • Policies • Big Data • Prediction • Realtime • Decision • Parallel support ProcessingChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 5
  • 6. Scenario 1 Social Media - MicroblogsChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 6
  • 7. IBM ConnectionsChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 7
  • 8. Twitter Follower @janedoe My dear @johndoe had troubles to wake up this #morningChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 8
  • 9. Retrieval on Twitter: First Steps 10 Millionen Tweets Retrieval Engine Query: beer Rang User Tweet 1 LoriAG beer 2 Crushdwinebar beer!! 3 Skippertaylor BEER 4 BigMacScola Beer 5 VANiamore beer....... 6 CindyMcManis To beer or not to beer on Beer Summit ? 7 silverlakewine beer beer beer beer beer beer beer. Simple 3pm 8 eldoradobar http://ping.fm/p/Bnra7 - In!!! BEER, BEER, BEER, BEER, BEER, BEER, BEER, BEER, BEER, BEER, 9 tonx Lompoc. beer beer beer beer beer beer beer beer beer beer. http://twitpic.com/l68ld 10 punkeyfunky Beer beer beer beer beer beer beer beer beer beer beer beer beer. Er, guess what Im looking forward to?Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 10
  • 10. Particularities of TwitterChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 11
  • 11. Twitter is different Maximum length: 140 characters 500000 450000 400000 350000 300000 # Tweets 250000 200000 150000 100000 50000 0 101 105 109 113 121 125 129 133 137 141 117 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 1 5 9 ZeichenChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 12
  • 12. Twitter is different  140 characters = few words 10000000 85% of tweets contain each word only once 10000000 1000000 100000# Tweets 10000 1000 Binary value ! 100 10 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 39 41 42 43 44 46 47 Max TF in Tweet Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 13
  • 13. Length normalisation Why are some documents longer (classic explanation) Verbosity hypothesis:  Long documents repeat themself  Short documents prefered as they are more concise Scope hypothesis:  Long documents address more topics  Short document prefered as they are more focussed Intuition:  Not valid for TwitterChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 15
  • 14. Verbosity hypothesis and Twitter?Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 16
  • 15. Scope hypothesis and Twitter? Are long tweets broader in scope? LDA:  100 topics Observations  8,5% of tweets have no strong topic  Remaining tweets: • 77,1% are dominated by one topic • 99,6% are dominated by two topicsChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 17
  • 16. Length normalisation on tweets Not necessary! … Negative impact? YES:  Short tweets are preferred! Beer!  Long tweets are considered of too wide scope. Pubs brewing their own beer: a list for Düsseldorf http://bit.ly/w2GZrVChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 18
  • 17. InterestingnessChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 19
  • 18. Interesting Content Concept of „relevance“ in IR:  Document is about a topic Additionally for Twitter:  Timeliness  Current trend  Informative Interestingness  Tweet is about a topic AND is interesting! Question: How to determine what is interesting???Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 20
  • 19. Retweets RT @janedoe: My Follower dear @johndoe had troubles to wake up this @janedoe #morning My dear @johndoe had troubles to wake up this #morningChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 21
  • 20. Retweets  Retweet indicates quality  „of interest for others“  Depends on  Content   Context (time, follower)   Idea:  Learn to predict retweets! Likelihood of retweet as metric for InterestingnessChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 22
  • 21. Retweets: Prediction model Dataset Users Tweets Retweets Choudhury 118,506 9,998,756 7.89% Choudhury (extended) 277,666 29,000,000 8.64% Petrovic 4,050,944 21,477,484 8.46%Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 24
  • 22. Logistic Regression: Weights Feature Dimensions Weight Constant (intercept) -5.45 Direct message -147.89 Username 146.82 Message feature Hashtag 42.27 URL 249.09 Valence -26.88 Sentiment Arousal 33.97 Dominance 19.56 Positive -21.8 Emoticons Negative 9.94 Positive 13.66 Exclamation Negative 8.72 ! -16.85 Punctuation ? 23.67 Terms Odds 19.79Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 25
  • 23. Logistic Regression: Topic Weights Topic Weight social media market post site web tool traffic network 27.54 follow thank twitter welcome hello check nice cool people 16.08 credit money market business rate economy home 15.25 christmas shop tree xmas present today wrap finish 2.87 home work hour long wait airport week flight head -14.43 twitter update facebook account page set squidoo check -14.43 cold snow warm today degree weather winter morning -26.56 night sleep work morning time bed feel tired home -75.19Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 26
  • 24. Re-Ranking using Interestingness Top-k relevant tweets Re-rank based on interestingness Rang Username Tweet 1 BeeracrossTX UK beer mag declares "the end of beer writing." @StanHieronymus says not so in the US. http://bit.ly/424HRQ #beer 2 narmmusic beer summit @bspward @jhinderaker no one had billy beer? heehee #narm - beer summit @bspward @jhinde http://tinyurl.com/n29oxj 3 beeriety Go green and turn those empty beer bottles into recycled beer glasses! | http://bit.ly/2src7F #beer #recycle (via: @td333) 4 hblackmon Great Divide beer dinner @ Porter Beer Bar on 8/19 - $45 for 3 courses + beer pairings. http://trunc.it/172wt 5 nycraftbeer Interesting Concept-Beer Petitions.com launches&hopes 2help craft beer drinkers enjoy beer they want @their fave pubs. http://bit.ly/11gJQN 6 carichardson Beer Cheddar Soup: Dish number two in my famed beer dinner series is Beer Cheddar Soup. I hadn’t had too.. http://bit.ly/1diDdF 7 BeerBrewing New York City Beer Events - Beer Tasting - New York Beer Festivals - New York Craft Beer http://is.gd/39kXj #beer 8 delphiforums Love beer? Our member is trying to build up a new beer drinkers forum. Grab a #beer and join us: http://tr.im/pD1n 9 Jamie_Mason #Baltimore Beer Week continues w/ a beer brkfst, beer pioneers luncheon, drink & donate event, beer tastings & more. http://ping.fm/VyTwg 10 carichardson Seattle and Beer: I went to Seattle last weekend. It was my friend’s stag - he likes beer - we drank beer.. http://tinyurl.com/cpb4n9Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 27
  • 25. ApplicationChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 28
  • 26. LiveTweet Data:  Twitter streaming API: sample  1% of all tweets Architecture:  Time slices over tweets  Analytical component with REST API  Web Frontend for end userChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 29
  • 27. LiveTweet http://livetweet.west.uni-koblenz.de/Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 30
  • 28. LiveTweet: What comes next? Retrieval  Incorporate with other retrieval metrics  Include Interestingness in a learning to rank approach  Social graph System extension  Personalisation  Public API  Work with IBM dataChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 32
  • 29. Scenario 2 Linked Open DataChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 33
  • 30. Information needs requiring semantic structure Examples  Male persons who have a public profile document  Computing science papers authored by social scientists  American actors who are also politicians and are married to a model. Maybe specific databases available:  Person search engines  Bibliographic databases  Movie databaseChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 34
  • 31. Linked Data Semantic Web Technology to 1. Provide structured data on the web 2. Link data across data sources Thing Thing Thing Thing Thing Thing Thing Thing Thing Thing typed typed typed typed links links links links A B C D EChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 35
  • 32. Entities are identified via URIs One statement = one triple rdf:type Subject Predicate Object pd:cygri foaf:Person foaf:name Richard Cyganiak foaf:based_near dbpedia:Berlin pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri dbpedia:Berlin = http://dbpedia.org/resource/Berlin Description of a link between two data sourcesChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 36
  • 33. Resolving URIs rdf:type pd:cygri foaf:Person foaf:name 3.405.259 Richard Cyganiak dp:population foaf:based_near dbpedia:Berlin skos:subject dp:Cities_in_GermanyChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 37
  • 34. The LOD CloudChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 39
  • 35. Querying linked dataSELECT ?xWHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 40
  • 36. Querying linked data – an IR task? Here happens IR magic Information need Keyword query Documents Information SPARQL query Data sources Entities Here we need magicChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 41
  • 37. Querying linked data – using an indexSELECT ?xWHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 42
  • 38. A Schema for LODChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 43
  • 39. Idea Schema Index:  Define families of graph patterns  Assign entities to graph patterns  Map graph patterns to context / source Construction:  Streambased for scalability  Little loss of accuracy NOTE:  Index defined over entities  But: Index stores the contexts (sources)Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 44
  • 40. Input Data n-Quads <subject> <predicate> <object> <context> . Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> . http://dig.csail.mit.edu/2008/... foaf: w3p: Person #meChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 45
  • 41. Layer 1: RDF Classes All entities of a particular type C1 DS 1 DS 2 DS 3SELECT ?xFROM …WHERE { foaf:Person ?x rdfs:type foaf:Person .} http://dig.csail.mit.edu/2008/... foaf: Person timbl: http://www.w3.org/People/Berners-Lee/card card#iChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 46
  • 42. Layer 2: Type Clusters All entities belonging to the C1 C2 same set of types TC1 DS 1 DS 2 DS 3SELECT ?xFROM …WHERE { foaf:Person pim:Male ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .} tc4711 pim: Male foaf: timbl: http://www.w3.org/People/Berners-Lee/card Person card#iChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 47
  • 43. Layer 3: Equivalence Classes Two entities are equivalent iff: C1 C2 C3  They are in the same TC  They have the same TC1 TC2 properties  The property targets are in the same TC EQC1 DS 1 DS 2 DS 3Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 48
  • 44. Layer 3: Equivalence ClassesSELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . foaf:Person pim:Male foaf:PPD ?y rdfs:type foaf:PersonalProfileDocument .} tc4711 tc1234 foaf: foaf: eqc0815 Person PPD -maker-pim: tc1234Male eqc0815 foaf:maker timbl: timbl: card http://www.w3.org/People/Berners-Lee/card card#iChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 49
  • 45. Schema Index Overview 3 Layers – 3 different graph patternsChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 51
  • 46. Schema ComputationChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 52
  • 47. Building the Index from a Stream Stream of n-quads (coming from a LD crawler) … Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1 FiFo 1 C3 4 6 C2 3 4 2 C2 2 1 3 C1 5Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 53
  • 48. Does it work good? Comparison of stream based vs. Gold standard Schema on 11 M triple data setChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 55
  • 49. Does it scale? Semantic Web Challenge: Billion Triples Track  Provision of large scale RDF dataset  Crawled from LOD Task:  Do something „useful“  Do it (web-)scalable  Do it with at least 1 billion triples Presentation at ISWCChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 56
  • 50. BTC results 1st billion 2nd billion full BTC# triples 1 billion 1 billion 2.17 billion# instances 187.7 M 222.6 M 450.0 M# data sources 13.5 M 9.5 M 24.1 M# type clusters 208.5 k 248.5 k 448.6 k# equivalence classes 0.97 M 1.14 M 2.12 M# triples index 29.1 M 24.8 M 54.7 MCompression ratio 2.91% 2.48% 2.52%# triples/sec. 40.5 k 45.6 k 39.5 kChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 57
  • 51. SchemEX: What comes next? Hierarchy of semantic information:  Type clusters  Equivalence clusters  Related types Optimization  Smarter caching  Performance – Hadoop  Error correctionChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 58
  • 52. ConclusionChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 59
  • 53. Take away message Web evolving in interesting directions  Social networks, user generated content  Semantic data Challenges for IR  Different settings  Different tasks  Question basic assumptionsChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 60
  • 54. Thank you!Contact: WeST – Institute for Web Science and Technologies Universität Koblenz-Landau gottron@uni-koblenz.deChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 61
  • 55. Relevant Publications1. A. Che Alhadi, S. Staab, and T. Gottron. Exploring user purpose writing single tweets. In WebSci ’11: Proceedings of the 3rd International Conference on Web Science, 2011.2. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Microblog retrieval based on interestingness, in TREC’11: Proceedings of the Text Retrieval Conference 2011, 2011.3. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Monitoring and predicting interesting microblog posts, in ECIR’12: Procedings of the 34th European Conference on Information Retrieval, 2012. in preparation.4. T. Gottron and N. Lipka, A comparison of language identification approaches on short, query-style texts, in ECIR ’10: Proceedings of the 32nd European Conference on Infor-mation Retrieval, pp. 611–614, Mar. 2010.5. M. Konrath, T. Gottron, and A. Scherp. Schemex – web-scale indexed schema extraction of linked open data, in Semantic Web Challenge, Submission to the Billion Triple Track,6. 2011.N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Bad news travel fast: A content-based analysis of interestingness on twitter. In WebSci ’11: Proceedings of the 3rd International Conference on Web Science, 2011.7. N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Searching microblogs: Coping with sparsity and document quality. In CIKM’11: Proceedings of 20th ACM Conference on Information and Knowledge Management, 2011.Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 62
  • 56. AtticChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 63
  • 57. Use CasesSAP Community Network (SCN) Lotus Connections MeaningMineCommunities Communities Communities• Customers • Employees • Social media• Partners • Working groups • News• Suppliers • Interest Groups • Web fora• Developers • Projects • Public communitiesBusiness value Business value Business value• Products support • Task relevant information • Topics• Services • Collaboration • Opinions• Find business partners • Innovation • Service for partnersVolume Volume Volume• 6,000 posts/day • 4,000 posts/day • 1,400,000 posts/day• 1,700,000 subscribers • 386,000 employees • 708,000 web sources• 16GB log/day • 1.5GB content/day • 45GB content/day Business Partners Employees Public Domain Extranet Intranet InternetChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 64
  • 58. Twitter is different Follower form social graph  PageRank applicable?! BUT:  Follow not (only) motivated by content  No statement about tweets!Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 65
  • 59. Information seeking behaviour on Twitter Web  Twitter  2-4 query terms  1-2 query terms  Broader terms  Specific terms  Intentions  Intentions • Navigation • Timely information • Information • Trends • Ressourcen • People  Get to know a topic  Follow a topicChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 66
  • 60. TREC Microblog Track 2011  12.000.000 Tweets  2 Weeks  49 „Topics“ (Queries)  Task: Filtering Constraints  No external knowledge!  English tweets only  Temporal order of topic & tweets  Official extension of „relevance“ to „interestingness“ (!!!)Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 67
  • 61. WeST @ TREC Microblog Track Basics:  Lucene  No length normalisation  Interestingness 4 configurations:  WESTfilter: Retrieval via Lucene, filtering non interesting tweets  WESTfilext: like WESTfilter, but with sentiments  WESTrelint: like WESTfilter, but re-ranking according to interestingness  WESTrlext: like WESTrelint, but with sentimentsChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 68
  • 62. Results Filtering significantly better than re-ranking Sentiments are of disadvantage (not significant) 0.4 0.35 0.3 0.25Score 0.2 0.15 0.1 0.05 0 P5 P10 P15 P20 P30 R-prec bpref MAP nDCG Metric WESTfilter WESTfilext WESTrelint WESTrlextChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 69
  • 63. Results Effective especially for shorter queries 0.3 0.25 0.2 MAP 0.15 0.1 0.05 0 1 2 3 4 5 6 7 Query Length (word count) WESTfilext WESTfilter WESTrelint WESTrlextChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 70
  • 64. Schema representation using VoiDChallenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 71