Challenging Retrieval Scenarios: Social Media and Linked Open Data

Institute for Web Science & Technologies – WeST

Challenging Retrieval
Scenarios:
Social Media and Linked Open Data
Dr. Thomas Gottron
gottron@uni-koblenz.de

Outline

 The ROBUST project
 Background
 Use cases

 Retrieval on Microblogs
 Particularities of Twitter
 Interestingness
 LiveTweet

 Search on the LOD cloud
 Querying LOD as IR task
 Schema extraction
 SchemEX
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 2

Online Communities


Business Communities

 Information ecosystems
 Employees
 Business Partners, Customers
 General Public
Valuable asset

Risks Opportunities


High Level Objectives

Risk Community
Management Analysis
• Risk modelling • Contents
• Detection • Single users
• Automatic • Entire
reaction communities

Community Large Scale
Forecasting Processing
• Policies • Big Data
• Prediction • Realtime
• Decision • Parallel
support Processing


Scenario 1

Social Media - Microblogs


IBM Connections


Twitter

Follower

@janedoe

My dear
@johndoe had
troubles to wake
up this #morning


Retrieval on Twitter: First Steps

 10 Millionen Tweets
 Retrieval Engine
 Query: beer
Rang User Tweet
1 LoriAG beer
2 Crushdwinebar beer!!
3 Skippertaylor BEER
4 BigMacScola Beer
5 VANiamore beer.......
6 CindyMcManis To beer or not to beer on Beer Summit ?
7 silverlakewine beer beer beer beer beer beer beer. Simple 3pm
8 eldoradobar http://ping.fm/p/Bnra7 - In!!! BEER, BEER, BEER,
BEER, BEER, BEER, BEER, BEER, BEER, BEER,
9 tonx Lompoc. beer beer beer beer beer beer beer beer beer
beer. http://twitpic.com/l68ld
10 punkeyfunky Beer beer beer beer beer beer beer beer beer beer beer
beer beer. Er, guess what I'm looking forward to?


Particularities of Twitter


Twitter is different

 Maximum length: 140 characters

500000

450000

400000

350000

300000
# Tweets

250000

200000

150000

100000

50000

0

101
105
109
113

121
125
129
133
137
141
117
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
1
5
9

Zeichen



 140 characters = few words
10000000 85% of tweets contain each word only once

10000000

1000000

100000
# Tweets

10000

1000
Binary value !
100

10

1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 39 41 42 43 44 46 47
Max TF in Tweet


Length normalisation

 Why are some documents longer (classic explanation)

 Verbosity hypothesis:
 Long documents repeat themself
 Short documents prefered as they are more concise

 Scope hypothesis:
 Long documents address more topics
 Short document prefered as they are more focussed

 Intuition:
 Not valid for Twitter


Verbosity hypothesis and Twitter?


Scope hypothesis and Twitter?

 Are long tweets broader in scope?

 LDA:
 100 topics

 Observations
 8,5% of tweets have no strong topic
 Remaining tweets:
• 77,1% are dominated by one topic
• 99,6% are dominated by two topics


Length normalisation on tweets

 Not necessary! … Negative impact?

 YES:
 Short tweets are preferred!

Beer!

 Long tweets are considered of too wide scope.

Pubs brewing their own beer: a list for Düsseldorf http://bit.ly/w2GZrV


Interestingness


Interesting Content

 Concept of „relevance“ in IR:
 Document is about a topic

 Additionally for Twitter:
 Timeliness
 Current trend
 Informative

 Interestingness
 Tweet is about a topic AND is interesting!

 Question: How to determine what is interesting???


Retweets
RT @janedoe: My
Follower
dear @johndoe
had troubles to
wake up this
@janedoe #morning

My dear
@johndoe had
troubles to wake
up this #morning


Retweets

 Retweet indicates quality
 „of interest for others“

 Depends on
 Content 
 Context (time, follower) 

 Idea:
 Learn to predict retweets!

Likelihood of retweet as
metric for Interestingness


Retweets: Prediction model

Dataset Users Tweets Retweets
Choudhury 118,506 9,998,756 7.89%
Choudhury (extended) 277,666 29,000,000 8.64%
Petrovic 4,050,944 21,477,484 8.46%


Logistic Regression: Weights

Feature Dimensions Weight
Constant (intercept) -5.45
Direct message -147.89
Username 146.82
Message feature
Hashtag 42.27
URL 249.09
Valence -26.88
Sentiment Arousal 33.97
Dominance 19.56
Positive -21.8
Emoticons
Negative 9.94
Positive 13.66
Exclamation
Negative 8.72
! -16.85
Punctuation
? 23.67
Terms Odds 19.79


Logistic Regression: Topic Weights

Topic Weight
social media market post site web tool traffic network 27.54
follow thank twitter welcome hello check nice cool people 16.08
credit money market business rate economy home 15.25
christmas shop tree xmas present today wrap finish 2.87
home work hour long wait airport week flight head -14.43
twitter update facebook account page set squidoo check -14.43
cold snow warm today degree weather winter morning -26.56
night sleep work morning time bed feel tired home -75.19


Re-Ranking using Interestingness

 Top-k relevant tweets
 Re-rank based on interestingness
Rang Username Tweet
1 BeeracrossTX UK beer mag declares "the end of beer writing." @StanHieronymus says not so in the US.
http://bit.ly/424HRQ #beer
2 narmmusic beer summit @bspward @jhinderaker no one had billy beer? heehee #narm - beer summit
@bspward @jhinde http://tinyurl.com/n29oxj
3 beeriety Go green and turn those empty beer bottles into recycled beer glasses! | http://bit.ly/2src7F
#beer #recycle (via: @td333)
4 hblackmon Great Divide beer dinner @ Porter Beer Bar on 8/19 - $45 for 3 courses + beer pairings.
http://trunc.it/172wt
5 nycraftbeer Interesting Concept-Beer Petitions.com launches&hopes 2help craft beer drinkers enjoy beer
they want @their fave pubs. http://bit.ly/11gJQN
6 carichardson Beer Cheddar Soup: Dish number two in my famed beer dinner series is Beer Cheddar
Soup. I hadn’t had too.. http://bit.ly/1diDdF
7 BeerBrewing New York City Beer Events - Beer Tasting - New York Beer Festivals - New York Craft Beer
http://is.gd/39kXj #beer
8 delphiforums Love beer? Our member is trying to build up a new beer drinker's forum. Grab a #beer and
join us: http://tr.im/pD1n
9 Jamie_Mason #Baltimore Beer Week continues w/ a beer brkfst, beer pioneers luncheon, drink & donate
event, beer tastings & more. http://ping.fm/VyTwg
10 carichardson Seattle and Beer: I went to Seattle last weekend. It was my friend’s stag - he likes
beer - we drank beer.. http://tinyurl.com/cpb4n9


Application


LiveTweet

 Data:
 Twitter streaming API: sample
 1% of all tweets

 Architecture:
 Time slices over tweets
 Analytical component with
REST API
 Web Frontend for end user


LiveTweet

http://livetweet.west.uni-koblenz.de/


LiveTweet: What comes next?

 Retrieval
 Incorporate with other retrieval metrics
 Include Interestingness in a learning to rank approach
 Social graph

 System extension
 Personalisation
 Public API
 Work with IBM data


Scenario 2

Linked Open Data


Information needs requiring semantic structure

 Examples
 Male persons who have a public profile document
 Computing science papers authored by social scientists
 American actors who are also politicians and are married
to a model.

 Maybe specific databases available:
 Person search engines
 Bibliographic databases
 Movie database


Linked Data

Semantic Web Technology to
1. Provide structured data on the web
2. Link data across data sources

Thing Thing Thing Thing Thing

Thing Thing Thing Thing Thing

typed typed typed typed
links links links links

A B C D E


Entities are identified via URIs

One statement = one triple

rdf:type Subject Predicate Object
pd:cygri foaf:Person

foaf:name
Richard Cyganiak
foaf:based_near
dbpedia:Berlin

pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri
dbpedia:Berlin = http://dbpedia.org/resource/Berlin

Description of a link between two data sources


Resolving URIs

rdf:type
pd:cygri foaf:Person

foaf:name 3.405.259
Richard Cyganiak dp:population
foaf:based_near
dbpedia:Berlin

skos:subject

dp:Cities_in_Germany


The LOD Cloud


Querying linked data

SELECT ?x
WHERE {
?x rdfs:type foaf:Person .
?x rdfs:type pim:Male .
?x foaf:maker ?y .
?y rdfs:type
foaf:PersonalProfileDocument .
}


Querying linked data – an IR task?

Here happens IR magic
Information need

Keyword query Documents Information

SPARQL query Data sources Entities

Here we need magic


Querying linked data – using an index

SELECT ?x
WHERE {
?x foaf:maker ?y .
?y rdfs:type
}


A Schema for LOD


Idea

 Schema Index:
 Define families of graph patterns
 Assign entities to graph patterns
 Map graph patterns to context / source

 Construction:
 Streambased for scalability
 Little loss of accuracy

 NOTE:
 Index defined over entities
 But: Index stores the contexts (sources)


Input Data

 n-Quads
<subject> <predicate> <object> <context> .
 Example:
<http://www.w3.org/People/Connolly/#me>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person>
<http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> .

http://dig.csail.mit.edu/2008/...

foaf:
w3p: Person
#me


Layer 1: RDF Classes

 All entities of a particular type C1

DS 1 DS 2 DS 3

SELECT ?x
FROM …
WHERE { foaf:Person
}

http://dig.csail.mit.edu/2008/...
foaf:
Person
timbl:
http://www.w3.org/People/Berners-Lee/card
card#i


Layer 2: Type Clusters

 All entities belonging to the C1 C2

same set of types
TC1

DS 1 DS 2 DS 3
SELECT ?x
FROM …
WHERE { foaf:Person pim:Male
} tc4711

pim:
Male

foaf:
timbl: http://www.w3.org/People/Berners-Lee/card
Person
card#i


Layer 3: Equivalence Classes

 Two entities are equivalent iff: C1 C2 C3

 They are in the same TC
 They have the same TC1 TC2
properties
 The property targets are in the
same TC
EQC1

DS 1 DS 2 DS 3


Layer 3: Equivalence Classes
SELECT ?x
FROM …
WHERE {
?x foaf:maker ?y .
foaf:Person pim:Male foaf:PPD
?y rdfs:type
}
tc4711 tc1234

foaf: foaf: eqc0815
Person PPD -maker-
pim: tc1234
Male
eqc0815
foaf:maker

timbl:
timbl: card http://www.w3.org/People/Berners-Lee/card
card#i


Schema Index Overview

 3 Layers – 3 different graph patterns


Schema Computation


Building the Index from a Stream

 Stream of n-quads (coming from a LD crawler)

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo
1
C3 4
6
C2 3
4
2
C2 2
1 3
C1 5


Does it work good?

Comparison of stream based vs. Gold standard Schema on 11 M triple data set


Does it scale?

 Semantic Web Challenge: Billion Triples Track
 Provision of large scale RDF dataset
 Crawled from LOD

 Task:
 Do something „useful“
 Do it (web-)scalable
 Do it with at least 1 billion triples

 Presentation at ISWC


BTC results

1st billion 2nd billion full BTC
# triples 1 billion 1 billion 2.17 billion
# instances 187.7 M 222.6 M 450.0 M
# data sources 13.5 M 9.5 M 24.1 M
# type clusters 208.5 k 248.5 k 448.6 k
# equivalence classes 0.97 M 1.14 M 2.12 M
# triples index 29.1 M 24.8 M 54.7 M
Compression ratio 2.91% 2.48% 2.52%
# triples/sec. 40.5 k 45.6 k 39.5 k


SchemEX: What comes next?

 Hierarchy of semantic information:
 Type clusters
 Equivalence clusters
 Related types

 Optimization
 Smarter caching
 Performance – Hadoop
 Error correction


Conclusion


Take away message

 Web evolving in interesting directions
 Social networks, user generated content
 Semantic data

 Challenges for IR
 Different settings
 Different tasks
 Question basic assumptions


Thank you!

Contact:
WeST – Institute for Web Science and Technologies
Universität Koblenz-Landau
gottron@uni-koblenz.de


Relevant Publications
1. A. Che Alhadi, S. Staab, and T. Gottron. Exploring user purpose writing single tweets. In WebSci ’11:
Proceedings of the 3rd International Conference on Web Science, 2011.
2. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Microblog retrieval based on
interestingness, in TREC’11: Proceedings of the Text Retrieval Conference 2011, 2011.
3. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Monitoring and predicting interesting
microblog posts, in ECIR’12: Procedings of the 34th European Conference on Information Retrieval,
2012. in preparation.
4. T. Gottron and N. Lipka, A comparison of language identification approaches on short, query-style texts,
in ECIR ’10: Proceedings of the 32nd European Conference on Infor-mation Retrieval, pp. 611–614, Mar.
2010.
5. M. Konrath, T. Gottron, and A. Scherp. Schemex – web-scale indexed schema extraction of linked open
data, in Semantic Web Challenge, Submission to the Billion Triple Track,
6. 2011.N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Bad news travel fast: A content-based
analysis of interestingness on twitter. In WebSci ’11: Proceedings of the 3rd International Conference on
Web Science, 2011.
7. N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Searching microblogs: Coping with sparsity and
document quality. In CIKM’11: Proceedings of 20th ACM Conference on Information and Knowledge
Management, 2011.


Attic


Use Cases

SAP Community Network (SCN) Lotus Connections MeaningMine
Communities Communities Communities
• Customers • Employees • Social media
• Partners • Working groups • News
• Suppliers • Interest Groups • Web fora
• Developers • Projects • Public communities
Business value Business value Business value
• Products support • Task relevant information • Topics
• Services • Collaboration • Opinions
• Find business partners • Innovation • Service for partners
Volume Volume Volume
• 6,000 posts/day • 4,000 posts/day • 1,400,000 posts/day
• 1,700,000 subscribers • 386,000 employees • 708,000 web sources
• 16GB log/day • 1.5GB content/day • 45GB content/day

Business Partners Employees Public Domain
Extranet Intranet Internet



 Follower form social graph
 PageRank applicable?!

 BUT:
 Follow not (only) motivated
by content
 No statement about tweets!


Information seeking behaviour on Twitter

 Web  Twitter
 2-4 query terms  1-2 query terms
 Broader terms  Specific terms
 Intentions  Intentions
• Navigation • Timely information
• Information • Trends
• Ressourcen • People
 Get to know a topic  Follow a topic


TREC

 Microblog Track 2011
 12.000.000 Tweets
 2 Weeks
 49 „Topics“ (Queries)
 Task: Filtering

 Constraints
 No external knowledge!
 English tweets only
 Temporal order of topic & tweets
 Official extension of „relevance“ to „interestingness“ (!!!)


WeST @ TREC Microblog Track

 Basics:
 Lucene
 No length normalisation
 Interestingness

 4 configurations:
 WESTfilter: Retrieval via Lucene, filtering non interesting
tweets
 WESTfilext: like WESTfilter, but with sentiments
 WESTrelint: like WESTfilter, but re-ranking according to
interestingness
 WESTrlext: like WESTrelint, but with sentiments


Results

 Filtering significantly better than re-ranking
 Sentiments are of disadvantage (not significant)

0.4
0.35
0.3
0.25
Score

0.2
0.15
0.1
0.05
0
P5 P10 P15 P20 P30 R-prec bpref MAP nDCG
Metric
WESTfilter WESTfilext WESTrelint WESTrlext


Results

 Effective especially for shorter queries
0.3

0.25

0.2
MAP

0.15

0.1

0.05

0
1 2 3 4 5 6 7
Query Length (word count)
WESTfilext WESTfilter WESTrelint WESTrlext


Schema representation using VoiD


Challenging Retrieval Scenarios: Social Media and Linked Open Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

More from Thomas Gottron

More from Thomas Gottron (10)

Recently uploaded

Recently uploaded (20)

Challenging Retrieval Scenarios: Social Media and Linked Open Data

Editor's Notes