Invited talk given in April 2012 at USI in Lugano at the IR research group of Fabio Crestani. Review of the work on Interestingness on Twitter and schema based indices on Linked Open Data (SchemEX).
Challenging Retrieval Scenarios: Social Media and Linked Open Data
1. Institute for Web Science & Technologies – WeST
Challenging Retrieval
Scenarios:
Social Media and Linked Open Data
Dr. Thomas Gottron
gottron@uni-koblenz.de
2. Outline
The ROBUST project
Background
Use cases
Retrieval on Microblogs
Particularities of Twitter
Interestingness
LiveTweet
Search on the LOD cloud
Querying LOD as IR task
Schema extraction
SchemEX
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 2
4. Business Communities
Information ecosystems
Employees
Business Partners, Customers
General Public
Valuable asset
Risks Opportunities
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 4
5. High Level Objectives
Risk Community
Management Analysis
• Risk modelling • Contents
• Detection • Single users
• Automatic • Entire
reaction communities
Community Large Scale
Forecasting Processing
• Policies • Big Data
• Prediction • Realtime
• Decision • Parallel
support Processing
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 5
6. Scenario 1
Social Media - Microblogs
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 6
12. Twitter is different
140 characters = few words
10000000 85% of tweets contain each word only once
10000000
1000000
100000
# Tweets
10000
1000
Binary value !
100
10
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 39 41 42 43 44 46 47
Max TF in Tweet
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 13
13. Length normalisation
Why are some documents longer (classic explanation)
Verbosity hypothesis:
Long documents repeat themself
Short documents prefered as they are more concise
Scope hypothesis:
Long documents address more topics
Short document prefered as they are more focussed
Intuition:
Not valid for Twitter
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 15
14. Verbosity hypothesis and Twitter?
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 16
15. Scope hypothesis and Twitter?
Are long tweets broader in scope?
LDA:
100 topics
Observations
8,5% of tweets have no strong topic
Remaining tweets:
• 77,1% are dominated by one topic
• 99,6% are dominated by two topics
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 17
16. Length normalisation on tweets
Not necessary! … Negative impact?
YES:
Short tweets are preferred!
Beer!
Long tweets are considered of too wide scope.
Pubs brewing their own beer: a list for Düsseldorf http://bit.ly/w2GZrV
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 18
18. Interesting Content
Concept of „relevance“ in IR:
Document is about a topic
Additionally for Twitter:
Timeliness
Current trend
Informative
Interestingness
Tweet is about a topic AND is interesting!
Question: How to determine what is interesting???
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 20
19. Retweets
RT @janedoe: My
Follower
dear @johndoe
had troubles to
wake up this
@janedoe #morning
My dear
@johndoe had
troubles to wake
up this #morning
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 21
20. Retweets
Retweet indicates quality
„of interest for others“
Depends on
Content
Context (time, follower)
Idea:
Learn to predict retweets!
Likelihood of retweet as
metric for Interestingness
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 22
23. Logistic Regression: Topic Weights
Topic Weight
social media market post site web tool traffic network 27.54
follow thank twitter welcome hello check nice cool people 16.08
credit money market business rate economy home 15.25
christmas shop tree xmas present today wrap finish 2.87
home work hour long wait airport week flight head -14.43
twitter update facebook account page set squidoo check -14.43
cold snow warm today degree weather winter morning -26.56
night sleep work morning time bed feel tired home -75.19
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 26
24. Re-Ranking using Interestingness
Top-k relevant tweets
Re-rank based on interestingness
Rang Username Tweet
1 BeeracrossTX UK beer mag declares "the end of beer writing." @StanHieronymus says not so in the US.
http://bit.ly/424HRQ #beer
2 narmmusic beer summit @bspward @jhinderaker no one had billy beer? heehee #narm - beer summit
@bspward @jhinde http://tinyurl.com/n29oxj
3 beeriety Go green and turn those empty beer bottles into recycled beer glasses! | http://bit.ly/2src7F
#beer #recycle (via: @td333)
4 hblackmon Great Divide beer dinner @ Porter Beer Bar on 8/19 - $45 for 3 courses + beer pairings.
http://trunc.it/172wt
5 nycraftbeer Interesting Concept-Beer Petitions.com launches&hopes 2help craft beer drinkers enjoy beer
they want @their fave pubs. http://bit.ly/11gJQN
6 carichardson Beer Cheddar Soup: Dish number two in my famed beer dinner series is Beer Cheddar
Soup. I hadn’t had too.. http://bit.ly/1diDdF
7 BeerBrewing New York City Beer Events - Beer Tasting - New York Beer Festivals - New York Craft Beer
http://is.gd/39kXj #beer
8 delphiforums Love beer? Our member is trying to build up a new beer drinker's forum. Grab a #beer and
join us: http://tr.im/pD1n
9 Jamie_Mason #Baltimore Beer Week continues w/ a beer brkfst, beer pioneers luncheon, drink & donate
event, beer tastings & more. http://ping.fm/VyTwg
10 carichardson Seattle and Beer: I went to Seattle last weekend. It was my friend’s stag - he likes
beer - we drank beer.. http://tinyurl.com/cpb4n9
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 27
26. LiveTweet
Data:
Twitter streaming API: sample
1% of all tweets
Architecture:
Time slices over tweets
Analytical component with
REST API
Web Frontend for end user
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 29
28. LiveTweet: What comes next?
Retrieval
Incorporate with other retrieval metrics
Include Interestingness in a learning to rank approach
Social graph
System extension
Personalisation
Public API
Work with IBM data
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 32
29. Scenario 2
Linked Open Data
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 33
30. Information needs requiring semantic structure
Examples
Male persons who have a public profile document
Computing science papers authored by social scientists
American actors who are also politicians and are married
to a model.
Maybe specific databases available:
Person search engines
Bibliographic databases
Movie database
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 34
31. Linked Data
Semantic Web Technology to
1. Provide structured data on the web
2. Link data across data sources
Thing Thing Thing Thing Thing
Thing Thing Thing Thing Thing
typed typed typed typed
links links links links
A B C D E
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 35
32. Entities are identified via URIs
One statement = one triple
rdf:type Subject Predicate Object
pd:cygri foaf:Person
foaf:name
Richard Cyganiak
foaf:based_near
dbpedia:Berlin
pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri
dbpedia:Berlin = http://dbpedia.org/resource/Berlin
Description of a link between two data sources
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 36
36. Querying linked data – an IR task?
Here happens IR magic
Information need
Keyword query Documents Information
SPARQL query Data sources Entities
Here we need magic
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 41
37. Querying linked data – using an index
SELECT ?x
WHERE {
?x rdfs:type foaf:Person .
?x rdfs:type pim:Male .
?x foaf:maker ?y .
?y rdfs:type
foaf:PersonalProfileDocument .
}
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 42
38. A Schema for LOD
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 43
39. Idea
Schema Index:
Define families of graph patterns
Assign entities to graph patterns
Map graph patterns to context / source
Construction:
Streambased for scalability
Little loss of accuracy
NOTE:
Index defined over entities
But: Index stores the contexts (sources)
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 44
40. Input Data
n-Quads
<subject> <predicate> <object> <context> .
Example:
<http://www.w3.org/People/Connolly/#me>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://xmlns.com/foaf/0.1/Person>
<http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> .
http://dig.csail.mit.edu/2008/...
foaf:
w3p: Person
#me
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 45
41. Layer 1: RDF Classes
All entities of a particular type C1
DS 1 DS 2 DS 3
SELECT ?x
FROM …
WHERE { foaf:Person
?x rdfs:type foaf:Person .
}
http://dig.csail.mit.edu/2008/...
foaf:
Person
timbl:
http://www.w3.org/People/Berners-Lee/card
card#i
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 46
42. Layer 2: Type Clusters
All entities belonging to the C1 C2
same set of types
TC1
DS 1 DS 2 DS 3
SELECT ?x
FROM …
WHERE { foaf:Person pim:Male
?x rdfs:type foaf:Person .
?x rdfs:type pim:Male .
} tc4711
pim:
Male
foaf:
timbl: http://www.w3.org/People/Berners-Lee/card
Person
card#i
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 47
43. Layer 3: Equivalence Classes
Two entities are equivalent iff: C1 C2 C3
They are in the same TC
They have the same TC1 TC2
properties
The property targets are in the
same TC
EQC1
DS 1 DS 2 DS 3
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 48
47. Building the Index from a Stream
Stream of n-quads (coming from a LD crawler)
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
1
C3 4
6
C2 3
4
2
C2 2
1 3
C1 5
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 53
48. Does it work good?
Comparison of stream based vs. Gold standard Schema on 11 M triple data set
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 55
49. Does it scale?
Semantic Web Challenge: Billion Triples Track
Provision of large scale RDF dataset
Crawled from LOD
Task:
Do something „useful“
Do it (web-)scalable
Do it with at least 1 billion triples
Presentation at ISWC
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 56
50. BTC results
1st billion 2nd billion full BTC
# triples 1 billion 1 billion 2.17 billion
# instances 187.7 M 222.6 M 450.0 M
# data sources 13.5 M 9.5 M 24.1 M
# type clusters 208.5 k 248.5 k 448.6 k
# equivalence classes 0.97 M 1.14 M 2.12 M
# triples index 29.1 M 24.8 M 54.7 M
Compression ratio 2.91% 2.48% 2.52%
# triples/sec. 40.5 k 45.6 k 39.5 k
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 57
51. SchemEX: What comes next?
Hierarchy of semantic information:
Type clusters
Equivalence clusters
Related types
Optimization
Smarter caching
Performance – Hadoop
Error correction
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 58
53. Take away message
Web evolving in interesting directions
Social networks, user generated content
Semantic data
Challenges for IR
Different settings
Different tasks
Question basic assumptions
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 60
54. Thank you!
Contact:
WeST – Institute for Web Science and Technologies
Universität Koblenz-Landau
gottron@uni-koblenz.de
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 61
55. Relevant Publications
1. A. Che Alhadi, S. Staab, and T. Gottron. Exploring user purpose writing single tweets. In WebSci ’11:
Proceedings of the 3rd International Conference on Web Science, 2011.
2. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Microblog retrieval based on
interestingness, in TREC’11: Proceedings of the Text Retrieval Conference 2011, 2011.
3. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Monitoring and predicting interesting
microblog posts, in ECIR’12: Procedings of the 34th European Conference on Information Retrieval,
2012. in preparation.
4. T. Gottron and N. Lipka, A comparison of language identification approaches on short, query-style texts,
in ECIR ’10: Proceedings of the 32nd European Conference on Infor-mation Retrieval, pp. 611–614, Mar.
2010.
5. M. Konrath, T. Gottron, and A. Scherp. Schemex – web-scale indexed schema extraction of linked open
data, in Semantic Web Challenge, Submission to the Billion Triple Track,
6. 2011.N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Bad news travel fast: A content-based
analysis of interestingness on twitter. In WebSci ’11: Proceedings of the 3rd International Conference on
Web Science, 2011.
7. N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Searching microblogs: Coping with sparsity and
document quality. In CIKM’11: Proceedings of 20th ACM Conference on Information and Knowledge
Management, 2011.
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 62
57. Use Cases
SAP Community Network (SCN) Lotus Connections MeaningMine
Communities Communities Communities
• Customers • Employees • Social media
• Partners • Working groups • News
• Suppliers • Interest Groups • Web fora
• Developers • Projects • Public communities
Business value Business value Business value
• Products support • Task relevant information • Topics
• Services • Collaboration • Opinions
• Find business partners • Innovation • Service for partners
Volume Volume Volume
• 6,000 posts/day • 4,000 posts/day • 1,400,000 posts/day
• 1,700,000 subscribers • 386,000 employees • 708,000 web sources
• 16GB log/day • 1.5GB content/day • 45GB content/day
Business Partners Employees Public Domain
Extranet Intranet Internet
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 64
58. Twitter is different
Follower form social graph
PageRank applicable?!
BUT:
Follow not (only) motivated
by content
No statement about tweets!
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 65
59. Information seeking behaviour on Twitter
Web Twitter
2-4 query terms 1-2 query terms
Broader terms Specific terms
Intentions Intentions
• Navigation • Timely information
• Information • Trends
• Ressourcen • People
Get to know a topic Follow a topic
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 66
60. TREC
Microblog Track 2011
12.000.000 Tweets
2 Weeks
49 „Topics“ (Queries)
Task: Filtering
Constraints
No external knowledge!
English tweets only
Temporal order of topic & tweets
Official extension of „relevance“ to „interestingness“ (!!!)
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 67
61. WeST @ TREC Microblog Track
Basics:
Lucene
No length normalisation
Interestingness
4 configurations:
WESTfilter: Retrieval via Lucene, filtering non interesting
tweets
WESTfilext: like WESTfilter, but with sentiments
WESTrelint: like WESTfilter, but re-ranking according to
interestingness
WESTrlext: like WESTrelint, but with sentiments
Challenging Retrieval Scenarios Thomas Gottron Lugano, 23.4.2012 68
Online communities as mainly perceived in publicPrivate usersSocial interactionSharing of information, pictures, online ressources
Business communities slightly different:Grouped around a business/enterpriseAim: add value to business (Knowledge management, public relations, customer aquisition, support, etc.)Conclusion: communities have a value, that needs to be taken care ofValue is endangered by risks (e.g. experts leaving) or might be increased by seizing opportunities (e.g. connect people working on the same topic)