Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Zen & the art of data mining
1. Old Dominion University
Department of Computer Science
Hany SalahEldeen
Hany SalahEldeen Khalil hany@cs.odu.edu
Zen & the Art of Data Mining
07-08-14
Social Media Data Collection and the path to
Modeling & Predicting User Intention
Web Science & Digital Libraries Lab
1
3. Hany SalahELdeen
Education:
• PhD Candidate
• Web Science and Digital Libraries Group
• Masters Degree in Computer Vision and Artificial Intelligence
• Universitat Autonoma de Barcelona
• Bachelors of Computer Systems Engineering
• University of Alexandria
Hany SalahEldeen 3
4. Research & Technical Experience
• Microsoft Research Cairo
• Google GmBH Zurich
• Microsoft Inc. Mountain View
• National University of Singapore
Hany SalahEldeen 4
5. Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
Web Mining Pattern Analysis Machine Learning
Human Behavioral Analysis
Social Media Analysis
So what am I investigating?
5
6. Publications
Hany SalahEldeen
Shanghai CIKM 2014 Conference
- 1 first author paper
- 1 second author paper
London DL 2014 Conference
- 1 third author paper
Malta TPDL 2013 Conference
- 1 first author paper
6
7. Publications
Hany SalahEldeen
Indianapolis JCDL 2013 Conference
- 1 first author paper
Rio de Janeiro WWW 2013 Conference
- 1 first author paper
Cyprus TPDL 2012 Conference
- 1 first author paper
7
8. Beside the perks of
travelling, our research has
been popular…
Hany SalahEldeen 8
16. Our Research’s Popularity
Hany SalahEldeen
• Local newspaper: The Virginia Pilot
• 4 x MIT Technology Review
• BBC
• Mashable
• The Atlantic
• Yahoo News
• Articles in > 11 different languages
• We have been called:
• The Internet Archeologists
• Web Time Travelers
16
18. Ok hold on, let’s go back to
the basics…
Hany SalahEldeen 18
19. Web 2.0
Definition: Web 2.0 is a concept that
takes the network as a platform for
information sharing, interoperability,
user-centered design, and collaboration
on the World Wide Web.*
* http://en.wikipedia.org/wiki/Web_2.0
Hany SalahEldeen 19
20. Web 2.0
• Yes, Web 2.0 is about “user-generated
content”
• But explicit content contributed by
users is just 20% of what “matters”
• 80% is in the implicitly contributed
data*
Hany SalahEldeen 20
*Toby Segaran, Programming Collective Intelligence, 2007
21. Systems & Web 2.0
• Google: Utilizes PageRank which is a
technique for extracting intelligence from
the link structure
• Flickr: Utilizes “interestingness” algorithm
• Amazon: Utilizes “people who bought this
product also bought” feature
• Pandora: Utilizes “similar artist radio”
• eBay: Utilizes “reputation system”
Hany SalahEldeen 21
22. So why do we even care
about all that?
Hany SalahEldeen 22
24. Power to the People!
• Because analyzing a huge dataset of
millions of users will yield a lot of potential
insights into:
• User Experience
• Marketing
• Personal Taste
• Human Behavior in general.
Hany SalahEldeen 24
26. Data Mining
• Definition: It is the computational process of
discovering patterns in large data
sets involving methods at the intersection
of artificial intelligence, machine
learning, statistics, and database systems. The
overall goal of the data mining process is to
extract information from a data set and
transform it into an understandable structure
for further use.
http://en.wikipedia.org/wiki/Data_mining
Hany SalahEldeen 26
27. Back to my goal:
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
27
28. Let’s breakdown the title first…
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
28
29. Let’s breakdown the title first…
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
29
31. Michael Jackson Dies
Hany SalahEldeen
Snapshot on: June 25th 2009
http://web.archive.org/web/20090625232522/http://www.cnn.com/
31
32. Jeff tweets about it…
Hany SalahEldeen
Published on: June 25th 2009
https://twitter.com/mdnitehk/status/2333993907
32
33. Jeff’s friend Jenny was on a vacation in Hawaii for a
month
Jenny is off the grid…
Hany SalahEldeen 33
34. When she came back she checked Jeff’s tweets and
was shocked!
Jenny starts catching up a month later
Hany SalahEldeen
Read on: July26th 2009
https://twitter.com/mdnitehk/status/2333993907
34
35. She quickly clicked on the link in the tweet…
Jenny follows the link on July 26th
Hany SalahEldeen
http://web.archive.org/web/20090726234411/http://www.cnn.com/
CNN page on:
July 26th 2009
35
36. • Implication:
• Jenny thought Jeff is making a joke about her
favorite singer and she got mad at him
• Problem:
• The tweet and the resource the tweet links
to have become unsynchronized.
Jenny is confused!
Hany SalahEldeen 36
39. Reading about it in Storify.com a year
later in March 2012
Hany SalahEldeen
http://storify.com/maq4sure/egypts-revolution
39
40. I noticed some shared images are missing
Hany SalahEldeen
http://storify.com/maq4sure/egypts-revolution
40
41. Some tweets are still intact
Hany SalahEldeen
https://twitter.com/miss_amy_qb/status/32477898581483521
41
42. …and some lost their meaning with
the disappearance of the images
Hany SalahEldeen
Missing ?
https://twitter.com/aishes/status/32485352102952960
https://twitter.com/omar_chaaban/status/32203697597452289
42
43. The tweet remains but the shared
image disappeared…
Hany SalahEldeen
http://yfrog.com/h5923xrvbqqvgzj
43
44. • Implication:
• The reader cannot understand what the
author of the tweet meant because the image
is not available.
• Problem:
• The post is available but the linked resource
(image) is completely missing.
Cairo….we have a problem!
Hany SalahEldeen 44
45. …back to the title
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
45
46. …back to the title
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
46
48. 48
The Anatomy of a Tweet
Author’s username
Other user mention
Tweet Body
Hash TagShortened URL
to resource
Publishing
timestamp
Social
Post
Shared Resource
Interaction
options
Hany SalahEldeen 48
49. 49
3 URIs = 3 Chances to fail
Hany SalahEldeen
http://news.blogs.cnn.com/2012/04/26/norwegian
s-sing-to-annoy-mass-killer/
https://twitter.com/KentEiler/status/19553574
9754527745
49
51. 51
If I click on a link in a tweet, which
version should I get?
ttweet or tclick ?
Hany SalahEldeen 51
52. 52
Sometimes you want a
previous version
The Correct Temporal
Intention
CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pm
Hany SalahEldeen 52
53. 53
Sometimes you want the
current version
The Correct Temporal
Intention
In this case the current state of the press releases page
Hany SalahEldeen 53
54. 54
Research Question
Can we estimate the users’
intention at the time of posting
and reading to predict and
maintain temporal consistency?
Hany SalahEldeen 54
55. 55
People rely on social media for most
updated information
Hany SalahEldeen 55
58. All tweets are equal…
…but some are more equal than the others
Hany SalahEldeen 58
59. Preliminary Research Questions:
1. How long would these last?
2. And if lost, are they archived?
3. Is this what the author intended?
Hany SalahEldeen 59
60. 60
Since tweets are considered the first draft
of history… the historical integrity of the
tweets could be compromised.
Hany SalahEldeen
Historical Integrity
60
63. 63
The life cycle of a social post
tweets Links to
Hany SalahEldeen 63
64. 64
The life cycle of a social post
tweets
What the
reader
receives
Links to
Same state
the author
intended
Hany SalahEldeen 64
65. 65
The life cycle of a social post
tweets
What the
reader
receives
Links to
Same state
the author
intended
Hany SalahEldeen
The resource
has disappeared
65
66. 66
The life cycle of a social post
tweets
What the
reader
receives
Links to
Same state
the author
intended
The resource
has disappeared
The resource
has changed
Hany SalahEldeen 66
67. 67
Same state
the author
intended
The Resource’s Possibilities
a bigger problem since the reader might not know.
What the
reader
receives
The resource
has disappeared
The resource
has changed
Hany SalahEldeen 67
69. 69
The attack on the embassy was in February
2013
Or the resource could change
Hany SalahEldeen 69
70. 70
Why do we want to detect the
Author’s Temporal Intention?
• Match: and convey the intended information.
• Notify:
– the author that the resource is prone to change.
– the reader that the resource has changed.
• Preserve: the resource by pushing snapshots into the
archive automatically.
• Retrieve: the closest archived version to maintain the
consistency.
Hany SalahEldeen 70
71. 71
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 71
72. 72
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 72
73. 73
Estimating Web Archiving Coverage
• Goal: Estimate how much of the public web is present in the public archives
and how many copies are available?
• Action:
– Getting 4 different datasets from 4 different sources:
• Search Engines Indices
• Bit.ly
• DMOZ
• Delicious.
• Results: *
• Publications:
– How much of the web is archived? JCDL '11
– http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-
archived.html
Hany SalahEldeen
16%-79% Archived according
to the source
73
74. 74
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 74
75. 75
The timeline of the resource
Hany SalahEldeen 75
http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
77. 77
Actual Vs. Estimated Dates
Hany SalahEldeen
• Successfully estimated the creation date
>75% of the resources
• >33% we estimated the exact date
77
78. 78
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 78
79. • From Twitter, Websites, Books:
• The Egyptian revolution
• From Twitter Only:
• Stanford’s SNAP dataset:
• Iranian elections
• H1N1 virus outbreak
• Michael Jackson’s death
• Obama’s Nobel Peace Prize
• Twitter API:
• The Syrian uprising
Six Socially Significant Events
Hany SalahEldeen 79
81. Revisiting after a year…
Hany SalahEldeen
• There is a nearly linear relationship between the amount
missing from the web and time.
• After 1 year ~11% is gone, and 0.02% is lost every day
81
83. First Attempts to Shared Content
Replacement
Hany SalahEldeen 83
• We performed an experiment to gauge how many
of the resources that are missing could be
replaced with other similar resources.
• Collected a dataset with available resources
which we assumed to be missing
• Used our method to extract the replacement
resources
• Measured the similarity with the original resource
84. First Attempts to Shared Content
Replacement
Hany SalahEldeen
We were able to extract another resource with >70% similarity
to the missing resource in >40% of the cases
84
85. 85
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 85
86. 86
Temporal Intention Relevancy Model
(TIRM)
Between ttweet and tclick:
The linked resource could have:
• Changed
• Not changed
The tweet and the linked resource could be:
• Still relevant
• No longer relevant
Hany SalahEldeen 86
87. 87
Resource is changed but relevant
• The resource changed
• But it is still relevant
Intention: need the current version of the resource at any time
Hany SalahEldeen 87
89. 89
Resource is changed and not relevant
Intention: need the past version of the resource at any time
• The resource changed
• But it is no longer relevant
Hany SalahEldeen 89
91. 91
Resource is not changed and relevant
Intention: need the past version of the resource at any time
• The resource is not changed
• And it is relevant
Hany SalahEldeen 91
93. 93
Resource is not changed and not relevant
Intention: I am not sure which version of the resource I need
• The resource is not changed
• But it is not relevant
Hany SalahEldeen 93
95. 95
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 95
96. 96
Feature extraction
• For each tweet we perform:
– Link analysis
– Social Media Mining
– Archival Existence
– Sentiment Analysis
– Content Similarity
– Entity Identification
Hany SalahEldeen 96
97. 97
1- Link analysis
• Since the tweets have embedded resources shortened by
Bit.ly we can extract:
– Total number of clicks
– Hourly click logs
– Creation dates
– Referring websites
– Referring countries
• We calculate the depth of the resource in relation to its domain
(either it is a leaf node or a root page)
– We calculated the number of backslashes in the resource’s URI
Hany SalahEldeen 97
98. 98
2- Social Media Mining
• Twitter:
– Using Topsy.com’s API to
extract:
• Total number of tweets.
• The most recent 500.
• Number of tweets by
influential users.
The collection of tweets extracted provided an extended context of the
resource authored by users in the twittersphere.
Hany SalahEldeen 98
99. 99
2- Social Media Mining
• Facebook:
– Mined too for likes, shares, posts, and clicks related to each
resource.
Hany SalahEldeen 99
100. 100
3- Archival Existence
• Using Memento Time
Maps we get:
– Total mementos
available
– Different archives count.
– The closest archived
version to the tweet
time.
Hany SalahEldeen 100
101. 101
4- Sentiment Analysis
• Using NLTK libraries of natural language text processing
• Extract the most prominent sentiment in the text
Hany SalahEldeen 101
102. 102
5- Content Similarity
• Steps:
– We download the content HTML using Lynx browser.
– We apply boilerplate removal algorithm and full text extraction.
– Calculate the cosine similarity between the two pages.
70% similarity
Hany SalahEldeen 102
103. 103
6- Entity Identification
• By visual inspection we observed that the majority of tweets about
celebrities are related to current events.
• We harvested Wikipedia for lists of actors, politicians, and athletes.
• Checked the existence of a celebrity mention in the tweets.
Actor: Johnny Depp
Hany SalahEldeen 103
104. 104
The trained classifier
• From the feature extraction phase we extracted 39
different features to train the classifier.
• Using 10-fold cross validation, the Cost Sensitive Classifier
Based on Random Forests gave the highest success rate =
90.32%
Hany SalahEldeen 104
105. 105
What’s Next for Hany?
• Finish up my dissertation
• Defend.
• Get a research/Data scientist position
• Interests:
– L3S Research Center Germany
– Microsoft Research
Hany SalahEldeen 105
106. 106
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen
Summary:
Email: hany@cs.odu.edu
Office: 3102
Website: http://www.cs.odu.edu/~hany/
Twitter: @hanysalaheldeen
106