Transcript of "Enabling Real-Time User Interests for Next Generation ..."
Enabling Real-Time User Interests
for Next Generation
Activity-Oriented Social Networks
Prateek Agarwal (2005CS10173)
Sameer Madan (2005CS10181)
A Thesis submitted to the Indian Institute of Technology Delhi in conformity with the
requirements for the degree of Bachelor of Technology
under the guidance of
Dr. Bijendra Nath Jain
Dr. Koustuv Dasgupta
Dr. Dipanjan Chakraborty
Dr. Anupam Joshi
Department of Computer Science & Engineering
Indian Institute of Technology Delhi
This is to certify that the project titled “Enabling Real-Time User Interests for Next
Generation Activity-Oriented Social Networks” being submitted by Prateek Agarwal and
Sameer Madan to the Indian Institute of Technology Delhi, for the award of the degree
of Bachelor of Technology in Computer Science and Engineering, is a record of bona ﬁde
project work carried out by them under my guidance and supervision at the Department
of Computer Science and Engineering, Indian Institute of Technology Delhi and IBM
India Research Lab, New Delhi.
Prof. Bijendra Nath Jain
Dept. of Comp. Sc. and Engg.
We would like to express our gratitude to Prof. Bijendra N. Jain for his patience and
constant encouragement, help, guidance and invaluable suggestions. We are thankful to
him for the advice he gave us at diﬀerent steps of the literature survey, problem under-
standing and its implementation.
We would like to thank our mentors at IBM: Dr. Koustuv Dasgupta, Dr. Dipanjan
Chakraborty, Dr. Anupam Joshi, Sumit Mittal and Seema Nagar for their valuable time
and for agreeing to supervise us on this project. We are grateful to them for their invalu-
able inputs, especially helping us precisely deﬁne the problem statement and for guiding
us in the right direction.
Since its inception, online social networking has evolved from basic messaging and friend
ﬁnding to a diverse and multi-faceted entity that encompasses the entire personality of
an individual. Recent advancements including mobile-phone based access and real-time
status update facilities (such as micro-blogging) allow a person’s online presence to be
as ephemeral and dynamic in nature as his very thoughts and interests. With the evo-
lution of the mobile platform, these networks now allow an easy and on-the-ﬂy access
through hand-held mobile devices and PDAs. In this context, we consider an activity-
oriented social network called R-U-In?, currently under development at the IBM India
Research Laboratory, New Delhi, which exploits contextual reasoning and match-making
techniques to help users locate others based on the activity of interest. In the context of
real-time presence, we wish to enhance the functionality of this next generation network
in two ways. Firstly, enable a more up-to-date reﬂection of user interest and expand the
range of the R-U-In? framework beyond desktops using a mobile platform, thereby al-
lowing easy access on the move and secondly, incorporate the complete online presence of
the person, including sources of real-time interests such as microblogs towards enhancing
the match-making process by developing a better understanding of his/her tastes. Thus,
we present here a mobile-phone based framework for R-U-In? as well as an enhanced ver-
sion of the backend that is able to tag a user based on his real time interests as gleaned
from his real-time online presence as well as his past activities on platforms, speciﬁcally
Twitter, which we consider in this report.
The area of online social networking has experienced a phenomenal growth in the past
decade. Since the late 1990s when rudimentary forms of present day social networking
sites ﬁrst appeared, these group sites have bloomed to encompass a wide gamut of ser-
vices like dating, friend-ﬁnding, business networking, chatting, photo and video sharing,
blogging, mobile connectivity and much more. Apart from social networking sites that
cater to the general public, there are also certain sites which cater to a particular niche
of users. As the competition in this area heats up, the recent times have seen a host of
new services coming up, with each site trying to capitalize on the market by creating the
next big thing in social networking.
1.1 Social Networks - A Brief History
Online social networking, as we know, began in 1997 with a site called SixDegrees.com.
The name of this site is derived from the empirical law that any two people in the world
are connected by a chain of friends of length atmost six people. At this point of time,
there existed pages on the web that allowed people to create their proﬁles, ﬁnd friends
and to aﬃliate themselves to a school. However, SixDegrees was the ﬁrst to combine all
these into one consolidated platform. This site attracted many users and grew till 1998
after which it declined and eventually shut down in 2000.
The next notable milestone was the advent of business networking. In 2001 sites like
Ryze.com, LinkedIn and Tribe.net opened their doors to the world. Ryze was started
with the aim of networking with entrepreneurs and venture capitalists in the San Fran-
sisco area, but it never really gained much popularity. LinkedIn, however grew into a
major platform for business networking and is now the major player in that niche of
Figure 1.1: A brief history of online social networks
Friendster was the ﬁrst site that showed exactly how popular social networking sites
were to become in the near future. Within a small time of its opening, the site experi-
enced an extremely rapid growth in its user base. It however could not keep with its own
speed and eventually collapsed and lost users to new players like MySpace. MySpace was
the ﬁrst social network to gain worldwide popularity and at present it boasts of a user
base exceeding 250 million users. Other social networking sites such as Orkut and Hi5
also came up around this time and gained popularity in other parts of the world. Orkut
has a very high user base in countries like India and Brazil.
Some social networking sites, such as Facebook started oﬀ with the intention of pro-
viding services to a speciﬁc niche of users. Facebook was only limited to Harvard at
the time of its creation, but it soon opened its doors to the general public. Some of
the features that led to the popularity of Facebook include its support for third party
applications as well as the emphasis laid on user privacy. Presently, Facebook is one of
the largest social networking sites in the world with 200 million registered users.
Recently, social networking has diversiﬁed to other forms like blogging. Sites like Live-
Journal and Xanga provide a platform for users to express their thoughts and emotions.
A recent, more dynamic incarnation of this paradigm is that of micro-blogging. Sites such
as Twitter combine the personal touch of a blog with the spontaneity of status updates by
allowing users to post short one line blogs. Twitter is known for being a highly dynamic
and updated source of a user’s interests and activities.
1.2 R-U-In? - Next Generation Activity-Oriented
Online Social Networking System
There has been a signiﬁcant enhancement in the features online social networks have
provided since their inception. Starting as a medium of bringing people together through
chat rooms and personal web pages, they are now being used in ﬁelds varying from busi-
ness to dating to medicine. Since the last few years, “contextual communications” is
emerging as one of the key features of these networks i.e. users are now using various end
devices like mobile phones, PDAs and clients like GoogleTalk, Facebook etc. to update
their presence, avalibilty and mood. Rich presence, thus, is not just limited to availability
but now extends the personality of the user and with the help of technologies like Web
2.0 and converged communications, a whole new genre of real-time communication driven
social networking will come up.
But present social networks are yet to fully exploit the domain of collaboration. R-
U-In?, which is currently being developed at IBM India Research Lab, New Delhi
leverages the strengths of Web 2.0 and converged networks technologies to create a rich
next-generation service. It allows users to collaborate and participate in activities of
mutual interest by enabling them to search for other like-minded users. R-U-In? uses
contextual modeling and reasoning techniques to enable social search based on real time
user interests and ﬁnds potential matches for the proposed activity. It also exploits
next-generation presence and communication technologies to manage the entire activity
lifecycle in real time.
Given the real-time networking nature of such a system, it is required to capture user
thoughts as they appear. Thus, we have developed an Android-based framework for R-
U-In? so that the user be able to use this service as well as update his/her interest even
when on the move. Apart from these real time updates, we would also like to be able
to incorporate a user’s past online activity to be able to build a better understanding of
his/her interests. Further, we need to expand the current ontology that R-U-In? recog-
nizes so that we are able to provide a match to a wide range of activities. Combining
all these and considering only a user’s updates on Twitter.com, we present an exhaustive
analysis of the nature and content of tweets by users and also a system that is able to tag
users based on what they tweet about by employing an ontology developed by the afore-
mentioned analysis as well as empirical observations of user tweets and status updates.
Consider the following example which explains how such a real-time social network
works: Arya, while out, wants to watch a movie but doesn’t have company. He uses his
mobile to search the R-U-In? system for a potential match and is returned with Jane,
who is also intersted in watching a movie at a theatre near-by his current location. He
immediately sends a request to Jane, who happens to like his proﬁle and accepts his re-
quest. A notiﬁcation is immediately displayed on Arya’s mobile about the same. He can
now use the R-U-In? communication features like SMS and Click-to-Call to ﬁx a meeting
point with Jane. The two eventually meet up and enjoy the movie.
The rest of the report is organized as follows: Chapter 2 presents some of the related
work in the mobile social networking domain. We present the design of our mobile frame-
work in Chapter 3 and a detailed use case in Chapter 4. Chapter 5 presents the detailed
microblog analysis followed by Conclusions and Further Work in Chapter 6.
R-U-In? Mobile Framework
R-U-In? is a real-time interest based service where users can update their interest and
notify the system of their intention to search other like-mined people. For such a service,
it is most useful if the user is able to update his interest and search for other users as
the thought strikes his mind while he is on the move and around the intended activity
location. This will allow him to search for existing similar minded users at that very
moment allowing him to plan his activity on the spot, thus not limiting him to wait till
he gets home and log onto a PC and then come back to that location again to follow
that activity. It’s the mobile phone which is with the user all the time and so this will
also allow a user to use the system on-the-go whenever he/she ﬁnds the time, be it just
5 minutes in a bus or a train.
Mobile phones also oﬀer other capabilities which PCs lack. Functions like the Global
Positioning Systems (GPS) which is now available in most phones can be used to know
the location of the user. This can help a user to ﬁnd a feasible location for the activity
by comparing the distances of other locations with his current location.
2.2 Mobile application over mobile web-site
While a possible solution could be just another mobile website for the system, it has
many limitations of its own. The inability of mobile web applications to access the local
capabilities on the mobile device can limit their ability to provide the same features as na-
tive applications. Generally web browsing software on mobile devices lack the ﬂexibility
and functionality present in the PC-based web browsers. They do not support features
like client-side scripting, style sheets, storage of cookies, etc. which are now widely used
in most web sites for enhancing user experience like facilitating the validation of data
entered by the page visitor, etc. Thus using the same web-site on mobile phone is not a
good solution. Also, mobile phones have a constraint display and large web-sites makes
it uncomfortable for the user as he has to scroll everytime to ﬁnd information.
A dedicated application can also be linked to other features present in the mobile
phones like calendar, alarm clock, etc., thus allowing user the ease to have a common
terminal of managing his information, or features like GPS, enhancing his user experi-
ence. A dedicated application can also be sold preloaded on the mobile phones making
the application an integral feature of the wireless handset like an alarm clock, calendar,
or mobile e-mail and thus sourcing revenue for the developer.
2.3 Android Platform
Android is an open source software platform for mobile devices based on the linux kernel.
It was initially developed by Google and then later by Open Handset Alliance.
Being an open source platform, Android is evolving very quickly. It oﬀers the advan-
tages of openness and collaborative development. Android’s radically diﬀerent approach
to mobile Linux application development oﬀers some unique advantages. The biggest
advantage is that it provides a very high level of uniformity. In theory, the vast majority
of Android applications will be able to run on virtually any Android-based device without
requiring any further modiﬁcation.
Android provides a comprehensive and well-organized variety of high-level APIs for build-
ing applications and leveraging the underlying functionality of the platform. These APIs
provide a very high level of abstraction, which makes them easy to use. The APIs truly
make it possible to build applications that integrate fully with the rest of the platform.
Some of the features oﬀered by the platform are:
Handset layouts Supports larger, VGA, 2D graphics library, 3D
graphics library based on OpenGL ES 1.0 speci-
ﬁcations, and traditional smartphone layouts
Connectivity Technologies including GSM/EDGE, CDMA, EV-
DO, UMTS, Bluetooth, and Wi-Fi are supported
Messaging SMS and MMS are available forms of messaging
including threaded text messaging.
Dalvik Virtual Machine Software written in Java can be compiled into
Dalvik bytecodes and executed in the Dalvik vir-
tual machine, which is a specialized VM implemen-
tation designed for mobile device use, although not
technically a standard Java Virtual Machine.
Media Support Supports the following audio/video/still media for-
mats: MPEG-4, H.264, MP3, AAC, MIDI, OGG,
AMR, JPEG, PNG, GIF.
Additional Hardware Android can utilize video/still cameras, touch-
Support screens, GPS, accelerometers, and accelerated 3D
Storage The Database Software SQLite is used for data
Development environ- Includes a device emulator, tools for debugging,
ment memory and performance proﬁling, a plugin for
the Eclipse IDE.
Table 2.1: Features of Android Platform
2.4 Present MSN market
There seems a huge potential for growth in this nascent market. Zivet al have discussed
the implications of mobile social networks for the wireless sector, content providers, tech-
nology companies, and the users of the mobile platform and presented a case study on
Dodgeball, a New York City based mobile social networking company to exemplify user-
centric innovation on the mobile platform. Industry analysts have predicted huge demand
for this market, particularly from teens and young adults. Today about half the global
population (around 3.3 billion people) owns a handset. The International Telecommuni-
cations Union (ITU) found that mobile subscription rose constantly with 39% and 28%
per year from 2005 to 2007 in Africa and Asia. eMarketer, a market analysis ﬁrm fore-
casts that mobile social networking will grow from 82 million users in 2007 to over 800
million worldwide by 2012.
A study conducted by Juniper Research reveals that user-generated content is pre-
dicted to grow from generating revenues of $ 572 million in 2008 to over $ 5.7 billion in
2012 of which about 50 % will be accounted by social networking sites.
In a country like India, where Mobile penetration exceeds PC penetration and will
continue to do so, accessing the internet from the mobile is deﬁnitely going to get more
and more popular. The Telecom Regulatory Authority of India (TRAI) recently found
that the total number of wireless subscribers in the country (as of end March 2009) is
391.76 million. With a signiﬁcant amount of these subscribers using “data-enabled” hand-
sets according to the TRAI data for the December, 08 quarter (101.1 million handsets),
the country may be assumed to take a big step forward on the mobile Internet front. As
far as Indian users go, presently half (48.9 %) of all traﬃc visit to social networks via
their mobile phones.
2.5 Existing Mobile Social Networking Services
Today, various web-based social networks are moving in the mobile domain (like Myspace,
Facebook etc.) while there are many which are being developed speciﬁcally for the mobile
Qeep is an online mobile social network developed by Blue Lion Mobile. It oﬀers its users
varied services like private messaging, live multi-player gaming, sound attacks, photo-
blogging with unlimited storage space and QMS, a special kind of text message designed
for qeep which costs a lot less than a normal SMS. Users are required to download a
java-based qeep application on their mobiles. Absence of a complex solution stack mini-
mizes the amount of memory needed for installation and running the program on mobile
phones. As of December 2008, qeep has a membership of over 750,000 users.
Dodgeball was a location based mobile social networking service started by Dennis Crow-
ley and Alex Rainert and was later acquired by Google in 2005. The service was used
by sending simple text messages to the system and there was no need to download or
install anything on the phone. User had to text his location to the service and then he
was notiﬁed of his friends and friends of friends present nearby. It was available in 22
cities before it was shutdown in February 2009 and replaced with Google Latitude.
It is an aggregation portal which shows users their friend’s social updates from Facebook,
Twitter, Flickr, Hyves and blogs. It allows users to store their phone’s contacts, pictures,
text messages and calendar events online and get their friend’s updated contact details
automatically synced to their phone. Users do not require to install any software on their
Design of the Application
In this chapter, we shall present the complete design of the application along with argu-
ments for choosing the same.
3.1 Design Overview
Our main aim was to incorporate all the functionalities provided by the web portal of
R-U-In? and make it easier for the user to use the system. We tried to keep it as close
as possible to the look and feel of the web portal so that the user feels familiar with the
system, but keeping all the constraints a mobile device has over a web application. Web-
portal for R-U-In? is developed using Java Server Pages, JSP, and Ajax technologies.
These technologies are not available for the android mobile platform. So we decided to
use simple http requests to talk to the servlets on the server side and then parsed the
response received using the inbuilt SAX XML Parser.
Another major task was to get periodic updates from the server about diﬀerent friends,
activities and invites. We preferred client-pull over server-push mechanism for this oper-
ation as it allowed us to keep the server design simple because now server is not required
to keep track of diﬀerent devices, their presence, their addresses, locations and status.
This way client could also determine the polling rate and control it as it requires. There
are three threads running in the background which perform this periodic task of updating
the data. They are created when the user logs in and interrupted as the user exits the
When a user queries the system, activities matching his query are marked on the map.
These results could be present anywhere on the map and due to the constraint screen size
of the mobile phone, it gets very uncomfortable for the user to check for the responses
received. To solve this issue, we zoom-out the map each time to such a zoom level such
that all the matches are displayed on the screen and the display also not gets cluttered.
Certain tabs present in the web interface have been combined to allow convenient
access to the user in the mobile domain. When the user receives a new invite or request
for an activity, he gets a notiﬁcation popup on his mobile phone screen.
There are ﬁve major components constituting the whole application. We shall now de-
scribe each of them one by one.
3.2.1 Location Module
R-U-In? marks the location of the activities on the map. When the user queries for an
activity (interest), latitude/longitude corresponding to that activity’s (interest’s) location
are determined. The server is also queried for matching interests (activities) and the lat-
itude/longitude of those locations are also determined. These locations are then marked
with diﬀerent markers on the map. In order to access the network for this operation,
following permission is required to be put in the AndroidManifest.xml ﬁle:
<uses-permission android:name = “android.permission.INTERNET”/>
3.2.2 Display Management Module
R-U-In? uses google map interface to display the activity location of the users. Due to the
constraint screen size, we decided to display the map on the whole screen. Google maps
are rendered on the screen using the set of APIs provided by Android. But this is not a
standard package in the Android library. In order to use it, the following XML element
is required to be added as a child of the application element, in the AndroidManifest.xml
<uses-library android:name = “com.google.android.maps”/>
After a user logs-in, this module queries the GPS module of the device for the last-known-
location of the user and centers the map onto that location.
3.2.3 Friend Info Module
This module gets the friends’ data of the user. It starts in the background after the user
logs-in and starts querying the server for friends’ real-time data. When the user wants
to view the friend information, this module parses the response from the server using
the SAX XML parser and writes the data on the device in key-value pair format using
the SharedPreferences mechanism. This data is then read by the display management
module which updates the friend information in the display window.
3.2.4 Activity Management Module
This module keeps track of the diﬀerent activities a user is involved in at any moment.
It sends http requests to the server to get the real-time status of all the activities. The
creator of the activity is then informed about the status of all the members in that ac-
tivity and the members (other than the creator) are informed about the creator of that
3.2.5 Invite Management Module
A user can receive “invites” or “requests” from other users in the system with similar
interests. This module queries the server and notiﬁes the user of the same. A notiﬁcation
window is ﬂashed on the screen and the phone vibrates to attract the attention of its user.
Detailed use case
We shall now demonstrate a complete use-case with the help of the screen shots of the
appplication. There are ﬁve people - Prateek, Sameer, Koustuv, Dipanjan and Seema,
each of them accessing “R-U-In?” through a mobile device.
The following ﬁgure, Figure 4.1, is the Login screen which appears when the user
clicks on the application icon for the ﬁrst time in the mobile phone menu. If he selects
the “Remember Me” option, he’ll not be asked to enter the password next time he accesses
Figure 4.1: Login Screen
Seema is new in town and wants to play tennis in the evening. She has a busy schedule
today, so while she is on the move, she uses her mobile phone to update her interest (IIT,
Sport) to see who else are interested. Similarly, Koustuv update his interest for a game
today evening (Figure 4.2).
Figure 4.2: Update Interest
Prateek, who also wants to play today and also willing to take the initiative of or-
ganizing, searches the system for activity “Tennis, JNU, 5/8/2009, 6:15 pm to 8:45 pm”
(Figure 4.3). R-U-In? searches for the potential matches and returns Seema and Koustuv
as results (Figure 4.4).
Figure 4.3: Creating an Activity
Figure 4.4: User Interests Search Results
Prateek can look at the proﬁles of these users and invites Koustuv and Seema for the
game (Figure 4.5). The “Manage Activity” tab of the application displays the current
status of all the activities (Figure 4.6).
Figure 4.5: User Information Window
Figure 4.6: Manage Activity Tab
Seema and Koustuv get an immediate notiﬁcation on their phones as they receive the
invite (Figure 4.7). They can then look at the inviter’s proﬁle to know about the person
(Figure 4.8) and accept the invite.
Figure 4.7: Invite Received Notiﬁcation
Figure 4.8: Inviter Proﬁle
Prateek, after receiving the updates of Seema and Koustuv accepting his invite, con-
ﬁrms the activity and updated status of the activity is immediately displayed on all the
corresponding mobile phones (Figure 4.9 and Figure 4.10)
Figure 4.9: Manage Activity Tab
Figure 4.10: Manage Activity Tab
A person can also join an already existing activity. Dipanjan also wishes to play
a game now and updates his interest. R-U-In? returns him the activity Prateek has
already created (Figure 4.11). He can look at the activity details by clicking on the
activity marker on the screen and request him to join (Figure 4.12).
Figure 4.11: Update Interest Result
Figure 4.12: Activity Creator Proﬁle
We have also categorised users based on their real time interests gleaned from their
tweets from micoblogging site Twitter.com (detailed discussion in the next chapter). So
Sameer, who has been tagged as a Sports lover from his tweets over the past one month,
is automatically returned with activities of the Sports category when he clicks on the “I’m
bored!!” button in the application (Figure 4.13). Sameer can now request to join that
Figure 4.13: I’m bored. Suggest me something!!
Presently, the R-U-In? backend consists of a very limited set of words describing an ac-
tivity. Furthermore, the R-U-In? interface currently requires a structured input from the
user. In this section, we present an enhanced version of the backend that can recognize
user interests by looking into other sources of user data, speciﬁcally micro-blogging sites
Presently R-U-In? works only on a structured input. The user is required to enter his/her
interest in a well-deﬁned syntax, which involves explicit delineation of the activity cate-
gory, its time and location. We feel, however, that this functionality can be expanded to
allow for unstructured and in some cases a vague input. It would greatly add value to
R-U-In? if we could automatically ﬁgure out users’ interests based on what they are shar-
ing with other users in diﬀerent social networking portals. R-U-In? is already interfaced
with Facebook and can extract user speciﬁed interests from there. However, Facebook
interests represent user interests in a very general sense and donot necessarily coincide
with his/her real time interests. This motivated us to look to a more active source of
information regarding user-interest, like Twitter.
Twitter is a free social networking and micro-blogging service that enables its users to
send and read other users’ updates known as tweets. Tweets are text-based posts of up
to 140 characters in length which are displayed on the user’s proﬁle page and delivered
to other users who have subscribed to them (known as ”followers”). Senders can restrict
delivery to those in their circle of friends or, by default, allow anybody to access them.
Social forums like Twitter, GTalk (considering the facility of status messages) provide a
convenient platform for people to share their current thoughts with other people. It is
this data which we mine to extract information about user interest and intention which
will allow the system to make a more informed decision about what the user is interested
Although the immediate solution to such an analysis problem might seem to lie in the
ﬁeld of deep natural language processing, however, given the amount of data involved,
it becomes impractical to use deep NLP because of its computational intensiveness and
high processing time. Thus, we choose to apply shallow NLP which does not imply text
understanding, i.e. semantic analysis of NL input. Instead it focuses on extracting text
chunks, matching patterns or entities that contain the answer to user questions. We shall
try to use purely statistical methods in our approach to the problem.
5.2 Related Work
The area of social networks as a highly active ﬁeld of research. We present here some of
the related work in this area, including the paper on R-U-In? the application that we
have primarily dealt with in the course of our project.
Banerjee et al present R-U-In?, a real time social networking framework, which
allows users to collaborate and participate in activities of mutual interest by enabling
them to search for peeople based on their real-time interests. R-U-In? leverages contex-
tual modeling and reasoning techniques to enable social search based on real time user
interests and ﬁnds potential matches for the proposed activity.
Hirata et al propose a system that aggregates together user’s multiple personal net-
works, constructs a personal network that uniﬁes their data and as well as adds activity
information for each user inside the uniﬁed personal network. The system also allows
transmission of user data within one’s own personal network using P2P.
Kelkar et al present an activity-based perspective of collaborative tagging (where
activity is deﬁned as the act of associating a tag with a bookmark by a user) which is
based on certain deﬁned measures of the tagging activity. It has applications in identify-
ing trends and types of interests in web communities as well as expertise, staﬃng needs
and knowledge gaps in enterprise communities.
Java et al present a topological and geographical study of Twitter’s social network
to show that people use microblogging to talk about their daily activities and to seek or
share information. They also analyze the user intentions associated at a community level
and show how users with similar intentions connect with each other.
5.3 Background Analysis Tools
We aim to develop a history proﬁle for every user in the backend. By doing an extensive
analysis of that user’s presence and posts on various forums, like Twitter, we can get a
better idea of the user’s real-time interests. In what follows, we shall detail the work done
by us along with all the data-sets used for each experiment. All the following analyses
incorporate two things:
• Porter’s Stemming Algorithm: Stemming is the process for reducing inﬂected
(or sometimes derived) words to their stem, base or root form - generally a written
word form. The stem need not be identical to the morphological root of the word;
it is usually suﬃcient that related words map to the same stem, even if this stem
is not in itself a valid root. The rationale behind including this in our study is that
tweets may contain diﬀerent forms of the same word. To be able to come up with
the correct number of occurrences of a given word in a dataset without having to
search individualy for all its forms, we stem all the data in the ﬁrst step. For this
purpose, we employ Porter’s Stemming Algorithm, which was written by Martin
Porter and published in the July 1980 issue of the journal, Program. This stemmer
was very widely used and has become the de-facto standard algorithm used for En-
As an example, consider these words: “Work”, “Working”, “Worked”. Running
these words through the stemming process reduces each of them to their root, ie.
“work”. The algorithm also reduces all letters to their lowercase forms, and leaves
everything else untouched.
• Lucene: Lucene is a free/open source information retrieval library, originally cre-
ated in Java by Doug Cutting. It is supported by the Apache Software Foundation
and is released under the Apache Software License. While suitable for any appli-
cation which requires full text indexing and searching capability, Lucene has been
widely recognized for its utility in the implementation of Internet search engines
and local, single-site searching. At the core of Lucene’s logical architecture is the
idea of a document containing ﬁelds of text. This ﬂexibility allows Lucene’s API to
be independent of the ﬁle format. Text from PDFs, HTML, Microsoft Word, and
OpenDocument documents, as well as many others can all be indexed so long as
their textual information can be extracted.
For our purposes, we shall be considering every tweet to be a “document” and we
shall run our queries over an index of these documents.
5.4 Creation of an Ontology
The ﬁrst step towards analysis of tweets was to develop a deeper understanding of what
users tweet about in general. We took a dataset consisting of 1.65 million tweets by users
from London gleaned over a period of one month. We also took a second dataset consist-
ing of all tweets from all over the world containing atleast one of a set of core words. This
set consisted of more than 4 million tweets. The main purpose of this experiment was to
get an idea of the number of occurrences of “useful” words. A useful word for us is that
which gives us information about what the user is interested in. We therefore compiled
a list of “Exclusion words” which consisted of all pronouns, prepositions, helping verbs,
question words etc.
After running both, the dataset and the exclusion list through the stemmer, we pre-
pared a plot of the top 15 useful words in both the datasets. This process required a
few iterations, over which we removed more non-useful words like “just”, “so”, “have” etc.
after examining the results. The results for both the sets are presented in Figure 5.1 and
Figure 5.1: Word distribution for London data
Figure 5.2: Word distribution for world-wide data
Based on the results of this experiment, we deﬁned ﬁve broad categories of activities
of user interest. Each of these categories consisted of a set of words which deﬁne activities
or interests based in that category. Apart from these ﬁve, we also deﬁned a set of action
words and a set of temporal words that could be used to assign a location and time to a
potential activity. The ﬁve categories thus deﬁned are:
These categories contained a total of around 190 words. Over the course of our experi-
ments, we expanded these words based on empirical observations and our own experience
of usage of GTalk, Facebook and other platforms.
5.5 Category Word Distribution
Having deﬁned a set of words in ﬁve categories, the next step was to see what percentage
of tweets actually contain these words. So, we ran all these words as Lucene queries over
the London data mentioned in the previous section. The top 19 words that appeared are
presented in Figure 5.3.
Figure 5.3: Top category words
Individual word occurrences donot seem to be very high from a ﬁrst look at this
graph. However, we summed up the result over all words in each category and plotted a
category-wise distribution. This is presented in Figure 5.4.
We noticed that a total of 15% of tweets contain atleast one category word. One might
argue that a direct summation of percentages for each category word is not correct, since
a tweet may contain more than one word of the same category. To justﬁy our approach,
we plotted a distribution of the number of words of a category that occur in a tweet. One
of these graphs is presented in Figure 5.5.
Figure 5.4: Category-wise distribution
Figure 5.5: Number of category words per tweets for one category
Notice that out of the tweets we consider for a category, 95% of those which contain
atleast one word, contain exactly one word. Roughly 5% contain 2 words. For higher
numbers, the percentages are negligible. Thus, as an approximation, a direct summation
over each category is fairly justiﬁed.
Given the results of this experiment, it was evident that we need to improve ontology
by adding more words so that we get a higher percentage of tweets containing category
words. Thus, based on an examination of around 1000 tweets as well as Facebook and
GTalk status updates, we included some more words in each category. Furthermore,
we added an entirely new category of Technology. The total set of category words now
contained around 260 words. The category of Movies was expanded with words implying
an interest in television soaps and shows as well. The category of Food was expanded
with certain speciﬁc instances, such as “sausage”, “pizza” and “cake”. The ontology words
are listed out in the appendices. We now ran the same experiment with the new ontology,
this time on a dataset consisting of tweets of the previous one week over ten cities: (In
Alphabetic Order) Atlanta, Austin, Boston, Chicago, Los Angeles, London, New York,
San Francisco, Seattle and Toronto. These are the top ten cities in the world in terms of
Twitter usage. The dataset is described in Table 5.1
We present here, the result from one of these cities, ie. New York. The results for the
other cities are presented in the appendices at the end of this report. The top 15 category
words that appeared are plotted in Figure 5.6. Figure 5.7 is the same plot, except that
the values are expressed as a percentage of tweets containing atleast one category word.
City Number of Tweets Percentage Tweets with Atleast one Category Word
Atlanta 186711 21.99
Austin 124489 24.37
Boston 163062 24.67
Chicago 231622 23.55
Los Angeles 339583 24.31
London 373992 25.29
New York 569668 23.37
San Francisco 226261 24.06
Seattle 151917 25.78
Toronto 189575 23.00
Table 5.1: Dataset For Co-Occurrence Analysis
Figure 5.6: Top 15 category words for New York as a percentage of total tweets
Figure 5.7: Top 15 category words for New York as a percentage of tweets
containing atleast one category word
As done previously, we sum up all the category words over their respective categories
and plot a category-wise distribution. This is shown in Figure 5.8.
We notice that nearly 23.4% of tweets contain atleast one category word, which is a
signiﬁcant improvement from the previous case where we got around 15%. We obtained
similar values for each of the ten cities, with varying percentages in diﬀerent categories.
These values are reported in Table 5.1. The relevant graphs are presented in the appen-
A binary co-occurrence measure tells us how many tweets contain two given words. In
context of activity oriented paradigms like R-U-In?, this kind of information would be
useful if the person expresses his/her interest in a particular activity alongwith a location
or time for the event. For instance, if a tweet contains the words “movie” and “tonight”
within a few words of each other, then it can be said with a very good probability that
the user intends to watch a movie tonight. In what follows, we conduct a co-occurrence
analysis on all the ten cities mentioned in the previous section. We do this analysis in
three parts: Action-Category, Temporal-Category and Action-Temporal.
• Action-Category Co-Occurrence: In context of R-U-In?, our primary interest
is to be able to parse an unstructured input and thus infer when a user expresses in-
terest in some activity. We examine the co-occurrence of action words with category
words to get an idea of the user’s intention. For example, if a tweet contains the
words “movie” then all we can say is that the user intends to say something about
a movie in paricular or movies in general. However, if we are given additional in-
formation that the tweet also contains the word “watch” within a few words of the
word “movie” then it can be inferred with a high probability that the user intends
to watch a movie. We again consider the dataset consisting of the ten cities men-
tioned in the previous section. We present the top ten action-category word pairs
for New York in Figure 5.9. Figure 5.10 is the same plot, except that the values are
expressed as a percentage of tweets containing atleast one category word.
Figure 5.8: Category-wise distribution for New York
Figure 5.9: Top ten action-category pairs for New York as a percentage of
Movie + Watch Show + Love Game + Play Show + Watch TV + Watch
Show + Love Game + Watch Show + Go Movie + See Gym + Go
Table 5.2: Top Ten Values
Figure 5.10: Top ten action-category pairs for New York as a perentage of
tweets containing atleast one category word
The highest occuring word pair is “movie” and “watch”. The lucene query used for
a proximity search was given a default value of 5 as the maximum distance between
the two words. Although the individual occurrence of a particular pair is small,
when we sum up all the word pairs we get a total number of around 6.7% of all
tweets which contain atleast one action-category word pair. The category-wise dis-
tribution of action-category word pairs is plotted below in Figure 5.11. Figure 5.12
is the same plot, except that the values are expressed as a percentage of tweets
containing atleast one category word.
• Temporal-Category Co-Occurrence: A temporal-category co-occurrence can
be useful since it allows the system to assign a time-frame to the activity of interest
and can thus be useful in the match-making process. An analysis analogous to the
above was done on all ten cities. The top ten temporal-category pairs for New York
are presented below in Figure 5.13. Figure 5.14 is the same plot, except that the
values are expressed as a percentage of tweets containing atleast one category word.
Figure 5.11: Category-wise distribution of action-category pairs for New York
as a percentage of total tweets
Figure 5.12: Category-wise distribution of action-category pairs for New York
as a percentage of tweets containing atleast one category word
Figure 5.13: Top ten temporal-category pairs for New York as a percentage of
Show + Tonight Show + Today Show + Night Show + Time Show + Day
Dinner + Tonight Coﬀee + Morning Song + Day Movie + Time Blog + Today
Table 5.3: Top Ten Values
Figure 5.14: Top ten temporal-category pairs for New York as a perentage of
tweets containing atleast one category word
The highest occuring word pair is “show” and “tonight”. The lucene query used
for a proximity search was given a default value of 5 as the maximum distance
between the two words. Although the individual occurrence of a particular pair is
small, when we sum up all the word pairs we get a total number of around 4% of
all tweets which contain atleast one action-category word pair. The category-wise
distribution of temporal-category word pairs is plotted below in Figure 5.15. Fig-
ure 5.16 is the same plot, except that the values are expressed as a percentage of
tweets containing atleast one category word.
• Action-Temporal Co-Occurrence: Analyzing action-temporal co-occurrences
can be useful by allowing us to associate timeframes to words describing some ac-
tion. Figure 5.17 shows the top 20 action-temporal word pairs for New York (with a
Figure 5.15: Category-wise distribution of temporal-category pairs for New
York as a percentage of total tweets
Figure 5.16: Category-wise distribution of temporal-category pairs for New
York as a percentage of tweets containing atleast one category word
default range of atmost 5). We note that the total percentage of tweets with atleast
one action-temporal co-occurrence comes out to be 9.29%:
The pair with the highest occurrence, as shown in Figure 5.17, is “work” and “day”.
Figure 5.17: Top 20 action-temporal word pairs for New York
Work + Day Go + Day Go + Am Go + Today Go + Tonight
Work + Today Go + Time Do + Time Do + Today Do + Day
Table 5.4: Top Ten Values
• The above analysis validates the underlying assumption of the system by showing
that users do tend to tweet about their real-time interests and in some cases even
mention an associated location or timeframe
• A co-occurrence helps us to infer with a high probability of the actual user intent
which would be very helpful in the matchmaking process. The fact that a fairly
large value of tweets containing one category word contain a co-occurrence suggests
that co-occurrence is deﬁnitely an indicator of user intent.
• This approach oﬀers a more practical solution to the match-making problem for
large data sets compared to more intense approaches such as deep NLP, which may
take a very long time.
5.7 Manual Benchmarking and Validation
Once we have the results for the various co-occurrences as detailed in the above section,
the next obvious step is to create a benchmark and manually verify some of the results
obtained. After manual benchmarking, we would be able to assign a “conﬁdence measure”
to a co-ocurence. This measure can be understood by a simple example. Consider 100
tweets by a user. Suppose that 20 of these contain the co-occuring pair “football” and
“play”. Suppose that 5 of these 20 actually imply an intention to play soccer in the near
future. Thus, we deﬁne:
Conﬁdence Measure = P(Actual Intention/Co-Occurrence)
= P(Actual Intention and Co-Occurrence)/P(Co-Occurrence)
In this example, the conﬁdence measure is simply 5/20 = 25%. We choose to pre-
pare a benchmark for the word pairs “movie” and “watch” in the action-category case
and “show” and “tonight” in the temporal-category case. The results as obtained for New
York are given below:
• For the pair “movie” and “watch” the total number of tweets containing this pair
was 583. We observed four kinds of tweets in general:
1. Tweets which indicate that the movie has already been seen
2. Tweets which ask for a suggestion for a movie to watch
3. Tweets which express a deﬁnite intention to watch a movie in the near future
4. Tweets expressing a disinclination to watch a movie
Out of these, we observed only one tweet that fell into the last kind (ie. expressed a
disinclination). In context of R-U-In?, we consider only tweets of the third kind, ie.
a deﬁnite intention to watch a movie in the near future. We found 114 out of the
583 tweets falling in this category, which implies a conﬁdence measure of 114/583
which is approximately 20%.
• For the pair “show” and “tonight” the total number of tweets containing this pair
was 409. We observed three main kinds of tweets in general:
1. Tweets which indicate that the show is already over
2. Tweets expressing interest in a TV/radio show
3. Tweets which express a deﬁnite intention to attend a show that night
There were also tweets that didn’t fall into any of these three main categories. There
were onle 3 tweets that expressed a disinclination to attend a show that night. In
context of R-U-In?, we consider only tweets of the third kind, ie. a deﬁnite intention
to attend a show that night. We found 133 out of the 409 tweets falling in this cat-
egory, which implies a conﬁdence measure of 133/409 which is approximately 32.5%.
The above manual benchmarking helped to establish the following:
1. People tend be more assertive about their interests than negative about their non-
2. The co-occurrence measure yields a reasonable conﬁdence measure, which validates
all our earlier analysis based on co-occurrences as a ﬁrst approximation.
3. We had noticed that around 8% tweets for New York contained atleast one action-
category pair. We had also obtained a 20% conﬁdence measure for the pair “movie”
and “watch”. Combining the two results (assuming a 20% conﬁdence measure for
all action-category pairs), we see that roughly 1.6% of all tweets contain an action-
category pair that actually signiﬁes interest in that activity.
5.8 Speciﬁc Instance Occurrences
During our analysis, we observed that users often tend to tweet about speciﬁc instances
rather than category words in general. For instance, a user might tweet about Star Wars
or Stanley Kubrick rather than exlicitly mention the word “movie” in his/her tweet. In
our ontology, the sports category already contains names of almost every known sport.
For the category of movies, where we saw this observation surface most often, we col-
lected a list of all movies since the year 2000 (source: Wikipedia) and ran them as Lucene
queries over the New York data. To avoid movie names that were also commonly occuring
words, like “Wanted”, we set a maximum limit of 500 on the number of occurrences of
a movie name, based on empirical observations. Recall that the entire movies category
accounted for roughly 5.1% of all the tweets in New York (Figure 5.8). However, when
we count only the speciﬁc instances, ie. movie names, we see that these alone account
for 2.82% of all the tweets. Notice that the list of speciﬁc instances in th realm of movies
can be extremely vast and may include names of actors, directors, etc. All these can add
up to a signiﬁcant number. Thus, we can interface a third-party solution with R-U-In?
which can use the tweet as a search query over an internet search engine after removing
all category words and thereby detect such instances and infer user interest as falling in
one of our deﬁned categories. We relegate this aspect, however as future work.
5.9 Per-User Analysis and Determination on Princi-
pal Interest Category
Once we have all the background information about the data, we next focussed our atten-
tion to a per-user analysis. The motivation behind this is that our ultimate aim is to be
able to tag every user with his/her category of interest with a high conﬁdence measure.
These tags can then be used for match-making in cases of unstructured or vague input.
For this particular experiment, we required a larger data set, ie. a set consisting of more
tweets. Thus, we used the dataset (Gleaned over one month from ten cities) detailed in
We ﬁrst ran the experiment for the ﬁrst 1000 users in the data set for one city. We
noticed that, users with higher number of tweets seem to have more consistency in their
interests. Thus, we re-ran the experiment, this time for all the ten cities, only for the top
1000 users (in terms of number of tweets) for each city. For every user, we found out the
• The number of tweets
• The top 5 category words for that user (Based on the six categories deﬁned previ-
City Number Of Users Number of Tweets
Atlanta 22201 911233
Austin 17202 687164
Boston 24000 912446
Chicago 29474 861152
Los Angeles 39314 1233271
London 42125 1065716
New York 64095 1254530
San Francisco 28815 1154659
Seattle 22302 832023
Toronto 24713 929683
Table 5.5: Dataset For Per-User Analysis
• The principle category of interest based on the category words that appear in his
• The top action and temporal co-occurrences with category words
The results presented a highly consistent picture which allows us to identify user interest
very easily in most cases. We present here, two sample users and the results obtained
for them. The principal category is calculated as follows. We sum up the number of
occurrences of category words over each category. Given these six values, we take their
mean and report all categories with the number of occurrences exceeding this value as
possible categories of interest:
1. This is a sample user from Atlanta. Figure 5.18 shows the category-wise distribution
of words in this user’s tweets.
• Username: ABC1
• No. Of Tweets: 2945
• Top 5 category words with corresponding occurrences:
(a) internet (585 occurrences)
(b) blog (250 occurrences)
(c) googl (google -> Stemmed form) (66 occurrences)
(d) site (47 occurrences)
(e) websit (website -> Stemmed form) (45 occurrences)
• Principle Category Of Interest: Technology (38.88% of this user’s tweets)
Real username masked for privacy reasons
Figure 5.18: Category-wise word distribution for ABC
• Top action-category pair: “internet” + “busi” (“Internet” + “Busy”) (102 oc-
• Top temporal-category pair: “internet” + “time” (19 occurrences)
2. This is a sample user from Los Angeles. Figure 5.19 shows the category-wise dis-
tribution of words in this user’s tweets.
• Username: DEF1
• No. Of Tweets: 11712
• Top 5 category words with corresponding occurrences:
(a) nba (752 occurrences)
(b) golf (418 occurrences)
(c) race (396 occurrences)
(d) footbal (football -> Stemmed form) (319 occurrences)
(e) game (318 occurrences)
• Principle Category Of Interest: Sports (25.96% of this user’s tweets)
• Top action-category pair: “game” + “plai” (“Game” + “Play”) (16 occurrences)
• Top temporal-category pair: “nba” + “time” (“NBA” + “Time”) (12 occur-
Real username masked for privacy reasons
Figure 5.19: Category-wise word distribution for DEF
For the above two users, the interest category was very easy to infer, since all of the
top ﬁve category words in their tweets belonged to the same category. Not all users were
so obviously biased, however. Consider for example the following user from Los Angeles.
Figure 5.20 shows the category-wise distribution of words in this user’s tweets.
• Username: GHI1
• No. Of Tweets: 2471
• Top 5 category words with corresponding occurrences:
1. rock (30 occurrences)
2. site (13 occurrences)
3. food (9 occurrences)
4. technolog (technology -> Stemmed form) (8 occurrences)
5. hotel (7 occurrences)
• Principle Category Of Interest: Music and Technology (1.54% of this user’s tweets).
For this particular user, we see from Figure 5.20 that there is not clear single peak.
So, we apply our mean-value based heuristic and we get a tag containing not one
but two principal categories of interest.
• Top action-category pair: “site” + “need” (9 occurrences)
Real username masked for privacy reasons
Figure 5.20: Category-wise word distribution for GHI
• Top temporal-category pair: none
• We have developed a scheme to tag users according to their interests as found out
from their tweets
• The category wise distribution for diﬀerent users varies vastly. For certain users,
the interest peaks are very obvious, but this is not the case for other users. For the
latter set of users, we report all categories above the mean as possible categories of
interest and use them while matchmaking in case of vague inputs to R-U-In?
5.10 Validation of Result Stability by measuring KL
In probability theory and information theory, the Kullback-Leibler divergence (also infor-
mation divergence, information gain, or relative entropy) is a non-commutative measure
of the diﬀerence between two probability distributions P and Q. Given two probability
distributions, P and Q, of a discrete random variable X, the Kullback-Leibler Divergence
(or the KL Divergence) is deﬁned as:
Dkl (P ||Q) = P (i)log2 (P (i)/Q(i))
We deﬁne the six categories as the six discrete values of the variable X. We deﬁne the
probability distributions as follows. Consider a user A. For each category word which
appears in any of A’s tweets, we sum up the number of occurrences of words for each
category. Let the values obtained for the six categories be x1 , x2 ... x6 . Note that the
xi ’s are all integers. We now normalize these values by dividing each by the sum of the
xi ’s. The values hence obtained form the required distribution. For this experiment, we
present the result for San Francisco. Tweets for this city have been gleaned over a period
from 18th March 2009 to 10th April 2009. We split this period into 4 parts of six days
each. For each of these parts, we calculated the per-user distribution of X. We did this
for the top 1000 users (In terms of number of tweets) for San Francisco.
The main aim behind such a calculation is two fold:
• To get a better understanding of how user interests vary with time.
• To validate the history proﬁling of a person by showing that a person’s interest
donot ﬂuctuate very often. By proving this, we can say that the history proﬁle of
users as created by our analysis of the user’s tweets are a fairly accurate indicator
of the user’s interests
Firstly, we removed users with non-admissible values of X from our set of observation.
This leaves us with 944 users. For these four sets of data over six days each, we calculate
the KL divergence of each pair of consecutive weeks. These are plotted as scatter plots
with the user number appearing on the x-axis and the corresponding KL divergence
values appearing on the y-axis. These plots are shown in Figure 5.21, Figure 5.22 and
Figure 5.23. Furthermore, to get an idea of the variation over a larger time period, we
did a similar plot for the ﬁrst and last six-day periods. This plot is shown in Figure 5.24.
Figure 5.21: KL Divergence between weeks 1 and 2
Figure 5.22: KL Divergence between weeks 2 and 3
Figure 5.23: KL Divergence between weeks 3 and 4
Figure 5.24: KL Divergence between weeks 1 and 4
We notice that there is a band centred around the value of 0.0 in the ﬁrst three plots.
This band seems to be spreading in the last plot. However, the values obtained remain
within a small range close to zero for most of the users. Thus, we can safely say that
user interests donot ﬂuctuate with a very high frequency and that our analysis using a
twitter-feed based history is valid.
Conclusion and Further Work
In the upcoming ﬁeld of real-time activity-oriented social networking, we feel that the
success of a venture would depend upon the product’s portability and ease of access.
Another crucial factor is the ability of the service to deliver correct matches for a given
interest. Considering the case of R-U-In?, we have successfully demonstrated two en-
• Firstly, we have presented a mobile phone based framework for R-U-In?. We have
successfully demonstrated the development on Android platform.
• We have shown that it is possible to create user proﬁles based on their complete
online presence. As part of this project, we consider the user’s presence on Twitter.
We have successfully shown that:
– Users do tweet about their real-time activities and interests
– Using a co-occurrence based shallow NLP approach we have also shown that
out of the tweets made by a user for words of a certain category, there are
a signiﬁcant number of tweets that contain a co-occurrence with descriptive
– By manually developing a benchmark, we have demonstrated a reasonable level
of conﬁdence in the results obtained by co-occurene measures
– After conducting a per-user analysis of tweets we have successfully been able
to tag users by identifyng their top category of interest based on their twitter
– Finally, we validated our analysis by calculating a KL divergence over a set of
data spanning almost one month.
This area holds a lot of potential for future work. Some of this is mentioned below:
• The concept of mobile-phone based real-time social networking can be extended to
other operating systems like Symbian in a fashion analogous to that employed by
• There are various other sources of contextual communication which have a huge
database of information. Such sources can be similarly exploited to develop a rich
presence based system. We can thus create an aggregation of user information based
on such sources.
• User interests may diverge over month to month so data collected over longer du-
rations of time (one year) could help in study the variations of user interests with
time, though the strategy of analysis would follow a similar path as we have shown.
• A more thorough analysis of the data may be carried out by employing techniques
from deep NLP. A trade-oﬀ, however would be that the time taken would be much
The following tables show the six categories and their constituent words along with their
Movies Sports Dance
Root Word Stemmed Word Root Word Stemmed Word Root Word Stemmed Word
movie movi sport sport dance dance
oscar oscar game game latin latin
hollywood hollywood ﬁshing ﬁsh salsa salsa
action action rugby rugbi ballroom ballroom
adventure adventur soccer soccer jazz jazz
animated anim football footbal ballet ballet
traditional tradit swimming swim modern modern
stop-motion stop-motion diving dive swing swing
biography biographi archery archeri interpretive interpret
comedy comedi race race tap tap
children children climbing climb lyrical lyric
crime crime skiing ski hip-hop hip-hop
disaster disast biking bike hiphop hiphop
drama drama baseball basebal ensemble ensembl
fantasy fantasi ball ball point point
horror horror cricket cricket ﬂamenco ﬂamenco
sci-ﬁ sci-ﬁ surﬁng surf club club
short short boarding board
thriller thriller skating skate
war war bowling bowl
western western cycling cycl
ﬁlm ﬁlm wrestling wrestl
theatre theatr judo judo
popcorn popcorn karate karate
tv tv fencing fenc
television televis boxing box
show show billiards billiard
sitcom sitcom pool pool
soap soap snooker snooker
episode episod country countri
series seri gym gym
cnn cnn gymkhana gymkhana
nbc nbc jumping jump
channel channel golf golf
Table A.1: Ontology (part 1)
Music Sports Food
Root Word Stemmed Word Root Word Stemmed Word Root Word Stemmed Word
music music handball handbal coﬀee coﬀe
rock rock hockey hockey bar bar
classical classic rally ralli lunch lunch
indian indian kayaking kayak restaurant restaur
fusion fusion canoe cano cafe cafe
metal metal rafting raft hotel hotel
blues blue rowing row snack snack
african african tennis tenni dinner dinner
folk folk badminton badminton meal meal
rap rap running run food food
pop pop walking walk hungry hungri
song song chess chess wine wine
hit hit sudoku sudoku vodka vodka
chartbuster chartbust bat bat
album album match match whisky whiski
band band superbowl superbowl breakfast breakfast
guitar guitar nba nba soup soup
drummer drummer ﬁfa ﬁfa sausage sausag
bassist bassist cup cup pie pie
bass bass league leagu chicken chicken
guitarist guitarist jogging jog noodle noodl
singer singer olympics olymp bread bread
vocalist vocalist sauce sauc
Table A.2: Ontology (part 2)
Technology Temporal Action
Root Word Stemmed Word Root Word Stemmed Word Root Word Stemmed Word
computer comput morning morn busy busi
keyboard keyboard evening even avaliable avail
mouse mouse afternoon afternoon feel feel
cd cd noon noon play plai
internet internet night night work work
net net hour hour see see
site site today todai watch watch
website websit tonight tonight love love
facebook facebook tonite tonit hate hate
orkut orkut yesterday yesterdai look look
disk disk minute minut need need
windows window year year thank thank
linux linux month month party parti
mac mac day dai think think
unix unix time time cook cook
blog blog monday mondai sleep sleep
email email tuesday tuesdai class class
gmail gmail wednesday wednesdai oﬃce oﬃc
google google thursday thursdai drive drive
microsoft microsoft friday fridai trek trek
mobile mobile saturday saturdai read read
youtube youtub sunday sundai write write
ipod ipod week week type type
iphone iphon weekday weekdai wait wait
ebay ebai weekend weekend go go
laptop laptop am am do do
notebook notebook pm pm ﬁght ﬁght
desktop desktop eat eat
Table A.3: Ontology (part 3)
Charts For The Remaining Nine
B.1 Category-wise Distribution For These Cities
Figure B.1: Atlanta
Figure B.2: Austin Figure B.6: London
Figure B.3: Boston Figure B.7: San Francisco
Figure B.4: Chicago Figure B.8: Seattle
Figure B.5: Los Angeles Figure B.9: Toronto
B.2 Category-wise Action and Temporal Co-Occurrences
For These Cities
Figure B.10: Atlanta - Action
Figure B.11: Atlanta - Temporal
Figure B.12: Austin - Action Figure B.16: Chicago - Action
Figure B.13: Austin - Temporal Figure B.17: Chicago - Temporal
Figure B.14: Boston - Action Figure B.18: Los Angeles - Action
Figure B.15: Boston - Temporal Figure B.19: Los Angeles - Temporal
Figure B.20: London - Action Figure B.24: Seattle - Action
Figure B.21: London - Temporal Figure B.25: Seattle - Temporal
Figure B.22: San Francisco - Action Figure B.26: Toronto - Action
Figure B.23: San Francisco - Temporal Figure B.27: Toronto - Temporal
 Tim Finin Belle Tseng Akshay Java, Xiaodan Song. Why we twitter: Understanding
microblogging usage and communities. pages 56–65, 2007.
 http://jcmc.indiana.edu/vol13/issue1/boyd.ellison.html. Social networks timeline.
 Koustuv Dasgupta Sumit Mittal Seema Nagar Saguna Nilanjan Banerjee, Dipan-
jan Chakraborty. R-u-in? - exploiting rich presence and converged communications
for next-generation activity-oriented social networking. MDM to appear.
 Bala Mulloth Nina D. Ziv. An exploration on mobile social networking: Dodgeball as
a case in point. Copenhagen, Denmark, 2006.
 Doree Seligmann Shreeharsh Kelkar, Ajita John. An activity-based perspective of
collaborative tagging. 2007.
 Hideaki Takeda Susumu Kunifuji Toshiyuki Hirata, Ikki Ohmukai. Personal network
aggregation system for real-time communication support. 2007.