Your SlideShare is downloading. ×
Enabling Real-Time User Interests
        for Next Generation
 Activity-Oriented Social Networks

                        ...
Certificate



This is to certify that the project titled “Enabling Real-Time User Interests for Next
Generation Activity-O...
Acknowledgements

We would like to express our gratitude to Prof. Bijendra N. Jain for his patience and
constant encourage...
Abstract



Since its inception, online social networking has evolved from basic messaging and friend
finding to a diverse ...
Contents

1 Introduction                                                                                                  ...
5.3    Background Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . .    .   .   28
   5.4    Creation of an On...
List of Figures

 1.1    A brief history of online social networks[2] . . . . . . . . . . . . . .                         ...
5.11 Category-wise distribution of action-category pairs for New York
     as a percentage of total tweets . . . . . . . ....
B.17   Chicago - Temporal . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . ...
List of Tables

 2.1   Features of Android Platform . . . . . . . . . . . . . . . . . . . . . .                           ...
Chapter 1

Introduction

The area of online social networking has experienced a phenomenal growth in the past
decade. Sinc...
Figure 1.1: A brief history of online social networks[2]



    Friendster was the first site that showed exactly how popul...
as Twitter combine the personal touch of a blog with the spontaneity of status updates by
allowing users to post short one...
all these and considering only a user’s updates on Twitter.com, we present an exhaustive
analysis of the nature and conten...
Chapter 2

R-U-In? Mobile Framework

2.1      Motivation
R-U-In? is a real-time interest based service where users can upd...
in most web sites for enhancing user experience like facilitating the validation of data
entered by the page visitor, etc....
Handset layouts        Supports larger, VGA, 2D graphics library, 3D
                       graphics library based on Open...
source: android.com


2.4     Present MSN market
There seems a huge potential for growth in this nascent market. Zivet al[...
2.5.1     Qeep
Qeep is an online mobile social network developed by Blue Lion Mobile. It offers its users
varied services l...
Chapter 3

Design of the Application

In this chapter, we shall present the complete design of the application along with ...
of the mobile phone, it gets very uncomfortable for the user to check for the responses
received. To solve this issue, we ...
3.2.3    Friend Info Module
This module gets the friends’ data of the user. It starts in the background after the user
log...
Chapter 4

Detailed use case

We shall now demonstrate a complete use-case with the help of the screen shots of the
apppli...
Sport) to see who else are interested. Similarly, Koustuv update his interest for a game
today evening (Figure 4.2).




 ...
Figure 4.4: User Interests Search Results

    Prateek can look at the profiles of these users and invites Koustuv and Seem...
Figure 4.6: Manage Activity Tab

    Seema and Koustuv get an immediate notification on their phones as they receive the
in...
Figure 4.8: Inviter Profile

   Prateek, after receiving the updates of Seema and Koustuv accepting his invite, con-
firms t...
Figure 4.10: Manage Activity Tab

    A person can also join an already existing activity. Dipanjan also wishes to play
a ...
Figure 4.12: Activity Creator Profile

    We have also categorised users based on their real time interests gleaned from t...
Chapter 5

MicroBlog Analysis

Presently, the R-U-In? backend consists of a very limited set of words describing an ac-
ti...
this data which we mine to extract information about user interest and intention which
will allow the system to make a mor...
share information. They also analyze the user intentions associated at a community level
and show how users with similar i...
be independent of the file format. Text from PDFs, HTML, Microsoft Word, and
      OpenDocument documents, as well as many ...
Figure 5.1: Word distribution for London data




Figure 5.2: Word distribution for world-wide data


                    ...
Based on the results of this experiment, we defined five broad categories of activities
of user interest. Each of these cate...
Figure 5.3: Top category words

   Individual word occurrences donot seem to be very high from a first look at this
graph. ...
Figure 5.4: Category-wise distribution




Figure 5.5: Number of category words per tweets for one category


            ...
Notice that out of the tweets we consider for a category, 95% of those which contain
atleast one word, contain exactly one...
Figure 5.6: Top 15 category words for New York as a percentage of total tweets




Figure 5.7: Top 15 category words for N...
As done previously, we sum up all the category words over their respective categories
and plot a category-wise distributio...
Figure 5.8: Category-wise distribution for New York




Figure 5.9: Top ten action-category pairs for New York as a percen...
Movie + Watch      Show + Love      Game + Play Show + Watch TV + Watch
   Show + Love      Game + Watch       Show + Go  ...
above was done on all ten cities. The top ten temporal-category pairs for New York
    are presented below in Figure 5.13....
Figure 5.12: Category-wise distribution of action-category pairs for New York
as a percentage of tweets containing atleast...
Show + Tonight      Show + Today      Show + Night     Show + Time      Show + Day
Dinner + Tonight Coffee + Morning       ...
Figure 5.15: Category-wise distribution of temporal-category pairs for New
York as a percentage of total tweets




Figure...
default range of atmost 5). We note that the total percentage of tweets with atleast
  one action-temporal co-occurrence c...
large value of tweets containing one category word contain a co-occurrence suggests
      that co-occurrence is definitely ...
583 tweets falling in this category, which implies a confidence measure of 114/583
      which is approximately 20%.


   •...
For the category of movies, where we saw this observation surface most often, we col-
lected a list of all movies since th...
City      Number Of Users Number of Tweets
                       Atlanta        22201           911233
                  ...
Figure 5.18: Category-wise word distribution for ABC

         • Top action-category pair: “internet” + “busi” (“Internet”...
Figure 5.19: Category-wise word distribution for DEF

    For the above two users, the interest category was very easy to ...
Figure 5.20: Category-wise word distribution for GHI

   • Top temporal-category pair: none

   To summarize:

   • We hav...
We define the six categories as the six discrete values of the variable X. We define the
probability distributions as follow...
Figure 5.21: KL Divergence between weeks 1 and 2




Figure 5.22: KL Divergence between weeks 2 and 3




                ...
Figure 5.23: KL Divergence between weeks 3 and 4




               Figure 5.24: KL Divergence between weeks 1 and 4

   W...
Chapter 6

Conclusion and Further Work

In the upcoming field of real-time activity-oriented social networking, we feel tha...
This area holds a lot of potential for future work. Some of this is mentioned below:


   • The concept of mobile-phone ba...
Appendices




    56
Appendix A

Our Ontology

The following tables show the six categories and their constituent words along with their
stemme...
Movies                       Sports                         Dance
Root Word Stemmed Word         Root Word Stemmed Word   ...
Music                      Sports                      Food
Root Word Stemmed Word       Root Word Stemmed Word     Root W...
Technology                   Temporal                    Action
Root Word Stemmed Word       Root Word Stemmed Word     Ro...
Appendix B

Charts For The Remaining Nine
Cities

B.1   Category-wise Distribution For These Cities




                  ...
Figure B.2: Austin              Figure B.6: London




  Figure B.3: Boston           Figure B.7: San Francisco




 Figur...
B.2   Category-wise Action and Temporal Co-Occurrences
      For These Cities




                Figure B.10: Atlanta - A...
Figure B.12: Austin - Action            Figure B.16: Chicago - Action




Figure B.13: Austin - Temporal         Figure B....
Figure B.20: London - Action             Figure B.24: Seattle - Action




  Figure B.21: London - Temporal            Fig...
Bibliography

[1] Tim Finin Belle Tseng Akshay Java, Xiaodan Song. Why we twitter: Understanding
    microblogging usage a...
Upcoming SlideShare
Loading in...5
×

Enabling Real-Time User Interests for Next Generation ...

439

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
439
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Enabling Real-Time User Interests for Next Generation ..."

  1. 1. Enabling Real-Time User Interests for Next Generation Activity-Oriented Social Networks Prateek Agarwal (2005CS10173) cs1050173@cse.iitd.ac.in Sameer Madan (2005CS10181) cs1050181@cse.iitd.ac.in A Thesis submitted to the Indian Institute of Technology Delhi in conformity with the requirements for the degree of Bachelor of Technology under the guidance of Dr. Bijendra Nath Jain bnj@cse.iitd.ac.in Dr. Koustuv Dasgupta kdasgupta@in.ibm.com Dr. Dipanjan Chakraborty cdipanjan@in.ibm.com Dr. Anupam Joshi anupam.joshi@in.ibm.com Sumit Mittal sumittal@in.ibm.com Department of Computer Science & Engineering Indian Institute of Technology Delhi
  2. 2. Certificate This is to certify that the project titled “Enabling Real-Time User Interests for Next Generation Activity-Oriented Social Networks” being submitted by Prateek Agarwal and Sameer Madan to the Indian Institute of Technology Delhi, for the award of the degree of Bachelor of Technology in Computer Science and Engineering, is a record of bona fide project work carried out by them under my guidance and supervision at the Department of Computer Science and Engineering, Indian Institute of Technology Delhi and IBM India Research Lab, New Delhi. Prof. Bijendra Nath Jain Dept. of Comp. Sc. and Engg. IIT Delhi i
  3. 3. Acknowledgements We would like to express our gratitude to Prof. Bijendra N. Jain for his patience and constant encouragement, help, guidance and invaluable suggestions. We are thankful to him for the advice he gave us at different steps of the literature survey, problem under- standing and its implementation. We would like to thank our mentors at IBM: Dr. Koustuv Dasgupta, Dr. Dipanjan Chakraborty, Dr. Anupam Joshi, Sumit Mittal and Seema Nagar for their valuable time and for agreeing to supervise us on this project. We are grateful to them for their invalu- able inputs, especially helping us precisely define the problem statement and for guiding us in the right direction. Prateek Agarwal Sameer Madan ii
  4. 4. Abstract Since its inception, online social networking has evolved from basic messaging and friend finding to a diverse and multi-faceted entity that encompasses the entire personality of an individual. Recent advancements including mobile-phone based access and real-time status update facilities (such as micro-blogging) allow a person’s online presence to be as ephemeral and dynamic in nature as his very thoughts and interests. With the evo- lution of the mobile platform, these networks now allow an easy and on-the-fly access through hand-held mobile devices and PDAs. In this context, we consider an activity- oriented social network called R-U-In?, currently under development at the IBM India Research Laboratory, New Delhi, which exploits contextual reasoning and match-making techniques to help users locate others based on the activity of interest. In the context of real-time presence, we wish to enhance the functionality of this next generation network in two ways. Firstly, enable a more up-to-date reflection of user interest and expand the range of the R-U-In? framework beyond desktops using a mobile platform, thereby al- lowing easy access on the move and secondly, incorporate the complete online presence of the person, including sources of real-time interests such as microblogs towards enhancing the match-making process by developing a better understanding of his/her tastes. Thus, we present here a mobile-phone based framework for R-U-In? as well as an enhanced ver- sion of the backend that is able to tag a user based on his real time interests as gleaned from his real-time online presence as well as his past activities on platforms, specifically Twitter, which we consider in this report. iii
  5. 5. Contents 1 Introduction 7 1.1 Social Networks - A Brief History . . . . . . . . . . . . . . . . . . . . . . 7 1.2 R-U-In? - Next Generation Activity-Oriented Online Social Networking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 R-U-In? Mobile Framework 11 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Mobile application over mobile web-site . . . . . . . . . . . . . . . . . . . 11 2.3 Android Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Present MSN market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Existing Mobile Social Networking Services . . . . . . . . . . . . . . . . . 14 2.5.1 Qeep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 Dodgeball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.3 Zyb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Design of the Application 16 3.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Location Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Display Management Module . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Friend Info Module . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.4 Activity Management Module . . . . . . . . . . . . . . . . . . . . 18 3.2.5 Invite Management Module . . . . . . . . . . . . . . . . . . . . . 18 4 Detailed use case 19 5 MicroBlog Analysis 26 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1
  6. 6. 5.3 Background Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.4 Creation of an Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.5 Category Word Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.6 Co-Occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.7 Manual Benchmarking and Validation . . . . . . . . . . . . . . . . . . . . 44 5.8 Specific Instance Occurrences . . . . . . . . . . . . . . . . . . . . . . . . 45 5.9 Per-User Analysis and Determination on Principal Interest Category . . . 46 5.10 Validation of Result Stability by measuring KL Divergence . . . . . . . . 50 6 Conclusion and Further Work 54 Appendices 56 A Our Ontology 57 B Charts For The Remaining Nine Cities 61 B.1 Category-wise Distribution For These Cities . . . . . . . . . . . . . . . . 61 B.2 Category-wise Action and Temporal Co-Occurrences For These Cities . . 63 2
  7. 7. List of Figures 1.1 A brief history of online social networks[2] . . . . . . . . . . . . . . 8 4.1 Login Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Update Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Creating an Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 User Interests Search Results . . . . . . . . . . . . . . . . . . . . . . . . 21 4.5 User Information Window . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.6 Manage Activity Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.7 Invite Received Notification . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.8 Inviter Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.9 Manage Activity Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.10 Manage Activity Tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.11 Update Interest Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.12 Activity Creator Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.13 I’m bored. Suggest me something!! . . . . . . . . . . . . . . . . . . . . . 25 5.1 Word distribution for London data . . . . . . . . . . . . . . . . . . 30 5.2 Word distribution for world-wide data . . . . . . . . . . . . . . . . 30 5.3 Top category words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4 Category-wise distribution . . . . . . . . . . . . . . . . . . . . . . . . 33 5.5 Number of category words per tweets for one category . . . . . . 33 5.6 Top 15 category words for New York as a percentage of total tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.7 Top 15 category words for New York as a percentage of tweets containing atleast one category word . . . . . . . . . . . . . . . . . 35 5.8 Category-wise distribution for New York . . . . . . . . . . . . . . . 37 5.9 Top ten action-category pairs for New York as a percentage of total tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.10 Top ten action-category pairs for New York as a perentage of tweets containing atleast one category word . . . . . . . . . . . . . 38 3
  8. 8. 5.11 Category-wise distribution of action-category pairs for New York as a percentage of total tweets . . . . . . . . . . . . . . . . . . . . . 39 5.12 Category-wise distribution of action-category pairs for New York as a percentage of tweets containing atleast one category word . 40 5.13 Top ten temporal-category pairs for New York as a percentage of total tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.14 Top ten temporal-category pairs for New York as a perentage of tweets containing atleast one category word . . . . . . . . . . . . . 41 5.15 Category-wise distribution of temporal-category pairs for New York as a percentage of total tweets . . . . . . . . . . . . . . . . . . 42 5.16 Category-wise distribution of temporal-category pairs for New York as a percentage of tweets containing atleast one category word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.17 Top 20 action-temporal word pairs for New York . . . . . . . . . 43 5.18 Category-wise word distribution for ABC . . . . . . . . . . . . . . 48 5.19 Category-wise word distribution for DEF . . . . . . . . . . . . . . 49 5.20 Category-wise word distribution for GHI . . . . . . . . . . . . . . 50 5.21 KL Divergence between weeks 1 and 2 . . . . . . . . . . . . . . . . 52 5.22 KL Divergence between weeks 2 and 3 . . . . . . . . . . . . . . . . 52 5.23 KL Divergence between weeks 3 and 4 . . . . . . . . . . . . . . . . 53 5.24 KL Divergence between weeks 1 and 4 . . . . . . . . . . . . . . . . 53 B.1 Atlanta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 B.2 Austin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.3 Boston . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.4 Chicago . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.5 Los Angeles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.6 London . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.7 San Francisco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.8 Seattle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.9 Toronto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.10 Atlanta - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 B.11 Atlanta - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 B.12 Austin - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.13 Austin - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.14 Boston - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.15 Boston - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.16 Chicago - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4
  9. 9. B.17 Chicago - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.18 Los Angeles - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.19 Los Angeles - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . 64 B.20 London - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.21 London - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.22 San Francisco - Action . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.23 San Francisco - Temporal . . . . . . . . . . . . . . . . . . . . . . . . 65 B.24 Seattle - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.25 Seattle - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.26 Toronto - Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.27 Toronto - Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5
  10. 10. List of Tables 2.1 Features of Android Platform . . . . . . . . . . . . . . . . . . . . . . 13 5.1 Dataset For Co-Occurrence Analysis . . . . . . . . . . . . . . . . . 34 5.2 Top Ten Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Top Ten Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4 Top Ten Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Dataset For Per-User Analysis . . . . . . . . . . . . . . . . . . . . . 47 A.1 Ontology (part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.2 Ontology (part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.3 Ontology (part 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6
  11. 11. Chapter 1 Introduction The area of online social networking has experienced a phenomenal growth in the past decade. Since the late 1990s when rudimentary forms of present day social networking sites first appeared, these group sites have bloomed to encompass a wide gamut of ser- vices like dating, friend-finding, business networking, chatting, photo and video sharing, blogging, mobile connectivity and much more. Apart from social networking sites that cater to the general public, there are also certain sites which cater to a particular niche of users. As the competition in this area heats up, the recent times have seen a host of new services coming up, with each site trying to capitalize on the market by creating the next big thing in social networking. 1.1 Social Networks - A Brief History Online social networking, as we know, began in 1997 with a site called SixDegrees.com. The name of this site is derived from the empirical law that any two people in the world are connected by a chain of friends of length atmost six people. At this point of time, there existed pages on the web that allowed people to create their profiles, find friends and to affiliate themselves to a school. However, SixDegrees was the first to combine all these into one consolidated platform. This site attracted many users and grew till 1998 after which it declined and eventually shut down in 2000. The next notable milestone was the advent of business networking. In 2001 sites like Ryze.com, LinkedIn and Tribe.net opened their doors to the world. Ryze was started with the aim of networking with entrepreneurs and venture capitalists in the San Fran- sisco area, but it never really gained much popularity. LinkedIn, however grew into a major platform for business networking and is now the major player in that niche of social networking. 7
  12. 12. Figure 1.1: A brief history of online social networks[2] Friendster was the first site that showed exactly how popular social networking sites were to become in the near future. Within a small time of its opening, the site experi- enced an extremely rapid growth in its user base. It however could not keep with its own speed and eventually collapsed and lost users to new players like MySpace. MySpace was the first social network to gain worldwide popularity and at present it boasts of a user base exceeding 250 million users. Other social networking sites such as Orkut and Hi5 also came up around this time and gained popularity in other parts of the world. Orkut has a very high user base in countries like India and Brazil. Some social networking sites, such as Facebook started off with the intention of pro- viding services to a specific niche of users. Facebook was only limited to Harvard at the time of its creation, but it soon opened its doors to the general public. Some of the features that led to the popularity of Facebook include its support for third party applications as well as the emphasis laid on user privacy. Presently, Facebook is one of the largest social networking sites in the world with 200 million registered users. Recently, social networking has diversified to other forms like blogging. Sites like Live- Journal and Xanga provide a platform for users to express their thoughts and emotions. A recent, more dynamic incarnation of this paradigm is that of micro-blogging. Sites such 8
  13. 13. as Twitter combine the personal touch of a blog with the spontaneity of status updates by allowing users to post short one line blogs. Twitter is known for being a highly dynamic and updated source of a user’s interests and activities. 1.2 R-U-In? - Next Generation Activity-Oriented Online Social Networking System There has been a significant enhancement in the features online social networks have provided since their inception. Starting as a medium of bringing people together through chat rooms and personal web pages, they are now being used in fields varying from busi- ness to dating to medicine. Since the last few years, “contextual communications” is emerging as one of the key features of these networks i.e. users are now using various end devices like mobile phones, PDAs and clients like GoogleTalk, Facebook etc. to update their presence, avalibilty and mood. Rich presence, thus, is not just limited to availability but now extends the personality of the user and with the help of technologies like Web 2.0 and converged communications, a whole new genre of real-time communication driven social networking will come up. But present social networks are yet to fully exploit the domain of collaboration. R- U-In?, which is currently being developed at IBM India Research Lab, New Delhi[3] leverages the strengths of Web 2.0 and converged networks technologies to create a rich next-generation service. It allows users to collaborate and participate in activities of mutual interest by enabling them to search for other like-minded users. R-U-In? uses contextual modeling and reasoning techniques to enable social search based on real time user interests and finds potential matches for the proposed activity. It also exploits next-generation presence and communication technologies to manage the entire activity lifecycle in real time. Given the real-time networking nature of such a system, it is required to capture user thoughts as they appear. Thus, we have developed an Android-based framework for R- U-In? so that the user be able to use this service as well as update his/her interest even when on the move. Apart from these real time updates, we would also like to be able to incorporate a user’s past online activity to be able to build a better understanding of his/her interests. Further, we need to expand the current ontology that R-U-In? recog- nizes so that we are able to provide a match to a wide range of activities. Combining 9
  14. 14. all these and considering only a user’s updates on Twitter.com, we present an exhaustive analysis of the nature and content of tweets by users and also a system that is able to tag users based on what they tweet about by employing an ontology developed by the afore- mentioned analysis as well as empirical observations of user tweets and status updates. Consider the following example which explains how such a real-time social network works: Arya, while out, wants to watch a movie but doesn’t have company. He uses his mobile to search the R-U-In? system for a potential match and is returned with Jane, who is also intersted in watching a movie at a theatre near-by his current location. He immediately sends a request to Jane, who happens to like his profile and accepts his re- quest. A notification is immediately displayed on Arya’s mobile about the same. He can now use the R-U-In? communication features like SMS and Click-to-Call to fix a meeting point with Jane. The two eventually meet up and enjoy the movie. The rest of the report is organized as follows: Chapter 2 presents some of the related work in the mobile social networking domain. We present the design of our mobile frame- work in Chapter 3 and a detailed use case in Chapter 4. Chapter 5 presents the detailed microblog analysis followed by Conclusions and Further Work in Chapter 6. 10
  15. 15. Chapter 2 R-U-In? Mobile Framework 2.1 Motivation R-U-In? is a real-time interest based service where users can update their interest and notify the system of their intention to search other like-mined people. For such a service, it is most useful if the user is able to update his interest and search for other users as the thought strikes his mind while he is on the move and around the intended activity location. This will allow him to search for existing similar minded users at that very moment allowing him to plan his activity on the spot, thus not limiting him to wait till he gets home and log onto a PC and then come back to that location again to follow that activity. It’s the mobile phone which is with the user all the time and so this will also allow a user to use the system on-the-go whenever he/she finds the time, be it just 5 minutes in a bus or a train. Mobile phones also offer other capabilities which PCs lack. Functions like the Global Positioning Systems (GPS) which is now available in most phones can be used to know the location of the user. This can help a user to find a feasible location for the activity by comparing the distances of other locations with his current location. 2.2 Mobile application over mobile web-site While a possible solution could be just another mobile website for the system, it has many limitations of its own. The inability of mobile web applications to access the local capabilities on the mobile device can limit their ability to provide the same features as na- tive applications. Generally web browsing software on mobile devices lack the flexibility and functionality present in the PC-based web browsers. They do not support features like client-side scripting, style sheets, storage of cookies, etc. which are now widely used 11
  16. 16. in most web sites for enhancing user experience like facilitating the validation of data entered by the page visitor, etc. Thus using the same web-site on mobile phone is not a good solution. Also, mobile phones have a constraint display and large web-sites makes it uncomfortable for the user as he has to scroll everytime to find information. A dedicated application can also be linked to other features present in the mobile phones like calendar, alarm clock, etc., thus allowing user the ease to have a common terminal of managing his information, or features like GPS, enhancing his user experi- ence. A dedicated application can also be sold preloaded on the mobile phones making the application an integral feature of the wireless handset like an alarm clock, calendar, or mobile e-mail and thus sourcing revenue for the developer. 2.3 Android Platform Android is an open source software platform for mobile devices based on the linux kernel. It was initially developed by Google and then later by Open Handset Alliance. Being an open source platform, Android is evolving very quickly. It offers the advan- tages of openness and collaborative development. Android’s radically different approach to mobile Linux application development offers some unique advantages. The biggest advantage is that it provides a very high level of uniformity. In theory, the vast majority of Android applications will be able to run on virtually any Android-based device without requiring any further modification. 2.3.1 Features Android provides a comprehensive and well-organized variety of high-level APIs for build- ing applications and leveraging the underlying functionality of the platform. These APIs provide a very high level of abstraction, which makes them easy to use. The APIs truly make it possible to build applications that integrate fully with the rest of the platform. Some of the features offered by the platform are: 12
  17. 17. Handset layouts Supports larger, VGA, 2D graphics library, 3D graphics library based on OpenGL ES 1.0 speci- fications, and traditional smartphone layouts Connectivity Technologies including GSM/EDGE, CDMA, EV- DO, UMTS, Bluetooth, and Wi-Fi are supported Messaging SMS and MMS are available forms of messaging including threaded text messaging. Dalvik Virtual Machine Software written in Java can be compiled into Dalvik bytecodes and executed in the Dalvik vir- tual machine, which is a specialized VM implemen- tation designed for mobile device use, although not technically a standard Java Virtual Machine. Media Support Supports the following audio/video/still media for- mats: MPEG-4, H.264, MP3, AAC, MIDI, OGG, AMR, JPEG, PNG, GIF. Additional Hardware Android can utilize video/still cameras, touch- Support screens, GPS, accelerometers, and accelerated 3D graphics. Storage The Database Software SQLite is used for data storage purposes Development environ- Includes a device emulator, tools for debugging, ment memory and performance profiling, a plugin for the Eclipse IDE. Table 2.1: Features of Android Platform 13
  18. 18. source: android.com 2.4 Present MSN market There seems a huge potential for growth in this nascent market. Zivet al[4] have discussed the implications of mobile social networks for the wireless sector, content providers, tech- nology companies, and the users of the mobile platform and presented a case study on Dodgeball, a New York City based mobile social networking company to exemplify user- centric innovation on the mobile platform. Industry analysts have predicted huge demand for this market, particularly from teens and young adults. Today about half the global population (around 3.3 billion people) owns a handset. The International Telecommuni- cations Union (ITU) found that mobile subscription rose constantly with 39% and 28% per year from 2005 to 2007 in Africa and Asia. eMarketer, a market analysis firm fore- casts that mobile social networking will grow from 82 million users in 2007 to over 800 million worldwide by 2012. A study conducted by Juniper Research reveals that user-generated content is pre- dicted to grow from generating revenues of $ 572 million in 2008 to over $ 5.7 billion in 2012 of which about 50 % will be accounted by social networking sites. In a country like India, where Mobile penetration exceeds PC penetration and will continue to do so, accessing the internet from the mobile is definitely going to get more and more popular. The Telecom Regulatory Authority of India (TRAI) recently found that the total number of wireless subscribers in the country (as of end March 2009) is 391.76 million. With a significant amount of these subscribers using “data-enabled” hand- sets according to the TRAI data for the December, 08 quarter (101.1 million handsets), the country may be assumed to take a big step forward on the mobile Internet front. As far as Indian users go, presently half (48.9 %) of all traffic visit to social networks via their mobile phones. 2.5 Existing Mobile Social Networking Services Today, various web-based social networks are moving in the mobile domain (like Myspace, Facebook etc.) while there are many which are being developed specifically for the mobile domain. 14
  19. 19. 2.5.1 Qeep Qeep is an online mobile social network developed by Blue Lion Mobile. It offers its users varied services like private messaging, live multi-player gaming, sound attacks, photo- blogging with unlimited storage space and QMS, a special kind of text message designed for qeep which costs a lot less than a normal SMS. Users are required to download a java-based qeep application on their mobiles. Absence of a complex solution stack mini- mizes the amount of memory needed for installation and running the program on mobile phones. As of December 2008, qeep has a membership of over 750,000 users. 2.5.2 Dodgeball Dodgeball was a location based mobile social networking service started by Dennis Crow- ley and Alex Rainert and was later acquired by Google in 2005. The service was used by sending simple text messages to the system and there was no need to download or install anything on the phone. User had to text his location to the service and then he was notified of his friends and friends of friends present nearby. It was available in 22 cities before it was shutdown in February 2009 and replaced with Google Latitude. 2.5.3 Zyb It is an aggregation portal which shows users their friend’s social updates from Facebook, Twitter, Flickr, Hyves and blogs. It allows users to store their phone’s contacts, pictures, text messages and calendar events online and get their friend’s updated contact details automatically synced to their phone. Users do not require to install any software on their phone. 15
  20. 20. Chapter 3 Design of the Application In this chapter, we shall present the complete design of the application along with argu- ments for choosing the same. 3.1 Design Overview Our main aim was to incorporate all the functionalities provided by the web portal of R-U-In? and make it easier for the user to use the system. We tried to keep it as close as possible to the look and feel of the web portal so that the user feels familiar with the system, but keeping all the constraints a mobile device has over a web application. Web- portal for R-U-In? is developed using Java Server Pages, JSP, and Ajax technologies. These technologies are not available for the android mobile platform. So we decided to use simple http requests to talk to the servlets on the server side and then parsed the response received using the inbuilt SAX XML Parser. Another major task was to get periodic updates from the server about different friends, activities and invites. We preferred client-pull over server-push mechanism for this oper- ation as it allowed us to keep the server design simple because now server is not required to keep track of different devices, their presence, their addresses, locations and status. This way client could also determine the polling rate and control it as it requires. There are three threads running in the background which perform this periodic task of updating the data. They are created when the user logs in and interrupted as the user exits the application. When a user queries the system, activities matching his query are marked on the map. These results could be present anywhere on the map and due to the constraint screen size 16
  21. 21. of the mobile phone, it gets very uncomfortable for the user to check for the responses received. To solve this issue, we zoom-out the map each time to such a zoom level such that all the matches are displayed on the screen and the display also not gets cluttered. Certain tabs present in the web interface have been combined to allow convenient access to the user in the mobile domain. When the user receives a new invite or request for an activity, he gets a notification popup on his mobile phone screen. 3.2 Components There are five major components constituting the whole application. We shall now de- scribe each of them one by one. 3.2.1 Location Module R-U-In? marks the location of the activities on the map. When the user queries for an activity (interest), latitude/longitude corresponding to that activity’s (interest’s) location are determined. The server is also queried for matching interests (activities) and the lat- itude/longitude of those locations are also determined. These locations are then marked with different markers on the map. In order to access the network for this operation, following permission is required to be put in the AndroidManifest.xml file: <uses-permission android:name = “android.permission.INTERNET”/> 3.2.2 Display Management Module R-U-In? uses google map interface to display the activity location of the users. Due to the constraint screen size, we decided to display the map on the whole screen. Google maps are rendered on the screen using the set of APIs provided by Android. But this is not a standard package in the Android library. In order to use it, the following XML element is required to be added as a child of the application element, in the AndroidManifest.xml file: <uses-library android:name = “com.google.android.maps”/> After a user logs-in, this module queries the GPS module of the device for the last-known- location of the user and centers the map onto that location. 17
  22. 22. 3.2.3 Friend Info Module This module gets the friends’ data of the user. It starts in the background after the user logs-in and starts querying the server for friends’ real-time data. When the user wants to view the friend information, this module parses the response from the server using the SAX XML parser and writes the data on the device in key-value pair format using the SharedPreferences mechanism. This data is then read by the display management module which updates the friend information in the display window. 3.2.4 Activity Management Module This module keeps track of the different activities a user is involved in at any moment. It sends http requests to the server to get the real-time status of all the activities. The creator of the activity is then informed about the status of all the members in that ac- tivity and the members (other than the creator) are informed about the creator of that activity. 3.2.5 Invite Management Module A user can receive “invites” or “requests” from other users in the system with similar interests. This module queries the server and notifies the user of the same. A notification window is flashed on the screen and the phone vibrates to attract the attention of its user. 18
  23. 23. Chapter 4 Detailed use case We shall now demonstrate a complete use-case with the help of the screen shots of the appplication. There are five people - Prateek, Sameer, Koustuv, Dipanjan and Seema, each of them accessing “R-U-In?” through a mobile device. The following figure, Figure 4.1, is the Login screen which appears when the user clicks on the application icon for the first time in the mobile phone menu. If he selects the “Remember Me” option, he’ll not be asked to enter the password next time he accesses the application. Figure 4.1: Login Screen Seema is new in town and wants to play tennis in the evening. She has a busy schedule today, so while she is on the move, she uses her mobile phone to update her interest (IIT, 19
  24. 24. Sport) to see who else are interested. Similarly, Koustuv update his interest for a game today evening (Figure 4.2). Figure 4.2: Update Interest Prateek, who also wants to play today and also willing to take the initiative of or- ganizing, searches the system for activity “Tennis, JNU, 5/8/2009, 6:15 pm to 8:45 pm” (Figure 4.3). R-U-In? searches for the potential matches and returns Seema and Koustuv as results (Figure 4.4). Figure 4.3: Creating an Activity 20
  25. 25. Figure 4.4: User Interests Search Results Prateek can look at the profiles of these users and invites Koustuv and Seema for the game (Figure 4.5). The “Manage Activity” tab of the application displays the current status of all the activities (Figure 4.6). Figure 4.5: User Information Window 21
  26. 26. Figure 4.6: Manage Activity Tab Seema and Koustuv get an immediate notification on their phones as they receive the invite (Figure 4.7). They can then look at the inviter’s profile to know about the person (Figure 4.8) and accept the invite. Figure 4.7: Invite Received Notification 22
  27. 27. Figure 4.8: Inviter Profile Prateek, after receiving the updates of Seema and Koustuv accepting his invite, con- firms the activity and updated status of the activity is immediately displayed on all the corresponding mobile phones (Figure 4.9 and Figure 4.10) Figure 4.9: Manage Activity Tab 23
  28. 28. Figure 4.10: Manage Activity Tab A person can also join an already existing activity. Dipanjan also wishes to play a game now and updates his interest. R-U-In? returns him the activity Prateek has already created (Figure 4.11). He can look at the activity details by clicking on the activity marker on the screen and request him to join (Figure 4.12). Figure 4.11: Update Interest Result 24
  29. 29. Figure 4.12: Activity Creator Profile We have also categorised users based on their real time interests gleaned from their tweets from micoblogging site Twitter.com (detailed discussion in the next chapter). So Sameer, who has been tagged as a Sports lover from his tweets over the past one month, is automatically returned with activities of the Sports category when he clicks on the “I’m bored!!” button in the application (Figure 4.13). Sameer can now request to join that activity. Figure 4.13: I’m bored. Suggest me something!! 25
  30. 30. Chapter 5 MicroBlog Analysis Presently, the R-U-In? backend consists of a very limited set of words describing an ac- tivity. Furthermore, the R-U-In? interface currently requires a structured input from the user. In this section, we present an enhanced version of the backend that can recognize user interests by looking into other sources of user data, specifically micro-blogging sites 5.1 Motivation Presently R-U-In? works only on a structured input. The user is required to enter his/her interest in a well-defined syntax, which involves explicit delineation of the activity cate- gory, its time and location. We feel, however, that this functionality can be expanded to allow for unstructured and in some cases a vague input. It would greatly add value to R-U-In? if we could automatically figure out users’ interests based on what they are shar- ing with other users in different social networking portals. R-U-In? is already interfaced with Facebook and can extract user specified interests from there. However, Facebook interests represent user interests in a very general sense and donot necessarily coincide with his/her real time interests. This motivated us to look to a more active source of information regarding user-interest, like Twitter. Twitter is a free social networking and micro-blogging service that enables its users to send and read other users’ updates known as tweets. Tweets are text-based posts of up to 140 characters in length which are displayed on the user’s profile page and delivered to other users who have subscribed to them (known as ”followers”). Senders can restrict delivery to those in their circle of friends or, by default, allow anybody to access them. Social forums like Twitter, GTalk (considering the facility of status messages) provide a convenient platform for people to share their current thoughts with other people. It is 26
  31. 31. this data which we mine to extract information about user interest and intention which will allow the system to make a more informed decision about what the user is interested in. Although the immediate solution to such an analysis problem might seem to lie in the field of deep natural language processing, however, given the amount of data involved, it becomes impractical to use deep NLP because of its computational intensiveness and high processing time. Thus, we choose to apply shallow NLP which does not imply text understanding, i.e. semantic analysis of NL input. Instead it focuses on extracting text chunks, matching patterns or entities that contain the answer to user questions. We shall try to use purely statistical methods in our approach to the problem. 5.2 Related Work The area of social networks as a highly active field of research. We present here some of the related work in this area, including the paper on R-U-In? the application that we have primarily dealt with in the course of our project. Banerjee et al[3] present R-U-In?, a real time social networking framework, which allows users to collaborate and participate in activities of mutual interest by enabling them to search for peeople based on their real-time interests. R-U-In? leverages contex- tual modeling and reasoning techniques to enable social search based on real time user interests and finds potential matches for the proposed activity. Hirata et al[6] propose a system that aggregates together user’s multiple personal net- works, constructs a personal network that unifies their data and as well as adds activity information for each user inside the unified personal network. The system also allows transmission of user data within one’s own personal network using P2P. Kelkar et al[5] present an activity-based perspective of collaborative tagging (where activity is defined as the act of associating a tag with a bookmark by a user) which is based on certain defined measures of the tagging activity. It has applications in identify- ing trends and types of interests in web communities as well as expertise, staffing needs and knowledge gaps in enterprise communities. Java et al[1] present a topological and geographical study of Twitter’s social network to show that people use microblogging to talk about their daily activities and to seek or 27
  32. 32. share information. They also analyze the user intentions associated at a community level and show how users with similar intentions connect with each other. 5.3 Background Analysis Tools We aim to develop a history profile for every user in the backend. By doing an extensive analysis of that user’s presence and posts on various forums, like Twitter, we can get a better idea of the user’s real-time interests. In what follows, we shall detail the work done by us along with all the data-sets used for each experiment. All the following analyses incorporate two things: • Porter’s Stemming Algorithm: Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form - generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The rationale behind including this in our study is that tweets may contain different forms of the same word. To be able to come up with the correct number of occurrences of a given word in a dataset without having to search individualy for all its forms, we stem all the data in the first step. For this purpose, we employ Porter’s Stemming Algorithm, which was written by Martin Porter and published in the July 1980 issue of the journal, Program. This stemmer was very widely used and has become the de-facto standard algorithm used for En- glish stemming. As an example, consider these words: “Work”, “Working”, “Worked”. Running these words through the stemming process reduces each of them to their root, ie. “work”. The algorithm also reduces all letters to their lowercase forms, and leaves everything else untouched. • Lucene: Lucene is a free/open source information retrieval library, originally cre- ated in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. While suitable for any appli- cation which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene’s logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene’s API to 28
  33. 33. be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others can all be indexed so long as their textual information can be extracted. For our purposes, we shall be considering every tweet to be a “document” and we shall run our queries over an index of these documents. 5.4 Creation of an Ontology The first step towards analysis of tweets was to develop a deeper understanding of what users tweet about in general. We took a dataset consisting of 1.65 million tweets by users from London gleaned over a period of one month. We also took a second dataset consist- ing of all tweets from all over the world containing atleast one of a set of core words. This set consisted of more than 4 million tweets. The main purpose of this experiment was to get an idea of the number of occurrences of “useful” words. A useful word for us is that which gives us information about what the user is interested in. We therefore compiled a list of “Exclusion words” which consisted of all pronouns, prepositions, helping verbs, question words etc. After running both, the dataset and the exclusion list through the stemmer, we pre- pared a plot of the top 15 useful words in both the datasets. This process required a few iterations, over which we removed more non-useful words like “just”, “so”, “have” etc. after examining the results. The results for both the sets are presented in Figure 5.1 and Figure 5.2. 29
  34. 34. Figure 5.1: Word distribution for London data Figure 5.2: Word distribution for world-wide data 30
  35. 35. Based on the results of this experiment, we defined five broad categories of activities of user interest. Each of these categories consisted of a set of words which define activities or interests based in that category. Apart from these five, we also defined a set of action words and a set of temporal words that could be used to assign a location and time to a potential activity. The five categories thus defined are: • Movies • Sports • Dance • Music • Food These categories contained a total of around 190 words. Over the course of our experi- ments, we expanded these words based on empirical observations and our own experience of usage of GTalk, Facebook and other platforms. 5.5 Category Word Distribution Having defined a set of words in five categories, the next step was to see what percentage of tweets actually contain these words. So, we ran all these words as Lucene queries over the London data mentioned in the previous section. The top 19 words that appeared are presented in Figure 5.3. 31
  36. 36. Figure 5.3: Top category words Individual word occurrences donot seem to be very high from a first look at this graph. However, we summed up the result over all words in each category and plotted a category-wise distribution. This is presented in Figure 5.4. We noticed that a total of 15% of tweets contain atleast one category word. One might argue that a direct summation of percentages for each category word is not correct, since a tweet may contain more than one word of the same category. To justfiy our approach, we plotted a distribution of the number of words of a category that occur in a tweet. One of these graphs is presented in Figure 5.5. 32
  37. 37. Figure 5.4: Category-wise distribution Figure 5.5: Number of category words per tweets for one category 33
  38. 38. Notice that out of the tweets we consider for a category, 95% of those which contain atleast one word, contain exactly one word. Roughly 5% contain 2 words. For higher numbers, the percentages are negligible. Thus, as an approximation, a direct summation over each category is fairly justified. Given the results of this experiment, it was evident that we need to improve ontology by adding more words so that we get a higher percentage of tweets containing category words. Thus, based on an examination of around 1000 tweets as well as Facebook and GTalk status updates, we included some more words in each category. Furthermore, we added an entirely new category of Technology. The total set of category words now contained around 260 words. The category of Movies was expanded with words implying an interest in television soaps and shows as well. The category of Food was expanded with certain specific instances, such as “sausage”, “pizza” and “cake”. The ontology words are listed out in the appendices. We now ran the same experiment with the new ontology, this time on a dataset consisting of tweets of the previous one week over ten cities: (In Alphabetic Order) Atlanta, Austin, Boston, Chicago, Los Angeles, London, New York, San Francisco, Seattle and Toronto. These are the top ten cities in the world in terms of Twitter usage. The dataset is described in Table 5.1 We present here, the result from one of these cities, ie. New York. The results for the other cities are presented in the appendices at the end of this report. The top 15 category words that appeared are plotted in Figure 5.6. Figure 5.7 is the same plot, except that the values are expressed as a percentage of tweets containing atleast one category word. City Number of Tweets Percentage Tweets with Atleast one Category Word Atlanta 186711 21.99 Austin 124489 24.37 Boston 163062 24.67 Chicago 231622 23.55 Los Angeles 339583 24.31 London 373992 25.29 New York 569668 23.37 San Francisco 226261 24.06 Seattle 151917 25.78 Toronto 189575 23.00 Table 5.1: Dataset For Co-Occurrence Analysis 34
  39. 39. Figure 5.6: Top 15 category words for New York as a percentage of total tweets Figure 5.7: Top 15 category words for New York as a percentage of tweets containing atleast one category word 35
  40. 40. As done previously, we sum up all the category words over their respective categories and plot a category-wise distribution. This is shown in Figure 5.8. We notice that nearly 23.4% of tweets contain atleast one category word, which is a significant improvement from the previous case where we got around 15%. We obtained similar values for each of the ten cities, with varying percentages in different categories. These values are reported in Table 5.1. The relevant graphs are presented in the appen- dices. 5.6 Co-Occurrences A binary co-occurrence measure tells us how many tweets contain two given words. In context of activity oriented paradigms like R-U-In?, this kind of information would be useful if the person expresses his/her interest in a particular activity alongwith a location or time for the event. For instance, if a tweet contains the words “movie” and “tonight” within a few words of each other, then it can be said with a very good probability that the user intends to watch a movie tonight. In what follows, we conduct a co-occurrence analysis on all the ten cities mentioned in the previous section. We do this analysis in three parts: Action-Category, Temporal-Category and Action-Temporal. • Action-Category Co-Occurrence: In context of R-U-In?, our primary interest is to be able to parse an unstructured input and thus infer when a user expresses in- terest in some activity. We examine the co-occurrence of action words with category words to get an idea of the user’s intention. For example, if a tweet contains the words “movie” then all we can say is that the user intends to say something about a movie in paricular or movies in general. However, if we are given additional in- formation that the tweet also contains the word “watch” within a few words of the word “movie” then it can be inferred with a high probability that the user intends to watch a movie. We again consider the dataset consisting of the ten cities men- tioned in the previous section. We present the top ten action-category word pairs for New York in Figure 5.9. Figure 5.10 is the same plot, except that the values are expressed as a percentage of tweets containing atleast one category word. 36
  41. 41. Figure 5.8: Category-wise distribution for New York Figure 5.9: Top ten action-category pairs for New York as a percentage of total tweets 37
  42. 42. Movie + Watch Show + Love Game + Play Show + Watch TV + Watch Show + Love Game + Watch Show + Go Movie + See Gym + Go Table 5.2: Top Ten Values Figure 5.10: Top ten action-category pairs for New York as a perentage of tweets containing atleast one category word The highest occuring word pair is “movie” and “watch”. The lucene query used for a proximity search was given a default value of 5 as the maximum distance between the two words. Although the individual occurrence of a particular pair is small, when we sum up all the word pairs we get a total number of around 6.7% of all tweets which contain atleast one action-category word pair. The category-wise dis- tribution of action-category word pairs is plotted below in Figure 5.11. Figure 5.12 is the same plot, except that the values are expressed as a percentage of tweets containing atleast one category word. • Temporal-Category Co-Occurrence: A temporal-category co-occurrence can be useful since it allows the system to assign a time-frame to the activity of interest and can thus be useful in the match-making process. An analysis analogous to the 38
  43. 43. above was done on all ten cities. The top ten temporal-category pairs for New York are presented below in Figure 5.13. Figure 5.14 is the same plot, except that the values are expressed as a percentage of tweets containing atleast one category word. Figure 5.11: Category-wise distribution of action-category pairs for New York as a percentage of total tweets 39
  44. 44. Figure 5.12: Category-wise distribution of action-category pairs for New York as a percentage of tweets containing atleast one category word Figure 5.13: Top ten temporal-category pairs for New York as a percentage of total tweets 40
  45. 45. Show + Tonight Show + Today Show + Night Show + Time Show + Day Dinner + Tonight Coffee + Morning Song + Day Movie + Time Blog + Today Table 5.3: Top Ten Values Figure 5.14: Top ten temporal-category pairs for New York as a perentage of tweets containing atleast one category word The highest occuring word pair is “show” and “tonight”. The lucene query used for a proximity search was given a default value of 5 as the maximum distance between the two words. Although the individual occurrence of a particular pair is small, when we sum up all the word pairs we get a total number of around 4% of all tweets which contain atleast one action-category word pair. The category-wise distribution of temporal-category word pairs is plotted below in Figure 5.15. Fig- ure 5.16 is the same plot, except that the values are expressed as a percentage of tweets containing atleast one category word. • Action-Temporal Co-Occurrence: Analyzing action-temporal co-occurrences can be useful by allowing us to associate timeframes to words describing some ac- tion. Figure 5.17 shows the top 20 action-temporal word pairs for New York (with a 41
  46. 46. Figure 5.15: Category-wise distribution of temporal-category pairs for New York as a percentage of total tweets Figure 5.16: Category-wise distribution of temporal-category pairs for New York as a percentage of tweets containing atleast one category word 42
  47. 47. default range of atmost 5). We note that the total percentage of tweets with atleast one action-temporal co-occurrence comes out to be 9.29%: The pair with the highest occurrence, as shown in Figure 5.17, is “work” and “day”. Figure 5.17: Top 20 action-temporal word pairs for New York Work + Day Go + Day Go + Am Go + Today Go + Tonight Work + Today Go + Time Do + Time Do + Today Do + Day Table 5.4: Top Ten Values To summarize: • The above analysis validates the underlying assumption of the system by showing that users do tend to tweet about their real-time interests and in some cases even mention an associated location or timeframe • A co-occurrence helps us to infer with a high probability of the actual user intent which would be very helpful in the matchmaking process. The fact that a fairly 43
  48. 48. large value of tweets containing one category word contain a co-occurrence suggests that co-occurrence is definitely an indicator of user intent. • This approach offers a more practical solution to the match-making problem for large data sets compared to more intense approaches such as deep NLP, which may take a very long time. 5.7 Manual Benchmarking and Validation Once we have the results for the various co-occurrences as detailed in the above section, the next obvious step is to create a benchmark and manually verify some of the results obtained. After manual benchmarking, we would be able to assign a “confidence measure” to a co-ocurence. This measure can be understood by a simple example. Consider 100 tweets by a user. Suppose that 20 of these contain the co-occuring pair “football” and “play”. Suppose that 5 of these 20 actually imply an intention to play soccer in the near future. Thus, we define: Confidence Measure = P(Actual Intention/Co-Occurrence) = P(Actual Intention and Co-Occurrence)/P(Co-Occurrence) In this example, the confidence measure is simply 5/20 = 25%. We choose to pre- pare a benchmark for the word pairs “movie” and “watch” in the action-category case and “show” and “tonight” in the temporal-category case. The results as obtained for New York are given below: • For the pair “movie” and “watch” the total number of tweets containing this pair was 583. We observed four kinds of tweets in general: 1. Tweets which indicate that the movie has already been seen 2. Tweets which ask for a suggestion for a movie to watch 3. Tweets which express a definite intention to watch a movie in the near future 4. Tweets expressing a disinclination to watch a movie Out of these, we observed only one tweet that fell into the last kind (ie. expressed a disinclination). In context of R-U-In?, we consider only tweets of the third kind, ie. a definite intention to watch a movie in the near future. We found 114 out of the 44
  49. 49. 583 tweets falling in this category, which implies a confidence measure of 114/583 which is approximately 20%. • For the pair “show” and “tonight” the total number of tweets containing this pair was 409. We observed three main kinds of tweets in general: 1. Tweets which indicate that the show is already over 2. Tweets expressing interest in a TV/radio show 3. Tweets which express a definite intention to attend a show that night There were also tweets that didn’t fall into any of these three main categories. There were onle 3 tweets that expressed a disinclination to attend a show that night. In context of R-U-In?, we consider only tweets of the third kind, ie. a definite intention to attend a show that night. We found 133 out of the 409 tweets falling in this cat- egory, which implies a confidence measure of 133/409 which is approximately 32.5%. The above manual benchmarking helped to establish the following: 1. People tend be more assertive about their interests than negative about their non- interests 2. The co-occurrence measure yields a reasonable confidence measure, which validates all our earlier analysis based on co-occurrences as a first approximation. 3. We had noticed that around 8% tweets for New York contained atleast one action- category pair. We had also obtained a 20% confidence measure for the pair “movie” and “watch”. Combining the two results (assuming a 20% confidence measure for all action-category pairs), we see that roughly 1.6% of all tweets contain an action- category pair that actually signifies interest in that activity. 5.8 Specific Instance Occurrences During our analysis, we observed that users often tend to tweet about specific instances rather than category words in general. For instance, a user might tweet about Star Wars or Stanley Kubrick rather than exlicitly mention the word “movie” in his/her tweet. In our ontology, the sports category already contains names of almost every known sport. 45
  50. 50. For the category of movies, where we saw this observation surface most often, we col- lected a list of all movies since the year 2000 (source: Wikipedia) and ran them as Lucene queries over the New York data. To avoid movie names that were also commonly occuring words, like “Wanted”, we set a maximum limit of 500 on the number of occurrences of a movie name, based on empirical observations. Recall that the entire movies category accounted for roughly 5.1% of all the tweets in New York (Figure 5.8). However, when we count only the specific instances, ie. movie names, we see that these alone account for 2.82% of all the tweets. Notice that the list of specific instances in th realm of movies can be extremely vast and may include names of actors, directors, etc. All these can add up to a significant number. Thus, we can interface a third-party solution with R-U-In? which can use the tweet as a search query over an internet search engine after removing all category words and thereby detect such instances and infer user interest as falling in one of our defined categories. We relegate this aspect, however as future work. 5.9 Per-User Analysis and Determination on Princi- pal Interest Category Once we have all the background information about the data, we next focussed our atten- tion to a per-user analysis. The motivation behind this is that our ultimate aim is to be able to tag every user with his/her category of interest with a high confidence measure. These tags can then be used for match-making in cases of unstructured or vague input. For this particular experiment, we required a larger data set, ie. a set consisting of more tweets. Thus, we used the dataset (Gleaned over one month from ten cities) detailed in Table 5.5. We first ran the experiment for the first 1000 users in the data set for one city. We noticed that, users with higher number of tweets seem to have more consistency in their interests. Thus, we re-ran the experiment, this time for all the ten cities, only for the top 1000 users (in terms of number of tweets) for each city. For every user, we found out the following things: • The number of tweets • The top 5 category words for that user (Based on the six categories defined previ- ously) 46
  51. 51. City Number Of Users Number of Tweets Atlanta 22201 911233 Austin 17202 687164 Boston 24000 912446 Chicago 29474 861152 Los Angeles 39314 1233271 London 42125 1065716 New York 64095 1254530 San Francisco 28815 1154659 Seattle 22302 832023 Toronto 24713 929683 Table 5.5: Dataset For Per-User Analysis • The principle category of interest based on the category words that appear in his tweets • The top action and temporal co-occurrences with category words The results presented a highly consistent picture which allows us to identify user interest very easily in most cases. We present here, two sample users and the results obtained for them. The principal category is calculated as follows. We sum up the number of occurrences of category words over each category. Given these six values, we take their mean and report all categories with the number of occurrences exceeding this value as possible categories of interest: 1. This is a sample user from Atlanta. Figure 5.18 shows the category-wise distribution of words in this user’s tweets. • Username: ABC1 • No. Of Tweets: 2945 • Top 5 category words with corresponding occurrences: (a) internet (585 occurrences) (b) blog (250 occurrences) (c) googl (google -> Stemmed form) (66 occurrences) (d) site (47 occurrences) (e) websit (website -> Stemmed form) (45 occurrences) • Principle Category Of Interest: Technology (38.88% of this user’s tweets) 1 Real username masked for privacy reasons 47
  52. 52. Figure 5.18: Category-wise word distribution for ABC • Top action-category pair: “internet” + “busi” (“Internet” + “Busy”) (102 oc- currences) • Top temporal-category pair: “internet” + “time” (19 occurrences) 2. This is a sample user from Los Angeles. Figure 5.19 shows the category-wise dis- tribution of words in this user’s tweets. • Username: DEF1 • No. Of Tweets: 11712 • Top 5 category words with corresponding occurrences: (a) nba (752 occurrences) (b) golf (418 occurrences) (c) race (396 occurrences) (d) footbal (football -> Stemmed form) (319 occurrences) (e) game (318 occurrences) • Principle Category Of Interest: Sports (25.96% of this user’s tweets) • Top action-category pair: “game” + “plai” (“Game” + “Play”) (16 occurrences) • Top temporal-category pair: “nba” + “time” (“NBA” + “Time”) (12 occur- rences) 1 Real username masked for privacy reasons 48
  53. 53. Figure 5.19: Category-wise word distribution for DEF For the above two users, the interest category was very easy to infer, since all of the top five category words in their tweets belonged to the same category. Not all users were so obviously biased, however. Consider for example the following user from Los Angeles. Figure 5.20 shows the category-wise distribution of words in this user’s tweets. • Username: GHI1 • No. Of Tweets: 2471 • Top 5 category words with corresponding occurrences: 1. rock (30 occurrences) 2. site (13 occurrences) 3. food (9 occurrences) 4. technolog (technology -> Stemmed form) (8 occurrences) 5. hotel (7 occurrences) • Principle Category Of Interest: Music and Technology (1.54% of this user’s tweets). For this particular user, we see from Figure 5.20 that there is not clear single peak. So, we apply our mean-value based heuristic and we get a tag containing not one but two principal categories of interest. • Top action-category pair: “site” + “need” (9 occurrences) 1 Real username masked for privacy reasons 49
  54. 54. Figure 5.20: Category-wise word distribution for GHI • Top temporal-category pair: none To summarize: • We have developed a scheme to tag users according to their interests as found out from their tweets • The category wise distribution for different users varies vastly. For certain users, the interest peaks are very obvious, but this is not the case for other users. For the latter set of users, we report all categories above the mean as possible categories of interest and use them while matchmaking in case of vague inputs to R-U-In? 5.10 Validation of Result Stability by measuring KL Divergence In probability theory and information theory, the Kullback-Leibler divergence (also infor- mation divergence, information gain, or relative entropy) is a non-commutative measure of the difference between two probability distributions P and Q. Given two probability distributions, P and Q, of a discrete random variable X, the Kullback-Leibler Divergence (or the KL Divergence) is defined as: Dkl (P ||Q) = P (i)log2 (P (i)/Q(i)) i 50
  55. 55. We define the six categories as the six discrete values of the variable X. We define the probability distributions as follows. Consider a user A. For each category word which appears in any of A’s tweets, we sum up the number of occurrences of words for each category. Let the values obtained for the six categories be x1 , x2 ... x6 . Note that the xi ’s are all integers. We now normalize these values by dividing each by the sum of the xi ’s. The values hence obtained form the required distribution. For this experiment, we present the result for San Francisco. Tweets for this city have been gleaned over a period from 18th March 2009 to 10th April 2009. We split this period into 4 parts of six days each. For each of these parts, we calculated the per-user distribution of X. We did this for the top 1000 users (In terms of number of tweets) for San Francisco. The main aim behind such a calculation is two fold: • To get a better understanding of how user interests vary with time. • To validate the history profiling of a person by showing that a person’s interest donot fluctuate very often. By proving this, we can say that the history profile of users as created by our analysis of the user’s tweets are a fairly accurate indicator of the user’s interests Firstly, we removed users with non-admissible values of X from our set of observation. This leaves us with 944 users. For these four sets of data over six days each, we calculate the KL divergence of each pair of consecutive weeks. These are plotted as scatter plots with the user number appearing on the x-axis and the corresponding KL divergence values appearing on the y-axis. These plots are shown in Figure 5.21, Figure 5.22 and Figure 5.23. Furthermore, to get an idea of the variation over a larger time period, we did a similar plot for the first and last six-day periods. This plot is shown in Figure 5.24. 51
  56. 56. Figure 5.21: KL Divergence between weeks 1 and 2 Figure 5.22: KL Divergence between weeks 2 and 3 52
  57. 57. Figure 5.23: KL Divergence between weeks 3 and 4 Figure 5.24: KL Divergence between weeks 1 and 4 We notice that there is a band centred around the value of 0.0 in the first three plots. This band seems to be spreading in the last plot. However, the values obtained remain within a small range close to zero for most of the users. Thus, we can safely say that user interests donot fluctuate with a very high frequency and that our analysis using a twitter-feed based history is valid. 53
  58. 58. Chapter 6 Conclusion and Further Work In the upcoming field of real-time activity-oriented social networking, we feel that the success of a venture would depend upon the product’s portability and ease of access. Another crucial factor is the ability of the service to deliver correct matches for a given interest. Considering the case of R-U-In?, we have successfully demonstrated two en- hancements. • Firstly, we have presented a mobile phone based framework for R-U-In?. We have successfully demonstrated the development on Android platform. • We have shown that it is possible to create user profiles based on their complete online presence. As part of this project, we consider the user’s presence on Twitter. We have successfully shown that: – Users do tweet about their real-time activities and interests – Using a co-occurrence based shallow NLP approach we have also shown that out of the tweets made by a user for words of a certain category, there are a significant number of tweets that contain a co-occurrence with descriptive action/temporal words. – By manually developing a benchmark, we have demonstrated a reasonable level of confidence in the results obtained by co-occurene measures – After conducting a per-user analysis of tweets we have successfully been able to tag users by identifyng their top category of interest based on their twitter feeds. – Finally, we validated our analysis by calculating a KL divergence over a set of data spanning almost one month. 54
  59. 59. This area holds a lot of potential for future work. Some of this is mentioned below: • The concept of mobile-phone based real-time social networking can be extended to other operating systems like Symbian in a fashion analogous to that employed by us. • There are various other sources of contextual communication which have a huge database of information. Such sources can be similarly exploited to develop a rich presence based system. We can thus create an aggregation of user information based on such sources. • User interests may diverge over month to month so data collected over longer du- rations of time (one year) could help in study the variations of user interests with time, though the strategy of analysis would follow a similar path as we have shown. • A more thorough analysis of the data may be carried out by employing techniques from deep NLP. A trade-off, however would be that the time taken would be much larger. 55
  60. 60. Appendices 56
  61. 61. Appendix A Our Ontology The following tables show the six categories and their constituent words along with their stemmed versions. 57
  62. 62. Movies Sports Dance Root Word Stemmed Word Root Word Stemmed Word Root Word Stemmed Word movie movi sport sport dance dance oscar oscar game game latin latin hollywood hollywood fishing fish salsa salsa action action rugby rugbi ballroom ballroom adventure adventur soccer soccer jazz jazz animated anim football footbal ballet ballet traditional tradit swimming swim modern modern stop-motion stop-motion diving dive swing swing biography biographi archery archeri interpretive interpret comedy comedi race race tap tap children children climbing climb lyrical lyric crime crime skiing ski hip-hop hip-hop disaster disast biking bike hiphop hiphop drama drama baseball basebal ensemble ensembl fantasy fantasi ball ball point point horror horror cricket cricket flamenco flamenco sci-fi sci-fi surfing surf club club short short boarding board thriller thriller skating skate war war bowling bowl western western cycling cycl film film wrestling wrestl theatre theatr judo judo popcorn popcorn karate karate tv tv fencing fenc television televis boxing box show show billiards billiard sitcom sitcom pool pool soap soap snooker snooker episode episod country countri series seri gym gym cnn cnn gymkhana gymkhana nbc nbc jumping jump channel channel golf golf Table A.1: Ontology (part 1) 58
  63. 63. Music Sports Food Root Word Stemmed Word Root Word Stemmed Word Root Word Stemmed Word music music handball handbal coffee coffe rock rock hockey hockey bar bar classical classic rally ralli lunch lunch indian indian kayaking kayak restaurant restaur fusion fusion canoe cano cafe cafe metal metal rafting raft hotel hotel blues blue rowing row snack snack african african tennis tenni dinner dinner folk folk badminton badminton meal meal rap rap running run food food pop pop walking walk hungry hungri song song chess chess wine wine hit hit sudoku sudoku vodka vodka chartbuster chartbust bat bat album album match match whisky whiski band band superbowl superbowl breakfast breakfast guitar guitar nba nba soup soup drummer drummer fifa fifa sausage sausag bassist bassist cup cup pie pie bass bass league leagu chicken chicken guitarist guitarist jogging jog noodle noodl singer singer olympics olymp bread bread vocalist vocalist sauce sauc cake cake pizza pizza sandwich sandwich salad salad tea tea milk milk Table A.2: Ontology (part 2) 59
  64. 64. Technology Temporal Action Root Word Stemmed Word Root Word Stemmed Word Root Word Stemmed Word computer comput morning morn busy busi keyboard keyboard evening even avaliable avail mouse mouse afternoon afternoon feel feel cd cd noon noon play plai internet internet night night work work net net hour hour see see site site today todai watch watch website websit tonight tonight love love facebook facebook tonite tonit hate hate orkut orkut yesterday yesterdai look look disk disk minute minut need need windows window year year thank thank linux linux month month party parti mac mac day dai think think unix unix time time cook cook blog blog monday mondai sleep sleep email email tuesday tuesdai class class gmail gmail wednesday wednesdai office offic google google thursday thursdai drive drive microsoft microsoft friday fridai trek trek mobile mobile saturday saturdai read read youtube youtub sunday sundai write write ipod ipod week week type type iphone iphon weekday weekdai wait wait ebay ebai weekend weekend go go laptop laptop am am do do notebook notebook pm pm fight fight desktop desktop eat eat buy bui sell sell download download upload upload upgrade upgrad listen listen speak speak meet meet enjoy enjoi search search wait wait sing sing report report Table A.3: Ontology (part 3) 60
  65. 65. Appendix B Charts For The Remaining Nine Cities B.1 Category-wise Distribution For These Cities Figure B.1: Atlanta 61
  66. 66. Figure B.2: Austin Figure B.6: London Figure B.3: Boston Figure B.7: San Francisco Figure B.4: Chicago Figure B.8: Seattle Figure B.5: Los Angeles Figure B.9: Toronto 62
  67. 67. B.2 Category-wise Action and Temporal Co-Occurrences For These Cities Figure B.10: Atlanta - Action Figure B.11: Atlanta - Temporal 63
  68. 68. Figure B.12: Austin - Action Figure B.16: Chicago - Action Figure B.13: Austin - Temporal Figure B.17: Chicago - Temporal Figure B.14: Boston - Action Figure B.18: Los Angeles - Action Figure B.15: Boston - Temporal Figure B.19: Los Angeles - Temporal 64
  69. 69. Figure B.20: London - Action Figure B.24: Seattle - Action Figure B.21: London - Temporal Figure B.25: Seattle - Temporal Figure B.22: San Francisco - Action Figure B.26: Toronto - Action Figure B.23: San Francisco - Temporal Figure B.27: Toronto - Temporal 65
  70. 70. Bibliography [1] Tim Finin Belle Tseng Akshay Java, Xiaodan Song. Why we twitter: Understanding microblogging usage and communities. pages 56–65, 2007. [2] http://jcmc.indiana.edu/vol13/issue1/boyd.ellison.html. Social networks timeline. [3] Koustuv Dasgupta Sumit Mittal Seema Nagar Saguna Nilanjan Banerjee, Dipan- jan Chakraborty. R-u-in? - exploiting rich presence and converged communications for next-generation activity-oriented social networking. MDM to appear. [4] Bala Mulloth Nina D. Ziv. An exploration on mobile social networking: Dodgeball as a case in point. Copenhagen, Denmark, 2006. [5] Doree Seligmann Shreeharsh Kelkar, Ajita John. An activity-based perspective of collaborative tagging. 2007. [6] Hideaki Takeda Susumu Kunifuji Toshiyuki Hirata, Ikki Ohmukai. Personal network aggregation system for real-time communication support. 2007. 66

×