An Improved People-Search Technique for Directed Social Network Graphs

  • 415 views
Uploaded on

Authors: Vacharasintopchai, Thiti and Nguyen-Huu, Phong …

Authors: Vacharasintopchai, Thiti and Nguyen-Huu, Phong

Issue Date: 11-Dec-2009

Type: Article

Series/Report no.: Proc. 2nd International Conference on Robotics, Informatics, and Intelligent Technology (RIIT2009);

Abstract: Social networks offer incredible opportunities for users to create contents and share their experiences. The number of users joining these social networks has been rising dramatically. However, in a social network several users may share the same name. This causes name ambiguity in which search engine returns homogeneous search results for each queried name. To solve this problem we propose an approach to improve search results for finding friends within a large social network by using friendships among users as our backbone feature. Our approach finds most ranked seeds by using PageRank algorithm before computing approximate shortest path in a directed graph. We also retrieve real data from the social network Twitter to verify our approach. Results show that our approach outperforms the SeedBase approach which selects seeds randomly with large margin.

URI: http://dspace.siu.ac.th/handle/1532/718

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
415
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ... The r International Conference on Robotics, Informatics, and IJttelligent Technology (RIm(}()9) December 1 r- J.f, 2009 at Bangkok. Thailand AN IMPROVED PEOPLE-SEARCH TECHNIQUE FOR DIRECTED SOCIAL NETWORK GRAPHS Thiti Vacharasintopcha4 Nguyen Huu Phong School of Technology, Shinawatra University Pathumthani~ Thailand 12160 Email: thitiv@Siu.ac.th.phongnh174@yahoo.com.vn ABSTRACf they have already got offlinerelationships to reconnect with them [4, 5]. Users can find their friends by providing their Social networks. offer incredible opporttmities fur users names addition to other infonnation. However, searchto ~ contents and $bare tbeirexperiences. The nwnber results from large popular sites return long irrelevant userof users joining these social networks has been rising lists than one can imagine,drama1ically. However. in a social network several users In this paper, we propose an approach to improvemay share the same name. This "CaUSeS name ambiguity in searching for friends in social network, Our approachwhich search engine returns homogeneous search results for employs the PageRon/c algorithm to find seeds in order toeach queried name. To solve this problem we propose an compute approximate shortest paths within a social network.approach to improve search results fur finding friends within We use the friendship among friends as the backbone.a large social network by using friendships among users as feature. We also conduct experiments to verify theour backbone feature. Our approach finds illOSt ranked seeds effectiveness of the proposed approach.by using PageRank algorithm before l.-omputing The rest of 1his paper win be organized as follows:approximate shortest path in a directed graph. We also First, we investigate previous developments of people-retrieved real data from the social network. Twitter to verifY search teclmiques in Section 2. Then we present ourour approach. The sesults show that our approach approach to search for friends in a social network inootper.fonns the SeedBase approach which selects seeds Section3_ Later, results are presented in Section 4. Finally,nmdomly by a large margin. conclusions are discussed in Section 5.Index Terms-- Search algorithm, social network analysis, 2. RELATED WORKauthority analysis, shortest-path algorithm, graph algorithm The top web search providers such as Google and Ymoo offer standard search services where users can search 1. INTRODUCTION by keying keywords. This may be lyrics of a song, a movie trailer, a show time of a fashion, the title of a textbook. or a Recently, social networlG have gained the explosive name of a friend. In traditional method. search engine5growth of popularity on the Web and the number of people match the provided keyword to contents in their databasejoining these networks is increasing significantly. These and return to users with a list of homogeneous search results.social networks assist users to create a network of mends In 1he web search and infonnation retrieval area, theand help in maintaining relationships among long distance accuracy of search results provided by search engines isfriends, finding friends and sharing infurmation among evaluated by a method called ranking algorithm. PageRanknetworks. Moreover, in the very near future, the social is one of the well known algorithms in this area [6, 7]. Thisnetwork site will play an important source of knowledge and algoritlun takes the numbet of forward links and the numberinformation II}. of back links to a web page as important factors to rank: each Popular social networks on the Web include MySpace web page IS]. By this way, the search engine retmns usersand Twitter. These sites assist users to create_and customize with a list of ordered and ranked web pages for eachtheir personal information, btogs, multimedia, groups and particular keyword. To get better search result, a user canother features. MySpace began in July 2003 and was the provide further infurmation about the searched keyword. Forlargest social netwOl:k in the world in November 2006 with instance, in case of finding a friend, users could provide themore than 130 million users [2]. Twitter, a microblogging high school name where they studied together and the schoolhad increased the number of users significantly since it year to the input SO that search engines can be able to fitterbegan in October, 2006 [3]. out irrelevant data and return more desired results. Researches demonstrate that the majority of users Furthemlore, .search results can be improved by usingactivity on the social network is to search for friends who 61
  • 2. The r International Conference on Robotics, biformatics. and Intelligent Technology (RIIn()()9) December 11!l1 -l.f, 2009 at Bangkok, Thailand implicit users information such as social annotation f7J and (BFS) from a finder to aU users [5]. The number of relationship queries [9, 10]. calculation is limited by stopping BFS after a desired bop. To improve web search results, autlrors [7J discuss two The latter algoritlnn uses seeds and computes approximate algorithms., namely SociaiSimRonk and SocialPageRonk. shortest distances from these seeds to all users [5]. The The former is based on an observation tbat when users research demonstrates that the seed-based ranking algoritlun browse and annotate a web page, this can be a sood ootpetforms other algorithms in tenn of performance ~ indicator of the web page content [7}. The latter is bas¢d on precision [5]. another observation that the number of users who annotate on a web page can demonstrate the quality of the web page [7]. That r~ shows that both types can improve web search significantly [7]. In this section, we popose an approach to select better Another research observestbat the top ranked web page seeds than the approach that is discussed in Section 2, in pairs could contain relationships between the two entities, wbiclI seeds are selected randomly before computing and that relationship can be used to improve the web search approximate ~ paths. In our approach, all vertices are [10]. ranked at first by using the PageRank algorithm. Then these In social network context, search engines could apply vertices are sorted in reverse order and top ranked vertices the same patterns as mthe web search 8lU for the particular are selected as seeds. After that. these seeds are used to purpose of people search as mentioned in [5]. However, the compute shortest paths ftom. them to all vertices. use of this approach in searching could meet the same problem as in the web search, which is returning the same 3.1 Seed Dista_ces search result for every ~ ~ provides the same keyword [5,9]. According to the autOOrs [11. 12}, for a given graph In general, a social network can be represented as a structure with the number of vertices is n and the number of graph in which vertices representing users and edges edges is m. a query for the distance between any pair of representing their relationships. The simplest form of vertices takes smaller amount of time and space than relationship is the friendship where a user is a friend of computing all pair shortest paths when these distances are another. In a social network, when a user searches for a pn>COtllputed. people name they would likely recognize people who have a The authors [5] applied the concept above by selecting closer relationship with them, in other worrls, the :friend a a small fiactioo of vertices randomly. These seeds person is looking for is more likely a person who bad the (landmarks) are used as navigational beacons in their ~ path" to them in their relationship graphs. Figure 1 friendsbip graph. Then shortest paths ftorn these seeds to all shows an example of searching fur a people name in the vertices are computed.. Later. the shortest path between any social network. The user named Ilman searches fur his pair of vertices can be queried. friend whose name is Huyen. Two results are returned in Figure 2 shows an example of the seed distance which the first person is in the distance of I and the second approach fur the convenience of demonstration.. Suppose person is in the distance of 2. The correct .result should be that We need to ~ the shortest path between Vertex 1 the first person since she is closer to. Thuan than the second and Vertex 7. We also suppose that Vertex 5 is chosen. as a person. seed. We first find the shortest path between Vertex 5 and Vertex 1. this shortest path is DSl =1. In the same manner, the shortest path between Vertex 5 and Vertex 7 is DS7 = 1. Finally, the shortest path between Vertex 1 and Vertex 7 is the swn of the above two shortest paths which is £),.7 = DSI + DS7 = 2. In this case, the shortest path is Phuong correct. However, in the other case such as when we need to compute the shortest path between Vertex 1 and Vertex 2, FIgUre 1: Searching for People Name in II Social the seed is Vertex 5. Then by using this approach, the Network shortest path between Vertex 1 and Vertex 2 is In this case, authors in [5] use the approximate shortest DI2 = Ds, + D52 = 3 which is incorrect. path in a friendship graph as a factor in their ranking Table I shows the pre"Plocessing result from seed algorithms. These algorithms include on-the-fiy ranking and vector 1. From this table, we can find seed distances from seed-based ranking. The funner algorithm computes any pairs of vertices by computing the smn of their distanceS distances at scoring time by running Breath First Search totbe seed. 62,i i d
  • 3. The r Inte.rnational Cotiference on Robotics, Informatics, and Int.eJJigent Technology (RImOO9) December 1 r t - 1.f", 2009 at Bangkok, Thailand number of Friends that a Tweeter follows is defined asC(T). PageRank of a Tweeter PR(T) is computed as follows: To compute the PageRank of all Tweeters, we first set PageRank of 1hem to be ones. Then we iterate over all of .( tweeters and compute their PageRank by using Equation 1. Figure 3 shows an example directed graph in Twitter. In this graph. the user named Thuan has five followers. Each ofFigure 2: Example Graph of Vertices ami Edges them also has some other followers. From Equation 1, Thuan is the highest rank since he is followed by many In a social network with 100 million users., we would important followers. ~ next ranked is Huyen since she has.need to compute up to 1016 times to know distances from all more number of followers than Hanh and Thanh. The vertices. By using small fraction of seeds., the runtime remaining followers are ranked equally. required and space ~ be reduced significantty.Table 1: Example ofPre-processing Seed Distance The approach in [5] selects the seed randomly whichmay cause the lower accuracy than if better seeds arechosen. Therefore, we propose an approach to select themost important seeds instead of choosing seed randomly.Since our social netWork forms a directed graph which has awmmon feature as web links in PageRank algorithm. we Figure 3: Example Directed Graph in Twitterdecide to use PageRank algorithm in selecting seeds in ourapproach. 3.3 Vect:ors Distances3.2 pageRank vectors distances of seeds consist of distances from fractions of vertices (seeds) to all vertices. First, these seeds PageRank algorithm is used to rank web pages based on are ranked by using the PageRank algorithm as descnDed infue number offorward links and back links to a web page [8, Section 3.2. Then these seeds are sorted in reverse order so13, 14, 15, 16J. The intuition of this algorithm is that when that the highest ranked ones are arranged at first. Next, ausers link from a web page to other web pages. this could fraction of seeds are selected from the top ones. Later, seedindicate endorsement of the web page content [13, 14}. We distances from these selected seeds to all vertices areobserved that in our social network Twitter, one tweeter may computed. The exact shortest path between two givenfollow other tweeters (friends) and may be followed by vertices is computed using classical Dijkstras algorithm asothers (followers). Therefore., applying the PageRank described in [17, 18] instead of using BFS and Map-reducealgorithm could help to find better seeds. - computation as presented in [5]. Though there are several We use the pageRank algorithm in our approach to rank algorithms to perform faster rumring time such aseach tweeter in our social network in which friends are as implementing Dijkstras algorithm with Fibonacci heap [11,furward links and followers are as back links. According to 19J, this goes beyond OUT scope. Finally, the approximate[8, 15}, the PageRank algorithm can be stated as: Given a shortest path between two given vertices is the smallest sumgraph of Twitter, the number of Followers (F) that fullows a of shortest paths from these vertices to selected seeds.Tweeter (T) is denoted n. A parameter d is the damping In Figure 2, suppose that we choose Vertex 1 andfactor which ranges between 0 and 1, and is set at 0.85. The Vertex 5 as seeds and we need to find approximate shortest 63
  • 4. ~. The r International Coriference on Robotics, Informatics, and Intelligent Technology (RIm0Q9) December 1 r - 14", 2009 at Bangkok, Thailand path between two vertices Vertex 5 and Vertex 7. 1be seed 4. EXPERIMENTS AND RESULTS distances from the Vertex 1 to all vertices are Dl =[0" I, I, I, 1, 1,2]. Also, the distances from In this section. we present our results and discussion of the two methods: SeedBase and PageRank. Vertex5 to all vertices is DJ =[1,2,2,2,0,2, 1]. The In Experiment 1, two sub experiments were conducted approximate shortest paths between the two vertices using using two different datasets. The maximum size of each the two seed distances are 3 and 1, respectively. The dataset was set to 125. The first dataset contained 87 approximate shortest path as described above is the smallest vertices and 103 edges. The second data set contained 107 distance 1. vertices and 120.edges. The number ofVt:rtices was less than the maximum number 125 since some tweeters had protectal 3.4 ExperimeDtal Setup data which arc tmIy accessible by those in their friend lists.. The numbers of seedswas varied from 1 to 10 with "1" FtrSt., we selected a pair of vertices randomly since we incremen1s. The mean accuracy of the SeedBase and the did not have access to data logs from the data resource PageRank were compared to the on-fue..fly ranking which (Section 3.5) for name queries. Wt; tht:n computed the yields l000"{ accurncy. The outcomes are presented in approximate shortest path between them using our approach. Table 2 and Table 3, respectively. These data were also This result was eompared to me oo-d.le-fly tanking as plotted in Figure 4 for the convenience of comparison. described in [51 since it yields lOOO/e accuracy_ In order to know the perfurmance of our approach, we Table 2: Atturacies of SeedBase and PageRank from implemented the seed-base ranking algorithm (SeedBase) as Experiment I Dataset I described in {5]. For comparative purposes, we modified this algorithm by replacing its combination of BPS and ::::seed Seed Base (°0) PageRank (%) Map-reduce with Dijkstras algorithm. 1 14 46 We l"3.O each experiment I (} times to compute accuracies 2 14 78 and running times. We perfunned experiments from a virtual 3 21 74 machine with 1.8 GHz processor, 512 MB RAM. 4 20 83 5 27 85 3.5 Data CoUection 6 38 80 We evaluated the accutacy of our approach with real 7 35 81 data from the social network Twitter. These data were .. 8 37 .. _., 95 " - gathered by using the snowbaII method described by [2J. 9 47 9S The algorithm was executed through several steps: selecting 10 34 99 tweeters as initial seeds, running a BFS to all of their friends until it reaches to a desired hop. Table 3; A~.rnde5 qfSee<Wase aad PageRank fro..- First, some tweeters were selected as initial seeds. E~t I Dataset 2 These tweeters were retrieved from Twitter public timeline in which a list of 20 tweeters was generated .randomly. As a #seed SeedBase (° 0) PageRank (°0) result, these tweeters may not be connected to each others. I 4 41 Since the focus of this research is to examine the 2 41 15 relationship among eacll user, we decided to pick up only 3 21 52 one tweeter in the public timeline per time. 4 17 81 Second, the number of frknds of each chosen tweeter varies.. Twitter limits the number of tweeters that one can 5 26 7& follow up to 2000. Even though this limitation can be lifted 6 30 83 by increasing the amount of number who funo~ it can be 7 27 79 observed that these tweeters are very likely to be a 8 43 86 representative of an organization rather than an individual. 9 39 88 For this reason, these tweeters are not retrieved. 10 41 87 Finally, the number of hops can also be chosen in variety, depending on the rnaximmn desired size of the With both datasets, the accuracies of both PageRank commnnity. For example. if each tweeter bas 10 friends and and SeedBase went higher as the number of seeds increased. the number of hops is equal to 5, the maximum size of this However, the PageRmtk outperfoimed SeedBase by a large 5 community is 1x 10 • margin. Even with given only one seed, 64 i, II i! ~
  • 5. The r International Conference on Robotics, Informatics, and Intelligent Technology riImOO9) December lI st - l,f, 2009 at Bangkok, Thailand In these experiments, the accuracies of both PageRank and SeedBase were also higher as the numbers of seed increased. PageRanks accuracies were between 18% and 30% when -fue·first seed was given, whereas the accuracy of eo ~ed.Base was lower than 100/0. PageRanks accuracy mcreased constantly at first, then grew rapidly after seven or I __,___ ~~~~==-~ __:_ ~":~R~;k2 I I . I 1 thirteen seeds and then kept increasing slowly until it reached up to above 90% when the number of seeds was I ! I .~. I I - - 1 seedBase1~~ -----/"4-- _ -----1---- I ~-_ between nineteen and twenty five. SeedBases accuracy also I I , f - ":" , increased constantly but reached up to only about 66% and --~-~-----~-- 32%. 20 In summary,with all datasets the Page Rank / outperfurmed the SeedBase significantly. In addition, it can ~I ~~--~2-----74----~6~--~8~--~O~ be seen that trends of accuracies of both PageRank and SeedBase were increas¢ as the numbers of seed went higher.Figvre 4: Aceunlcies of SeedBase and PageRank from Our Twitter network fonns a directed graph where the directions from one tweeter to others are ordered. As aE~rinrentl~tlandDa~t2 result. a tweeter bas higher rank than others when many high ranked tweeters follow. Our results are also in agreementPDgeRanks accuracy was between 40% and 50"10 whereas with the results from [20] where the centrality method is1he accuracy of SeedBase was below 20"/0. PageRanks used fur choosing seeds (landmarks) in undirected graphs, accuracy rate increased slowly at first; then grew rapidly where vertices at the central of graph with many shortest after two or four seeds. Then, the performance kept edges going through are important. increasing slowly and reached above 90% accuracy when The PageRank method takes longer runtime than the 1he number of seeds was between eight and ten. SeedBases SeedBase. The reason is that, from the PageRonk. accuracy increased constantly as the number of seeds was Equation 1, each vertex may be traversed several times to increased but reached up to only 40% which is much lower rank all vertices befure picking up seeds so that in worse than that from PageRank. In Experiment 2, two different datasets were also used. case the runtime is o(n2), whereas, in the SeedBase, the However. the maximum size of each dataset was increased to 1000. The first dataset contained 181 vertices and 230 nmtime is constant 0(1) since it is spent only to pick up edges. The second dataset contained 482 vertices and S50 seeds. However, in our social network Twitter, the number edges. The number of seeds was varied from 1 to 25 with of friends that a tweeter follows is many times smaller than "1" increments. The mean accuracy of the SeedBase and the the number of all tweeters so that our approach is reasonable PageRank were compared to the on-the-fly ranking.. The and effective. results are plotted in Figure 5. 5. CONCLUSIONS In this research. the approximate shortest pa1h between tweeters in Twitter is used as our backbone factor in ranking eo search results. We have applied our strategy by using the PageRank. algorithm to select most important tweeters before computing approximate shortest path among them.. In terms of accuracy, the results show that our strategy outperforms the SeedBase method in (5}, which selects seeds nmdomly, by large margin. The resuhs are also showed that the high accuracy can be achieved with small fraction of seeds. Our approach uses small amount of seed (about 2"10- 5%) but yields very high accuracy. Applying this approach in social networks will make the search result for finding 5 10 15 20 25 """be< of Seeds friends more effuctive. Future work includes reducing the preprocessing time by speeding up the ranking seeds process. The implemented source codes in PHP Figure 5: Accuracies of SeedBase and PageRank from programming language are made available. Experiment 2 Dataset 1 and Dataset 2 65
  • 6. " The :zM International Conference on Robotics, Informatics, and intelligent Technology (R.IIT2009) st December lI - 14, 2009 at Bangkok, Thailand 6. REFERENCES [IOJ G. Luo, C. Tang and Y. Tian, "Answering relationship queries on the web", Proceedings of the 16th intemativnqj[IJ A. Mislove, M. Marcon, K. P. Gurrnnadi, P. DruscheI, Coriference on World Wide Web, WWW 07, pp. 561-570and B. Bhattachrujee, "Measurement and analysis of online ~7. social networks", Proceedings qfthe 7th ACM SIGCOMMCoriference on irrternet Measurement, IMe 07, pp. 29.-42, [tl] M. Thorup, and U. Zwick, "Approximate distance2007. or1lCles", Journal ofACM52, 1, pp. 1-24,2005.(2) Y- Aim, S. Han, H. Kwak, S. Moon .and H. Jeong, [12J S. Baswana and S. Sen, "Approximate distance oracles"Anaiysis oftopoIQgica1 characteristics of huge online social for unweigbted graphs in expected O(n2) time", ACMnetworking services", Proceedings of the 16th international Transactions on Algorithms 2,4, pp. 557-577, 2006.Conference on World Wide Web, WWW 07, pp. 835-844,2007. [13] L. Ding, T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng, P. Reddivari, V. Doshi and J. Sachs, "Swoogle: a search and[3] A. Java, X. Song, T. Finin, and B. Tseng, "Why we metadata engine for the semantic web", Proceedings of thetwitter: understanding microblogging usage and Thirteenth ACM international Conference on informationcommunities", Proceedings of the 9th WebKDD and 1st and Knowledge Management, CIKM 04, pp. 652-659,SNA·KDD 2007 Workshop on Web Mining and Social 2004.Network Analysis, WebKDDISNA-KDD 07, pp. 56-65,2007. [14] M. Richardson, A. Prakash and R Brill. "Beyond PageRank: .machine learning for static ranking",[4] A. N. Joinson, "Looking at, looking up or keeping up Proceedings of the 15th international Co.,yerence on Worldwith people?: motives and use of facebook". Proceeding qf Wide Web, WWW 06, pp. 707-715, 2006.the Twenty-Sirth Annual SIGCHI Cortference on HumanFactors in Computing Systems, CHI 08, pp.. 1027-1036, {I5J S. Brin and L Page, "The anatomy of a large-scale2008. hypertextual Web search engine", Comput. Netw. ISDN Syst. 30,1-7, pp. 107-117, 1998.[5} M. V. Vieira, B. M. Fonseca, R. Damazio, P. B.Golgher, D. d. Reis and B. Ribeiro-Neto, "Efficient search [16] Y. Zhang, L Zhang, Y. Zhang, XLi, "XRank:ranking in social networks", Proceedings of the Sixteenth Learning More from Web User Behaviors.,n Computer andACM Cmrference on Conference on ir(ormation and Information TecJnwlogy, Intemational Coriference on, pp.Knawledge Management, CIKM 07, pp. 563-572, 2007. 36, Sixth IEEE International Conference on Computer and Infurmation Technology (Crro6), 2006.[6JE. Amitay, D. Carmel, N. HarEL S. Ofek-Koifinan, ASoffer, S. Yogev and N. Golbandi, "Social search and [17] T. G.. Micl!ael and T. Roberto, "Data Structure anddiscovery using a unified approach", Proceedings if the Algorithms in Java", John Wiley & Son, Inc., ISBN-20th ACM Conference on Hypertext and Hypermedia, HT {)471644528, 2004.09, pp. 199-208,2009. [I8} E. Dijkstra, "A note on two problems in connexion with[7] S. Bao, G. Xue, X. Wu, Y. YU, B. Fei. and Z. Suo graphs", Numerische Mathematik:, 1: pp. 269-271, 1959.Optimizing web search using social annotatioos",Proceedings of the I 6th internatiunal Conference on World [19J M. Holzer, F. Schulz, D. Wagner and T. Willha1m,Wide Web, WWW07, pp. 501-510, 2007. "Combining speed-up techniques fur shortest-path computations". A CM Journal on Experimentant[8] L. Page, S. Brin, R. Motwani and T. Winograd, "The Algorithmics 10,2.5,2005.pagerank citation ranking: Bringing order to the web"Teclmical report, Stanford Digital Library Technologi~ [20] P. Michalis, B. Francesco, C. Carlos and G. Aristides,Project, 1998. "Fast Shortest Path Distance Estimation in Large Networks", to be appeared in Proceedings of the Eighteenth ACM[9] D. V. Kalashnikov, R. Nuray-Turan and S. Mebrotra, Conference on Conference on information and Knowledge"Towards breaking the quality curse.: a web-querying Management, CIKM 09, 2009.approach to web people search", Proceedings of the 31stAnnual international ACM SIGIR Conference on Researchand Development in iriformation Retrieval, SIGIR 08, pp.27-34, 2008. 66