Sup (Semantic User Profiling)


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sup (Semantic User Profiling)

  1. 1. SUP – Semantic User Profiling Emanuela Boroș, Alexandru-Lucian Gînscă UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {emanuela.boros, lucian.ginsca} Abstract. We present in this rapport a model for a user’s profile based on multiple social network accounts and influence services. In the modeling process we make use of well established vocabularies, but we also create our own model especially for data regarding influence. We built a web application with the purpose of offering an accessible interface for accessing the knowledgebase, but also allowing the user to have his social graph semantically modeled.1 IntroductionUsing the information given by the current social networks (Twitter and Facebook),SUP (Semantic User Profiling) is a Web platform able to manage user profiles. A userprofile is modeled semantically, and exposed on the related standards. It also providesmeans for estimating a users reputation based on multiple criteria, using socialscoring services such as Klout and PeerIndex. The user has the satisfaction of viewinghis social graph that also can be queried using a SPARQL service. The core principlesbehind this application are constructed around the visually attractive method of seeinga user’s semantic profile. The next concerns more the functional properties of theapplication. SUP extends a standard CRUD architecture into sophisticated webapplication, the presentation and data model logic is properly separated (clients canprovide the user interface and servers can handle storage and application modelinglogic), the storage is handled nicely by Virtuoso triple store, end-to-end consistency indata (JSON/JavaScript), smooth communication and interaction from client to serverand back, preserved clean encapsulated interfaces and lightweight RESTful webservices. The final result is a web application with effective user experience thatbrings together the cumulative advances of modern JavaScript and web architecturedesign patterns, JSON, RDF, AJAX, REST style, and thin server architecture.2 Global ArchitectureThe primary purpose of this data-driven application is being able to visualize it in themost pleasuring way it can be. A query is being passed to the application and it
  2. 2. returns a bunch of matching responses, in the order of relevance, mapped in astandardized way. This process needs a light updater for the web page which meansasynchronous functionality, a creative way for visualizing the updates, an end-to-endconsistency in data and a lightweight CRUD style data provider. In order to obtainthis, the architecture of SUP (Semantic User Profile) has been designed following athree-tier approach such as a light model-view-controller. The architecture combinesthe different technologies coming from Javascript/JQuery/Ajax and Java worlds. Thepresentation layer is Javascript-driven with Ajax for pushing information while thebusiness and data layers are realized through Java EE technologies. Following thisthought, the application takes the best of both worlds: the dynamic, personalized userexperience we expect of immersive Web applications and the simple, scalablearchitecture we expect from RESTful applications. Here below we provide furtherdetails about the three specific tiers. Figure 1: SUP global architecture
  3. 3. Presentation LayerThis layer has been developed as a single web page. The parent page has the primarypurpose of satisfying the common user of the application that is looking for a creativeway of visualizing personal data and the child page regards the specialized users thatare looking for a representational state of their Sparql queries. The communicationbetween the two higher tiers is carried on through Ajax, with the client submittingrequests to the logic tier and receiving back JSON data representing the content of theresponse, which is then parsed and used to activate proper interaction in the userinterface. The presentation implies data received from server represented in two ways:one for the graph form of data visualization and the other one for the raw result for theSparql queries, which comes in xml format.The main keywords for this tier are: Html, Css, Javascript, Ajax, Protovis,Twitter@Anywhere, Facebook Javascript SDK.First of all, there is an important need for maintaining a user’s profile data. More datapushes from the server implies this simple way of distributing processing to theclients. This fact transforms the application into a proper scalable web application.The fact that Ajax lets the interaction with the server without a full refresh puts theoption of a stateful client back on the table. This has profound implications for thearchitectural possibilities for dynamic immersive Web applications. The RESTfulservices (Visualization and Sparql web services) are the data providers for the Ajaxupdates. The primary type of response that we use is JSON, for its special quality ofbeing human readable and easy to process.The business and functional components of the application require minimalinformation from the main social networks that are used as data providers. These arecompleted using Twitter@Anywhere1 and Facebook Javascript SDK2. Twitter@Anywhere is an easy-to-deploy solution for bringing the Twitter communicationplatform to a web page. It is used to build the integration with "Connect to Twitter."The Facebook JavaScript SDK provides simple client-side functionality for accessingFacebooks API calls. The social plugins are used in order to obtain an access tokenfor the communication with Facebook.The creation and population of the graphs that are needed for visualizing the data forevery semantic profile is done with the use of Protovis. The common forms ofvisualization are the social graph and the timeline. This are provided with JSONresults after the RESTful services are also provided with query-specific results (thisdiscussion will be continued in the next section).1
  4. 4. Business Logic LayerThe business logic of the application is implemented through a collection of JavaRESTful Web Services which are deployed on Tomcat 6 server. The services are usedfor sending further Sparql queries and receiving from the Virtuoso triple store specificresponses. These are processed and made prettier for the user interface to get them.This tier has the great property of using REST web services which are lightweight (nocomplex markups) with human readable results and easy to build - no toolkitsrequired. We take advantage of using them for a CRUD way of getting our need datafor creating semantic profiles.Data LayerThe data tier is mainly represented by a component for accessing and managing theRDF/OWL model. This component queries and manages RDF triples RDF tripleswith the OpenLink Softwares Virtuoso3 which is a database server that can also store(and, as part of its original specialty, serve as an efficient interface to databases of)relational data and XML. The primary data which consists of details of users’ profilesfrom different social networks and different scores of their influence in onlinemedium is gathered using implementations of common used social medias and socialscoring applications: Twitter, Facebook, Klout and PeerIndex.For Klout and PeerIndex, we created our personal API’s implementations. They arethe main providers for influence scoring computing. For Twitter, we used Twitter4J4which is a library for easily integration of the Twitter service with built-in OAuthsupport and zero dependency and for Facebook, we chose RestFB5 which is a simpleand flexible Facebook Graph API and Old REST API client written in Java.The reasoning over specific data is explained in the Data Acquisition and Influencemodel sections.3 General Model and VocabulariesVocabularies. Besides the rdf, rdfs, owl and our own vocabularies developed with thepurpose of modeling influence information, we mainly use the foaf and siocvocabularies.3
  5. 5. Table 1: Used terms sample SIOC FOAF FOAF sioc:user foaf:Agent foaf:birthdate sioc:follows foaf:onlineAcount foaf:firstName sioc:userAcount foaf:knows foaf:lastName sioc:avatar foaf:nick foaf:homepage sioc:creatorOf foaf:img sioc:post foaf:mboxIn figure 2, we can see a part of the model, containing information about three usersand their friends. The visualization was done with Gravity using the RDF generatedby the Jena API. Figure 2: Model sample with GravityIn Figure 3, there is a visualisation of the same snippet of the model, this time withWelkin. A node was highlighted for more information.
  6. 6. Figure 3: Model sample with Welkin4 Data acquisitionData acquisition regards the knowledge model of SUP. The raw data is obtained fromthe main social networks APIs implementations. The data is directly imported fromthe web, mainly Twitter and Facebook. For Twitter and Facebook data acquisition, wecreated wrappers for the libraries used to apply to our data needs. Both of them needthe application to be registered in order to acquire consumer keys, and consumersecrets in advance.The Twitter API6 consists of three parts: two REST APIs and a Streaming API. TheTwitter REST API is the core API set, it allows developers to access core Twitterdata, it contains most of the methods and functions that would be used to utilizeTwitter data in an application, and it supports three formats (or endpoints) for eachmethod: XML, Atom, and JSON formats. This includes update timelines, status data,and user information. The Search API methods give developers methods to interactwith Twitter Search and trends data. The main concern for us is the effects on ratelimiting and output format which can become easily an important issue of using thisAPI. We use a Java library recognized by Twitter for a simple implementation of theREST Twitter API, Twitter4J. The data extracted with the library is mainly consistedby user personal information, details about friends and followers and latest tweets.Basically, the methods that Twitter offer resources have this pattern:Resource URL:
  7. 7. GET followers/ids Returns an array of numeric IDs for every user following thespecified user. This method is powerful when used in conjunction with users/lookup.GET friends/ids Returns an array of numeric IDs for every user the specified user isfollowing. This method is powerful when used in conjunction with users/lookup.GET users/show Returns extended information of a given user, specified by ID orscreen name as per the required id parameter. The authors most recent status will bereturned inline. Users follow their interests on Twitter through both one-way andmutual following relationships.The responses we are aiming for have the JSON structure:{ "profile_image_url": "", "location": "San Francisco, CA", "follow_request_sent": false, "id_str": "6253282", "profile_link_color": "0000ff", "is_translator": false, "contributors_enabled": true, "url": "", "favourites_count": 15, "id ": 6253282}Facebook Graph API7 presents a simple, consistent view of the Facebook socialgraph, uniformly representing objects in the graph (e.g., people, photos, events, andpages) and the connections between them (e.g., friend relationships, shared content,and photo tags). For Facebook data acquisition, we use RestFB java library. RestFBalready maps objects to Json so the data is received in this format:{ "id": "220439", "name": "Facebook User", "first_name": "Facebook", "last_name": "User", "link": "", "username": "facebook.user", "gender": "male", "locale": "en_US"}For proper usage of this library, we created a wrapper with already built-in FacebookGraph specific queries. This way, we minimized the effort of repeatedly creatingdifferent queries. Finally, Facebook offers us personal data, extended details forfriends and personal feed.7
  8. 8. The process of data acquisition combined with social scores is explained in the figurebelow. Figure 4: Data acquisition workflow5 Influence modelWe are interested in discovering features related to a user’s influence on a certainsocial network, the influence of his friend and creating a model using RDFS andOWL for these influence components. We use two services that are known for theirwork in social network influence analysis, Klout8 and PeerIndex9.Klout. We included in our model, besides the Klout score, other influence relatedconcepts that Klout offers. Next, we present the four influence scores that Kloutprovides. Most of the descriptions were taken from the Klout’s website and serve thepurpose of giving a better understanding of the different notions regarding influencethar are being introduced in the model.8
  9. 9. Klout Score: The Klout Score is the measurement of the user’s overall onlineinfluence. The score ranges from 1 to 100 with higher scores representing a wider andstronger sphere of influence.Amplification Probability: Klout describes the Amplification Probability as: "thelikelihood that your content will be acted upon. The ability to create content thatcompels others to respond and high-velocity content that spreads into networksbeyond your own is a key component of influence."Network: The network effect that an author has and it is a measure of the influence ofthe people the author is reaching. Klout describes it as "the influence level of yourengaged audience."True Reach: The True Reach score from Klout measures how many people an authorinfluences.In Figure 5, a snippet from the RDF/XML file describing the Klout score is shown. Figure 5: Klout score in RDFNext, we will present some of the 17 klout classes. In our model, the klout classconcept is defined using the owl:oneOf construct and enumerating the instances.Broadcaster: The user broadcasts appreciated content that spreads fast. He is anessential information source in his industry. He has a large and diverse audience.Celebrity: The user reached a maximum point of audience. People share his contentin great numbers. He is probably famous in real life and has numerous fans.Curator: The user highlights the most interesting people and finds the best content onthe web and share it to a wide audience. He is a critical information source.Feeder: The user’s audience relies on him for a steady flow of information about hisindustry or topic.Observer: He doesn’t share very much, but follows the social web. He prefers toobserve more than sharing.Klout also offers lists of maximum five influencers and one of maximum fiveinfluences. We caught this aspect in the isInfluencedBy and influences relations, asseen in Figure 6.
  10. 10. Figure 6: Klout influence relations in RDFPeerIndex. Although PeerIndex relies on fewer data sources than Klout, we desiredto have an alternative to the klout score. Next, we will present descriptions of the fourinfluence scores, as given by PeerIndex.PeerIndex score: A user’s overall PeerIndex score is a relative measure of his onlineauthority. The PeerIndex Score reflects the impact of his online activities, and theextent to which he has built up social and reputational capital on the web.In Figure 7, a snippet from the RDF/XML file describing the PeerIndex score isshown. Figure 7: PeerIndex score in RDFAuthority Score: Authority is the measure of trust calculating how much others relyon the user’s recommendations and opinion in general and on particular topics.PeerIndex calculates the authority in eight benchmark topics for every profile. Theseare used to generate the overall Authority Score as well as produce the PeerIndexFootprint diagram. The Authority Score is a relative positioning against everyone elsein each benchmark topic. The rank is a normalized measure against all the otherauthorities in the topic area.Audience Score: The Audience Score is a normalized indication of the user’s reachtaking into account the relative size of his audience to the size of the audiences of
  11. 11. others. In calculating his Audience Score, PeerIndex does not simply use the numberof people who follow him, but instead generate from the number of people who areimpacted by his actions and are receptive to what he is saying. If the user is a personwho has an "audience" consisting of a large number of spam accounts, bots, orinactive accounts, his Audience Score will reflect this.Activity Score: Your Activity Score is the measure of how much the user does that isrelated to the topic communities he is part of. By being too active, his topiccommunity members tend to get fatigued and may stop engaging with him. TheActivity Score takes into account this behavior. Like the other scores, Activity Scoreis calculated relative to the user’s communities. If he is part of a community that has alarge amount of activity, his level of activity and engagement will need to be higher toachieve the same relative score as in a topic that has less activity.In Figure 8, we see a visualization of the model with Welkin.10 Figure 8: Influence model visualized with Welkin6 Topic Semantic SimilarityA user has associated different topics drawn from multiple sources which give anoverview image of his mostly discussed concepts or his interests. In our currentimplementation, topics are gathered from the Klout and PeerIndex services. WhilePeerIndex returns a straight-forward list of topics for a certain user, Klout has aparticular understanding of the concept of ―topic‖. Next, we will present Klout’smethod of finding topics.10
  12. 12. Klout topics are gathered from the Twitter stream and in some cases they seem tohave nothing to do with what the tweets about. Klout looks for specific keywords/ inthe user’s tweets that received a certain amount of attention, such as numerous repliesto the user’s tweet or retweets of that tweet. If the user replies to someone’s tweet andthe response generated lots of interest, then Klout will look back to the original tweetfor keywords. Once the keywords that draw influence are obtained, Klout uses adictionary to identify relevant terms. More details regarding this dictionary and howthe terms are correlated seem not to be available for public disclosure. Klout thencompares the user’s influence on these terms to see if you he is generating significantinfluence within their network. If Klout determines if a user has influence on aspecific term, that term will appear on his list of topics. For a better understanding ofthis process, we give a small example. If a user has at least 10 tweets about cats eachday, but no one every replies on those, the term ―cat‖ will not appear on his topic list,but if a user publishes a tweet about ―war‖ and this tweet generates tens of replies andgets retweeted a lot of times, then it is most likely that the term ―war‖ will be found inhis list of topics.For computing the semantic similarity between two terms, we use three WordNetsemantic similarity algorithms, Wu and Palmer, Resnik and Lin. Next, we give moredetails about these measures and present results computed on 5 Klout topics extractedfrom our knowledgebase.Wu and Palmer measure. The Wu & Palmer measure [3] calculates semanticsimilarity by considering the depths of the two synsets in the WordNet taxonomies,along with the depth of the least common subsumer. The formula is as follows:s1: the synset of the first term;s2: the synset of the second term;lcs(s1, s2): the synset of the least common subsumer.This means that 0 < <= 1. The score can never be zero because thedepth of the least common subsumer is never zero. The depth of the root of ataxonomy is one. The score is one if the two input synsets are the same.Table 2: Wu and PalmerTerms internet design web education philosophyinternet 1.0 0.631 0.909 0.222 0.21Design 0.631 1.0 0.75 0.8 0.75Web 0.909 0.75 1.0 0.461 0.428education 0.222 0.8 0.8 1.0 0.8philosophy 0.21 0.75 0.428 0.8 1.0
  13. 13. Resnik measure. This measure also relies on the idea of a least common subsumer(LCS), the most specific concept that is a shared ancestor of the two concepts. [4]The Resnik [1] measure simply uses the Information Content of the LCS as thesimilarity value:lcs(t1,ts2): the least common subsumer.freq(t): the freaquecy of term t in a corpus;maxFreq: the maximum frequency of a term from the same corpus. The Resnik measure is considered somewhat coarse, since many different pairs ofconcepts may share the same LCS. However, it is less likely to suffer from zerocounts (and resulting undefined values) since in general the LCS of two concepts willnot be a very specific concept.Table 3: Resnikterms internet design web education philosophyinternet 10.37 0.631 10.37 0.0 0.0design 2.49 11.76 2.49 3.39 3.39web 10.37 2.49 11.76 2.87 0.77education 0.0 3.39 2.87 10.66 3.39philosophy 0.0 3.39 0.77 3.39 11.76Lin measure. The Lin measure [2] augments the information content of the LCS withthe sum of the information content of concepts A and B themselves. The lin measurescales the information content of the LCS by this sum.
  14. 14. Table 4: Linterms internet design web education philosophyinternet 1.0 0.28 0.32 0.0 0.0design 0.28 1.0 27 0.46 0.48web 0.32 0.27 1.0 0.09 0.09education 0.0 0.46 0.09 1.0 0.46philosophy 0.0 0.48 0.09 0.46 1.0Topic set similarity. For computing the semantic similarity between the topics ofinterest of two users using one of the three measures described above, we firstgenerate the stem of each term, using an open source implementation of the PorterStemmer. The final similarity score is obtained using a weighted average over themaximum score obtained by applying a semantic similarity measure on eachcombination of a term from the first user’s topics set and one from the second user’stopic set.T1: first user’s topics set;T2: second user’s topics set;sim(t1, t2): one of the Wu and Palmer, Resnik or Lin similarity measures.7 VisualizationWe mentioned Protovis11 usage in order to create the graphics for visualizing asemantic profile. Protovis is a great tool that draws images in the Scalable VectorGraphic format (SVG) which every modern and mobile browser, including IE 9, canrender it. We used two types of graphs: a force-directed graph and a timeline. In thecase of the force-directed graph, an intuitive approach to network layout is to modelthe graph as a physical system: nodes are charged particles that repel each other, andlinks are dampened springs that pull related nodes together. A physical simulation ofthese forces then determines node positions; approximation techniques that avoidcomputing all pair wise forces enable the layout of large numbers of nodes. Inaddition, interactivity allows the user to direct the layout and jiggle nodes todisambiguate links. A structure of this type graph has been developed for representingfriendship scoring between a user and his friends.11
  15. 15. Figure 9: GraphThe timeline represents a common way of showing a user’s activity in time. Screenshots below. Figure 10: Timeline
  16. 16. Figure 11: Sparql Endpoint8 Use CasesWe distinguish two main types of use cases. One involving an inexperienced user thatjust wants to find information about his social graphs or about his friends’ graphs andanother one, where an user with Sparql knowledge can write his own queries andvisualize the results in table form, or select one of the predefined queries that generateinteractive graphs and modify the queries.9 ConclusionSemantic modeling deserves necessary involvement from out team and it is importantto continue investigating new means for influence computation more accurately. Alarger collection of triples would be needed, along with a more complex semanticmodel. Future work includes completing SUP with a semantic similaritycomputation between users’ topics. The module has been implemented usingWordNet based semantic similarity algorithms but not yet included in themain workflow. In conclusion, we will focus on improving the semantic model andfurthermore exploring new ways of proper visualizing data.
  17. 17. References1. Philip Resnik. 1995. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Confer2. D. Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August.3. Wu and M. Palmer. 1994. Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138, Las Cruces, New Mexico4. Pedersen, Ted, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet::similarity — measuring the relatedness of concepts. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04). AAAI Press, Cambridge, MA, pages 1024–1025