Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Making sense of users' Web activities
1. Making sense of Users’ Web activities Mathieu d'Aquin Knowledge Media Institute, The Open University, UK
2. A bit of sci-fi to start with “… from people who are afraid that someone else knows information that they don’t and is gaining an unfair advantage by it. For all the claims one hears about the liberating impact of the data-net, the truth is that it whished on most of us a brand-new reason for paranoia” John Brunner, The Shockwave Rider, 1975
3. What we don’t know that they know Simple important things: And more complex important things… What are all the websites that know my e-mail address? What does amazon.co.uk or the website of my favorite airline know about me?
4. Is this Personal Information Management? Yes, but… Looking at individual user’s information exchange and more generally activities on the Web This is : Big Heterogeneous Distributed Fragmented Sometimes implicit And hard to collect!
5. So, what do we do? Unrestricted monitoring of information exchange on the Web by an individual user
6. Local Logging Proxy HTTP Requests HTTP Requests Local Web Agents (e.g., browser) External Web Sites HTTP Responses HTTP Responses Web Exchange RDF Logs
8. What this talk is about Using ontologies and external datasets to Generate abstractions of this low level data Enrich it with external knowledge and models Interpret to give back useful information to the user
9. Online Activities Ontology HTTP Ontology Parameters and Website info. Personal Information Web Site Information Trust Model Location Information
10. HTTP Ontology Built bottom-up from the data Can help inferring simple things from it And answer questions through SPARQL queries InternetPoint time: DateTime origine Request time: DateTime toURL: URL referer: URL toHost WebHost domain: String User-Agent WebAgent ID: String hasResponse Content Content-Type Response time: DateTime responseCode: int DataFile ID: String Content Content-Type DataFormat MineID: String
12. Integrating basic info Domain name IP Location “What!? What requests have I made to websites in Nigeria? What Data did I send?” Can be answered in a SPARQL query
13. More information about websites The linked data cloud is full of it. Using the domain name to address this information. CONSTRUCT {<domain_name> ?p ?y} WHERE {{{?xdbpedia:homepage <http://domain_name>}. {?x ?p ?y}} UNION {{?xowl:sameAs ?z}. {?xdbpedia:homepage <http://domain_name>}. {?x ?p ?y}}}
14. Examples Google Services Entertainment Websites Web Analytics Internet Search Engine subject/category Video sharing Video Hosting www.google-analytics.com Company developer Web Search Engine Search Engine type subject/category google owner subsediaryOf www.youtube.com www.google.com parent DBpedia freebase
15. Activities Can we now understand the user activities? Based on website categories and on their parameters: GET http://uk.search.yahoo.com/beacon/module?p=idiocracy&url=http%3A%2F%2Fwww.imdb.com%2Ftitle%2Ftt0387808%2F POST format=JSON&method=fql%2Emultiquery&api%5Fkey=51d350e8d92da1f5623512a9e801da2b&v =1%2E0&queries=%7B%22query2%22%3A%22SELECT%20app%5Fid%2C%20display%5Fname%20FROM %20application%20WHERE%20app%5Fid%20IN%20%28SELECT%20app%5Fid%20FROM%20%23query1 %29%22%2C%22query1%22%3A%22SELECT%20post%5Fid%2C%20source%5Fid%2C%20created%5Ftime%2C%20updated%5Ftime%2C%20actor%5Fid%2C%20target%5Fid%2C%20app%5Fid%2C%20message%2C%20attachment%2C%20comments%2C%20likes%2C%20permalink%2C%20attribution%2C%20type%20FROM%20stream%20WHERE%20filter%5Fkey%20IN%20%28SELECT%20filter%5Fkey%20FROM%20stream%5Ffilter%20WHERE%20uid%20%3D%20605559235%20AND%20type%20%3D%20%27newsfeed%27%29%20AND%20%28created%5Ftime%20%3E%3D%201257443596%29%20AND%20%28%28created%5Ftime%20%3E%201257945423%29%20OR%20%28updated%5Ftime%20%21%3D%20created%5Ftime%29%29%20ORDER%20BY%20created%5Ftime%20DESC%20LIMIT%20200%22%7D&call%5Fid=12565739074246102&sig=01a13a72825ed83ed6d23bdf2791ad1a&session%5Fkey=be312ffdf9b9e1a5ec6c5768%2D605559235
16. Activities in an Ontology Derived in a bottom-up way from categories of activities/request Can be used to characterize overall activities, individual activities or correlations between activities ActivityBasedRequest ImplicitActivity ExplicitActivity ReportToAnalytics Search CheckStatusFeed SearchVideo SearchImage AutoCheckStatusFeed FollowLink ManualCheckStatusFeed FollowSearchResult
21. Tool used to create mappings between data sent to websites (from logs on the right) with the user profile (left). Effectively reconstructing the profile from the data
22. User profile re-constructed from Web activities 36 attributes, 1,080 values, to 123 domains A model of what piece of personal information was sent where (can answer the questions)
23. What that tells us about trust Taking the point of view of an external observer, we can derive an observed model of trust and criticality of data If this piece of data is critical to you and you give it to bob, you must trust bob If you give this piece of data to many untrusted people, you probably don’t consider it critical
24. Formally Trust in a domain = max of criticality of data it received Criticality of a piece of data= 1 / 1 + Σ (1- trust in websites that received the data) Obviously, these 2 formulas are interdependent. Treating them as a sequence, with initial values at 0.5
25. Interacting with the model Expose the user to his own observed behavior has observed, so that he can try to align it to his intended behavior
27. Conclusion First set tools exploiting logs of personal Web activity Demonstrate the need for ways to abstract and interpreter activity data, to support Web Users Demonstrate the ability of semantic technologies, ontologies and the enrichment through external data, to provide such abilities
28. So much more to do Can I collect this tweet? From HTTPS? From my mobile phone? Can I link it to where I am? To what I’m doing? To what I have been doing? To the abstract of the presentation? To the slides on SlideShare.net? To blogs mentioning it? Can I cope with the scale of all this information? Can I decide what to share? Can I store all this securely? Can I get usable access to it? Can I learn something from it?