Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
UCIAD overview
1. User Centric Integration of Activity Data Mathieu d’Aquin, Stuart Brown, SalmanElahi, Enrico Motta The Open University
2. Agenda Introduction of the Team Objectives and Hypothesis Overview of technical realization Challenges Summary of results so far and dissemination
3. Team Dr Mathieu d’Aquin– Research fellow, KMi – project director Stuart Brown – Web developments and online communities, communication services – member of the steering group, liaison with online services SalmanElahi– Resarch assistant and PhD student, KMi – developer/researcher Prof Enrico Motta – Professor of knowledge technologies, KMi – Chair of the steering group
4. Objectives and Hypothesis Hypothesis Taking a user centric point of view can allow different types of analysis of logs/activity data, which are valuable to the organisation and the user Ontologiesand Ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources
6. At the Open University An analytics system building aggregated data from various university’s websites Based on a manually defined sitemaps Good for website optimization, marketing campaigns, etc. But the data being pre-aggregated, it is limited with respect to what it can do Limited control No user view
7. User Centric Activity Data Activity analysis for and by individual users Consolidation Integration Interpretation Ontologies Logs 2 Logs 4 Logs 1 Logs 3 Website 2 Website 4 Website 1 Website 3 Organisation Users
8. Ontologies Formal conceptual models of a domain Here, the domain is online user activity At the basis of Semantic Web technologies Standard languages for expressing ontologies and ontological data (RDF, OWL) Tools to manipulate and work with ontologies and semantic data (NeOn Toolkit, OWLIM) Many ontologies to reuse (cf. Watson) Adhere to a logical formalism Enable inferences on the data
9. Objectives and Deliverables Build the technical infrastructure that can hold traces of activity data as semantic data Include triple store with reasoning capability, log parsers for different formats of logs, and renderers as semantic data (RDF) Build the ontologies to interpret and reason upon activity data Including various aspects of activity data in a way which is extensible Tools to support users in analyzing their own activity data Recognize a user from the different settings and provide view on his/her own data Allow him/her to customize the view, by customizing the ontology Test, validate, deploy, distribute
11. Technical infrastructure Development of parsers for different kinds a log formats Currently handle Apache web server log files, parameterized from the Apache configuration Easily extensible for dedicated log formats Provide a common data structure serialized in RDF by the RDF renderer Each server produces a daily extract from the logs in RDF, which is being used to populate the semantic triple store The triple store includes multiple repositories and sub-spaces depending on time/user/server
12. Ontologies Key concepts to be represented: Actors (human users and robots) Sitemaps Traces (broad notion of logs) Activities Reusing existing ontologies FOAF: for people and documents Time Ontology: for traces Action ontology: for traces and activities (Planned) OPO: Online presence (Planner) SIOC: Online communities
13.
14. Iterative and extensible construction of the ontologies Provide a base with actors, sitemaps and traces Specific extensions with typologies of activities, depending on user and site Dynamically building and integrating
15. Tool for analysis Need a tool which given A set of ontologies A data repository (which can be the overall one, the one restricted by time, and one for a given user) can provide a meaningful and interactive overview of the activity data To be used for Provide an ontology-specific view of data analytics Support the iterative development of the ontologies Provide a user centric view of the data
17. Example In the ontology: /robot.txt is a RobotTXT page A Spider is an RobotAgent (ActorAgent) An agent used to access a RobotTXT is a Spider An AutomaticActivity is a Trace realized by a RobotAgent Result: Thousands of traces automatically classified as automatic activities.
18. Example In the ontology: UCIAD-Blog and LUCERO-Blog are Blogs (Website) A BlogPage is a page which is part of a Blog An activity onBlog is an activity happening on a Blog Page Result: Can look specifically at activities happening on a Blog and specialize them (same applies to Wikis, and other types of websites)
19. Example In the ontology: A SPARQLEndpoint is a specific type of Webpage AccessingSparqlEnpoint is an activity on a SPARQLEndpoint SPARLQQueryParameter is a parameter with the name “query” used in an AccessingSPARQLEndpoint activity ExecutingSPARQLQuery is an AccessingSPARQLQuery activity attached to a SPARQLQueryParameter Result: Can explore the specific activity of executing SPARQL queries and its parameters Can combine: Detect the activity of Automatically Accessing a SPARQL endpoint: and automatic activity and accessing a SPARQL endpoint.
20. Next step: User support Allow users to log-in detect setting bring up the relevant data explore it But also, to customize the view of the data to extend the ontologies to provide a personalized analysis of activity data to export (interpreted) activity data for reuse
21. User support User Logging or register Detect setting (agent+IP) unknown setting It is the first time you log into UCIAD with this setting (detail) do you want to attach it to your account? Check setting non-ambiguous non-ambiguous ambiguous known setting for user Add setting to known setting Register setting as ambiguous Display Activity Data related to all known settings of the user yes no
22. User support: data for a user For a user <u> the SPARQL query Construct {?trace ?p ?y. ?y ?q ?z} where {<u> actor:hasKnownSetting ?s. ?trace trace:hasSetting ?s. ?trace ?p ?y. ?trace ?q ?z} builds the traces of activities around the known setting of <u> Used to populate a specific repository with sub-spaces for each registered users
23. Deployment, test, validation At the moment, testing for websites of projects and events hosted on KMi servers: Sssw.org, sssw09.org, loted.eu, lucero-project.info, uciad.info, data.open.ac.uk, lucero.open.ac.uk, … Next level up, websites/systems from main open university website: www.open.ac.uk, study at the OU, podcasts.open.ac.uk, VLE Extend to deployment of instances for specific projects with distributed websites
24. Challenges Scalability OWLIM triple store can handle billions of triples But struggle with millions when inference is “on” 1 repository without inference with all historical data, 1 with inference with 1 week of data only, and 1 with inference for registered users User management and privacy Ensuring that the user who logs in from a particular setting is the one having the activity is difficult (e.g., in the case of shared computers) Is this really a problem? Check ambiguity – ask verification questions – moderate? Distribution and IPR Code and ontologies under open licenses (small uncertainty regarding code developed in other projects) Overall data: privacy issues (is k-anonymity actually applicable? Would it work?) Overall data: institutional issues (can we show the traffic on our websites to everybody) User data export: what license?
25. Summary and dissemination Promising initial results Can create new ways of analysis at run-time by editing the ontologies! Mechanisms to provide personal views on own activity data across websites First version of the ontologies: ongoing task First version of the tools: test and validate! Dissemination Blog / Twitter #uciad KMi’sinternal news letter (KMi Planet) Salman’s paper at the ESWC 2011 PhD symposium: “Personal Semantics: Personal information management in the Web with Semantic Technologies” Position paper at the W3C Web tracking and privacy workshop: “Self-Tracking on the Web: Why and How” Submission to the Personal Semantic Data workshop at K-CAP 2011
26. More info UCIAD Blog: http://uciad.info Code base: http://github.com/uciad Twitter: #uciad @mdaquin