UCIAD overview

User Centric Integration of Activity Data Mathieu d’Aquin, Stuart Brown, SalmanElahi, Enrico Motta The Open University

Agenda Introduction of the Team Objectives and Hypothesis Overview of technical realization Challenges Summary of results so far and dissemination

Team Dr Mathieu d’Aquin– Research fellow, KMi – project director Stuart Brown – Web developments and online communities, communication services – member of the steering group, liaison with online services SalmanElahi– Resarch assistant and PhD student, KMi – developer/researcher Prof Enrico Motta – Professor of knowledge technologies, KMi – Chair of the steering group

Objectives and Hypothesis Hypothesis Taking a user centric point of view can allow different types of analysis of logs/activity data, which are valuable to the organisation and the user Ontologiesand Ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources

Organisation Centric Activity Data Analytics = aggregated stats Consolidation Consolidation Consolidation Logs 2 Logs 4 Logs 1 Logs 3 Website 2 Website 4 Website 1 Website 3 Organisation Users

At the Open University An analytics system building aggregated data from various university’s websites Based on a manually defined sitemaps Good for website optimization, marketing campaigns, etc. But the data being pre-aggregated, it is limited with respect to what it can do Limited control No user view

User Centric Activity Data Activity analysis for and by individual users Consolidation Integration Interpretation Ontologies Logs 2 Logs 4 Logs 1 Logs 3 Website 2 Website 4 Website 1 Website 3 Organisation Users

Ontologies Formal conceptual models of a domain Here, the domain is online user activity At the basis of Semantic Web technologies Standard languages for expressing ontologies and ontological data (RDF, OWL) Tools to manipulate and work with ontologies and semantic data (NeOn Toolkit, OWLIM) Many ontologies to reuse (cf. Watson) Adhere to a logical formalism Enable inferences on the data

Objectives and Deliverables Build the technical infrastructure that can hold traces of activity data as semantic data Include triple store with reasoning capability, log parsers for different formats of logs, and renderers as semantic data (RDF) Build the ontologies to interpret and reason upon activity data Including various aspects of activity data in a way which is extensible Tools to support users in analyzing their own activity data Recognize a user from the different settings and provide view on his/her own data Allow him/her to customize the view, by customizing the ontology Test, validate, deploy, distribute

Technical infrastructure Semantic Triple Store Scheduler/Manager Daily RDF traces Daily RDF traces Parser/RDF renderer Parser/RDF renderer Daily RDF traces Daily RDF traces Daily RDF traces Log Log Parser/RDF renderer Parser/RDF renderer Parser/RDF renderer Application Log Log Log Application Server1 Server2 Server3

Technical infrastructure Development of parsers for different kinds a log formats Currently handle Apache web server log files, parameterized from the Apache configuration Easily extensible for dedicated log formats Provide a common data structure serialized in RDF by the RDF renderer Each server produces a daily extract from the logs in RDF, which is being used to populate the semantic triple store The triple store includes multiple repositories and sub-spaces depending on time/user/server

Ontologies Key concepts to be represented: Actors (human users and robots) Sitemaps Traces (broad notion of logs) Activities Reusing existing ontologies FOAF: for people and documents Time Ontology: for traces Action ontology: for traces and activities (Planned) OPO: Online presence (Planner) SIOC: Online communities

Iterative and extensible construction of the ontologies Provide a base with actors, sitemaps and traces Specific extensions with typologies of activities, depending on user and site Dynamically building and integrating

Tool for analysis Need a tool which given A set of ontologies A data repository (which can be the overall one, the one restricted by time, and one for a given user) can provide a meaningful and interactive overview of the activity data To be used for Provide an ontology-specific view of data analytics Support the iterative development of the ontologies Provide a user centric view of the data

Example In the ontology: /robot.txt is a RobotTXT page A Spider is an RobotAgent (ActorAgent) An agent used to access a RobotTXT is a Spider An AutomaticActivity is a Trace realized by a RobotAgent Result: Thousands of traces automatically classified as automatic activities.

Example In the ontology: UCIAD-Blog and LUCERO-Blog are Blogs (Website) A BlogPage is a page which is part of a Blog An activity onBlog is an activity happening on a Blog Page Result: Can look specifically at activities happening on a Blog and specialize them (same applies to Wikis, and other types of websites)

Example In the ontology: A SPARQLEndpoint is a specific type of Webpage AccessingSparqlEnpoint is an activity on a SPARQLEndpoint SPARLQQueryParameter is a parameter with the name “query” used in an AccessingSPARQLEndpoint activity ExecutingSPARQLQuery is an AccessingSPARQLQuery activity attached to a SPARQLQueryParameter Result: Can explore the specific activity of executing SPARQL queries and its parameters Can combine: Detect the activity of Automatically Accessing a SPARQL endpoint: and automatic activity and accessing a SPARQL endpoint.

Next step: User support Allow users to log-in detect setting bring up the relevant data explore it But also, to customize the view of the data to extend the ontologies to provide a personalized analysis of activity data to export (interpreted) activity data for reuse

User support User Logging or register Detect setting (agent+IP) unknown setting It is the first time you log into UCIAD with this setting (detail) do you want to attach it to your account? Check setting non-ambiguous non-ambiguous ambiguous known setting for user Add setting to known setting Register setting as ambiguous Display Activity Data related to all known settings of the user yes no

User support: data for a user For a user <u> the SPARQL query Construct {?trace ?p ?y. ?y ?q ?z} where {<u> actor:hasKnownSetting ?s. ?trace trace:hasSetting ?s. ?trace ?p ?y. ?trace ?q ?z} builds the traces of activities around the known setting of <u> Used to populate a specific repository with sub-spaces for each registered users

Deployment, test, validation At the moment, testing for websites of projects and events hosted on KMi servers: Sssw.org, sssw09.org, loted.eu, lucero-project.info, uciad.info, data.open.ac.uk, lucero.open.ac.uk, … Next level up, websites/systems from main open university website: www.open.ac.uk, study at the OU, podcasts.open.ac.uk, VLE Extend to deployment of instances for specific projects with distributed websites

Challenges Scalability OWLIM triple store can handle billions of triples But struggle with millions when inference is “on”  1 repository without inference with all historical data, 1 with inference with 1 week of data only, and 1 with inference for registered users User management and privacy Ensuring that the user who logs in from a particular setting is the one having the activity is difficult (e.g., in the case of shared computers) Is this really a problem? Check ambiguity – ask verification questions – moderate? Distribution and IPR Code and ontologies under open licenses (small uncertainty regarding code developed in other projects) Overall data: privacy issues (is k-anonymity actually applicable? Would it work?) Overall data: institutional issues (can we show the traffic on our websites to everybody) User data export: what license?

Summary and dissemination Promising initial results Can create new ways of analysis at run-time by editing the ontologies! Mechanisms to provide personal views on own activity data across websites First version of the ontologies: ongoing task First version of the tools: test and validate! Dissemination Blog / Twitter #uciad KMi’sinternal news letter (KMi Planet) Salman’s paper at the ESWC 2011 PhD symposium: “Personal Semantics: Personal information management in the Web with Semantic Technologies” Position paper at the W3C Web tracking and privacy workshop: “Self-Tracking on the Web: Why and How” Submission to the Personal Semantic Data workshop at K-CAP 2011

More info UCIAD Blog: http://uciad.info Code base: http://github.com/uciad Twitter: #uciad @mdaquin

UCIAD overview

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to UCIAD overview

Similar to UCIAD overview (20)

More from Mathieu d'Aquin

More from Mathieu d'Aquin (20)

Recently uploaded

Recently uploaded (20)

UCIAD overview