• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hafslund SESAM - Semantic integration in practice
 

Hafslund SESAM - Semantic integration in practice

on

  • 4,632 views

 

Statistics

Views

Total Views
4,632
Views on SlideShare
4,313
Embed Views
319

Actions

Likes
6
Downloads
39
Comments
0

7 Embeds 319

http://www.scoop.it 93
https://twitter.com 77
http://www.linkedin.com 72
http://eventifier.co 57
http://www.redditmedia.com 16
http://openlinkeddatasolutions.magnify.net 3
https://web.tweetdeck.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hafslund SESAM - Semantic integration in practice Hafslund SESAM - Semantic integration in practice Presentation Transcript

    • SESAM Lars Marius Garshol, Baksia, 2013-02-13 larsga@bouvet.no, http://twitter.com/larsga1
    • Agenda • Project background • Functional overview • Under the hood • Similar projects (if time)2
    • Lars Marius Garshol• Consultant in Bouvet since 2007 – focus on information architecture and semantics• Worked with semantic technologies since 1999 – mostly with Topic Maps – co-founder of Ontopia, later CTO – editor of several Topic Maps ISO standards 2001- – co-chair of TMRA conference 2006-2011 – developed several key Topic Maps technologies – consultant in a number of Topic Maps projects• Published a book on XML on Prentice-Hall• Implemented Unicode support in the Opera web browser
    • My role on the project• The overall architecture is the brainchild of Axel Borge• SDshare came from an idea by Graham Moore• I only contributed parts of the design – and some parts of the implementation• Don’t actually know the whole system
    • Hafslund SESAM
    • Hafslund ASA• Norwegian energy company – founded 1898 – 53% owned by the city of Oslo – responsible for energy grid around Oslo – 1.4 million customers• A conglomerate of companies – Nett (electricity grid) – Fjernvarme (remote heating) – Produksjon (power generation) – Venture – ...
    • What if...?ERP CRM Work Bill order Transfor Cables Meter Customer mer
    • Hafslund SESAM• An archive system, really• Generally, archive systems are glorified trash cans – putting it in the archive effectively means hiding it• Because archives are not important, are they?• Except, when you need that contract from 1937 about the right to build a power line across...
    • Problems with archives• Many documents aren’t there – even though they should be, – because entering metadata is too much hassle• Poor metadata, because nobody bothers to enter it properly – yet, much of the metadata exists in the user context• Not used by anybody – strange, separate system with poor interface – (and the metadata is poor, too)• Contains only documents – not connected to anything else
    • Goals for the SESAM project• Increase percentage of archived documents – to do this, archiving must be made easy• Increase metadata quality – by automatically harvesting metadata• Make it easy to find archived documents – by building a well-tuned search application
    • What SESAM does• Automatically enrich document metadata – to do that we have to collect background data from the source systems• Connect document metadata with structured business data – this is a side-effect of the enrichment• Provide search across the whole information space – once we’ve collected the data this part is easy
    • As seen by customer13
    • High-level architecture Share- ERP CRM point SDshare CMIS Search SDshare Triple store SDshare Archive engine
    • Main principle of data extraction• No canonical model! – instead, data reflects model of source system• One ontology per source system – subtyped from core ontology where possible• Vastly simplifies data extraction – for search purposes it loses us nothing – and translation is easier once the data is in the triple store
    • Simplified core ontology
    • Data structure in triple store ERP sameAs Archive sameAs CRM Sharepoint
    • Connecting data ERP • Contacts in many different systems sameAs – ERP has one set of contacts, – these are mirrored in archive Archive • Different IDs in these two systems – using “sameAs” we know which are the same = – can do queries across the systems – can also translate IDs from one system to the other18
    • Duplicate suppression ERP CRM Billing Suppliers Customers Customers Customers Companies Field Record 1 Record 2 Probability Name acme inc acme inc 0.9 Assoc no 177477707 0.5 Zip code 9161 9161 0.6 owl:sameAs Country norway Duke norway 0.51 Data hub Address 1 mb 113 mailbox 113 0.49 Address 2 0.5 SDshare http://code.google.com/p/duke/
    • When archiving• The user works on the document in some system – ERP, CRM, whatever• This system knows the context – what user, project, equipment, etc is involved• This information is passed to the CMIS server – it uses already gathered information from the triple store to attach more metadata
    • Auto-tagging Manager Sent to archive Project Customer Work order Equipment Equipment
    • Archive integration • Clients deliver documents via CMIS – an OASIS standard for CMS interfaces – lots of available implementations • Metadata translation – not only auto-tagging CMIS – client vocabulary translated to archive vocab – required static metadata added automatically • Benefits – clients can reuse existing software for connecting Archive – interface is independent of actual archive – archive integration is reusable22
    • Archive integration elsewhere • Generally, archive integrations are hard – web site integration: 400 hours – #2 integration: performance problems • They also reproduce the same functionality over and over again • Hard bindings against – internal archive model System Regulation Outlook – archive interface #2 system • Switching archive system System Web site will be very hard... #1 – even upgrading is hard Archive23
    • Showing context in the ERP system
    • Access control• Users only see objects they’re allowed to see• Implemented by search engine – all objects have lists of users/groups allowed to see them – on login a SPARQL query lists user’s access control group memberships – search engine uses this to filter search results• In some cases, complex access rules are run to resolve ACLs before loading into triple store – e.g: archive system
    • Data volumes Graph Statements IFS data 5,417,260 Public 360 data 3,725,963 GeoNIS data 44,242 Tieto CAB data 138,521,810 Hummingbird 1 32,619,140 Hummingbird 2 165,671,179 Hummingbird 3 192,930,188 Hummingbird 4 48,623,178 Address data 2,415,315 Siebel data 36,117,786 Duke links 4,858 Total 626,090,919
    • 7
    • Under the hood33
    • RDF • The data hub is an RDF database – RDF = Resource Description Framework – a W3C standard for data – also interchange format and query language (SPARQL) • Many implementations of the standards – lots of alternatives to choose between • Technology relatively well known – books, specifications, courses, conferences, ...34
    • How RDF works ‘PERSON’ table ID NAME EMAIL 1 Stian Danenbarger stian.danenbarger@ 2 Lars Marius Garshol larsga@bouvet.no 3 Axel Borge axel.borge@bouvet RDF-ized data SUBJECT PROPERTY OBJECT http://example.com/person/1 rdf:type ex:Person http://example.com/person/1 ex:name Stian Danenbarger http://example.com/person/1 ex:email stian.danenbarger@ http://example.com/person/2 rdf:type Person http://example.com/person/2 ex:name Lars Marius Garshol ... ... ...35
    • RDF/XML • Standard XML format for RDF – not really very nice • However, it’s a generic format – that is, it’s the same format regardless of what data you’re transmitting • Therefore – all data flows are in the same format – absolutely no syntax transformations whatsoever36
    • Plug-and-play • Data source integrations are pluggable Data – if you need another data source, hub connect and pull in the data • RDF database is schemaless – no need to create tables and columns beforehand • No common data model System #1 System #2 – do not need to transform data to store it37
    • Product queue • Since sources are pluggable we can guide the project with a product queue – consisting of new sources and data elements • We analyse the items in the queue – estimate complexity and amount of work – also to see how it fits into what’s already there • Customer then decides what to do, and in what order38
    • Don’t push! • The general IT approach is to push data – source calls services provided by recipient • Leads to high complexity – two-way dependency between systems – have to extend source system with extra logic – extra triggers/threads in source system – many moving parts System #139
    • Pull! • We let the recipient pull – always using same protocol and same format • Wrap the source – the wrapper must support 3 simple functions • A different kind of solution – one-way dependency – often zero code – wrapper is thin and stateless – data moving done by reused code System #140
    • The data integration• All data transport done by SDshare• A simple Atom-based specification for synchronizing RDF data – http://www.sdshare.org• Provides two main features – snapshot of the data – fragments for each updated resource
    • Basics of SDshare• Source offers – a dump of the entire data set – a list of fragments changed since time t – a dump of each fragment• Completely generic solution – always the same protocol – always the same data format (RDF/XML)• A generic SDshare client then transfers the data – to the receipient, whatever it is
    • SDshare service structure
    • Typical usage of SDshare• Client downloads snapshot – client now has complete data set• Client polls fragment feed – each time asking for new fragments since last check – client keeps track of time of last check – fragments are applied to data, keeping them in sync
    • Implementing the fragment feedselect objid, objtype, change_timefrom history_log <atom>where change_time > :since: <title>Fragments for ...</title>order by change_time asc ... <entry> <title>Change to 34121</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:23</updated> </entry> <entry> <title>Change to 94857</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:24</updated> </entry> ...
    • The SDshare client SPARQL- Triple backend store Frontend Core POST- backend W S http://code.google.com/p/sdshare-client/
    • Getting data out of the triple store • Set up SPARQL queries to RDF extract the data • Server does the rest • Queries can be configured to SPARQL produce SDshare server – any subset of data – data in any shape
    • Contacts into the archive• We want some resources in the triple store to be written into the archive as “contacts” – need to select which resources to include – must also transform from source data model• How to achieve without hard-wiring anything?
    • Contacts solution• Create a generic archive object writer – type of RDF resource specifies type of object to create – name of RDF property (within namespace) specifies which property to set• Set up RDF mapping from source data – type1 maps-to type2 – prop1 maps-to prop2 – only mapped types/properties included• Use SPARQL to – create SDshare feed – do data translation with CONSTRUCT query
    • Properties of the system• Uniform integration approach – everything is done the same way• Really simple integration – setting up a data source is generally very easy• Loose bindings – components can easily be replaced• Very little state – most components are stateless (or have little state)• Idempotent – applying a fragment 1 or many times: same result• Clear and reload – can delete everything and reload at any time
    • Conclusion SESAM51
    • Project outcome • Big, innovative project – many developers over a long time period – innovated a number of techniques and tools • Delivered on time, and working – despite an evolving project context • Very satisfied customer – they claim the project has already paid for itself in cost savings at the document center52
    • Archive of the year 2012 ”Virksomheten har vært innovative og strategiske i sin bruk av teknologi både når det gjelder de gamle papirarkivene og det digitale arkivet. De har et stort fokus på å øke datafangsten og forenkle bruken av metadata. De har hatt som mål at brukerne kun skal måtte påføre én metadata. De er nå nede i to – alt annet blir påført automatisk i løsningen – men målet er fortsatt å komme ned i kun én metadata.”53 http://www.arkivrad.no/utdanning.html?newsid=10677
    • Highly flexible solution • ERP integration replaced twice – without affecting other components • CRM system replaced while in production – all customers got new IDs – Sesam hid this from users, making transition 100% transparent54
    • Status now • In production since autumn 2011 • Used by – Hafslund Fjernvarme – Hafslund support centre55
    • Ohter implementations • A very similar system has been implemented for Statkraft • A system based on the same architecture has been implemented for DSS – basically an intranet system56
    • DSS Intranet57
    • DSS project background • Delivers IT services and platform to ministries – archive, intranet, email, ... • Intranet currently based on EPiServer – in addition, want to use Sharepoint – of course, many other related systems... • How to integrate all this? – integrations must be loose, as systems in 13 ministries come and go58
    • Metadata structure Person Org. enhet (seksjon, avd., dep., virkso mhet) Har ansvar for Tema/ arb.område/ Har ansvar for Handler om Arkivnøkkel Tilhører/ har rolle i Handler om Er involvert i Prosjekt/ sak Er relevant for Har rolle i forhold til Saks-nr. Tilhører Saks-/ Prosess-type Person Er ekspert på Dokument/ Jobber med Innhold Omtale/ Har ansvar for/ skrevet kompetanse Tittel Telefon Status adresse … Dokument-type Fil-type Dato59
    • Project scope EPiServer IDM Sharepoint (intranet) User information Access groups60
    • How we built it79
    • The web part EPiServer We Sharepoint We • Viser data gitt – en spørring b b part part SDShare SPARQL SDShare – et XSLT-stilsett • Kan kjøres hvor som helst Virtuoso – standard protokoll til RDF DB databasen • Lett å inkludere i SDShare SDShare SDShare andre systemer også Novell Active Regjering IDM Directory en.no80
    • Data flow (simplified) Workspaces Documents Lists Sharepoint Org structure Categories Page structure EPiServer Virtuoso Categories RDF DB Employees Org. structure Novell IDM81
    • Independence of source • Solutions need to be independent of the data source – when we extract data, or – when we run queries against the data hub • Independence isolates clients from changes to the sources – allows us to replace sources completely, or – change the source models • Of course, total independence is impossible – if structures are too different it won’t work – but surprisingly often they’re not82
    • Independence of source #1 Core model core: core:participant core: Person Project idm: idm:has-member idm: sp: sp:member-of sp: Person Project Person Project IDM Sharepoint83
    • Independence of source #2 • “Facebook streams” a dss:StreamEvent; sp:doc-created require queries across sources rdfs:label "opprettet"; dss:time-property sp:Created; – every source has its own model dss:user-property sp:Author. • We can reduce the differences sp:doc-updated a dss:StreamEvent; with annotation "endret"; rdfs:label – ie: we describe the data with RDF dss:time-property sp:Modified; dss:user-property sp:Editor. • Allows data models to – change, or – be plugged in • with no changes to query idm:ny-ansatt a dss:StreamEvent; rdfs:label "ble ansatt i"; dss:time-property idm:ftAnsattFra .84
    • Conclusion85
    • Execution • Customer asked for start August 15, delivery January 1 – sources: EPiServer, Sharepoint, IDM • We offered fixed price, done November 1 • Delibered october 20 – also integrated with ActiveDirectory – added Regjeringen.no for extra value • Integration with ACOS Websak – has been analyzed and described – but not implemented86
    • The data sources Source Connector Active Directory LDAP IDM LDAP Intranet EPiServer Regjeringen.no EPiServer Sharepoint Sharepoint87
    • Conclusion • We could do this so quickly because – the architecture is right – have components we can reuse – the architecture allows us to plug together the components in different settings • Once we have the data, the rest is usually simple – it’s getting the data that’s hard88