SESAM    Lars Marius Garshol, Baksia, 2013-02-13    larsga@bouvet.no, http://twitter.com/larsga1
Agenda    •   Project background    •   Functional overview    •   Under the hood    •   Similar projects (if time)2
Lars Marius Garshol• Consultant in Bouvet since 2007  – focus on information architecture and semantics• Worked with seman...
My role on the project• The overall architecture is the brainchild of Axel  Borge• SDshare came from an idea by Graham Moo...
Hafslund SESAM
Hafslund ASA• Norwegian energy company  –   founded 1898  –   53% owned by the city of Oslo  –   responsible for energy gr...
What if...?ERP                                                            CRM                 Work                     Bil...
Hafslund SESAM• An archive system, really• Generally, archive systems are glorified trash cans  – putting it in the archiv...
Problems with archives• Many documents aren’t there  – even though they should be,  – because entering metadata is too muc...
Goals for the SESAM project• Increase percentage of archived documents  – to do this, archiving must be made easy• Increas...
What SESAM does• Automatically enrich document metadata  – to do that we have to collect background data from the    sourc...
As seen by customer13
High-level architecture                                          Share-                     ERP      CRM                  ...
Main principle of data extraction• No canonical model!  – instead, data reflects model of source system• One ontology per ...
Simplified core ontology
Data structure in triple store  ERP                sameAs  Archive                         sameAs CRM Sharepoint
Connecting data                                                      ERP     • Contacts in many different       systems   ...
Duplicate suppression       ERP                          CRM                             Billing     Suppliers    Customer...
When archiving• The user works on the document in some system  – ERP, CRM, whatever• This system knows the context  – what...
Auto-tagging                                        Manager  Sent to archive           Project                            ...
Archive integration     • Clients deliver documents via CMIS       – an OASIS standard for CMS interfaces       – lots of ...
Archive integration elsewhere     • Generally, archive integrations are hard       – web site integration: 400 hours      ...
Showing context in the ERP system
Access control• Users only see objects they’re allowed to see• Implemented by search engine  – all objects have lists of u...
Data volumes    Graph             Statements    IFS data                   5,417,260    Public 360 data            3,725,9...
7
Under the hood33
RDF     • The data hub is an RDF database       – RDF = Resource Description Framework       – a W3C standard for data    ...
How RDF works     ‘PERSON’ table     ID NAME                   EMAIL     1   Stian Danenbarger     stian.danenbarger@     ...
RDF/XML     • Standard XML format for RDF       – not really very nice     • However, it’s a generic format       – that i...
Plug-and-play     • Data source integrations are       pluggable                                                    Data  ...
Product queue     • Since sources are pluggable we can guide       the project with a product queue       – consisting of ...
Don’t push!     • The general IT approach is to push data       – source calls services provided by recipient     • Leads ...
Pull!     • We let the recipient pull        – always using same protocol and same format     • Wrap the source        – t...
The data integration• All data transport done by SDshare• A simple Atom-based specification for  synchronizing RDF data  –...
Basics of SDshare• Source offers  – a dump of the entire data set  – a list of fragments changed since time t  – a dump of...
SDshare service structure
Typical usage of SDshare• Client downloads snapshot  – client now has complete data set• Client polls fragment feed  – eac...
Implementing the fragment feedselect objid, objtype, change_timefrom history_log                   <atom>where change_time...
The SDshare client                                              SPARQL-   Triple                                          ...
Getting data out of the triple store                  • Set up SPARQL queries to     RDF                    extract the da...
Contacts into the archive• We want some resources in the triple store to be  written into the archive as “contacts”  – nee...
Contacts solution• Create a generic archive object writer  – type of RDF resource specifies type of object to create  – na...
Properties of the system• Uniform integration approach   – everything is done the same way• Really simple integration   – ...
Conclusion     SESAM51
Project outcome     • Big, innovative project       – many developers over a long time period       – innovated a number o...
Archive of the year 2012                                                  ”Virksomheten har vært                          ...
Highly flexible solution     • ERP integration replaced twice       – without affecting other components     • CRM system ...
Status now     • In production since autumn 2011     • Used by       – Hafslund Fjernvarme       – Hafslund support centre55
Ohter implementations     • A very similar system has been       implemented for Statkraft     • A system based on the sam...
DSS Intranet57
DSS project background     • Delivers IT services and platform to       ministries        – archive, intranet, email, ... ...
Metadata structure                                 Person                              Org. enhet                        (...
Project scope                        EPiServer          IDM                        Sharepoint                        (intr...
How we built it79
The web part       EPiServer                   We                                   Sharepoint                            ...
Data flow (simplified)                                                           Workspaces                               ...
Independence of source     • Solutions need to be independent of the       data source       – when we extract data, or   ...
Independence of source #1                                                                            Core model           ...
Independence of source #2     • “Facebook streams” a dss:StreamEvent;                  sp:doc-created                     ...
Conclusion85
Execution     • Customer asked for start August 15,       delivery January 1       – sources: EPiServer, Sharepoint, IDM  ...
The data sources     Source             Connector     Active Directory   LDAP     IDM                LDAP     Intranet    ...
Conclusion     • We could do this so quickly because       – the architecture is right       – have components we can reus...
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Upcoming SlideShare
Loading in …5
×

Hafslund SESAM - Semantic integration in practice

7,124 views

Published on

Published in: Technology

Hafslund SESAM - Semantic integration in practice

  1. 1. SESAM Lars Marius Garshol, Baksia, 2013-02-13 larsga@bouvet.no, http://twitter.com/larsga1
  2. 2. Agenda • Project background • Functional overview • Under the hood • Similar projects (if time)2
  3. 3. Lars Marius Garshol• Consultant in Bouvet since 2007 – focus on information architecture and semantics• Worked with semantic technologies since 1999 – mostly with Topic Maps – co-founder of Ontopia, later CTO – editor of several Topic Maps ISO standards 2001- – co-chair of TMRA conference 2006-2011 – developed several key Topic Maps technologies – consultant in a number of Topic Maps projects• Published a book on XML on Prentice-Hall• Implemented Unicode support in the Opera web browser
  4. 4. My role on the project• The overall architecture is the brainchild of Axel Borge• SDshare came from an idea by Graham Moore• I only contributed parts of the design – and some parts of the implementation• Don’t actually know the whole system
  5. 5. Hafslund SESAM
  6. 6. Hafslund ASA• Norwegian energy company – founded 1898 – 53% owned by the city of Oslo – responsible for energy grid around Oslo – 1.4 million customers• A conglomerate of companies – Nett (electricity grid) – Fjernvarme (remote heating) – Produksjon (power generation) – Venture – ...
  7. 7. What if...?ERP CRM Work Bill order Transfor Cables Meter Customer mer
  8. 8. Hafslund SESAM• An archive system, really• Generally, archive systems are glorified trash cans – putting it in the archive effectively means hiding it• Because archives are not important, are they?• Except, when you need that contract from 1937 about the right to build a power line across...
  9. 9. Problems with archives• Many documents aren’t there – even though they should be, – because entering metadata is too much hassle• Poor metadata, because nobody bothers to enter it properly – yet, much of the metadata exists in the user context• Not used by anybody – strange, separate system with poor interface – (and the metadata is poor, too)• Contains only documents – not connected to anything else
  10. 10. Goals for the SESAM project• Increase percentage of archived documents – to do this, archiving must be made easy• Increase metadata quality – by automatically harvesting metadata• Make it easy to find archived documents – by building a well-tuned search application
  11. 11. What SESAM does• Automatically enrich document metadata – to do that we have to collect background data from the source systems• Connect document metadata with structured business data – this is a side-effect of the enrichment• Provide search across the whole information space – once we’ve collected the data this part is easy
  12. 12. As seen by customer13
  13. 13. High-level architecture Share- ERP CRM point SDshare CMIS Search SDshare Triple store SDshare Archive engine
  14. 14. Main principle of data extraction• No canonical model! – instead, data reflects model of source system• One ontology per source system – subtyped from core ontology where possible• Vastly simplifies data extraction – for search purposes it loses us nothing – and translation is easier once the data is in the triple store
  15. 15. Simplified core ontology
  16. 16. Data structure in triple store ERP sameAs Archive sameAs CRM Sharepoint
  17. 17. Connecting data ERP • Contacts in many different systems sameAs – ERP has one set of contacts, – these are mirrored in archive Archive • Different IDs in these two systems – using “sameAs” we know which are the same = – can do queries across the systems – can also translate IDs from one system to the other18
  18. 18. Duplicate suppression ERP CRM Billing Suppliers Customers Customers Customers Companies Field Record 1 Record 2 Probability Name acme inc acme inc 0.9 Assoc no 177477707 0.5 Zip code 9161 9161 0.6 owl:sameAs Country norway Duke norway 0.51 Data hub Address 1 mb 113 mailbox 113 0.49 Address 2 0.5 SDshare http://code.google.com/p/duke/
  19. 19. When archiving• The user works on the document in some system – ERP, CRM, whatever• This system knows the context – what user, project, equipment, etc is involved• This information is passed to the CMIS server – it uses already gathered information from the triple store to attach more metadata
  20. 20. Auto-tagging Manager Sent to archive Project Customer Work order Equipment Equipment
  21. 21. Archive integration • Clients deliver documents via CMIS – an OASIS standard for CMS interfaces – lots of available implementations • Metadata translation – not only auto-tagging CMIS – client vocabulary translated to archive vocab – required static metadata added automatically • Benefits – clients can reuse existing software for connecting Archive – interface is independent of actual archive – archive integration is reusable22
  22. 22. Archive integration elsewhere • Generally, archive integrations are hard – web site integration: 400 hours – #2 integration: performance problems • They also reproduce the same functionality over and over again • Hard bindings against – internal archive model System Regulation Outlook – archive interface #2 system • Switching archive system System Web site will be very hard... #1 – even upgrading is hard Archive23
  23. 23. Showing context in the ERP system
  24. 24. Access control• Users only see objects they’re allowed to see• Implemented by search engine – all objects have lists of users/groups allowed to see them – on login a SPARQL query lists user’s access control group memberships – search engine uses this to filter search results• In some cases, complex access rules are run to resolve ACLs before loading into triple store – e.g: archive system
  25. 25. Data volumes Graph Statements IFS data 5,417,260 Public 360 data 3,725,963 GeoNIS data 44,242 Tieto CAB data 138,521,810 Hummingbird 1 32,619,140 Hummingbird 2 165,671,179 Hummingbird 3 192,930,188 Hummingbird 4 48,623,178 Address data 2,415,315 Siebel data 36,117,786 Duke links 4,858 Total 626,090,919
  26. 26. 7
  27. 27. Under the hood33
  28. 28. RDF • The data hub is an RDF database – RDF = Resource Description Framework – a W3C standard for data – also interchange format and query language (SPARQL) • Many implementations of the standards – lots of alternatives to choose between • Technology relatively well known – books, specifications, courses, conferences, ...34
  29. 29. How RDF works ‘PERSON’ table ID NAME EMAIL 1 Stian Danenbarger stian.danenbarger@ 2 Lars Marius Garshol larsga@bouvet.no 3 Axel Borge axel.borge@bouvet RDF-ized data SUBJECT PROPERTY OBJECT http://example.com/person/1 rdf:type ex:Person http://example.com/person/1 ex:name Stian Danenbarger http://example.com/person/1 ex:email stian.danenbarger@ http://example.com/person/2 rdf:type Person http://example.com/person/2 ex:name Lars Marius Garshol ... ... ...35
  30. 30. RDF/XML • Standard XML format for RDF – not really very nice • However, it’s a generic format – that is, it’s the same format regardless of what data you’re transmitting • Therefore – all data flows are in the same format – absolutely no syntax transformations whatsoever36
  31. 31. Plug-and-play • Data source integrations are pluggable Data – if you need another data source, hub connect and pull in the data • RDF database is schemaless – no need to create tables and columns beforehand • No common data model System #1 System #2 – do not need to transform data to store it37
  32. 32. Product queue • Since sources are pluggable we can guide the project with a product queue – consisting of new sources and data elements • We analyse the items in the queue – estimate complexity and amount of work – also to see how it fits into what’s already there • Customer then decides what to do, and in what order38
  33. 33. Don’t push! • The general IT approach is to push data – source calls services provided by recipient • Leads to high complexity – two-way dependency between systems – have to extend source system with extra logic – extra triggers/threads in source system – many moving parts System #139
  34. 34. Pull! • We let the recipient pull – always using same protocol and same format • Wrap the source – the wrapper must support 3 simple functions • A different kind of solution – one-way dependency – often zero code – wrapper is thin and stateless – data moving done by reused code System #140
  35. 35. The data integration• All data transport done by SDshare• A simple Atom-based specification for synchronizing RDF data – http://www.sdshare.org• Provides two main features – snapshot of the data – fragments for each updated resource
  36. 36. Basics of SDshare• Source offers – a dump of the entire data set – a list of fragments changed since time t – a dump of each fragment• Completely generic solution – always the same protocol – always the same data format (RDF/XML)• A generic SDshare client then transfers the data – to the receipient, whatever it is
  37. 37. SDshare service structure
  38. 38. Typical usage of SDshare• Client downloads snapshot – client now has complete data set• Client polls fragment feed – each time asking for new fragments since last check – client keeps track of time of last check – fragments are applied to data, keeping them in sync
  39. 39. Implementing the fragment feedselect objid, objtype, change_timefrom history_log <atom>where change_time > :since: <title>Fragments for ...</title>order by change_time asc ... <entry> <title>Change to 34121</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:23</updated> </entry> <entry> <title>Change to 94857</title> <link rel=fragment href=“...”/> <sdshare:resource>http://...</sdshare:resource> <updated>2012-09-06T08:22:24</updated> </entry> ...
  40. 40. The SDshare client SPARQL- Triple backend store Frontend Core POST- backend W S http://code.google.com/p/sdshare-client/
  41. 41. Getting data out of the triple store • Set up SPARQL queries to RDF extract the data • Server does the rest • Queries can be configured to SPARQL produce SDshare server – any subset of data – data in any shape
  42. 42. Contacts into the archive• We want some resources in the triple store to be written into the archive as “contacts” – need to select which resources to include – must also transform from source data model• How to achieve without hard-wiring anything?
  43. 43. Contacts solution• Create a generic archive object writer – type of RDF resource specifies type of object to create – name of RDF property (within namespace) specifies which property to set• Set up RDF mapping from source data – type1 maps-to type2 – prop1 maps-to prop2 – only mapped types/properties included• Use SPARQL to – create SDshare feed – do data translation with CONSTRUCT query
  44. 44. Properties of the system• Uniform integration approach – everything is done the same way• Really simple integration – setting up a data source is generally very easy• Loose bindings – components can easily be replaced• Very little state – most components are stateless (or have little state)• Idempotent – applying a fragment 1 or many times: same result• Clear and reload – can delete everything and reload at any time
  45. 45. Conclusion SESAM51
  46. 46. Project outcome • Big, innovative project – many developers over a long time period – innovated a number of techniques and tools • Delivered on time, and working – despite an evolving project context • Very satisfied customer – they claim the project has already paid for itself in cost savings at the document center52
  47. 47. Archive of the year 2012 ”Virksomheten har vært innovative og strategiske i sin bruk av teknologi både når det gjelder de gamle papirarkivene og det digitale arkivet. De har et stort fokus på å øke datafangsten og forenkle bruken av metadata. De har hatt som mål at brukerne kun skal måtte påføre én metadata. De er nå nede i to – alt annet blir påført automatisk i løsningen – men målet er fortsatt å komme ned i kun én metadata.”53 http://www.arkivrad.no/utdanning.html?newsid=10677
  48. 48. Highly flexible solution • ERP integration replaced twice – without affecting other components • CRM system replaced while in production – all customers got new IDs – Sesam hid this from users, making transition 100% transparent54
  49. 49. Status now • In production since autumn 2011 • Used by – Hafslund Fjernvarme – Hafslund support centre55
  50. 50. Ohter implementations • A very similar system has been implemented for Statkraft • A system based on the same architecture has been implemented for DSS – basically an intranet system56
  51. 51. DSS Intranet57
  52. 52. DSS project background • Delivers IT services and platform to ministries – archive, intranet, email, ... • Intranet currently based on EPiServer – in addition, want to use Sharepoint – of course, many other related systems... • How to integrate all this? – integrations must be loose, as systems in 13 ministries come and go58
  53. 53. Metadata structure Person Org. enhet (seksjon, avd., dep., virkso mhet) Har ansvar for Tema/ arb.område/ Har ansvar for Handler om Arkivnøkkel Tilhører/ har rolle i Handler om Er involvert i Prosjekt/ sak Er relevant for Har rolle i forhold til Saks-nr. Tilhører Saks-/ Prosess-type Person Er ekspert på Dokument/ Jobber med Innhold Omtale/ Har ansvar for/ skrevet kompetanse Tittel Telefon Status adresse … Dokument-type Fil-type Dato59
  54. 54. Project scope EPiServer IDM Sharepoint (intranet) User information Access groups60
  55. 55. How we built it79
  56. 56. The web part EPiServer We Sharepoint We • Viser data gitt – en spørring b b part part SDShare SPARQL SDShare – et XSLT-stilsett • Kan kjøres hvor som helst Virtuoso – standard protokoll til RDF DB databasen • Lett å inkludere i SDShare SDShare SDShare andre systemer også Novell Active Regjering IDM Directory en.no80
  57. 57. Data flow (simplified) Workspaces Documents Lists Sharepoint Org structure Categories Page structure EPiServer Virtuoso Categories RDF DB Employees Org. structure Novell IDM81
  58. 58. Independence of source • Solutions need to be independent of the data source – when we extract data, or – when we run queries against the data hub • Independence isolates clients from changes to the sources – allows us to replace sources completely, or – change the source models • Of course, total independence is impossible – if structures are too different it won’t work – but surprisingly often they’re not82
  59. 59. Independence of source #1 Core model core: core:participant core: Person Project idm: idm:has-member idm: sp: sp:member-of sp: Person Project Person Project IDM Sharepoint83
  60. 60. Independence of source #2 • “Facebook streams” a dss:StreamEvent; sp:doc-created require queries across sources rdfs:label "opprettet"; dss:time-property sp:Created; – every source has its own model dss:user-property sp:Author. • We can reduce the differences sp:doc-updated a dss:StreamEvent; with annotation "endret"; rdfs:label – ie: we describe the data with RDF dss:time-property sp:Modified; dss:user-property sp:Editor. • Allows data models to – change, or – be plugged in • with no changes to query idm:ny-ansatt a dss:StreamEvent; rdfs:label "ble ansatt i"; dss:time-property idm:ftAnsattFra .84
  61. 61. Conclusion85
  62. 62. Execution • Customer asked for start August 15, delivery January 1 – sources: EPiServer, Sharepoint, IDM • We offered fixed price, done November 1 • Delibered october 20 – also integrated with ActiveDirectory – added Regjeringen.no for extra value • Integration with ACOS Websak – has been analyzed and described – but not implemented86
  63. 63. The data sources Source Connector Active Directory LDAP IDM LDAP Intranet EPiServer Regjeringen.no EPiServer Sharepoint Sharepoint87
  64. 64. Conclusion • We could do this so quickly because – the architecture is right – have components we can reuse – the architecture allows us to plug together the components in different settings • Once we have the data, the rest is usually simple – it’s getting the data that’s hard88

×