Hafslund SESAM - Semantic integration in practice

SESAM
Lars Marius Garshol, Baksia, 2013-02-13
larsga@bouvet.no, http://twitter.com/larsga

1

Agenda

• Project background
• Functional overview
• Under the hood
• Similar projects (if time)

2

Lars Marius Garshol
• Consultant in Bouvet since 2007
– focus on information architecture and semantics
• Worked with semantic technologies since 1999
– mostly with Topic Maps
– co-founder of Ontopia, later CTO
– editor of several Topic Maps ISO standards 2001-
– co-chair of TMRA conference 2006-2011
– developed several key Topic Maps technologies
– consultant in a number of Topic Maps projects
• Published a book on XML on Prentice-Hall
• Implemented Unicode support in the Opera web
browser

My role on the project
• The overall architecture is the brainchild of Axel
Borge
• SDshare came from an idea by Graham Moore
• I only contributed parts of the design
– and some parts of the implementation
• Don’t actually know the whole system

Hafslund ASA
• Norwegian energy company
– founded 1898
– 53% owned by the city of Oslo
– responsible for energy grid around Oslo
– 1.4 million customers
• A conglomerate of companies
– Nett (electricity grid)
– Fjernvarme (remote heating)
– Produksjon (power generation)
– Venture
– ...

What if...?
ERP
CRM
Work Bill
order

Transfor
Cables Meter Customer
mer

Hafslund SESAM
• An archive system, really
• Generally, archive systems are glorified trash cans
– putting it in the archive effectively means hiding it
• Because archives are not important, are they?
• Except, when you need that contract from 1937
about the right to build a power line across...

Problems with archives
• Many documents aren’t there
– even though they should be,
– because entering metadata is too much hassle
• Poor metadata, because nobody bothers to enter
it properly
– yet, much of the metadata exists in the user context
• Not used by anybody
– strange, separate system with poor interface
– (and the metadata is poor, too)
• Contains only documents
– not connected to anything else

Goals for the SESAM project
• Increase percentage of archived documents
– to do this, archiving must be made easy
• Increase metadata quality
– by automatically harvesting metadata
• Make it easy to find archived documents
– by building a well-tuned search application

What SESAM does
• Automatically enrich document metadata
– to do that we have to collect background data from the
source systems
• Connect document metadata with structured
business data
– this is a side-effect of the enrichment
• Provide search across the whole information
space
– once we’ve collected the data this part is easy

High-level architecture

Share-
ERP CRM
point

SDshare
CMIS

Search SDshare
Triple store SDshare Archive
engine

Main principle of data extraction
• No canonical model!
– instead, data reflects model of source system
• One ontology per source system
– subtyped from core ontology where possible
• Vastly simplifies data extraction
– for search purposes it loses us nothing
– and translation is easier once the data is in the triple store

Data structure in triple store
ERP

sameAs

Archive

sameAs

CRM

Sharepoint

Connecting data
ERP

• Contacts in many different
systems sameAs

– ERP has one set of contacts,
– these are mirrored in archive Archive

• Different IDs in these two
systems
– using “sameAs” we know which
are the same =
– can do queries across the systems
– can also translate IDs from one
system to the other

18

Duplicate suppression
ERP CRM Billing
Suppliers

Customers Customers Customers

Companies

Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
owl:sameAs
Country norway Duke norway 0.51
Data hub
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5

SDshare

http://code.google.com/p/duke/

When archiving
• The user works on the document in some system
– ERP, CRM, whatever
• This system knows the context
– what user, project, equipment, etc is involved
• This information is passed to the CMIS server
– it uses already gathered information from the triple store
to attach more metadata

Auto-tagging
Manager

Sent to archive Project

Customer

Work
order

Equipment

Equipment

Archive integration

• Clients deliver documents via CMIS
– an OASIS standard for CMS interfaces
– lots of available implementations
• Metadata translation
– not only auto-tagging CMIS
– client vocabulary translated to archive vocab
– required static metadata added automatically
• Benefits
– clients can reuse existing software for connecting Archive

– interface is independent of actual archive
– archive integration is reusable

22

Archive integration elsewhere

• Generally, archive integrations are hard
– web site integration: 400 hours
– #2 integration: performance problems
• They also reproduce the same functionality
over and over again
• Hard bindings against
– internal archive model System Regulation
Outlook
– archive interface #2 system

• Switching archive system System
Web site
will be very hard... #1

– even upgrading is hard
Archive

23

Showing context in the ERP system

Access control
• Users only see objects they’re allowed to see
• Implemented by search engine
– all objects have lists of users/groups allowed to see them
– on login a SPARQL query lists user’s access control group
memberships
– search engine uses this to filter search results
• In some cases, complex access rules are run to
resolve ACLs before loading into triple store
– e.g: archive system

Data volumes
Graph Statements
IFS data 5,417,260
Public 360 data 3,725,963
GeoNIS data 44,242
Tieto CAB data 138,521,810
Hummingbird 1 32,619,140
Address data 2,415,315
Siebel data 36,117,786
Duke links 4,858
Total 626,090,919

RDF

• The data hub is an RDF database
– RDF = Resource Description Framework
– a W3C standard for data
– also interchange format and query language (SPARQL)
• Many implementations of the standards
– lots of alternatives to choose between
• Technology relatively well known
– books, specifications, courses, conferences, ...

34

How RDF works
‘PERSON’ table
ID NAME EMAIL
1 Stian Danenbarger stian.danenbarger@
2 Lars Marius Garshol larsga@bouvet.no
3 Axel Borge axel.borge@bouvet

RDF-ized data
SUBJECT PROPERTY OBJECT
http://example.com/person/1 rdf:type ex:Person
http://example.com/person/1 ex:name Stian Danenbarger
http://example.com/person/1 ex:email stian.danenbarger@
http://example.com/person/2 rdf:type Person
http://example.com/person/2 ex:name Lars Marius Garshol
... ... ...

35

RDF/XML

• Standard XML format for RDF
– not really very nice
• However, it’s a generic format
– that is, it’s the same format regardless of what data
you’re transmitting
• Therefore
– all data flows are in the same format
– absolutely no syntax transformations whatsoever

36

Plug-and-play

• Data source integrations are
pluggable
Data
– if you need another data source,
hub
connect and pull in the data
• RDF database is schemaless
– no need to create tables and
columns beforehand
• No common data model System #1 System #2
– do not need to transform data
to store it

37

Product queue

• Since sources are pluggable we can guide
the project with a product queue
– consisting of new sources and data elements
• We analyse the items in the queue
– estimate complexity and amount of work
– also to see how it fits into what’s already there
• Customer then decides what to do, and in
what order

38

Don’t push!

• The general IT approach is to push data
– source calls services provided by recipient
• Leads to high complexity
– two-way dependency between systems
– have to extend source system with extra logic
– extra triggers/threads in source system
– many moving parts

System #1

39

Pull!

• We let the recipient pull
– always using same protocol and same format
• Wrap the source
– the wrapper must support 3 simple functions
• A different kind of solution
– one-way dependency
– often zero code
– wrapper is thin and stateless
– data moving done by reused code

System #1

40

The data integration
• All data transport done by SDshare
• A simple Atom-based specification for
synchronizing RDF data
– http://www.sdshare.org
• Provides two main features
– snapshot of the data
– fragments for each updated resource

Basics of SDshare
• Source offers
– a dump of the entire data set
– a list of fragments changed since time t
– a dump of each fragment
• Completely generic solution
– always the same protocol
– always the same data format (RDF/XML)
• A generic SDshare client then transfers the data
– to the receipient, whatever it is

Typical usage of SDshare
• Client downloads snapshot
– client now has complete data set
• Client polls fragment feed
– each time asking for new fragments since last check
– client keeps track of time of last check
– fragments are applied to data, keeping them in sync

Implementing the fragment feed
select objid, objtype, change_time
from history_log <atom>
where change_time > :since: <title>Fragments for ...</title>
order by change_time asc ...

<entry>
<title>Change to 34121</title>
<link rel=fragment href=“...”/>
<sdshare:resource>http://...</sdshare:resource>
<updated>2012-09-06T08:22:23</updated>
</entry>
<entry>
<title>Change to 94857</title>
<link rel=fragment href=“...”/>
<sdshare:resource>http://...</sdshare:resource>
<updated>2012-09-06T08:22:24</updated>
</entry>
...

The SDshare client
SPARQL- Triple
backend store
Frontend Core
POST-
backend
W
S

http://code.google.com/p/sdshare-client/

Getting data out of the triple store
• Set up SPARQL queries to
RDF
extract the data
• Server does the rest
• Queries can be configured to
SPARQL
produce
SDshare server – any subset of data
– data in any shape

Contacts into the archive
• We want some resources in the triple store to be
written into the archive as “contacts”
– need to select which resources to include
– must also transform from source data model
• How to achieve without hard-wiring anything?

Contacts solution
• Create a generic archive object writer
– type of RDF resource specifies type of object to create
– name of RDF property (within namespace) specifies which
property to set
• Set up RDF mapping from source data
– type1 maps-to type2
– prop1 maps-to prop2
– only mapped types/properties included
• Use SPARQL to
– create SDshare feed
– do data translation with CONSTRUCT query

Properties of the system
• Uniform integration approach
– everything is done the same way
• Really simple integration
– setting up a data source is generally very easy
• Loose bindings
– components can easily be replaced
• Very little state
– most components are stateless (or have little state)
• Idempotent
– applying a fragment 1 or many times: same result
• Clear and reload
– can delete everything and reload at any time

Project outcome

• Big, innovative project
– many developers over a long time period
– innovated a number of techniques and tools
• Delivered on time, and working
– despite an evolving project context
• Very satisfied customer
– they claim the project has already paid for itself in
cost savings at the document center

52

Archive of the year 2012
”Virksomheten har vært
innovative og strategiske i sin
bruk av teknologi både når det
gjelder de gamle papirarkivene og
det digitale arkivet. De har et
stort fokus på å øke datafangsten
og forenkle bruken av metadata.
De har hatt som mål at brukerne
kun skal måtte påføre én
metadata. De er nå nede i to – alt
annet blir påført automatisk i
løsningen – men målet er fortsatt
å komme ned i kun én metadata.”

53 http://www.arkivrad.no/utdanning.html?newsid=10677

Highly flexible solution

• ERP integration replaced twice
– without affecting other components
• CRM system replaced while in production
– all customers got new IDs
– Sesam hid this from users, making transition 100%
transparent

54

Status now

• In production since autumn 2011
• Used by
– Hafslund Fjernvarme
– Hafslund support centre

55

Ohter implementations

• A very similar system has been
implemented for Statkraft
• A system based on the same architecture
has been implemented for DSS
– basically an intranet system

56

DSS project background

• Delivers IT services and platform to
ministries
– archive, intranet, email, ...
• Intranet currently based on EPiServer
– in addition, want to use Sharepoint
– of course, many other related systems...
• How to integrate all this?
– integrations must be loose, as systems in 13
ministries come and go

58

Metadata structure
Person
Org. enhet
(seksjon, avd., dep., virkso
mhet) Har ansvar for
Tema/
arb.område/
Har ansvar for
Handler om Arkivnøkkel

Tilhører/ har rolle i Handler om
Er involvert i Prosjekt/ sak
Er relevant for
Har rolle i forhold til

Saks-nr. Tilhører
Saks-/ Prosess-type
Person
Er ekspert på
Dokument/
Jobber med
Innhold
Omtale/ Har ansvar for/ skrevet
kompetanse Tittel
Telefon Status
adresse
… Dokument-type Fil-type
Dato

59

Project scope

EPiServer
IDM Sharepoint
(intranet)

User information
Access groups

60

The web part
EPiServer
We
Sharepoint
We
• Viser data gitt
– en spørring
b b
part part

SDShare SPARQL SDShare
– et XSLT-stilsett
• Kan kjøres hvor som
helst
Virtuoso – standard protokoll til
RDF DB databasen
• Lett å inkludere i
SDShare SDShare
SDShare
andre systemer også
Novell Active Regjering
IDM Directory en.no

80

Data flow (simplified)

Workspaces
Documents
Lists
Sharepoint
Org structure

Categories
Page structure
EPiServer Virtuoso
Categories
RDF DB

Employees
Org. structure

Novell
IDM

81

Independence of source

• Solutions need to be independent of the
data source
– when we extract data, or
– when we run queries against the data hub
• Independence isolates clients from changes
to the sources
– allows us to replace sources completely, or
– change the source models
• Of course, total independence is impossible
– if structures are too different it won’t work
– but surprisingly often they’re not

82

Independence of source #1
Core model

core: core:participant core:
Person Project

idm:
idm:has-member idm: sp: sp:member-of sp:
Person Project Person Project

IDM Sharepoint

83

Independence of source #2

• “Facebook streams” a dss:StreamEvent;
sp:doc-created
require
queries across sources
rdfs:label "opprettet";
dss:time-property sp:Created;
– every source has its own model
dss:user-property sp:Author.
• We can reduce the differences
sp:doc-updated a dss:StreamEvent;
with annotation "endret";
rdfs:label
– ie: we describe the data with RDF
dss:time-property sp:Modified;
dss:user-property sp:Editor.
• Allows data models to
– change, or
– be plugged in
• with no changes to query
idm:ny-ansatt a dss:StreamEvent;
rdfs:label "ble ansatt i";
dss:time-property idm:ftAnsattFra .

84

Execution

• Customer asked for start August 15,
delivery January 1
– sources: EPiServer, Sharepoint, IDM
• We offered fixed price, done November 1
• Delibered october 20
– also integrated with ActiveDirectory
– added Regjeringen.no for extra value
• Integration with ACOS Websak
– has been analyzed and described
– but not implemented

86

The data sources

Source Connector
Active Directory LDAP
IDM LDAP
Intranet EPiServer
Regjeringen.no EPiServer
Sharepoint Sharepoint

87

Conclusion

• We could do this so quickly because
– the architecture is right
– have components we can reuse
– the architecture allows us to plug together the
components in different settings
• Once we have the data, the rest is usually
simple
– it’s getting the data that’s hard

88

Hafslund SESAM - Semantic integration in practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hafslund SESAM - Semantic integration in practice

Similar to Hafslund SESAM - Semantic integration in practice (20)

More from Lars Marius Garshol

More from Lars Marius Garshol (20)

Recently uploaded

Recently uploaded (20)

Hafslund SESAM - Semantic integration in practice