How to Remove Document Management Hurdles with X-Docs?
Hafslund SESAM - Semantic integration in practice
1. SESAM
Lars Marius Garshol, Baksia, 2013-02-13
larsga@bouvet.no, http://twitter.com/larsga
1
2. Agenda
• Project background
• Functional overview
• Under the hood
• Similar projects (if time)
2
3. Lars Marius Garshol
• Consultant in Bouvet since 2007
– focus on information architecture and semantics
• Worked with semantic technologies since 1999
– mostly with Topic Maps
– co-founder of Ontopia, later CTO
– editor of several Topic Maps ISO standards 2001-
– co-chair of TMRA conference 2006-2011
– developed several key Topic Maps technologies
– consultant in a number of Topic Maps projects
• Published a book on XML on Prentice-Hall
• Implemented Unicode support in the Opera web
browser
4. My role on the project
• The overall architecture is the brainchild of Axel
Borge
• SDshare came from an idea by Graham Moore
• I only contributed parts of the design
– and some parts of the implementation
• Don’t actually know the whole system
6. Hafslund ASA
• Norwegian energy company
– founded 1898
– 53% owned by the city of Oslo
– responsible for energy grid around Oslo
– 1.4 million customers
• A conglomerate of companies
– Nett (electricity grid)
– Fjernvarme (remote heating)
– Produksjon (power generation)
– Venture
– ...
7. What if...?
ERP
CRM
Work Bill
order
Transfor
Cables Meter Customer
mer
8. Hafslund SESAM
• An archive system, really
• Generally, archive systems are glorified trash cans
– putting it in the archive effectively means hiding it
• Because archives are not important, are they?
• Except, when you need that contract from 1937
about the right to build a power line across...
9.
10. Problems with archives
• Many documents aren’t there
– even though they should be,
– because entering metadata is too much hassle
• Poor metadata, because nobody bothers to enter
it properly
– yet, much of the metadata exists in the user context
• Not used by anybody
– strange, separate system with poor interface
– (and the metadata is poor, too)
• Contains only documents
– not connected to anything else
11. Goals for the SESAM project
• Increase percentage of archived documents
– to do this, archiving must be made easy
• Increase metadata quality
– by automatically harvesting metadata
• Make it easy to find archived documents
– by building a well-tuned search application
12. What SESAM does
• Automatically enrich document metadata
– to do that we have to collect background data from the
source systems
• Connect document metadata with structured
business data
– this is a side-effect of the enrichment
• Provide search across the whole information
space
– once we’ve collected the data this part is easy
14. High-level architecture
Share-
ERP CRM
point
SDshare
CMIS
Search SDshare
Triple store SDshare Archive
engine
15. Main principle of data extraction
• No canonical model!
– instead, data reflects model of source system
• One ontology per source system
– subtyped from core ontology where possible
• Vastly simplifies data extraction
– for search purposes it loses us nothing
– and translation is easier once the data is in the triple store
17. Data structure in triple store
ERP
sameAs
Archive
sameAs
CRM
Sharepoint
18. Connecting data
ERP
• Contacts in many different
systems sameAs
– ERP has one set of contacts,
– these are mirrored in archive Archive
• Different IDs in these two
systems
– using “sameAs” we know which
are the same =
– can do queries across the systems
– can also translate IDs from one
system to the other
18
19. Duplicate suppression
ERP CRM Billing
Suppliers
Customers Customers Customers
Companies
Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
owl:sameAs
Country norway Duke norway 0.51
Data hub
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
SDshare
http://code.google.com/p/duke/
20. When archiving
• The user works on the document in some system
– ERP, CRM, whatever
• This system knows the context
– what user, project, equipment, etc is involved
• This information is passed to the CMIS server
– it uses already gathered information from the triple store
to attach more metadata
21. Auto-tagging
Manager
Sent to archive Project
Customer
Work
order
Equipment
Equipment
22. Archive integration
• Clients deliver documents via CMIS
– an OASIS standard for CMS interfaces
– lots of available implementations
• Metadata translation
– not only auto-tagging CMIS
– client vocabulary translated to archive vocab
– required static metadata added automatically
• Benefits
– clients can reuse existing software for connecting Archive
– interface is independent of actual archive
– archive integration is reusable
22
23. Archive integration elsewhere
• Generally, archive integrations are hard
– web site integration: 400 hours
– #2 integration: performance problems
• They also reproduce the same functionality
over and over again
• Hard bindings against
– internal archive model System Regulation
Outlook
– archive interface #2 system
• Switching archive system System
Web site
will be very hard... #1
– even upgrading is hard
Archive
23
25. Access control
• Users only see objects they’re allowed to see
• Implemented by search engine
– all objects have lists of users/groups allowed to see them
– on login a SPARQL query lists user’s access control group
memberships
– search engine uses this to filter search results
• In some cases, complex access rules are run to
resolve ACLs before loading into triple store
– e.g: archive system
26. Data volumes
Graph Statements
IFS data 5,417,260
Public 360 data 3,725,963
GeoNIS data 44,242
Tieto CAB data 138,521,810
Hummingbird 1 32,619,140
Hummingbird 2 165,671,179
Hummingbird 3 192,930,188
Hummingbird 4 48,623,178
Address data 2,415,315
Siebel data 36,117,786
Duke links 4,858
Total 626,090,919
34. RDF
• The data hub is an RDF database
– RDF = Resource Description Framework
– a W3C standard for data
– also interchange format and query language (SPARQL)
• Many implementations of the standards
– lots of alternatives to choose between
• Technology relatively well known
– books, specifications, courses, conferences, ...
34
35. How RDF works
‘PERSON’ table
ID NAME EMAIL
1 Stian Danenbarger stian.danenbarger@
2 Lars Marius Garshol larsga@bouvet.no
3 Axel Borge axel.borge@bouvet
RDF-ized data
SUBJECT PROPERTY OBJECT
http://example.com/person/1 rdf:type ex:Person
http://example.com/person/1 ex:name Stian Danenbarger
http://example.com/person/1 ex:email stian.danenbarger@
http://example.com/person/2 rdf:type Person
http://example.com/person/2 ex:name Lars Marius Garshol
... ... ...
35
36. RDF/XML
• Standard XML format for RDF
– not really very nice
• However, it’s a generic format
– that is, it’s the same format regardless of what data
you’re transmitting
• Therefore
– all data flows are in the same format
– absolutely no syntax transformations whatsoever
36
37. Plug-and-play
• Data source integrations are
pluggable
Data
– if you need another data source,
hub
connect and pull in the data
• RDF database is schemaless
– no need to create tables and
columns beforehand
• No common data model System #1 System #2
– do not need to transform data
to store it
37
38. Product queue
• Since sources are pluggable we can guide
the project with a product queue
– consisting of new sources and data elements
• We analyse the items in the queue
– estimate complexity and amount of work
– also to see how it fits into what’s already there
• Customer then decides what to do, and in
what order
38
39. Don’t push!
• The general IT approach is to push data
– source calls services provided by recipient
• Leads to high complexity
– two-way dependency between systems
– have to extend source system with extra logic
– extra triggers/threads in source system
– many moving parts
System #1
39
40. Pull!
• We let the recipient pull
– always using same protocol and same format
• Wrap the source
– the wrapper must support 3 simple functions
• A different kind of solution
– one-way dependency
– often zero code
– wrapper is thin and stateless
– data moving done by reused code
System #1
40
41. The data integration
• All data transport done by SDshare
• A simple Atom-based specification for
synchronizing RDF data
– http://www.sdshare.org
• Provides two main features
– snapshot of the data
– fragments for each updated resource
42. Basics of SDshare
• Source offers
– a dump of the entire data set
– a list of fragments changed since time t
– a dump of each fragment
• Completely generic solution
– always the same protocol
– always the same data format (RDF/XML)
• A generic SDshare client then transfers the data
– to the receipient, whatever it is
44. Typical usage of SDshare
• Client downloads snapshot
– client now has complete data set
• Client polls fragment feed
– each time asking for new fragments since last check
– client keeps track of time of last check
– fragments are applied to data, keeping them in sync
45. Implementing the fragment feed
select objid, objtype, change_time
from history_log <atom>
where change_time > :since: <title>Fragments for ...</title>
order by change_time asc ...
<entry>
<title>Change to 34121</title>
<link rel=fragment href=“...”/>
<sdshare:resource>http://...</sdshare:resource>
<updated>2012-09-06T08:22:23</updated>
</entry>
<entry>
<title>Change to 94857</title>
<link rel=fragment href=“...”/>
<sdshare:resource>http://...</sdshare:resource>
<updated>2012-09-06T08:22:24</updated>
</entry>
...
46. The SDshare client
SPARQL- Triple
backend store
Frontend Core
POST-
backend
W
S
http://code.google.com/p/sdshare-client/
47. Getting data out of the triple store
• Set up SPARQL queries to
RDF
extract the data
• Server does the rest
• Queries can be configured to
SPARQL
produce
SDshare server – any subset of data
– data in any shape
48. Contacts into the archive
• We want some resources in the triple store to be
written into the archive as “contacts”
– need to select which resources to include
– must also transform from source data model
• How to achieve without hard-wiring anything?
49. Contacts solution
• Create a generic archive object writer
– type of RDF resource specifies type of object to create
– name of RDF property (within namespace) specifies which
property to set
• Set up RDF mapping from source data
– type1 maps-to type2
– prop1 maps-to prop2
– only mapped types/properties included
• Use SPARQL to
– create SDshare feed
– do data translation with CONSTRUCT query
50. Properties of the system
• Uniform integration approach
– everything is done the same way
• Really simple integration
– setting up a data source is generally very easy
• Loose bindings
– components can easily be replaced
• Very little state
– most components are stateless (or have little state)
• Idempotent
– applying a fragment 1 or many times: same result
• Clear and reload
– can delete everything and reload at any time
52. Project outcome
• Big, innovative project
– many developers over a long time period
– innovated a number of techniques and tools
• Delivered on time, and working
– despite an evolving project context
• Very satisfied customer
– they claim the project has already paid for itself in
cost savings at the document center
52
53. Archive of the year 2012
”Virksomheten har vært
innovative og strategiske i sin
bruk av teknologi både når det
gjelder de gamle papirarkivene og
det digitale arkivet. De har et
stort fokus på å øke datafangsten
og forenkle bruken av metadata.
De har hatt som mål at brukerne
kun skal måtte påføre én
metadata. De er nå nede i to – alt
annet blir påført automatisk i
løsningen – men målet er fortsatt
å komme ned i kun én metadata.”
53 http://www.arkivrad.no/utdanning.html?newsid=10677
54. Highly flexible solution
• ERP integration replaced twice
– without affecting other components
• CRM system replaced while in production
– all customers got new IDs
– Sesam hid this from users, making transition 100%
transparent
54
55. Status now
• In production since autumn 2011
• Used by
– Hafslund Fjernvarme
– Hafslund support centre
55
56. Ohter implementations
• A very similar system has been
implemented for Statkraft
• A system based on the same architecture
has been implemented for DSS
– basically an intranet system
56
58. DSS project background
• Delivers IT services and platform to
ministries
– archive, intranet, email, ...
• Intranet currently based on EPiServer
– in addition, want to use Sharepoint
– of course, many other related systems...
• How to integrate all this?
– integrations must be loose, as systems in 13
ministries come and go
58
59. Metadata structure
Person
Org. enhet
(seksjon, avd., dep., virkso
mhet) Har ansvar for
Tema/
arb.område/
Har ansvar for
Handler om Arkivnøkkel
Tilhører/ har rolle i Handler om
Er involvert i Prosjekt/ sak
Er relevant for
Har rolle i forhold til
Saks-nr. Tilhører
Saks-/ Prosess-type
Person
Er ekspert på
Dokument/
Jobber med
Innhold
Omtale/ Har ansvar for/ skrevet
kompetanse Tittel
Telefon Status
adresse
… Dokument-type Fil-type
Dato
59
60. Project scope
EPiServer
IDM Sharepoint
(intranet)
User information
Access groups
60
80. The web part
EPiServer
We
Sharepoint
We
• Viser data gitt
– en spørring
b b
part part
SDShare SPARQL SDShare
– et XSLT-stilsett
• Kan kjøres hvor som
helst
Virtuoso – standard protokoll til
RDF DB databasen
• Lett å inkludere i
SDShare SDShare
SDShare
andre systemer også
Novell Active Regjering
IDM Directory en.no
80
82. Independence of source
• Solutions need to be independent of the
data source
– when we extract data, or
– when we run queries against the data hub
• Independence isolates clients from changes
to the sources
– allows us to replace sources completely, or
– change the source models
• Of course, total independence is impossible
– if structures are too different it won’t work
– but surprisingly often they’re not
82
83. Independence of source #1
Core model
core: core:participant core:
Person Project
idm:
idm:has-member idm: sp: sp:member-of sp:
Person Project Person Project
IDM Sharepoint
83
84. Independence of source #2
• “Facebook streams” a dss:StreamEvent;
sp:doc-created
require
queries across sources
rdfs:label "opprettet";
dss:time-property sp:Created;
– every source has its own model
dss:user-property sp:Author.
• We can reduce the differences
sp:doc-updated a dss:StreamEvent;
with annotation "endret";
rdfs:label
– ie: we describe the data with RDF
dss:time-property sp:Modified;
dss:user-property sp:Editor.
• Allows data models to
– change, or
– be plugged in
• with no changes to query
idm:ny-ansatt a dss:StreamEvent;
rdfs:label "ble ansatt i";
dss:time-property idm:ftAnsattFra .
84
86. Execution
• Customer asked for start August 15,
delivery January 1
– sources: EPiServer, Sharepoint, IDM
• We offered fixed price, done November 1
• Delibered october 20
– also integrated with ActiveDirectory
– added Regjeringen.no for extra value
• Integration with ACOS Websak
– has been analyzed and described
– but not implemented
86
87. The data sources
Source Connector
Active Directory LDAP
IDM LDAP
Intranet EPiServer
Regjeringen.no EPiServer
Sharepoint Sharepoint
87
88. Conclusion
• We could do this so quickly because
– the architecture is right
– have components we can reuse
– the architecture allows us to plug together the
components in different settings
• Once we have the data, the rest is usually
simple
– it’s getting the data that’s hard
88