Making the semantic web work

Making the Semantic Web
Work
Reasoning beyond OWL

What is semantics?
Although animals do not use language, they are capable of many of the same kinds of cognition as us; much
of our experience is at a non-verbal level.
Semantics is the bridge between surface forms used in language and what we do and experience.
Language understanding depends on world knowledge (i.e. “the pig is in the pen” vs. “the ink is in the pen”)

Machine to Machine Communication
message
exchange
Underlying the systems are different
databases; the ability to “get something
done” is like a non-verbalized ability, but to
work with other systems we need to
formulate messages in an artificial
language.
Understanding human language is a big problem.
What chunk can we break off that will be useful
and can be done today?
Key insight:
The semantic problem of communications between
business IT systems aren’t that different from the
semantic problem of communication between
animals

Natural Language to support M2M
Internal database
Industry standard
message format
Machine-readable and human readable
specfications
Capture critical knowledge in graph database; perhaps 80% of
process can be automated, but human effort is part of a structured
process that clearly links specification to implementation
Captured specifications are used to
compile data transformation rules.
Graph model is used as “universal solvent”

More generally…
requirements regulationspolicies
Programs that
implement behaviors
We might not be ready for executives to specify
policies themselves, but we can make the process
from specification to behavior more automated,
linked to precise vocabulary, and more traceable.
Advances such as SVBR and an English
serialization for ISO Common Logic means that
executives and line workers can understand why
the system does certain things, or verify that
policies and regulations are implemented
Logged Decision Process
Focusing on the execution of tasks is
the road to real semantics; anything
that does a useful job solves the
“grounding problem;” Children can’t
learn language by watching television,
only by talking with others.

Making Expressive Reasoning Scalable
Scalable fabric
BACKGROUND KNOWLEDGE
RULES MODELS
ALGORITHMS
HEURISTICS
Scalable system merges data from siloed sources;
constructs graph(s) of facts relevant to specific
records and entities
profiler
VOCABULARY MANAGEMENT
VERSION CONTROL
EXCEPTION HANDLING
BUSINESS RULES MANAGEMENT
CASE MANAGEMENT
CONCEPT MATCHING
BEHAVIOR TRACEABLE
TO REQUIREMENTS
MULTILINGUAL SUPPORT
ENRICHED LINKED DATAScalable profiler lets system discover
“ground truth” about data to inform
generated rules and behaviors

People are looking for better tools
Unconstructive Criticism of the Semantic Web is Common
Blanket dismissals displace real thinking, particularly a “gap analysis” as to what is missing.
Yet, certain unworkable standards (OWL) have also displaced real progress.

History of RDF is about evolution
good stuff survives, bad ideas (slowly) fade away
RDF/XML
RDFS
OWL
SPARQL
SPIN
Linked Data
ISO Common
Logic
Turtle
Early work built on XML, had natural
representations for ordered
collections but was pedagogically
awful (where are the triples?)
N-Triples
Turtle is a human friendly format but
isn’t scalable to billions of triples
Competition for
schema/inference lanuages left a
two winners
A full-featured query language
changed everything: but ordered
collections go “under the bus”
New inference and transformation
languages emerge
In the Linked Data era we can
handle billions of triples, but
collections and blank nodes
become awkward
In the long-term we’ll see highly
expressive languages forward
compatible with RDF
RDF*
RDF* and SPARQL* let us make
statements about statements
and query them; this increases
expressive and can be used for
data management

We can be optimistic because…
multiple communities have been working on similar things in parallel
Semantic web
RDF / SPARQL
Diagramming and
representation of
data structures,
processes,
systems, models,
etc.
Common Logic
and
Message
Vocabularies
SUMO
Upper ontology
Commercial
Master Data
Management
products
accurately match
entitiesVocabularies and
message formats
for business
When you look at the pieces of the puzzle developed by communities that don’t really talk to
each other, you see that the “state of the art” is better than it appears…

Common data models
• Relational data model
• Fundamentally tabular, like a CSV file
• Object-relational model
• A column can contain rows
• This is like XML or JSON
• Graph Model
• Highly general
• Hypergraphs
• “Property Graphs” and RDF*
These models are compatible in that
you can represent a graph with
relational tables, break up an XML
record into multiple relational tables,
or even embed a hypergraph inside a
graph, but there are big difference
when it comes to efficiency when you
need a certain set of facts in one
place.

Predicate Calculus
RDF is a special case of the “predicate calculus”
Statement of arity 2
Predicate Calculus:
A(:Dog,:Fido)
RDF:
:Fido A :Dog .
Statement of arity 3
Predicate Calculus:
:Population(:Nigeria,2013,173.6e6)
RDF:
[
a :Population .
:where :Nigeria .
:when 2013 .
:amount 173.6e6
]
It’s not too hard to write this in
Turtle
This implementation, however,
is structurally unstable, since
we went from one triple to four
triples

How to think about RDF
• The basic element of RDF is the Node
• This borrows heavily from XML in that
• Terms come out of a URL-based namespace so we can throw everything in a big pot
• We get the basic types from XML schema
• Plus we can even use XML literals
• A triple is just a tuple with (i) three nodes, and (ii) set semantics
• Higher-arity predicates are tuples with >3 nodes
• SPARQL result sets and intermediate results are tuples of Nodes
• Official serialization formats exist for SPARQL result sets
ISO Common Logic is the obvious upgrade path, since it uses the same data types as RDF and can handle RDF
triples, as well as higher-order predicates and intuitively obvious inference.

ISO Common Logic
Next step in evolution
• Uses RDF Node as basic data type with all benefits thereof
• RDF triples are just arity 2 predicates and can be used directly
• First order logic operators supported; typed logic allows some
“beyond first order logic” capabilities
• OWL and RDFS can be implemented as a theory in FOL
• Builds on the KIF Knowledge Interchange Format
• Foundation for additional developments
• Controlled English Format for Common Logic Statements
• Modal logics: SVBR
• Interchange language for knowledge-based systems of all kinds

The Old RDF: Expressive but not scalable
Early RDF:
RDF/XML serialization, heavy use of blank nodes, extreme
expressiveness:
[ a sp:Select ;
sp:resultVariables (_:b2) ;
sp:where ([ sp:object rdfs:Class ;
sp:predicate rdf:type ;
sp:subject _:b1
] [ a sp:SubQuery ;
sp:query
[ a sp:Select ;
sp:where ([ sp:object _:b2 ;
sp:predicate rdfs:label ;
sp:subject _:b1
])
]
])
]
This is a representation of a SPARQL
query in RDF!
This example uses Turtle, where
square brackets create blank nodes
and parenthesis create lists.
With this graph in the JENA
framework you can easily manipulate
this as an abstract syntax tree.
Very complex relationships, such as
mathematical equations can be built
this way; blank nodes can be used to
write high-arity predicates.
Accessing it through SPARQL would
not be so easy!

Linked Data: New Focus
Linked data source
Blank nodes are discouraged because it’s hard for
a distributed community to talk about something
without a name.
[ a sp:Select ;
sp:where ([ sp:object rdfs:Class ;
sp:predicate rdf:type ;
sp:subject _:b1
] [ a sp:SubQuery ;
sp:query
[ a sp:Select ;
sp:where ([ sp:object _:b2 ;
sp:predicate rdfs:label ;
sp:subject _:b1
])
]
])
]
Turtle and RDF/XML (which
have sweet syntax for blank
nodes) are not scalable
because the parser cannot be
restarted after a failure: if
you have billions of triples, a
few will be bad
<http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show" .
<http://example.org/show/218> <http://example.org/show/localName> "That Seventies Show"@en .
<http://example.org/show/218> <http://example.org/show/localName> "Cette Série des Années Septante"@fr-be .
<http://example.org/#spiderman> <http://example.org/text> "This is a multi-linenliteral with many quotes (""""")nand two apostrophes ('')." . <http://en.wikipedia.org/wiki/Helium>
<http://example.org/elements/atomicNumber> "2"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/specificGravity> "1.663E-
4"^^<http://www.w3.org/2001/XMLSchema#double> .
N-Triples is practical for large databases such as Freebase and Dbpedia because records are
isolated, but blank nodes must be named, triple-centric modelling is encouraged
We now have a great query language, SPARQL. SPARQL supports the
same shorthand for blank nodes as Turtle. Some blank node patterns
work naturally, but it is particularly hard to ask questions about
ordered collections.
Blank nodes, collections, etc. are out of fashion.

Old Approaches To Reification:
Named Graphs
:graph :subject :predicate :object .
Adding an extra node to a triple is simple, practical and useful for
many purposes.
For instance, I could take in triple data from various sources and
keep them apart by putting them in different graphs.
The trouble is that this is a one trick pony: I can’t take collections
of named graphs from different sources and keep them apart
using named graphs
For practical logic we need to be able to qualify statements to manage:
• Provenance
• Access Controls
• Metadata
• Modal relationships
• Time

Old Approaches to Reificiation
Reification with Blank Nodes
[
rdf:type rdf:Statement .
rdf:subject :Tolkien .
rdf:predicate :wrote .
rdf:object :LordOfTheRings .
:said :Wikipedia .
]
http://stackoverflow.com/questions/1312741/simple-example-of-reification-in-rdf
This isn’t too hard to write in Turtle, but it breaks
SPARQL queries and inference for reified triples.
The number of triples is at the very least tripled; the
triple store is unlikely to be able to optimize for
common use cases.

a new standard that unifies RDF with the property graph model
RDF*/SPARQL* (Reification Done Right)
Turtle facts:
:bob foaf:name "Bob" .
<<:bob foaf:age 23>> dct:creator <http://example.com/crawlers#c1>
dct:source <http://example.net/homepage-listing.html> .
Sparql query:
SELECT ?age ?src WHERE {
?bob foaf:name "Bob" .
<<?bob foaf:age ?age>> dct:source ?src .
}
This is huge! So far products based on property graphs have been ad-hoc, without a
formal model. SPARQL* brings rich queries to the property graph model and the reverse
mapping means RDF* can be processed with traversal-based languages like Gremlin.

Roles of Schemas
• Documentation
• Integrity Preservation
• Efficiency
• Inference

Schemas as Documentation
Humans write code to insert data and write queries: schemas tell us
how the data is organized.
Automated systems can also use schemas to drive code generation
(consider object-relational mapping)

Schemas can preserve integrity
SQL:
create table customer (
id integer primary key,
username varchar(16) unique key not null,
email varchar(64) not null,
)
SQL prevents attempts to insert records
with non-existing fields or lacking
required fields. SQL can enforce key
integrity and other constraints.
You can (often) code algorithms and take
it for granted that data structures satisfy
invariants required for those algorithms
to work.
RDF:
RDFS and OWL, implemented with the standard semantics,
do not validate data.
Practically, RDF users will use types and properties across a
wide range of standard and proprietary namespaces, and it can
be hard to keep track of them all.
For instance, rdfs:label is defined in RDFS, despite the fact that
you can have labels without schemas. Terms that are the
bread-and-butter of RDFS, such as rdf:type, rdf:Property, and
rdf:Type are defined in the RDF specification.
It’s an easy mistake to get the “s” wrong in either writing data or
queries and if you do you run a query, get zero results, and
could easily chase your tail looking for other causes.
You can (and should) define an alternate semantics for RDFS and
OWL, which rejects types and properties that are not listed in
either data or queries, but this is nonstandard.
These issues are addressed in the “RDF Data Shapes” effort to be
completed in 2017, (

Schemas can promote efficiency
JSON (and XML)
Structural information is repeated in each schema
record
{
“red”: 201,
“green”: 99,
“blue”: 82
“alpha”: 115
}
(85 bytes)
C
typedef unsigned char byte;
struct color {
byte red;
byte green;
byte blue;
byte alpha
} Defines meaning of
201 99 82 115
(4 bytes)
20x compression!
In numerical work, it often takes longer to convert a million numbers from ASCII to float than you spend working on the floats. The speed of text parsing is a limiting factor in
electronic trading systems and many other applications.
GZIP compression of repetitive data helps, but you get a smaller file if you apply GZIP to binary data. You pay a CPU price for data compression plus a large price in string parsing.
Textual data formats have been fashionable in the Internet Age, because it is easy to get string parsing code to “almost work;” one of the reasons we are hearing about security
breaches every day is that it’s extremely difficult to write correct string parsing code.
RDF standards do not address binary serialization, however Vital AI can create binary formats based on OWL schemas

Schemas and Inference
the unique value of RDF!
:Joe myvocab:emailAddress “joe@example.com” .
dbpedia:Some_Body db:eMailAddress <mailto:sb@example.com> .
basekb:m.3137q basekb:organization.email_contact.email_address “3137q@example.com” .
:Lily schemaOrg:email “lily@example.com” .
Look at 4 RDF vocabularies and find 4 ways to write an e-mail address
myvocab:emailAddress rdfs:subPropertyOf foaf:email .
db:eMailAddress rdfs:subPropertyOf foaf:email .
basekb:m.3137q rdfs:subPropertyOf foaf:email .
:Lily rdfs:subPropertyOf foaf:email .
:Joe foaf:email “joe@example.com” .
dbpedia:Some_Body foaf:email <mailto:sb@example.com> .
basekb:m.3137q foaf:email “3137q@example.com” .
:Lily foaf:email “lily@example.com” .
A-BOX
T-BOX
Inferred
facts

It looks like an answer for data integration,
but…
:Joe foaf:email “joe@example.com” .
dbpedia:Some_Body foaf:email <mailto:sb@example.com> .
basekb:m.3137q foaf:email “3137q@example.com” .
:Lily foaf:email “lily@example.com” .
There are two reasonable ways to write an email address:
as a string or as a URI
foaf:email rdfs:domain owl:Thing .
According to the foaf spec, only the URI is correct since,
“In OWL DL literals are disjoint from owl:Thing,” (at least if
we are using OWL DL…)
Any ETL tool has an ability to apply a function to data (it’s not hard at all to write code to translate a string to
a mailto: URI)
RDFS and OWL, however, can’t do simple format conversion. For instance, it is reasonable for people to
specify temperatures in Fahrenheit or Centigrade or Kelvin, but OWL inference can’t “multiply by
something and add” – even though it can state that properties “mean the same thing”, it can’t specify
simple transformations.
Something like OWL may be necessary for data integration, but OWL is not sufficient.

Other things OWL Can’t Do
• We can’t reject data
• Reject things that we don’t agree with
• Reject things we don’t need; let’s use Freebase to seed…
• A directory of ski areas
• The spatial hierarchy of Africa
• A biomedical ontology
We don’t want to pay to store stuff we don’t need, or wait for it to be
processed, do quality control on it, or deal with any problems it might
create

OWL is unintuitive
Here’s an excerpt from the FIBO (Finance) ontology:
Organization:
A social unit of people, systematically struc-tured
and managed to meet a need or pursue collective
goals on a continuing basis.
Autonomous Agent:
An agent is an autonomous individual that can
adapt to and interact with its environment.
Property Restriction 1:
Set of things that must have property "has
member" at least 2 taken from "autonomous
agent"
Set of things that may have property "has part"
taken from "organization"
Set of things that must have property "has" at least
1 taken from "goal"
Set of things that may have property "has" taken
from "postal address”
means “has parent”
LEGEND
How do you explain this to your boss? To the programmer that just joined the team? What kind of inference does this entail?
(I think two people with a goal are an organization, but is there a real difference between a DBA filing for a person who is self
employed and one that has an additional employee?)

It’s not always obvious how to do things in
OWL
You can’t say
“The United States Has 50 States”
But you can say
“Anything that has 50 states is the United States”
You can get close to what you want to say by
“The United States is a member of an anonymous
class that contains anything with 50 states.”
You can get some entailments from that, but nothing
happens if only 47 states are on the list (it’s an open
world, we just don’t know about…)
Thus:
It’s not obvious what exactly can be specified in
OWL.
If you talk to an expert, you’ll find that he can do a
lot of things you might think aren’t possible.

Production Rules and First-Order Logic
Many 1970s “expert systems” were driven by production rules; these are now widespread in “Business Rules
Engines”.
Condition -> Action
Common data transformations can be easily written with production rules:
Weight(person,weight) and Height(person,height) -> BodyMassIndex(person,weight/height^2)
BodyMassIndex(person,bmi) and bmi<18.5 -> Underweight(person)
BodyMassIndex(person,bmi) and 18.5<=bmi<25 -> NormalWeight(person)
BodyMassIndex(person,bmi) and 25<=bmi<30 -> Overweight(person)
BodyMassIndex(person,bmi) and 30<=bmi 0 -> Obese(person)
You could easily miss it reading the documentation,
but it’s possible to state this in OWL by using XML
Schema constraints on data types
You can’t do this in OWL. You just can’t

Production Rules
vs imperative programming languages
The BMI example could easily be written in (say) Java…
BUT
You have to get the steps in the right order; this is trivial to do in a simple case, but it gets increasingly
harder as complexity goes up. This is one of the reasons why programming is a specialized skill.
Production Rules constrain the conditions so the engine can quickly determine which rules
are fired when the state changes…
… but, the actions are written in a conventional programming language like LISP or
Java, so we can use a fully spectrum of programming techniques and a lot of
existing code.
Note: rules engines have advanced greatly since the “golden age of AI”, and now
100,000+ rules and 10 million+ facts are practical.

Production Rules in the Wider Picture
Drools Expert: Execution of production rules
Drools Fusion: Complex event processing
jBPM: Business Process Management; coordination of asynchronous human
and automated behaviors – controlled by rules
Optaplanner: Multi-objective combinatoric optimization for tasks such as
scheduling, vehicle routing, box packing – controlled by rules
This is the JBOSS stack; products such as Blaze Advisor and iLog do all this and more.
The use of production rules to control business processes, particularly in scenarios involving complex
workflows and complex multiple requirements is well established.
This is an emerging research topic in the
semweb community, but in the business rules
world this is a mature technology

“Impeadance Mismatch” between Business
Rules and RDF is minimal
Most Java Rules Engines (like JESS and Drools) can reason
about ordinary Java objects
RDF data can be converted to specialized predicate
objects for performance or convenience, but it is
very possible to insert objects from the Jena
framework such as Nodes, Triples and Models directly
into a rules engine.

OWL and RDFS implementations often use
production rules
OWL 2 RL dialect
Forward chaining
The semantics of RDFS and most of OWL can be
implemented with production rules; RETE and
Post-RETE algorithms can evaluate these efficiently.
Popular reasonsers such as Jena and OWLIM often use
a box of production rules to implement RDFS and
OWL and expose this functionality so you can
implement custom inference.
OWL 2 QL dialect
Backwards chaining
RDFS, and another major subset of OWL, can be
implemented by rewriting SPARQL queries.
Since SPARQL is based on relational algebra, the
whole bag of tricks used to optimize relational
database queries can be used to efficiently answer
queries.

OWL dialects have “computer science”
advantages
(i.e., algorithms exist to answer queries in bounded time, with scaling
that looks good on paper)
More expressive logics that are undecidable sound scary,..
However, many things about conventional programming languages are undecidable…
For instance, you can’t solve the halting problem for conventional programming languages., yet, that
Doesn’t intimidate most people to use languages that lack recursion and unbounded loop.
Algorithms to exactly solve common optimization problems (travelling salesman problem, etc.) are
computationally intractable, but approximate algorithms are fine for the real world.
(Evaluation of production rules is not decidable in finite time since it is possible to create an “infinite loop”)

Logical Theorem Proving
ex. VAMPIRE
If we constrain the action fields of rules a bit, we can prove theorems,
a highly flexible form of reasoning. There are other ways to do it, but
one effective method is the saturation solver.
Axioms
(S) Statement to
prove
Logical Negation Solver
conclusions
If S is true, then not S is false. Eventually the solver
will find a contradiction and produce the conclusion
false.
Since you can derive an infinite number of conclusions from
most theories, this process is not guaranteed to finish. A lucky
or clever algorithim, could reach false with a short chain.
State of the art reasoners use multiple search strategies that
work well in many real-life cases.

Real-life OWL and RDFS performance doesn’t
satisfy
RDFS inference, done according to the book, generates a vast number
of trivial and uninteresting conclusions; practical reasoners usually
don’t implement the complete standard

Requirements for Practical logic
One long term goal for logic is
“capture 100% of critical knowledge in business documents”
It might sound like science fiction, but if we hire a team of programmers to implement a policy or to make a
system that complies with regulation and requirements, it is the goal. Can we (i) reduce team size, (ii) speed
up the project, and (iii) be able to show the rules being enforced to management in a way they can
understand?
Plain first-order logic does not cover all the bases.
We need:
• Modal logic (CAN, SHOULD, MUST, IT WAS TRUE THAT, HARRY BELIEVES THAT)
• Temporal logic (things change at different times)
• Default and Defeasible logic
• Higher-order logic (for all statements) or (there exists a statements)
These logics are not as
mature as FOL, but we
can often use tricks to
simulate them

Modal logic
Key for Law, Contracts, Requirements, …
A modal operator qualifies a
statement:
MUST(S) -> S is necessarily true in
any situation
USUALLY(S) -> S is usually true
PERMISSABLE(S) -> It it
permissible that S is true
BELIEVES(person,S) -> specified
person believe S is true
PREVIOUSLY(S) -> S was true in
the past
Some modal logic problems can be addressed by rewriting the
problem, for instance if S(x,y) is a simple predicate we could
define a predicate like
BELIEVES_S(person,x,y)
We can’t express arbitrary statements this way, but we may
be able to express all the ones that we’ll really use.
Systems like SUMO use tricks like this to punch above their
weight

Temporal Logic
Change is the one thing that is constant. The population of Las Vegas was 25 in 1990 and 583,736 in 2010.
Since laws change over time, to know if a set of actions was illegal, we need to know when the actions
where and what the law was at the time and answer questions like “What did the President know and when
did he know it?”
A complete theory is not fully developed, but some pretty good tools are available
The Allen Algebra
Time intervals are closer to reality than points in time; with time intervals we can specify that a meeting
starts at 6:00 pm on a certain day and goes on for 1 hour. We could ask if this overlaps with the interval of
another meeting to know if I need to choose between one meeting and the other.
Allen Algebra doesn’t cover all temporal reasoning cases, but it works well with production rule systems,
and is widely used in complex event processing.
A complete theory is not fully developed, but some pretty good tools are available

Default and Defeasible Reasoning
The following logical chain leads to a bad
result:
Flies(Bird)
A(Penguin,Bird)
Flies(Penguin)
Exceptions are widespread in real life:
“A year divisible by 4 is a leap year, unless the year is divisible by
100; however, if the year is divisible by 400 it IS a leap year”
“An amateur radio operator may not transmit music unless they
are retransmitting a signal from the International Space Station.”
We could write
Any(x): A(x,Bird) and NOT(A(x,Penguin)) -> Flies(x)
But this gets hard to maintain when we find out about ostriches, domestic ducks, etc. It would be worse yet to
maintain a list of flying birds.
Default logic adds features that let us express defaults
Defeasible logic allows us to retract a conclusion if we find contrary evidence later

Logical Negation
ALL APPROACHES ARE SOMEWHAT PROBLEMATIC
There are many ways to implement logical negation, but there is no universal answer to the problem.
For instance, suppose we add
NOT(Underweight(person)) -> WellFed(person)
to the rules we’ve been working on.
If this rule is activated before we have: (i) gotten height and weight information, (ii) computed the BMI,
and (iii) classified this person, it will fire improperly. This might not be problem if it has no real-world
consequences and is retracted when it becomes false, but it’s not the behavior we want.

Logic Programming
Practical Concessions
Phase I: Extract Information About
Height and Weight
Phase II: Compute BMI and
classify
Phase III: Make additional conclusions
knowing ALL Phase II conclusions
With the agenda mechanism in most
Business Rules Systems, each phase
can get a complete view of what
happened in the last phase,
meaning that negation, counting and
similar operations work as expected
(At the cost that we need to assign
rules to the right phases)

What about SPIN?
SPIN is similar in expressiveness to production rules.
ex:Person
a rdfs:Class ;
rdfs:label "Person"^^xsd:string ;
rdfs:subClassOf owl:Thing ;
spin:rule
[ a sp:Construct ;
sp:text """
CONSTRUCT {
?this ex:grandParent ?grandParent .
}
WHERE {
?parent ex:child ?this .
?grandParent ex:child ?parent .
}"""
] .
This is like a production rule written in
reverse, we infer triples from the
CONSTRUCT clause based on matching
the WHERE clause.
TopBraid Composer implements most
inference through primitive forward
chaining (a fixed point algorithm, RETE
cannot be used because the order of
rule firing is unpredictable.)
Backwards chaining can be
accomplished through the definition of
“magic properties” (something similar
can be done with Drools too)
SPIN has support for query templates, in some ways like Decision
Tables but possibly more palatable for coders and for semantic apps
Control of execution order, negation, and non-monotonic
reasoning are not settled. Less is know about how to implement it

Linked Data
“Trough of Disillusionment”
The dream of linked data is that you can
easily “mash up” data from multiple sources
to answer questions.
If you want to get the right answers,
however, it is not so easy.
If you didn’t have a lot of experience in the corporate
world you might blame data publishers, RDF, and the
incentive structures around linked data for this,
however…

Corporate Data
… real life data in business is frequently bad;
80% of effort in data mining projects goes
into data preparation and cleaning.
tools
Business
analyst
ERP
POS
email
ERP
CRM
web
Factory
automation
CRM
Wiki
HR
Sharepoint
CMS
Custom
apps
Inventory
Social
CMS
SAAS Apps
A large business has multiple business units running a huge number
of applications written at different times by different people
Businesses grow by acquisition; to the extent that
customers and employees are aware of different IT
systems and their histories, customer service sucks,
employees underperform and costs are high
Businesses face the same problem as the Linked Data Community but
these problems happen behind closed doors and people are cursing
COBOL and SAP instead of RDF and SPARQL

While Linked Data was were emerging, Enterprise IT developed
“Master Data Management” to enable a “Customer Centric” enterprise
Personal Account
Paul’s Business Account A
Paul’s Business Account B
Olivia’s Business Account
Child’s Account
Paul’s IRA
Olivia’s IRA
SEP IRA
Home Equity Line
Houseguest
Tenant A Personal
Tenant A Corporate
Tenant B
Traditional business systems are “account centric”, which is
enough to get by but not enough to thrive. To really serve me
well, my credit union needs a complete picture of the relationship
I have with it. (It took me a while to remember how many
accounts I have and I might have missed one)
Financial institutions are under legal pressure to “know your
customer” (KYC) and linking accounts that belong to a customer is
necessary to prevent monkey business
but I own shares in
this one!
My name is on this column of accounts,
but not the others

Dominant paradigm for master data management:
Objects are clustered based on a
distance metric; objects are
“blocked” beforehand to avoid the
N2 cost of computing distances
… this is effective in the case of matching different records
for the same customer, but is NOT effective in cases where
we have a ground truth and can know rather than guess …
Tyrol Tirol
Two variants that differ by a letter
can be fuzzy matched, but it’s
hard to guess arbitrary things like
AT-7
ISO 3166-2
AT33
NUTS
AU07
FIPS 10-4
蒂罗尔州
CHINESE
… and why guess when you can just look them up in a quality
controlled database?
Conventional MDM focuses on resolving
customers (people or businesses;) in some
cases it involves resolving products.
Generally the objects being matched are
“equal” to each other in ontological status,
such as two customer records.
Semantic MDM covers a wider range of
concepts and often imports large amount of
knowledge from general databases or
involves alignment with industry ontologies.
In some cases we are discovering new
concepts and maintaining the ontology, but
more often we are matching surface forms to
underlying concepts.

Do we clean data before or after query time?
Weather station reports temperature in
centigrade, reports -999 upon error
32.1 34.6 36.3 -999 33.8
Let’s say we want to compute the average…
If we use the arithmetic mean, we get
-215.55° C. Outrageously wrong!
If we know this device reports -999 on error
or that temperatures can never be less than
-273.15 we can reject the bad value, we get
34.2° C
If we use the median instead of the mean,
the outlier is automatically ignored we also
get 34.2° C (we’re lucky it’s exactly the same)
In this case it’s reasonable to clean the data or use an algorithm that is
robust to outliers – they teach kids in elementary school the median is robust,
but how many other robust algorithms are on the tip of everyone’s tounge?

Ahead-of-time data preparation
TEST
CASES
Test failure blocks
further analysis
Queries business
analyst
Error reports thrown
“over the wall”
data quality team
Line drawn between data processing
and data use establishes test perimeter
and makes process scalable in human
terms

Fixing up at query time will drive you nuts
Scenario: Business Analyst writes
queries while talking to co-workers to
quickly build collective understanding.
Requirement: easy to write queries off
the cuff and get the right answer!
“joe@example.com”
<mailto:sb@example.com>
“3137q@example.com”
“lily@example.com”
It’s not hard to canonicalize two
variant forms of an e-mail address in
either a query or in processing the result
set
query complexity
effort
A real query might be querying tens
of values, some are used in conditions,
others end up in the results. If many
things are being joined (i.e. you’re using
SPARQL) the query will explode
exponentially in complexity.
Will you trust the answer?
Some kind of query rewriting
(like the implementation of OWL 2
QL) might help, but we still lack a
perimeter where we can test the
system and give it a clean bill of
health

Ordered collections are awkward in RDF
• Two ways to do it because neither one is satisfying
RDF Containers
:Missions a rdf:Seq ;
rdf:_1 :Mercury ;
rdf:_2 :Gemini ;
rdf:_3 :Apollo .
This could generate huge numbers of predicates,
also nothing stops one from accidentally using a
numbered label more than once. The facts
comprising this list could be spread across a
system.
RDF Collections
:Missions a rdf:List .
rdf:first :Mercury ;
rdf:rest _:n1 .
_n1: rdf:first :Gemini ;
rdf:rest _:n2 .
_n2: rdf:first :Apollo ;
rdf:rest rdf:nil .
Operations on a LISP-style list are slow because you
need to follow lots of points. The use of blank
nodes can protect Collections from modification
(important in the OWL spec.)
Neither construction is easy to query in (standard) SPARQL

Yet, some RDF syntaxes look almost the same
as JSON/XML
JSON
{
missions: [ “Mercury”, “Gemini”, “Apollo” ]
}
TURTLE
:Missions :members (:Mercury,:Gemini,:Apollo) .
Most RDF tools will expand this into a LISP-list
with blank nodes, but in TURTLE format the
physical layout is the same as JSON.
Collections and Containers are described as “non-
normative” in RDF 1.1; advanced tools may use
special efficient representations (like would be used for
JSON).
It’s awkward to work with ordered collections in the
common “client-server” model that revolves around
SPARQL engines, but for small graphs in memory, the
situation is different – the Jena framework provides a
facility for accessing Collections that feels a lot like
accessing data in JSON
Ordered collections are critical for dealing with external data
that supports external collections AND critical for many
traditional RDF use cases such as metadata (you’ll find
scientists are pretty sensitive to the order of authors for a
paper)

Another Bad Idea in Linked Data
DEREFERENCING
http
In principle a client could ask questions about individual items and
“follow it’s nose” to discover related information.
In practice, however, you miss data quality problems that are obvious
when you look at data holistically. (i.e. 47 instead of 50 states)
If the data was clean ahead of time, and if we understood the structure
of data complely ahead of time, dereferencing might work.
Since Linked Data does not enforce quality standards, however,
dereferencing is one of those dangerous things that “almost works”.

John Martin T 34 $17.50 I first met…
Barry Robnson F 17 $12.76 Barry has…
Mary Capps T 104 $541.99 Sometimes …
Eric Kramer T 95 $214.22 Nobody who …
Matt Butts F 32 $6.54 I’ve never …
Imagine we find a CSV file without any specification as to format…
Most of these
match a list of
common first
names
Most of these
match a list of
common last
names
These look like
Boolean values
All of these are
integers
These look like
monetary
values
These fields
appear to
contain free text
In the last example, we were able to make some pretty good guesses by looking at the
data, not knowing anything about the names of the headers. This could go a long way
towards interpreting this file in an automated way.
Add knowledge about the problem domain and we’re cooking with gas…
PROFILING
For best results, do analysis against ALL of the data!

Traditional Data Warehousing
POS sales data B
POS sales data C
POS sales data D
POS sales data A
Data from four different point-of-sale
systems used in different parts of a company
CANONICAL
DATA
MODEL
The good: analysts work with consistent, clean data
The bad: the burden of normalizing the data when it is
generated is felt acutely; in a worst case we could do this
work and never end up analyzing the data.
The ugly: Since the normalization was done before the
requirements for analysis were known, normalized data may
not satisify requirements of analysts

Data Lake Enabled by Hadoop
Ingestion is simple
because we simply
copy raw data of any
kind to HDFS.
Development and
operations are not
burdened by ingestion
requirements
Data import is lossless.
Compute and data are
tightly coupled; we can
“full scan” the data quickly
at any time.
Data cleanup can be
performed to meet
requirements of specific
uses AND can be informed
by inspection of the
complete data set.
Analysis can be performed
on text and other kinds of
data which cannot be
normalized conventionally.

We can square this circle…
Data Lake
operations
raw data
Not perfect, but not damaged by
import process! project
Data preparation is driven by requirements;
no wasted time and no compromises
Queries
Predictive
Analytics
Machine
Learning
Other
projects
Ontologies, taxonomies, and logic programming mean
an increasing amount of work can be shared between
projects
Data Lake

Putting Knowledge To Work
(UNIT CONVERSION ONCE AGAIN)
EnglishTemp(location,amount) -> INSERT(MetricTemp(location,(5/9)*(amount-32))
Conversion of a unit
represented by a predicate is
one simple rule that could be
written by hand
Input data specification
Output data specification
Analysis of input and output schema reveals need for unit
conversion; system gets conversion rule out of world
knowledge library and specializes it
World Knowledge Libraries
General Industry-Specfic Company-
Specfic
Code generation

Intelligent Data Preparation
Data Lake
Documentation
Machine readable schemas
describes
Scalable/Parallel
profiler
transformer
consumers
Ontologies Requirements
Knowledge base about instances (ex. Places) and common
patterns in data expression (ex. Date formats)
broad spectrum
vertical specific
company specific
application specific
Iterative Development
Process Generates and
tests hypotheses

Making the semantic web work

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Making the semantic web work

Similar to Making the semantic web work (20)

More from Paul Houle

More from Paul Houle (20)

Recently uploaded

Recently uploaded (20)

Making the semantic web work