SlideShare a Scribd company logo
1 of 58
Making the Semantic Web
Work
Reasoning beyond OWL
What is semantics?
Although animals do not use language, they are capable of many of the same kinds of cognition as us; much
of our experience is at a non-verbal level.
Semantics is the bridge between surface forms used in language and what we do and experience.
Language understanding depends on world knowledge (i.e. “the pig is in the pen” vs. “the ink is in the pen”)
Machine to Machine Communication
message
exchange
Underlying the systems are different
databases; the ability to “get something
done” is like a non-verbalized ability, but to
work with other systems we need to
formulate messages in an artificial
language.
Understanding human language is a big problem.
What chunk can we break off that will be useful
and can be done today?
Key insight:
The semantic problem of communications between
business IT systems aren’t that different from the
semantic problem of communication between
animals
Natural Language to support M2M
Internal database
Industry standard
message format
Machine-readable and human readable
specfications
Capture critical knowledge in graph database; perhaps 80% of
process can be automated, but human effort is part of a structured
process that clearly links specification to implementation
Captured specifications are used to
compile data transformation rules.
Graph model is used as “universal solvent”
More generally…
requirements regulationspolicies
Programs that
implement behaviors
We might not be ready for executives to specify
policies themselves, but we can make the process
from specification to behavior more automated,
linked to precise vocabulary, and more traceable.
Advances such as SVBR and an English
serialization for ISO Common Logic means that
executives and line workers can understand why
the system does certain things, or verify that
policies and regulations are implemented
Logged Decision Process
Focusing on the execution of tasks is
the road to real semantics; anything
that does a useful job solves the
“grounding problem;” Children can’t
learn language by watching television,
only by talking with others.
Making Expressive Reasoning Scalable
Scalable fabric
BACKGROUND KNOWLEDGE
RULES MODELS
ALGORITHMS
HEURISTICS
Scalable system merges data from siloed sources;
constructs graph(s) of facts relevant to specific
records and entities
profiler
VOCABULARY MANAGEMENT
VERSION CONTROL
EXCEPTION HANDLING
BUSINESS RULES MANAGEMENT
CASE MANAGEMENT
CONCEPT MATCHING
BEHAVIOR TRACEABLE
TO REQUIREMENTS
MULTILINGUAL SUPPORT
ENRICHED LINKED DATAScalable profiler lets system discover
“ground truth” about data to inform
generated rules and behaviors
People are looking for better tools
Unconstructive Criticism of the Semantic Web is Common
Blanket dismissals displace real thinking, particularly a “gap analysis” as to what is missing.
Yet, certain unworkable standards (OWL) have also displaced real progress.
History of RDF is about evolution
good stuff survives, bad ideas (slowly) fade away
RDF/XML
RDFS
OWL
SPARQL
SPIN
Linked Data
ISO Common
Logic
Turtle
Early work built on XML, had natural
representations for ordered
collections but was pedagogically
awful (where are the triples?)
N-Triples
Turtle is a human friendly format but
isn’t scalable to billions of triples
Competition for
schema/inference lanuages left a
two winners
A full-featured query language
changed everything: but ordered
collections go “under the bus”
New inference and transformation
languages emerge
In the Linked Data era we can
handle billions of triples, but
collections and blank nodes
become awkward
In the long-term we’ll see highly
expressive languages forward
compatible with RDF
RDF*
RDF* and SPARQL* let us make
statements about statements
and query them; this increases
expressive and can be used for
data management
We can be optimistic because…
multiple communities have been working on similar things in parallel
Semantic web
RDF / SPARQL
Diagramming and
representation of
data structures,
processes,
systems, models,
etc.
Common Logic
and
Message
Vocabularies
SUMO
Upper ontology
Commercial
Master Data
Management
products
accurately match
entitiesVocabularies and
message formats
for business
When you look at the pieces of the puzzle developed by communities that don’t really talk to
each other, you see that the “state of the art” is better than it appears…
Common data models
• Relational data model
• Fundamentally tabular, like a CSV file
• Object-relational model
• A column can contain rows
• This is like XML or JSON
• Graph Model
• Highly general
• Hypergraphs
• “Property Graphs” and RDF*
These models are compatible in that
you can represent a graph with
relational tables, break up an XML
record into multiple relational tables,
or even embed a hypergraph inside a
graph, but there are big difference
when it comes to efficiency when you
need a certain set of facts in one
place.
Predicate Calculus
RDF is a special case of the “predicate calculus”
Statement of arity 2
Predicate Calculus:
A(:Dog,:Fido)
RDF:
:Fido A :Dog .
Statement of arity 3
Predicate Calculus:
:Population(:Nigeria,2013,173.6e6)
RDF:
[
a :Population .
:where :Nigeria .
:when 2013 .
:amount 173.6e6
]
It’s not too hard to write this in
Turtle
This implementation, however,
is structurally unstable, since
we went from one triple to four
triples
How to think about RDF
• The basic element of RDF is the Node
• This borrows heavily from XML in that
• Terms come out of a URL-based namespace so we can throw everything in a big pot
• We get the basic types from XML schema
• Plus we can even use XML literals
• A triple is just a tuple with (i) three nodes, and (ii) set semantics
• Higher-arity predicates are tuples with >3 nodes
• SPARQL result sets and intermediate results are tuples of Nodes
• Official serialization formats exist for SPARQL result sets
ISO Common Logic is the obvious upgrade path, since it uses the same data types as RDF and can handle RDF
triples, as well as higher-order predicates and intuitively obvious inference.
ISO Common Logic
Next step in evolution
• Uses RDF Node as basic data type with all benefits thereof
• RDF triples are just arity 2 predicates and can be used directly
• First order logic operators supported; typed logic allows some
“beyond first order logic” capabilities
• OWL and RDFS can be implemented as a theory in FOL
• Builds on the KIF Knowledge Interchange Format
• Foundation for additional developments
• Controlled English Format for Common Logic Statements
• Modal logics: SVBR
• Interchange language for knowledge-based systems of all kinds
The Old RDF: Expressive but not scalable
Early RDF:
RDF/XML serialization, heavy use of blank nodes, extreme
expressiveness:
[ a sp:Select ;
sp:resultVariables (_:b2) ;
sp:where ([ sp:object rdfs:Class ;
sp:predicate rdf:type ;
sp:subject _:b1
] [ a sp:SubQuery ;
sp:query
[ a sp:Select ;
sp:resultVariables (_:b2) ;
sp:where ([ sp:object _:b2 ;
sp:predicate rdfs:label ;
sp:subject _:b1
])
]
])
]
This is a representation of a SPARQL
query in RDF!
This example uses Turtle, where
square brackets create blank nodes
and parenthesis create lists.
With this graph in the JENA
framework you can easily manipulate
this as an abstract syntax tree.
Very complex relationships, such as
mathematical equations can be built
this way; blank nodes can be used to
write high-arity predicates.
Accessing it through SPARQL would
not be so easy!
Linked Data: New Focus
Linked data source
Blank nodes are discouraged because it’s hard for
a distributed community to talk about something
without a name.
[ a sp:Select ;
sp:resultVariables (_:b2) ;
sp:where ([ sp:object rdfs:Class ;
sp:predicate rdf:type ;
sp:subject _:b1
] [ a sp:SubQuery ;
sp:query
[ a sp:Select ;
sp:resultVariables (_:b2) ;
sp:where ([ sp:object _:b2 ;
sp:predicate rdfs:label ;
sp:subject _:b1
])
]
])
]
Turtle and RDF/XML (which
have sweet syntax for blank
nodes) are not scalable
because the parser cannot be
restarted after a failure: if
you have billions of triples, a
few will be bad
<http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show" .
<http://example.org/show/218> <http://example.org/show/localName> "That Seventies Show"@en .
<http://example.org/show/218> <http://example.org/show/localName> "Cette Série des Années Septante"@fr-be .
<http://example.org/#spiderman> <http://example.org/text> "This is a multi-linenliteral with many quotes (""""")nand two apostrophes ('')." . <http://en.wikipedia.org/wiki/Helium>
<http://example.org/elements/atomicNumber> "2"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/specificGravity> "1.663E-
4"^^<http://www.w3.org/2001/XMLSchema#double> .
N-Triples is practical for large databases such as Freebase and Dbpedia because records are
isolated, but blank nodes must be named, triple-centric modelling is encouraged
We now have a great query language, SPARQL. SPARQL supports the
same shorthand for blank nodes as Turtle. Some blank node patterns
work naturally, but it is particularly hard to ask questions about
ordered collections.
Blank nodes, collections, etc. are out of fashion.
Old Approaches To Reification:
Named Graphs
:graph :subject :predicate :object .
Adding an extra node to a triple is simple, practical and useful for
many purposes.
For instance, I could take in triple data from various sources and
keep them apart by putting them in different graphs.
The trouble is that this is a one trick pony: I can’t take collections
of named graphs from different sources and keep them apart
using named graphs
For practical logic we need to be able to qualify statements to manage:
• Provenance
• Access Controls
• Metadata
• Modal relationships
• Time
Old Approaches to Reificiation
Reification with Blank Nodes
[
rdf:type rdf:Statement .
rdf:subject :Tolkien .
rdf:predicate :wrote .
rdf:object :LordOfTheRings .
:said :Wikipedia .
]
http://stackoverflow.com/questions/1312741/simple-example-of-reification-in-rdf
This isn’t too hard to write in Turtle, but it breaks
SPARQL queries and inference for reified triples.
The number of triples is at the very least tripled; the
triple store is unlikely to be able to optimize for
common use cases.
a new standard that unifies RDF with the property graph model
RDF*/SPARQL* (Reification Done Right)
Turtle facts:
:bob foaf:name "Bob" .
<<:bob foaf:age 23>> dct:creator <http://example.com/crawlers#c1>
dct:source <http://example.net/homepage-listing.html> .
Sparql query:
SELECT ?age ?src WHERE {
?bob foaf:name "Bob" .
<<?bob foaf:age ?age>> dct:source ?src .
}
This is huge! So far products based on property graphs have been ad-hoc, without a
formal model. SPARQL* brings rich queries to the property graph model and the reverse
mapping means RDF* can be processed with traversal-based languages like Gremlin.
Roles of Schemas
• Documentation
• Integrity Preservation
• Efficiency
• Inference
Schemas as Documentation
Humans write code to insert data and write queries: schemas tell us
how the data is organized.
Automated systems can also use schemas to drive code generation
(consider object-relational mapping)
Schemas can preserve integrity
SQL:
create table customer (
id integer primary key,
username varchar(16) unique key not null,
email varchar(64) not null,
)
SQL prevents attempts to insert records
with non-existing fields or lacking
required fields. SQL can enforce key
integrity and other constraints.
You can (often) code algorithms and take
it for granted that data structures satisfy
invariants required for those algorithms
to work.
RDF:
RDFS and OWL, implemented with the standard semantics,
do not validate data.
Practically, RDF users will use types and properties across a
wide range of standard and proprietary namespaces, and it can
be hard to keep track of them all.
For instance, rdfs:label is defined in RDFS, despite the fact that
you can have labels without schemas. Terms that are the
bread-and-butter of RDFS, such as rdf:type, rdf:Property, and
rdf:Type are defined in the RDF specification.
It’s an easy mistake to get the “s” wrong in either writing data or
queries and if you do you run a query, get zero results, and
could easily chase your tail looking for other causes.
You can (and should) define an alternate semantics for RDFS and
OWL, which rejects types and properties that are not listed in
either data or queries, but this is nonstandard.
These issues are addressed in the “RDF Data Shapes” effort to be
completed in 2017, (
Schemas can promote efficiency
JSON (and XML)
Structural information is repeated in each schema
record
{
“red”: 201,
“green”: 99,
“blue”: 82
“alpha”: 115
}
(85 bytes)
C
typedef unsigned char byte;
struct color {
byte red;
byte green;
byte blue;
byte alpha
} Defines meaning of
201 99 82 115
(4 bytes)
20x compression!
In numerical work, it often takes longer to convert a million numbers from ASCII to float than you spend working on the floats. The speed of text parsing is a limiting factor in
electronic trading systems and many other applications.
GZIP compression of repetitive data helps, but you get a smaller file if you apply GZIP to binary data. You pay a CPU price for data compression plus a large price in string parsing.
Textual data formats have been fashionable in the Internet Age, because it is easy to get string parsing code to “almost work;” one of the reasons we are hearing about security
breaches every day is that it’s extremely difficult to write correct string parsing code.
RDF standards do not address binary serialization, however Vital AI can create binary formats based on OWL schemas
Schemas and Inference
the unique value of RDF!
:Joe myvocab:emailAddress “joe@example.com” .
dbpedia:Some_Body db:eMailAddress <mailto:sb@example.com> .
basekb:m.3137q basekb:organization.email_contact.email_address “3137q@example.com” .
:Lily schemaOrg:email “lily@example.com” .
Look at 4 RDF vocabularies and find 4 ways to write an e-mail address
myvocab:emailAddress rdfs:subPropertyOf foaf:email .
db:eMailAddress rdfs:subPropertyOf foaf:email .
basekb:m.3137q rdfs:subPropertyOf foaf:email .
:Lily rdfs:subPropertyOf foaf:email .
:Joe foaf:email “joe@example.com” .
dbpedia:Some_Body foaf:email <mailto:sb@example.com> .
basekb:m.3137q foaf:email “3137q@example.com” .
:Lily foaf:email “lily@example.com” .
A-BOX
T-BOX
Inferred
facts
It looks like an answer for data integration,
but…
:Joe foaf:email “joe@example.com” .
dbpedia:Some_Body foaf:email <mailto:sb@example.com> .
basekb:m.3137q foaf:email “3137q@example.com” .
:Lily foaf:email “lily@example.com” .
There are two reasonable ways to write an email address:
as a string or as a URI
foaf:email rdfs:domain owl:Thing .
According to the foaf spec, only the URI is correct since,
“In OWL DL literals are disjoint from owl:Thing,” (at least if
we are using OWL DL…)
Any ETL tool has an ability to apply a function to data (it’s not hard at all to write code to translate a string to
a mailto: URI)
RDFS and OWL, however, can’t do simple format conversion. For instance, it is reasonable for people to
specify temperatures in Fahrenheit or Centigrade or Kelvin, but OWL inference can’t “multiply by
something and add” – even though it can state that properties “mean the same thing”, it can’t specify
simple transformations.
Something like OWL may be necessary for data integration, but OWL is not sufficient.
Other things OWL Can’t Do
• We can’t reject data
• Reject things that we don’t agree with
• Reject things we don’t need; let’s use Freebase to seed…
• A directory of ski areas
• The spatial hierarchy of Africa
• A biomedical ontology
We don’t want to pay to store stuff we don’t need, or wait for it to be
processed, do quality control on it, or deal with any problems it might
create
OWL is unintuitive
Here’s an excerpt from the FIBO (Finance) ontology:
Organization:
A social unit of people, systematically struc-tured
and managed to meet a need or pursue collective
goals on a continuing basis.
Autonomous Agent:
An agent is an autonomous individual that can
adapt to and interact with its environment.
Property Restriction 1:
Set of things that must have property "has
member" at least 2 taken from "autonomous
agent"
Property Restriction 2:
Set of things that may have property "has part"
taken from "organization"
Property Restriction 3:
Set of things that must have property "has" at least
1 taken from "goal"
Property Restriction 4:
Set of things that may have property "has" taken
from "postal address”
means “has parent”
LEGEND
How do you explain this to your boss? To the programmer that just joined the team? What kind of inference does this entail?
(I think two people with a goal are an organization, but is there a real difference between a DBA filing for a person who is self
employed and one that has an additional employee?)
It’s not always obvious how to do things in
OWL
You can’t say
“The United States Has 50 States”
But you can say
“Anything that has 50 states is the United States”
You can get close to what you want to say by
“The United States is a member of an anonymous
class that contains anything with 50 states.”
You can get some entailments from that, but nothing
happens if only 47 states are on the list (it’s an open
world, we just don’t know about…)
Thus:
It’s not obvious what exactly can be specified in
OWL.
If you talk to an expert, you’ll find that he can do a
lot of things you might think aren’t possible.
Production Rules and First-Order Logic
Many 1970s “expert systems” were driven by production rules; these are now widespread in “Business Rules
Engines”.
Condition -> Action
Common data transformations can be easily written with production rules:
Weight(person,weight) and Height(person,height) -> BodyMassIndex(person,weight/height^2)
BodyMassIndex(person,bmi) and bmi<18.5 -> Underweight(person)
BodyMassIndex(person,bmi) and 18.5<=bmi<25 -> NormalWeight(person)
BodyMassIndex(person,bmi) and 25<=bmi<30 -> Overweight(person)
BodyMassIndex(person,bmi) and 30<=bmi 0 -> Obese(person)
You could easily miss it reading the documentation,
but it’s possible to state this in OWL by using XML
Schema constraints on data types
You can’t do this in OWL. You just can’t
Production Rules
vs imperative programming languages
The BMI example could easily be written in (say) Java…
BUT
You have to get the steps in the right order; this is trivial to do in a simple case, but it gets increasingly
harder as complexity goes up. This is one of the reasons why programming is a specialized skill.
Production Rules constrain the conditions so the engine can quickly determine which rules
are fired when the state changes…
… but, the actions are written in a conventional programming language like LISP or
Java, so we can use a fully spectrum of programming techniques and a lot of
existing code.
Note: rules engines have advanced greatly since the “golden age of AI”, and now
100,000+ rules and 10 million+ facts are practical.
Production Rules in the Wider Picture
Drools Expert: Execution of production rules
Drools Fusion: Complex event processing
jBPM: Business Process Management; coordination of asynchronous human
and automated behaviors – controlled by rules
Optaplanner: Multi-objective combinatoric optimization for tasks such as
scheduling, vehicle routing, box packing – controlled by rules
This is the JBOSS stack; products such as Blaze Advisor and iLog do all this and more.
The use of production rules to control business processes, particularly in scenarios involving complex
workflows and complex multiple requirements is well established.
This is an emerging research topic in the
semweb community, but in the business rules
world this is a mature technology
“Impeadance Mismatch” between Business
Rules and RDF is minimal
Most Java Rules Engines (like JESS and Drools) can reason
about ordinary Java objects
RDF data can be converted to specialized predicate
objects for performance or convenience, but it is
very possible to insert objects from the Jena
framework such as Nodes, Triples and Models directly
into a rules engine.
OWL and RDFS implementations often use
production rules
OWL 2 RL dialect
Forward chaining
The semantics of RDFS and most of OWL can be
implemented with production rules; RETE and
Post-RETE algorithms can evaluate these efficiently.
Popular reasonsers such as Jena and OWLIM often use
a box of production rules to implement RDFS and
OWL and expose this functionality so you can
implement custom inference.
OWL 2 QL dialect
Backwards chaining
RDFS, and another major subset of OWL, can be
implemented by rewriting SPARQL queries.
Since SPARQL is based on relational algebra, the
whole bag of tricks used to optimize relational
database queries can be used to efficiently answer
queries.
OWL dialects have “computer science”
advantages
(i.e., algorithms exist to answer queries in bounded time, with scaling
that looks good on paper)
More expressive logics that are undecidable sound scary,..
However, many things about conventional programming languages are undecidable…
For instance, you can’t solve the halting problem for conventional programming languages., yet, that
Doesn’t intimidate most people to use languages that lack recursion and unbounded loop.
Algorithms to exactly solve common optimization problems (travelling salesman problem, etc.) are
computationally intractable, but approximate algorithms are fine for the real world.
(Evaluation of production rules is not decidable in finite time since it is possible to create an “infinite loop”)
Logical Theorem Proving
ex. VAMPIRE
If we constrain the action fields of rules a bit, we can prove theorems,
a highly flexible form of reasoning. There are other ways to do it, but
one effective method is the saturation solver.
Axioms
(S) Statement to
prove
Logical Negation Solver
conclusions
If S is true, then not S is false. Eventually the solver
will find a contradiction and produce the conclusion
false.
Since you can derive an infinite number of conclusions from
most theories, this process is not guaranteed to finish. A lucky
or clever algorithim, could reach false with a short chain.
State of the art reasoners use multiple search strategies that
work well in many real-life cases.
Real-life OWL and RDFS performance doesn’t
satisfy
RDFS inference, done according to the book, generates a vast number
of trivial and uninteresting conclusions; practical reasoners usually
don’t implement the complete standard
Requirements for Practical logic
One long term goal for logic is
“capture 100% of critical knowledge in business documents”
It might sound like science fiction, but if we hire a team of programmers to implement a policy or to make a
system that complies with regulation and requirements, it is the goal. Can we (i) reduce team size, (ii) speed
up the project, and (iii) be able to show the rules being enforced to management in a way they can
understand?
Plain first-order logic does not cover all the bases.
We need:
• Modal logic (CAN, SHOULD, MUST, IT WAS TRUE THAT, HARRY BELIEVES THAT)
• Temporal logic (things change at different times)
• Default and Defeasible logic
• Higher-order logic (for all statements) or (there exists a statements)
These logics are not as
mature as FOL, but we
can often use tricks to
simulate them
Modal logic
Key for Law, Contracts, Requirements, …
A modal operator qualifies a
statement:
MUST(S) -> S is necessarily true in
any situation
USUALLY(S) -> S is usually true
PERMISSABLE(S) -> It it
permissible that S is true
BELIEVES(person,S) -> specified
person believe S is true
PREVIOUSLY(S) -> S was true in
the past
Some modal logic problems can be addressed by rewriting the
problem, for instance if S(x,y) is a simple predicate we could
define a predicate like
BELIEVES_S(person,x,y)
We can’t express arbitrary statements this way, but we may
be able to express all the ones that we’ll really use.
Systems like SUMO use tricks like this to punch above their
weight
Temporal Logic
Change is the one thing that is constant. The population of Las Vegas was 25 in 1990 and 583,736 in 2010.
Since laws change over time, to know if a set of actions was illegal, we need to know when the actions
where and what the law was at the time and answer questions like “What did the President know and when
did he know it?”
A complete theory is not fully developed, but some pretty good tools are available
The Allen Algebra
Time intervals are closer to reality than points in time; with time intervals we can specify that a meeting
starts at 6:00 pm on a certain day and goes on for 1 hour. We could ask if this overlaps with the interval of
another meeting to know if I need to choose between one meeting and the other.
Allen Algebra doesn’t cover all temporal reasoning cases, but it works well with production rule systems,
and is widely used in complex event processing.
A complete theory is not fully developed, but some pretty good tools are available
Default and Defeasible Reasoning
The following logical chain leads to a bad
result:
Flies(Bird)
A(Penguin,Bird)
Flies(Penguin)
Exceptions are widespread in real life:
“A year divisible by 4 is a leap year, unless the year is divisible by
100; however, if the year is divisible by 400 it IS a leap year”
“An amateur radio operator may not transmit music unless they
are retransmitting a signal from the International Space Station.”
We could write
Any(x): A(x,Bird) and NOT(A(x,Penguin)) -> Flies(x)
But this gets hard to maintain when we find out about ostriches, domestic ducks, etc. It would be worse yet to
maintain a list of flying birds.
Default logic adds features that let us express defaults
Defeasible logic allows us to retract a conclusion if we find contrary evidence later
Logical Negation
ALL APPROACHES ARE SOMEWHAT PROBLEMATIC
There are many ways to implement logical negation, but there is no universal answer to the problem.
For instance, suppose we add
NOT(Underweight(person)) -> WellFed(person)
to the rules we’ve been working on.
If this rule is activated before we have: (i) gotten height and weight information, (ii) computed the BMI,
and (iii) classified this person, it will fire improperly. This might not be problem if it has no real-world
consequences and is retracted when it becomes false, but it’s not the behavior we want.
Logic Programming
Practical Concessions
Phase I: Extract Information About
Height and Weight
Phase II: Compute BMI and
classify
Phase III: Make additional conclusions
knowing ALL Phase II conclusions
With the agenda mechanism in most
Business Rules Systems, each phase
can get a complete view of what
happened in the last phase,
meaning that negation, counting and
similar operations work as expected
(At the cost that we need to assign
rules to the right phases)
What about SPIN?
SPIN is similar in expressiveness to production rules.
ex:Person
a rdfs:Class ;
rdfs:label "Person"^^xsd:string ;
rdfs:subClassOf owl:Thing ;
spin:rule
[ a sp:Construct ;
sp:text """
CONSTRUCT {
?this ex:grandParent ?grandParent .
}
WHERE {
?parent ex:child ?this .
?grandParent ex:child ?parent .
}"""
] .
This is like a production rule written in
reverse, we infer triples from the
CONSTRUCT clause based on matching
the WHERE clause.
TopBraid Composer implements most
inference through primitive forward
chaining (a fixed point algorithm, RETE
cannot be used because the order of
rule firing is unpredictable.)
Backwards chaining can be
accomplished through the definition of
“magic properties” (something similar
can be done with Drools too)
SPIN has support for query templates, in some ways like Decision
Tables but possibly more palatable for coders and for semantic apps
Control of execution order, negation, and non-monotonic
reasoning are not settled. Less is know about how to implement it
Linked Data
“Trough of Disillusionment”
The dream of linked data is that you can
easily “mash up” data from multiple sources
to answer questions.
If you want to get the right answers,
however, it is not so easy.
If you didn’t have a lot of experience in the corporate
world you might blame data publishers, RDF, and the
incentive structures around linked data for this,
however…
Corporate Data
… real life data in business is frequently bad;
80% of effort in data mining projects goes
into data preparation and cleaning.
tools
Business
analyst
ERP
POS
email
ERP
CRM
web
Factory
automation
CRM
Wiki
HR
Sharepoint
CMS
Custom
apps
Inventory
Social
CMS
SAAS Apps
A large business has multiple business units running a huge number
of applications written at different times by different people
Businesses grow by acquisition; to the extent that
customers and employees are aware of different IT
systems and their histories, customer service sucks,
employees underperform and costs are high
Businesses face the same problem as the Linked Data Community but
these problems happen behind closed doors and people are cursing
COBOL and SAP instead of RDF and SPARQL
While Linked Data was were emerging, Enterprise IT developed
“Master Data Management” to enable a “Customer Centric” enterprise
Personal Account
Paul’s Business Account A
Paul’s Business Account B
Olivia’s Business Account
Child’s Account
Paul’s IRA
Olivia’s IRA
SEP IRA
Home Equity Line
Houseguest
Tenant A Personal
Tenant A Corporate
Tenant B
Traditional business systems are “account centric”, which is
enough to get by but not enough to thrive. To really serve me
well, my credit union needs a complete picture of the relationship
I have with it. (It took me a while to remember how many
accounts I have and I might have missed one)
Financial institutions are under legal pressure to “know your
customer” (KYC) and linking accounts that belong to a customer is
necessary to prevent monkey business
but I own shares in
this one!
My name is on this column of accounts,
but not the others
Dominant paradigm for master data management:
Objects are clustered based on a
distance metric; objects are
“blocked” beforehand to avoid the
N2 cost of computing distances
… this is effective in the case of matching different records
for the same customer, but is NOT effective in cases where
we have a ground truth and can know rather than guess …
Tyrol Tirol
Two variants that differ by a letter
can be fuzzy matched, but it’s
hard to guess arbitrary things like
AT-7
ISO 3166-2
AT33
NUTS
AU07
FIPS 10-4
蒂罗尔州
CHINESE
… and why guess when you can just look them up in a quality
controlled database?
Conventional MDM focuses on resolving
customers (people or businesses;) in some
cases it involves resolving products.
Generally the objects being matched are
“equal” to each other in ontological status,
such as two customer records.
Semantic MDM covers a wider range of
concepts and often imports large amount of
knowledge from general databases or
involves alignment with industry ontologies.
In some cases we are discovering new
concepts and maintaining the ontology, but
more often we are matching surface forms to
underlying concepts.
Do we clean data before or after query time?
Weather station reports temperature in
centigrade, reports -999 upon error
32.1 34.6 36.3 -999 33.8
Let’s say we want to compute the average…
If we use the arithmetic mean, we get
-215.55° C. Outrageously wrong!
If we know this device reports -999 on error
or that temperatures can never be less than
-273.15 we can reject the bad value, we get
34.2° C
If we use the median instead of the mean,
the outlier is automatically ignored we also
get 34.2° C (we’re lucky it’s exactly the same)
In this case it’s reasonable to clean the data or use an algorithm that is
robust to outliers – they teach kids in elementary school the median is robust,
but how many other robust algorithms are on the tip of everyone’s tounge?
Ahead-of-time data preparation
TEST
CASES
Test failure blocks
further analysis
Queries business
analyst
Error reports thrown
“over the wall”
data quality team
Line drawn between data processing
and data use establishes test perimeter
and makes process scalable in human
terms
Fixing up at query time will drive you nuts
Scenario: Business Analyst writes
queries while talking to co-workers to
quickly build collective understanding.
Requirement: easy to write queries off
the cuff and get the right answer!
“joe@example.com”
<mailto:sb@example.com>
“3137q@example.com”
“lily@example.com”
It’s not hard to canonicalize two
variant forms of an e-mail address in
either a query or in processing the result
set
query complexity
effort
A real query might be querying tens
of values, some are used in conditions,
others end up in the results. If many
things are being joined (i.e. you’re using
SPARQL) the query will explode
exponentially in complexity.
Will you trust the answer?
Some kind of query rewriting
(like the implementation of OWL 2
QL) might help, but we still lack a
perimeter where we can test the
system and give it a clean bill of
health
Ordered collections are awkward in RDF
• Two ways to do it because neither one is satisfying
RDF Containers
:Missions a rdf:Seq ;
rdf:_1 :Mercury ;
rdf:_2 :Gemini ;
rdf:_3 :Apollo .
This could generate huge numbers of predicates,
also nothing stops one from accidentally using a
numbered label more than once. The facts
comprising this list could be spread across a
system.
RDF Collections
:Missions a rdf:List .
rdf:first :Mercury ;
rdf:rest _:n1 .
_n1: rdf:first :Gemini ;
rdf:rest _:n2 .
_n2: rdf:first :Apollo ;
rdf:rest rdf:nil .
Operations on a LISP-style list are slow because you
need to follow lots of points. The use of blank
nodes can protect Collections from modification
(important in the OWL spec.)
Neither construction is easy to query in (standard) SPARQL
Yet, some RDF syntaxes look almost the same
as JSON/XML
JSON
{
missions: [ “Mercury”, “Gemini”, “Apollo” ]
}
TURTLE
:Missions :members (:Mercury,:Gemini,:Apollo) .
Most RDF tools will expand this into a LISP-list
with blank nodes, but in TURTLE format the
physical layout is the same as JSON.
Collections and Containers are described as “non-
normative” in RDF 1.1; advanced tools may use
special efficient representations (like would be used for
JSON).
It’s awkward to work with ordered collections in the
common “client-server” model that revolves around
SPARQL engines, but for small graphs in memory, the
situation is different – the Jena framework provides a
facility for accessing Collections that feels a lot like
accessing data in JSON
Ordered collections are critical for dealing with external data
that supports external collections AND critical for many
traditional RDF use cases such as metadata (you’ll find
scientists are pretty sensitive to the order of authors for a
paper)
Another Bad Idea in Linked Data
DEREFERENCING
http
In principle a client could ask questions about individual items and
“follow it’s nose” to discover related information.
In practice, however, you miss data quality problems that are obvious
when you look at data holistically. (i.e. 47 instead of 50 states)
If the data was clean ahead of time, and if we understood the structure
of data complely ahead of time, dereferencing might work.
Since Linked Data does not enforce quality standards, however,
dereferencing is one of those dangerous things that “almost works”.
John Martin T 34 $17.50 I first met…
Barry Robnson F 17 $12.76 Barry has…
Mary Capps T 104 $541.99 Sometimes …
Eric Kramer T 95 $214.22 Nobody who …
Matt Butts F 32 $6.54 I’ve never …
Imagine we find a CSV file without any specification as to format…
Most of these
match a list of
common first
names
Most of these
match a list of
common last
names
These look like
Boolean values
All of these are
integers
These look like
monetary
values
These fields
appear to
contain free text
In the last example, we were able to make some pretty good guesses by looking at the
data, not knowing anything about the names of the headers. This could go a long way
towards interpreting this file in an automated way.
Add knowledge about the problem domain and we’re cooking with gas…
PROFILING
For best results, do analysis against ALL of the data!
Traditional Data Warehousing
POS sales data B
POS sales data C
POS sales data D
POS sales data A
Data from four different point-of-sale
systems used in different parts of a company
CANONICAL
DATA
MODEL
The good: analysts work with consistent, clean data
The bad: the burden of normalizing the data when it is
generated is felt acutely; in a worst case we could do this
work and never end up analyzing the data.
The ugly: Since the normalization was done before the
requirements for analysis were known, normalized data may
not satisify requirements of analysts
Data Lake Enabled by Hadoop
Ingestion is simple
because we simply
copy raw data of any
kind to HDFS.
Development and
operations are not
burdened by ingestion
requirements
Data import is lossless.
Compute and data are
tightly coupled; we can
“full scan” the data quickly
at any time.
Data cleanup can be
performed to meet
requirements of specific
uses AND can be informed
by inspection of the
complete data set.
Analysis can be performed
on text and other kinds of
data which cannot be
normalized conventionally.
We can square this circle…
Data Lake
operations
raw data
Not perfect, but not damaged by
import process! project
Data preparation is driven by requirements;
no wasted time and no compromises
Queries
Predictive
Analytics
Machine
Learning
Other
projects
Ontologies, taxonomies, and logic programming mean
an increasing amount of work can be shared between
projects
Data Lake
Putting Knowledge To Work
(UNIT CONVERSION ONCE AGAIN)
EnglishTemp(location,amount) -> INSERT(MetricTemp(location,(5/9)*(amount-32))
Conversion of a unit
represented by a predicate is
one simple rule that could be
written by hand
Input data specification
Output data specification
Analysis of input and output schema reveals need for unit
conversion; system gets conversion rule out of world
knowledge library and specializes it
World Knowledge Libraries
General Industry-Specfic Company-
Specfic
Code generation
Intelligent Data Preparation
Data Lake
Documentation
Machine readable schemas
describes
Scalable/Parallel
profiler
transformer
consumers
Ontologies Requirements
Knowledge base about instances (ex. Places) and common
patterns in data expression (ex. Date formats)
broad spectrum
vertical specific
company specific
application specific
Iterative Development
Process Generates and
tests hypotheses

More Related Content

What's hot

RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic WebRob Paok
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFdonaldlsmithjr
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsArmin Haller
 
RDF for Librarians
RDF for LibrariansRDF for Librarians
RDF for LibrariansJenn Riley
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminologyTenforce
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 
Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodelChris Mungall
 
OOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria PovedaOOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria Povedasemanticsconference
 
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...Marta Villegas
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked DataGabriela Agustini
 

What's hot (20)

RDFa Semantic Web
RDFa Semantic WebRDFa Semantic Web
RDFa Semantic Web
 
Sparql
SparqlSparql
Sparql
 
5 rdfs
5 rdfs5 rdfs
5 rdfs
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
What's New in RDF 1.1?
What's New in RDF 1.1?What's New in RDF 1.1?
What's New in RDF 1.1?
 
Code4Lib Keynote 2011
Code4Lib Keynote 2011Code4Lib Keynote 2011
Code4Lib Keynote 2011
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
RDF for Librarians
RDF for LibrariansRDF for Librarians
RDF for Librarians
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminology
 
Semantic web an overview and projects
Semantic web   an  overview and projectsSemantic web   an  overview and projects
Semantic web an overview and projects
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Semantic Web Nature
Semantic Web NatureSemantic Web Nature
Semantic Web Nature
 
Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodel
 
OOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria PovedaOOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria Poveda
 
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Efficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data StreamsEfficient RDF Interchange (ERI) Format for RDF Data Streams
Efficient RDF Interchange (ERI) Format for RDF Data Streams
 

Viewers also liked

The OU Linked Open Data, Production and Consumption
The OU Linked Open Data, Production and ConsumptionThe OU Linked Open Data, Production and Consumption
The OU Linked Open Data, Production and Consumptionfzablith
 
Exploiting Linked (Open) Data via Microsoft Access
Exploiting Linked (Open) Data via Microsoft AccessExploiting Linked (Open) Data via Microsoft Access
Exploiting Linked (Open) Data via Microsoft AccessKingsley Uyi Idehen
 
Business building-taining-guide
Business building-taining-guideBusiness building-taining-guide
Business building-taining-guidebestwebsite2008
 
Dynamique des villes
Dynamique des villesDynamique des villes
Dynamique des villesArchi Guelma
 
Case Study - Body Harvest by Sam And Alice
Case Study - Body Harvest by Sam And AliceCase Study - Body Harvest by Sam And Alice
Case Study - Body Harvest by Sam And AliceMedia Studies
 
China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...
China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...
China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...Bruno Bensaid
 
Why and How You Should Move from PHP to Node.js
Why and How You Should Move from PHP to Node.jsWhy and How You Should Move from PHP to Node.js
Why and How You Should Move from PHP to Node.jsBrainhub
 
Chabot is a not a product, it's a feature
Chabot is a not a product, it's a featureChabot is a not a product, it's a feature
Chabot is a not a product, it's a featureMichael Vakulenko
 
Semantic Web Technologies for Intelligent Engineering Applications
Semantic Web Technologies for  Intelligent Engineering ApplicationsSemantic Web Technologies for  Intelligent Engineering Applications
Semantic Web Technologies for Intelligent Engineering ApplicationsMarta Sabou
 
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...Amazon Web Services
 
Chatbots: What You Should Know
Chatbots: What You Should KnowChatbots: What You Should Know
Chatbots: What You Should KnowAntonin Archer
 
GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...
GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...
GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...Neo4j
 
AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)
AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)
AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)Amazon Web Services
 

Viewers also liked (17)

The OU Linked Open Data, Production and Consumption
The OU Linked Open Data, Production and ConsumptionThe OU Linked Open Data, Production and Consumption
The OU Linked Open Data, Production and Consumption
 
Exploiting Linked (Open) Data via Microsoft Access
Exploiting Linked (Open) Data via Microsoft AccessExploiting Linked (Open) Data via Microsoft Access
Exploiting Linked (Open) Data via Microsoft Access
 
Oct8th2013 0000 gmt
Oct8th2013 0000 gmtOct8th2013 0000 gmt
Oct8th2013 0000 gmt
 
Ebook stories1
Ebook stories1Ebook stories1
Ebook stories1
 
Business building-taining-guide
Business building-taining-guideBusiness building-taining-guide
Business building-taining-guide
 
Dynamique des villes
Dynamique des villesDynamique des villes
Dynamique des villes
 
Case Study - Body Harvest by Sam And Alice
Case Study - Body Harvest by Sam And AliceCase Study - Body Harvest by Sam And Alice
Case Study - Body Harvest by Sam And Alice
 
Testimonals usa
Testimonals usaTestimonals usa
Testimonals usa
 
Testimonal in malaysia
Testimonal in malaysiaTestimonal in malaysia
Testimonal in malaysia
 
China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...
China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...
China Tech Investment Ecosystem - presentation @ BPI - 14 Nov 2016 - Bruno Be...
 
Why and How You Should Move from PHP to Node.js
Why and How You Should Move from PHP to Node.jsWhy and How You Should Move from PHP to Node.js
Why and How You Should Move from PHP to Node.js
 
Chabot is a not a product, it's a feature
Chabot is a not a product, it's a featureChabot is a not a product, it's a feature
Chabot is a not a product, it's a feature
 
Semantic Web Technologies for Intelligent Engineering Applications
Semantic Web Technologies for  Intelligent Engineering ApplicationsSemantic Web Technologies for  Intelligent Engineering Applications
Semantic Web Technologies for Intelligent Engineering Applications
 
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
AWS re:Invent 2016: Application Lifecycle Management in a Serverless World (S...
 
Chatbots: What You Should Know
Chatbots: What You Should KnowChatbots: What You Should Know
Chatbots: What You Should Know
 
GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...
GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...
GraphConnect Europe 2016 - Semantic PIM: Using a Graph Data Model at Toy Manu...
 
AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)
AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)
AWS re:Invent 2016: Chalk Talk: Succeeding at Infrastructure-as-Code (GPSCT312)
 

Similar to Making the semantic web work

State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinthsDaniel Camarda
 
Semantic web
Semantic webSemantic web
Semantic webtariq1352
 
RDF and the Semantic Web -- Joanna Pszenicyn
RDF and the Semantic Web -- Joanna PszenicynRDF and the Semantic Web -- Joanna Pszenicyn
RDF and the Semantic Web -- Joanna PszenicynRichard.Sapon-White
 
Kellogg XML Holland Speech
Kellogg XML Holland SpeechKellogg XML Holland Speech
Kellogg XML Holland SpeechDave Kellogg
 
A year on the Semantic Web @ W3C
A year on the Semantic Web @ W3CA year on the Semantic Web @ W3C
A year on the Semantic Web @ W3CIvan Herman
 
Semantic Web: From Representations to Applications
Semantic Web: From Representations to ApplicationsSemantic Web: From Representations to Applications
Semantic Web: From Representations to ApplicationsGuus Schreiber
 
RDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesKurt Cagle
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic websiteCJ Jenkins
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Lee Iverson - How does the web connect content?
Lee Iverson - How does the web connect content?Lee Iverson - How does the web connect content?
Lee Iverson - How does the web connect content?Museums Computer Group
 
Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries) Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries) robin fay
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Max Neunhöffer
 

Similar to Making the semantic web work (20)

State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
 
Semantic web
Semantic webSemantic web
Semantic web
 
.Net and Rdf APIs
.Net and Rdf APIs.Net and Rdf APIs
.Net and Rdf APIs
 
RDF and the Semantic Web -- Joanna Pszenicyn
RDF and the Semantic Web -- Joanna PszenicynRDF and the Semantic Web -- Joanna Pszenicyn
RDF and the Semantic Web -- Joanna Pszenicyn
 
Kellogg XML Holland Speech
Kellogg XML Holland SpeechKellogg XML Holland Speech
Kellogg XML Holland Speech
 
A year on the Semantic Web @ W3C
A year on the Semantic Web @ W3CA year on the Semantic Web @ W3C
A year on the Semantic Web @ W3C
 
Semantic Web: From Representations to Applications
Semantic Web: From Representations to ApplicationsSemantic Web: From Representations to Applications
Semantic Web: From Representations to Applications
 
Nosql public
Nosql publicNosql public
Nosql public
 
RDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data Frames
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Lee Iverson - How does the web connect content?
Lee Iverson - How does the web connect content?Lee Iverson - How does the web connect content?
Lee Iverson - How does the web connect content?
 
Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries) Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries)
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
master_thesis_greciano_v2
master_thesis_greciano_v2master_thesis_greciano_v2
master_thesis_greciano_v2
 

More from Paul Houle

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Paul Houle
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessPaul Houle
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Paul Houle
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemPaul Houle
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataPaul Houle
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platformPaul Houle
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform EvolutionPaul Houle
 
Paul houle the supermen
Paul houle   the supermenPaul houle   the supermen
Paul houle the supermenPaul Houle
 
Paul houle what ails enterprise search
Paul houle   what ails enterprise search Paul houle   what ails enterprise search
Paul houle what ails enterprise search Paul Houle
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance SmackdownPaul Houle
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Paul Houle
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql serverPaul Houle
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Paul Houle
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resumePaul Houle
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksPaul Houle
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHPPaul Houle
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronousPaul Houle
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?Paul Houle
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Paul Houle
 
Pro align snap 2
Pro align snap 2Pro align snap 2
Pro align snap 2Paul Houle
 

More from Paul Houle (20)

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development Process
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI System
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart Data
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platform
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform Evolution
 
Paul houle the supermen
Paul houle   the supermenPaul houle   the supermen
Paul houle the supermen
 
Paul houle what ails enterprise search
Paul houle   what ails enterprise search Paul houle   what ails enterprise search
Paul houle what ails enterprise search
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance Smackdown
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql server
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacks
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHP
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronous
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 
Pro align snap 2
Pro align snap 2Pro align snap 2
Pro align snap 2
 

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Recently uploaded (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Making the semantic web work

  • 1. Making the Semantic Web Work Reasoning beyond OWL
  • 2. What is semantics? Although animals do not use language, they are capable of many of the same kinds of cognition as us; much of our experience is at a non-verbal level. Semantics is the bridge between surface forms used in language and what we do and experience. Language understanding depends on world knowledge (i.e. “the pig is in the pen” vs. “the ink is in the pen”)
  • 3. Machine to Machine Communication message exchange Underlying the systems are different databases; the ability to “get something done” is like a non-verbalized ability, but to work with other systems we need to formulate messages in an artificial language. Understanding human language is a big problem. What chunk can we break off that will be useful and can be done today? Key insight: The semantic problem of communications between business IT systems aren’t that different from the semantic problem of communication between animals
  • 4. Natural Language to support M2M Internal database Industry standard message format Machine-readable and human readable specfications Capture critical knowledge in graph database; perhaps 80% of process can be automated, but human effort is part of a structured process that clearly links specification to implementation Captured specifications are used to compile data transformation rules. Graph model is used as “universal solvent”
  • 5. More generally… requirements regulationspolicies Programs that implement behaviors We might not be ready for executives to specify policies themselves, but we can make the process from specification to behavior more automated, linked to precise vocabulary, and more traceable. Advances such as SVBR and an English serialization for ISO Common Logic means that executives and line workers can understand why the system does certain things, or verify that policies and regulations are implemented Logged Decision Process Focusing on the execution of tasks is the road to real semantics; anything that does a useful job solves the “grounding problem;” Children can’t learn language by watching television, only by talking with others.
  • 6. Making Expressive Reasoning Scalable Scalable fabric BACKGROUND KNOWLEDGE RULES MODELS ALGORITHMS HEURISTICS Scalable system merges data from siloed sources; constructs graph(s) of facts relevant to specific records and entities profiler VOCABULARY MANAGEMENT VERSION CONTROL EXCEPTION HANDLING BUSINESS RULES MANAGEMENT CASE MANAGEMENT CONCEPT MATCHING BEHAVIOR TRACEABLE TO REQUIREMENTS MULTILINGUAL SUPPORT ENRICHED LINKED DATAScalable profiler lets system discover “ground truth” about data to inform generated rules and behaviors
  • 7. People are looking for better tools Unconstructive Criticism of the Semantic Web is Common Blanket dismissals displace real thinking, particularly a “gap analysis” as to what is missing. Yet, certain unworkable standards (OWL) have also displaced real progress.
  • 8. History of RDF is about evolution good stuff survives, bad ideas (slowly) fade away RDF/XML RDFS OWL SPARQL SPIN Linked Data ISO Common Logic Turtle Early work built on XML, had natural representations for ordered collections but was pedagogically awful (where are the triples?) N-Triples Turtle is a human friendly format but isn’t scalable to billions of triples Competition for schema/inference lanuages left a two winners A full-featured query language changed everything: but ordered collections go “under the bus” New inference and transformation languages emerge In the Linked Data era we can handle billions of triples, but collections and blank nodes become awkward In the long-term we’ll see highly expressive languages forward compatible with RDF RDF* RDF* and SPARQL* let us make statements about statements and query them; this increases expressive and can be used for data management
  • 9. We can be optimistic because… multiple communities have been working on similar things in parallel Semantic web RDF / SPARQL Diagramming and representation of data structures, processes, systems, models, etc. Common Logic and Message Vocabularies SUMO Upper ontology Commercial Master Data Management products accurately match entitiesVocabularies and message formats for business When you look at the pieces of the puzzle developed by communities that don’t really talk to each other, you see that the “state of the art” is better than it appears…
  • 10. Common data models • Relational data model • Fundamentally tabular, like a CSV file • Object-relational model • A column can contain rows • This is like XML or JSON • Graph Model • Highly general • Hypergraphs • “Property Graphs” and RDF* These models are compatible in that you can represent a graph with relational tables, break up an XML record into multiple relational tables, or even embed a hypergraph inside a graph, but there are big difference when it comes to efficiency when you need a certain set of facts in one place.
  • 11. Predicate Calculus RDF is a special case of the “predicate calculus” Statement of arity 2 Predicate Calculus: A(:Dog,:Fido) RDF: :Fido A :Dog . Statement of arity 3 Predicate Calculus: :Population(:Nigeria,2013,173.6e6) RDF: [ a :Population . :where :Nigeria . :when 2013 . :amount 173.6e6 ] It’s not too hard to write this in Turtle This implementation, however, is structurally unstable, since we went from one triple to four triples
  • 12. How to think about RDF • The basic element of RDF is the Node • This borrows heavily from XML in that • Terms come out of a URL-based namespace so we can throw everything in a big pot • We get the basic types from XML schema • Plus we can even use XML literals • A triple is just a tuple with (i) three nodes, and (ii) set semantics • Higher-arity predicates are tuples with >3 nodes • SPARQL result sets and intermediate results are tuples of Nodes • Official serialization formats exist for SPARQL result sets ISO Common Logic is the obvious upgrade path, since it uses the same data types as RDF and can handle RDF triples, as well as higher-order predicates and intuitively obvious inference.
  • 13. ISO Common Logic Next step in evolution • Uses RDF Node as basic data type with all benefits thereof • RDF triples are just arity 2 predicates and can be used directly • First order logic operators supported; typed logic allows some “beyond first order logic” capabilities • OWL and RDFS can be implemented as a theory in FOL • Builds on the KIF Knowledge Interchange Format • Foundation for additional developments • Controlled English Format for Common Logic Statements • Modal logics: SVBR • Interchange language for knowledge-based systems of all kinds
  • 14. The Old RDF: Expressive but not scalable Early RDF: RDF/XML serialization, heavy use of blank nodes, extreme expressiveness: [ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object rdfs:Class ; sp:predicate rdf:type ; sp:subject _:b1 ] [ a sp:SubQuery ; sp:query [ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object _:b2 ; sp:predicate rdfs:label ; sp:subject _:b1 ]) ] ]) ] This is a representation of a SPARQL query in RDF! This example uses Turtle, where square brackets create blank nodes and parenthesis create lists. With this graph in the JENA framework you can easily manipulate this as an abstract syntax tree. Very complex relationships, such as mathematical equations can be built this way; blank nodes can be used to write high-arity predicates. Accessing it through SPARQL would not be so easy!
  • 15. Linked Data: New Focus Linked data source Blank nodes are discouraged because it’s hard for a distributed community to talk about something without a name. [ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object rdfs:Class ; sp:predicate rdf:type ; sp:subject _:b1 ] [ a sp:SubQuery ; sp:query [ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object _:b2 ; sp:predicate rdfs:label ; sp:subject _:b1 ]) ] ]) ] Turtle and RDF/XML (which have sweet syntax for blank nodes) are not scalable because the parser cannot be restarted after a failure: if you have billions of triples, a few will be bad <http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show"^^<http://www.w3.org/2001/XMLSchema#string> . <http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show" . <http://example.org/show/218> <http://example.org/show/localName> "That Seventies Show"@en . <http://example.org/show/218> <http://example.org/show/localName> "Cette Série des Années Septante"@fr-be . <http://example.org/#spiderman> <http://example.org/text> "This is a multi-linenliteral with many quotes (""""")nand two apostrophes ('')." . <http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/atomicNumber> "2"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/specificGravity> "1.663E- 4"^^<http://www.w3.org/2001/XMLSchema#double> . N-Triples is practical for large databases such as Freebase and Dbpedia because records are isolated, but blank nodes must be named, triple-centric modelling is encouraged We now have a great query language, SPARQL. SPARQL supports the same shorthand for blank nodes as Turtle. Some blank node patterns work naturally, but it is particularly hard to ask questions about ordered collections. Blank nodes, collections, etc. are out of fashion.
  • 16. Old Approaches To Reification: Named Graphs :graph :subject :predicate :object . Adding an extra node to a triple is simple, practical and useful for many purposes. For instance, I could take in triple data from various sources and keep them apart by putting them in different graphs. The trouble is that this is a one trick pony: I can’t take collections of named graphs from different sources and keep them apart using named graphs For practical logic we need to be able to qualify statements to manage: • Provenance • Access Controls • Metadata • Modal relationships • Time
  • 17. Old Approaches to Reificiation Reification with Blank Nodes [ rdf:type rdf:Statement . rdf:subject :Tolkien . rdf:predicate :wrote . rdf:object :LordOfTheRings . :said :Wikipedia . ] http://stackoverflow.com/questions/1312741/simple-example-of-reification-in-rdf This isn’t too hard to write in Turtle, but it breaks SPARQL queries and inference for reified triples. The number of triples is at the very least tripled; the triple store is unlikely to be able to optimize for common use cases.
  • 18. a new standard that unifies RDF with the property graph model RDF*/SPARQL* (Reification Done Right) Turtle facts: :bob foaf:name "Bob" . <<:bob foaf:age 23>> dct:creator <http://example.com/crawlers#c1> dct:source <http://example.net/homepage-listing.html> . Sparql query: SELECT ?age ?src WHERE { ?bob foaf:name "Bob" . <<?bob foaf:age ?age>> dct:source ?src . } This is huge! So far products based on property graphs have been ad-hoc, without a formal model. SPARQL* brings rich queries to the property graph model and the reverse mapping means RDF* can be processed with traversal-based languages like Gremlin.
  • 19. Roles of Schemas • Documentation • Integrity Preservation • Efficiency • Inference
  • 20. Schemas as Documentation Humans write code to insert data and write queries: schemas tell us how the data is organized. Automated systems can also use schemas to drive code generation (consider object-relational mapping)
  • 21. Schemas can preserve integrity SQL: create table customer ( id integer primary key, username varchar(16) unique key not null, email varchar(64) not null, ) SQL prevents attempts to insert records with non-existing fields or lacking required fields. SQL can enforce key integrity and other constraints. You can (often) code algorithms and take it for granted that data structures satisfy invariants required for those algorithms to work. RDF: RDFS and OWL, implemented with the standard semantics, do not validate data. Practically, RDF users will use types and properties across a wide range of standard and proprietary namespaces, and it can be hard to keep track of them all. For instance, rdfs:label is defined in RDFS, despite the fact that you can have labels without schemas. Terms that are the bread-and-butter of RDFS, such as rdf:type, rdf:Property, and rdf:Type are defined in the RDF specification. It’s an easy mistake to get the “s” wrong in either writing data or queries and if you do you run a query, get zero results, and could easily chase your tail looking for other causes. You can (and should) define an alternate semantics for RDFS and OWL, which rejects types and properties that are not listed in either data or queries, but this is nonstandard. These issues are addressed in the “RDF Data Shapes” effort to be completed in 2017, (
  • 22. Schemas can promote efficiency JSON (and XML) Structural information is repeated in each schema record { “red”: 201, “green”: 99, “blue”: 82 “alpha”: 115 } (85 bytes) C typedef unsigned char byte; struct color { byte red; byte green; byte blue; byte alpha } Defines meaning of 201 99 82 115 (4 bytes) 20x compression! In numerical work, it often takes longer to convert a million numbers from ASCII to float than you spend working on the floats. The speed of text parsing is a limiting factor in electronic trading systems and many other applications. GZIP compression of repetitive data helps, but you get a smaller file if you apply GZIP to binary data. You pay a CPU price for data compression plus a large price in string parsing. Textual data formats have been fashionable in the Internet Age, because it is easy to get string parsing code to “almost work;” one of the reasons we are hearing about security breaches every day is that it’s extremely difficult to write correct string parsing code. RDF standards do not address binary serialization, however Vital AI can create binary formats based on OWL schemas
  • 23. Schemas and Inference the unique value of RDF! :Joe myvocab:emailAddress “joe@example.com” . dbpedia:Some_Body db:eMailAddress <mailto:sb@example.com> . basekb:m.3137q basekb:organization.email_contact.email_address “3137q@example.com” . :Lily schemaOrg:email “lily@example.com” . Look at 4 RDF vocabularies and find 4 ways to write an e-mail address myvocab:emailAddress rdfs:subPropertyOf foaf:email . db:eMailAddress rdfs:subPropertyOf foaf:email . basekb:m.3137q rdfs:subPropertyOf foaf:email . :Lily rdfs:subPropertyOf foaf:email . :Joe foaf:email “joe@example.com” . dbpedia:Some_Body foaf:email <mailto:sb@example.com> . basekb:m.3137q foaf:email “3137q@example.com” . :Lily foaf:email “lily@example.com” . A-BOX T-BOX Inferred facts
  • 24. It looks like an answer for data integration, but… :Joe foaf:email “joe@example.com” . dbpedia:Some_Body foaf:email <mailto:sb@example.com> . basekb:m.3137q foaf:email “3137q@example.com” . :Lily foaf:email “lily@example.com” . There are two reasonable ways to write an email address: as a string or as a URI foaf:email rdfs:domain owl:Thing . According to the foaf spec, only the URI is correct since, “In OWL DL literals are disjoint from owl:Thing,” (at least if we are using OWL DL…) Any ETL tool has an ability to apply a function to data (it’s not hard at all to write code to translate a string to a mailto: URI) RDFS and OWL, however, can’t do simple format conversion. For instance, it is reasonable for people to specify temperatures in Fahrenheit or Centigrade or Kelvin, but OWL inference can’t “multiply by something and add” – even though it can state that properties “mean the same thing”, it can’t specify simple transformations. Something like OWL may be necessary for data integration, but OWL is not sufficient.
  • 25. Other things OWL Can’t Do • We can’t reject data • Reject things that we don’t agree with • Reject things we don’t need; let’s use Freebase to seed… • A directory of ski areas • The spatial hierarchy of Africa • A biomedical ontology We don’t want to pay to store stuff we don’t need, or wait for it to be processed, do quality control on it, or deal with any problems it might create
  • 26. OWL is unintuitive Here’s an excerpt from the FIBO (Finance) ontology: Organization: A social unit of people, systematically struc-tured and managed to meet a need or pursue collective goals on a continuing basis. Autonomous Agent: An agent is an autonomous individual that can adapt to and interact with its environment. Property Restriction 1: Set of things that must have property "has member" at least 2 taken from "autonomous agent" Property Restriction 2: Set of things that may have property "has part" taken from "organization" Property Restriction 3: Set of things that must have property "has" at least 1 taken from "goal" Property Restriction 4: Set of things that may have property "has" taken from "postal address” means “has parent” LEGEND How do you explain this to your boss? To the programmer that just joined the team? What kind of inference does this entail? (I think two people with a goal are an organization, but is there a real difference between a DBA filing for a person who is self employed and one that has an additional employee?)
  • 27. It’s not always obvious how to do things in OWL You can’t say “The United States Has 50 States” But you can say “Anything that has 50 states is the United States” You can get close to what you want to say by “The United States is a member of an anonymous class that contains anything with 50 states.” You can get some entailments from that, but nothing happens if only 47 states are on the list (it’s an open world, we just don’t know about…) Thus: It’s not obvious what exactly can be specified in OWL. If you talk to an expert, you’ll find that he can do a lot of things you might think aren’t possible.
  • 28. Production Rules and First-Order Logic Many 1970s “expert systems” were driven by production rules; these are now widespread in “Business Rules Engines”. Condition -> Action Common data transformations can be easily written with production rules: Weight(person,weight) and Height(person,height) -> BodyMassIndex(person,weight/height^2) BodyMassIndex(person,bmi) and bmi<18.5 -> Underweight(person) BodyMassIndex(person,bmi) and 18.5<=bmi<25 -> NormalWeight(person) BodyMassIndex(person,bmi) and 25<=bmi<30 -> Overweight(person) BodyMassIndex(person,bmi) and 30<=bmi 0 -> Obese(person) You could easily miss it reading the documentation, but it’s possible to state this in OWL by using XML Schema constraints on data types You can’t do this in OWL. You just can’t
  • 29. Production Rules vs imperative programming languages The BMI example could easily be written in (say) Java… BUT You have to get the steps in the right order; this is trivial to do in a simple case, but it gets increasingly harder as complexity goes up. This is one of the reasons why programming is a specialized skill. Production Rules constrain the conditions so the engine can quickly determine which rules are fired when the state changes… … but, the actions are written in a conventional programming language like LISP or Java, so we can use a fully spectrum of programming techniques and a lot of existing code. Note: rules engines have advanced greatly since the “golden age of AI”, and now 100,000+ rules and 10 million+ facts are practical.
  • 30. Production Rules in the Wider Picture Drools Expert: Execution of production rules Drools Fusion: Complex event processing jBPM: Business Process Management; coordination of asynchronous human and automated behaviors – controlled by rules Optaplanner: Multi-objective combinatoric optimization for tasks such as scheduling, vehicle routing, box packing – controlled by rules This is the JBOSS stack; products such as Blaze Advisor and iLog do all this and more. The use of production rules to control business processes, particularly in scenarios involving complex workflows and complex multiple requirements is well established. This is an emerging research topic in the semweb community, but in the business rules world this is a mature technology
  • 31. “Impeadance Mismatch” between Business Rules and RDF is minimal Most Java Rules Engines (like JESS and Drools) can reason about ordinary Java objects RDF data can be converted to specialized predicate objects for performance or convenience, but it is very possible to insert objects from the Jena framework such as Nodes, Triples and Models directly into a rules engine.
  • 32. OWL and RDFS implementations often use production rules OWL 2 RL dialect Forward chaining The semantics of RDFS and most of OWL can be implemented with production rules; RETE and Post-RETE algorithms can evaluate these efficiently. Popular reasonsers such as Jena and OWLIM often use a box of production rules to implement RDFS and OWL and expose this functionality so you can implement custom inference. OWL 2 QL dialect Backwards chaining RDFS, and another major subset of OWL, can be implemented by rewriting SPARQL queries. Since SPARQL is based on relational algebra, the whole bag of tricks used to optimize relational database queries can be used to efficiently answer queries.
  • 33. OWL dialects have “computer science” advantages (i.e., algorithms exist to answer queries in bounded time, with scaling that looks good on paper) More expressive logics that are undecidable sound scary,.. However, many things about conventional programming languages are undecidable… For instance, you can’t solve the halting problem for conventional programming languages., yet, that Doesn’t intimidate most people to use languages that lack recursion and unbounded loop. Algorithms to exactly solve common optimization problems (travelling salesman problem, etc.) are computationally intractable, but approximate algorithms are fine for the real world. (Evaluation of production rules is not decidable in finite time since it is possible to create an “infinite loop”)
  • 34. Logical Theorem Proving ex. VAMPIRE If we constrain the action fields of rules a bit, we can prove theorems, a highly flexible form of reasoning. There are other ways to do it, but one effective method is the saturation solver. Axioms (S) Statement to prove Logical Negation Solver conclusions If S is true, then not S is false. Eventually the solver will find a contradiction and produce the conclusion false. Since you can derive an infinite number of conclusions from most theories, this process is not guaranteed to finish. A lucky or clever algorithim, could reach false with a short chain. State of the art reasoners use multiple search strategies that work well in many real-life cases.
  • 35. Real-life OWL and RDFS performance doesn’t satisfy RDFS inference, done according to the book, generates a vast number of trivial and uninteresting conclusions; practical reasoners usually don’t implement the complete standard
  • 36. Requirements for Practical logic One long term goal for logic is “capture 100% of critical knowledge in business documents” It might sound like science fiction, but if we hire a team of programmers to implement a policy or to make a system that complies with regulation and requirements, it is the goal. Can we (i) reduce team size, (ii) speed up the project, and (iii) be able to show the rules being enforced to management in a way they can understand? Plain first-order logic does not cover all the bases. We need: • Modal logic (CAN, SHOULD, MUST, IT WAS TRUE THAT, HARRY BELIEVES THAT) • Temporal logic (things change at different times) • Default and Defeasible logic • Higher-order logic (for all statements) or (there exists a statements) These logics are not as mature as FOL, but we can often use tricks to simulate them
  • 37. Modal logic Key for Law, Contracts, Requirements, … A modal operator qualifies a statement: MUST(S) -> S is necessarily true in any situation USUALLY(S) -> S is usually true PERMISSABLE(S) -> It it permissible that S is true BELIEVES(person,S) -> specified person believe S is true PREVIOUSLY(S) -> S was true in the past Some modal logic problems can be addressed by rewriting the problem, for instance if S(x,y) is a simple predicate we could define a predicate like BELIEVES_S(person,x,y) We can’t express arbitrary statements this way, but we may be able to express all the ones that we’ll really use. Systems like SUMO use tricks like this to punch above their weight
  • 38. Temporal Logic Change is the one thing that is constant. The population of Las Vegas was 25 in 1990 and 583,736 in 2010. Since laws change over time, to know if a set of actions was illegal, we need to know when the actions where and what the law was at the time and answer questions like “What did the President know and when did he know it?” A complete theory is not fully developed, but some pretty good tools are available The Allen Algebra Time intervals are closer to reality than points in time; with time intervals we can specify that a meeting starts at 6:00 pm on a certain day and goes on for 1 hour. We could ask if this overlaps with the interval of another meeting to know if I need to choose between one meeting and the other. Allen Algebra doesn’t cover all temporal reasoning cases, but it works well with production rule systems, and is widely used in complex event processing. A complete theory is not fully developed, but some pretty good tools are available
  • 39. Default and Defeasible Reasoning The following logical chain leads to a bad result: Flies(Bird) A(Penguin,Bird) Flies(Penguin) Exceptions are widespread in real life: “A year divisible by 4 is a leap year, unless the year is divisible by 100; however, if the year is divisible by 400 it IS a leap year” “An amateur radio operator may not transmit music unless they are retransmitting a signal from the International Space Station.” We could write Any(x): A(x,Bird) and NOT(A(x,Penguin)) -> Flies(x) But this gets hard to maintain when we find out about ostriches, domestic ducks, etc. It would be worse yet to maintain a list of flying birds. Default logic adds features that let us express defaults Defeasible logic allows us to retract a conclusion if we find contrary evidence later
  • 40. Logical Negation ALL APPROACHES ARE SOMEWHAT PROBLEMATIC There are many ways to implement logical negation, but there is no universal answer to the problem. For instance, suppose we add NOT(Underweight(person)) -> WellFed(person) to the rules we’ve been working on. If this rule is activated before we have: (i) gotten height and weight information, (ii) computed the BMI, and (iii) classified this person, it will fire improperly. This might not be problem if it has no real-world consequences and is retracted when it becomes false, but it’s not the behavior we want.
  • 41. Logic Programming Practical Concessions Phase I: Extract Information About Height and Weight Phase II: Compute BMI and classify Phase III: Make additional conclusions knowing ALL Phase II conclusions With the agenda mechanism in most Business Rules Systems, each phase can get a complete view of what happened in the last phase, meaning that negation, counting and similar operations work as expected (At the cost that we need to assign rules to the right phases)
  • 42. What about SPIN? SPIN is similar in expressiveness to production rules. ex:Person a rdfs:Class ; rdfs:label "Person"^^xsd:string ; rdfs:subClassOf owl:Thing ; spin:rule [ a sp:Construct ; sp:text """ CONSTRUCT { ?this ex:grandParent ?grandParent . } WHERE { ?parent ex:child ?this . ?grandParent ex:child ?parent . }""" ] . This is like a production rule written in reverse, we infer triples from the CONSTRUCT clause based on matching the WHERE clause. TopBraid Composer implements most inference through primitive forward chaining (a fixed point algorithm, RETE cannot be used because the order of rule firing is unpredictable.) Backwards chaining can be accomplished through the definition of “magic properties” (something similar can be done with Drools too) SPIN has support for query templates, in some ways like Decision Tables but possibly more palatable for coders and for semantic apps Control of execution order, negation, and non-monotonic reasoning are not settled. Less is know about how to implement it
  • 43. Linked Data “Trough of Disillusionment” The dream of linked data is that you can easily “mash up” data from multiple sources to answer questions. If you want to get the right answers, however, it is not so easy. If you didn’t have a lot of experience in the corporate world you might blame data publishers, RDF, and the incentive structures around linked data for this, however…
  • 44. Corporate Data … real life data in business is frequently bad; 80% of effort in data mining projects goes into data preparation and cleaning. tools Business analyst ERP POS email ERP CRM web Factory automation CRM Wiki HR Sharepoint CMS Custom apps Inventory Social CMS SAAS Apps A large business has multiple business units running a huge number of applications written at different times by different people Businesses grow by acquisition; to the extent that customers and employees are aware of different IT systems and their histories, customer service sucks, employees underperform and costs are high Businesses face the same problem as the Linked Data Community but these problems happen behind closed doors and people are cursing COBOL and SAP instead of RDF and SPARQL
  • 45. While Linked Data was were emerging, Enterprise IT developed “Master Data Management” to enable a “Customer Centric” enterprise Personal Account Paul’s Business Account A Paul’s Business Account B Olivia’s Business Account Child’s Account Paul’s IRA Olivia’s IRA SEP IRA Home Equity Line Houseguest Tenant A Personal Tenant A Corporate Tenant B Traditional business systems are “account centric”, which is enough to get by but not enough to thrive. To really serve me well, my credit union needs a complete picture of the relationship I have with it. (It took me a while to remember how many accounts I have and I might have missed one) Financial institutions are under legal pressure to “know your customer” (KYC) and linking accounts that belong to a customer is necessary to prevent monkey business but I own shares in this one! My name is on this column of accounts, but not the others
  • 46. Dominant paradigm for master data management: Objects are clustered based on a distance metric; objects are “blocked” beforehand to avoid the N2 cost of computing distances … this is effective in the case of matching different records for the same customer, but is NOT effective in cases where we have a ground truth and can know rather than guess … Tyrol Tirol Two variants that differ by a letter can be fuzzy matched, but it’s hard to guess arbitrary things like AT-7 ISO 3166-2 AT33 NUTS AU07 FIPS 10-4 蒂罗尔州 CHINESE … and why guess when you can just look them up in a quality controlled database? Conventional MDM focuses on resolving customers (people or businesses;) in some cases it involves resolving products. Generally the objects being matched are “equal” to each other in ontological status, such as two customer records. Semantic MDM covers a wider range of concepts and often imports large amount of knowledge from general databases or involves alignment with industry ontologies. In some cases we are discovering new concepts and maintaining the ontology, but more often we are matching surface forms to underlying concepts.
  • 47. Do we clean data before or after query time? Weather station reports temperature in centigrade, reports -999 upon error 32.1 34.6 36.3 -999 33.8 Let’s say we want to compute the average… If we use the arithmetic mean, we get -215.55° C. Outrageously wrong! If we know this device reports -999 on error or that temperatures can never be less than -273.15 we can reject the bad value, we get 34.2° C If we use the median instead of the mean, the outlier is automatically ignored we also get 34.2° C (we’re lucky it’s exactly the same) In this case it’s reasonable to clean the data or use an algorithm that is robust to outliers – they teach kids in elementary school the median is robust, but how many other robust algorithms are on the tip of everyone’s tounge?
  • 48. Ahead-of-time data preparation TEST CASES Test failure blocks further analysis Queries business analyst Error reports thrown “over the wall” data quality team Line drawn between data processing and data use establishes test perimeter and makes process scalable in human terms
  • 49. Fixing up at query time will drive you nuts Scenario: Business Analyst writes queries while talking to co-workers to quickly build collective understanding. Requirement: easy to write queries off the cuff and get the right answer! “joe@example.com” <mailto:sb@example.com> “3137q@example.com” “lily@example.com” It’s not hard to canonicalize two variant forms of an e-mail address in either a query or in processing the result set query complexity effort A real query might be querying tens of values, some are used in conditions, others end up in the results. If many things are being joined (i.e. you’re using SPARQL) the query will explode exponentially in complexity. Will you trust the answer? Some kind of query rewriting (like the implementation of OWL 2 QL) might help, but we still lack a perimeter where we can test the system and give it a clean bill of health
  • 50. Ordered collections are awkward in RDF • Two ways to do it because neither one is satisfying RDF Containers :Missions a rdf:Seq ; rdf:_1 :Mercury ; rdf:_2 :Gemini ; rdf:_3 :Apollo . This could generate huge numbers of predicates, also nothing stops one from accidentally using a numbered label more than once. The facts comprising this list could be spread across a system. RDF Collections :Missions a rdf:List . rdf:first :Mercury ; rdf:rest _:n1 . _n1: rdf:first :Gemini ; rdf:rest _:n2 . _n2: rdf:first :Apollo ; rdf:rest rdf:nil . Operations on a LISP-style list are slow because you need to follow lots of points. The use of blank nodes can protect Collections from modification (important in the OWL spec.) Neither construction is easy to query in (standard) SPARQL
  • 51. Yet, some RDF syntaxes look almost the same as JSON/XML JSON { missions: [ “Mercury”, “Gemini”, “Apollo” ] } TURTLE :Missions :members (:Mercury,:Gemini,:Apollo) . Most RDF tools will expand this into a LISP-list with blank nodes, but in TURTLE format the physical layout is the same as JSON. Collections and Containers are described as “non- normative” in RDF 1.1; advanced tools may use special efficient representations (like would be used for JSON). It’s awkward to work with ordered collections in the common “client-server” model that revolves around SPARQL engines, but for small graphs in memory, the situation is different – the Jena framework provides a facility for accessing Collections that feels a lot like accessing data in JSON Ordered collections are critical for dealing with external data that supports external collections AND critical for many traditional RDF use cases such as metadata (you’ll find scientists are pretty sensitive to the order of authors for a paper)
  • 52. Another Bad Idea in Linked Data DEREFERENCING http In principle a client could ask questions about individual items and “follow it’s nose” to discover related information. In practice, however, you miss data quality problems that are obvious when you look at data holistically. (i.e. 47 instead of 50 states) If the data was clean ahead of time, and if we understood the structure of data complely ahead of time, dereferencing might work. Since Linked Data does not enforce quality standards, however, dereferencing is one of those dangerous things that “almost works”.
  • 53. John Martin T 34 $17.50 I first met… Barry Robnson F 17 $12.76 Barry has… Mary Capps T 104 $541.99 Sometimes … Eric Kramer T 95 $214.22 Nobody who … Matt Butts F 32 $6.54 I’ve never … Imagine we find a CSV file without any specification as to format… Most of these match a list of common first names Most of these match a list of common last names These look like Boolean values All of these are integers These look like monetary values These fields appear to contain free text In the last example, we were able to make some pretty good guesses by looking at the data, not knowing anything about the names of the headers. This could go a long way towards interpreting this file in an automated way. Add knowledge about the problem domain and we’re cooking with gas… PROFILING For best results, do analysis against ALL of the data!
  • 54. Traditional Data Warehousing POS sales data B POS sales data C POS sales data D POS sales data A Data from four different point-of-sale systems used in different parts of a company CANONICAL DATA MODEL The good: analysts work with consistent, clean data The bad: the burden of normalizing the data when it is generated is felt acutely; in a worst case we could do this work and never end up analyzing the data. The ugly: Since the normalization was done before the requirements for analysis were known, normalized data may not satisify requirements of analysts
  • 55. Data Lake Enabled by Hadoop Ingestion is simple because we simply copy raw data of any kind to HDFS. Development and operations are not burdened by ingestion requirements Data import is lossless. Compute and data are tightly coupled; we can “full scan” the data quickly at any time. Data cleanup can be performed to meet requirements of specific uses AND can be informed by inspection of the complete data set. Analysis can be performed on text and other kinds of data which cannot be normalized conventionally.
  • 56. We can square this circle… Data Lake operations raw data Not perfect, but not damaged by import process! project Data preparation is driven by requirements; no wasted time and no compromises Queries Predictive Analytics Machine Learning Other projects Ontologies, taxonomies, and logic programming mean an increasing amount of work can be shared between projects Data Lake
  • 57. Putting Knowledge To Work (UNIT CONVERSION ONCE AGAIN) EnglishTemp(location,amount) -> INSERT(MetricTemp(location,(5/9)*(amount-32)) Conversion of a unit represented by a predicate is one simple rule that could be written by hand Input data specification Output data specification Analysis of input and output schema reveals need for unit conversion; system gets conversion rule out of world knowledge library and specializes it World Knowledge Libraries General Industry-Specfic Company- Specfic Code generation
  • 58. Intelligent Data Preparation Data Lake Documentation Machine readable schemas describes Scalable/Parallel profiler transformer consumers Ontologies Requirements Knowledge base about instances (ex. Places) and common patterns in data expression (ex. Date formats) broad spectrum vertical specific company specific application specific Iterative Development Process Generates and tests hypotheses