Semantic Web Standards and the Variety “V” of Big Data

© Copyright 2014 TopQuadrant Inc. Slide 1
Semantic Web standards and
the Variety “V” of Big Data
Bob DuCharme
August 20, 2014

Three Vs of Big Data
 Volume
 Velocity
 Variety

Gartner, September 2013

Which dimensions did people struggle with the
most?
 Volume 35%
 Velocity 16%
 Variety 49%

Why is variety hard?
Furniture
Inventory
Protein
Database
?
Customer
Database
Conference
Attendees?
Surname
GivenName
LastPurchase
ZipCode
Email
last_name
first_name
is_speaker
postal_code
email

Schemas
Good thing:
Ensure data quality
Make query writing* easier
Add efficiency
*And essentially, all application
development
Annoying thing:
 Can’t add property values
someone didn’t see coming
 Changing schema (and data
with it) slow and expensive
 Often tied too closely to
specific implementation
Inflexibility × 3.

Schemaless NoSQL databases
 Can’t add property values someone
didn’t see coming?
 Changing schema (and data with it) slow
and expensive?
 Often tied too closely to specific
implementation?

Schemaless: how do applications know
what properties are available?
 By any means necessary
 Documentation
 Query for properties that got used
 App possibly written by same person or team
 Responsibility shifted from database
(designer) to application (designer)

Schema: all or nothing?
Customer
Database
Conference
Attendees?
Surname
GivenName
LastPurchase
ZipCode
Email
last_name
first_name
is_speaker
postal_code
email
ETL (Extract-Transform-Load)?

RDF Schema (RDFS)
 W3C Standard since 2004
 Often overshadowed by superset standard
OWL
 Describes RDF, written using RDF syntaxes
Semantic
Web
Linked
Data

RDF
 www.w3.org/RDF (second sentence!):
“RDF has features that facilitate data merging even
if the underlying schemas differ, and it specifically
supports the evolution of schemas over time
without requiring all the data consumers to be
changed.”

Sample schema
@prefix cust: <http://companyX.com/ns/customer#> .
@prefix ca: <http://companyY.com/ns/confAttendees#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
cust:Surname a rdf:Property .
# or: cust:Surname rdf:type rdf:Property .
cust:GivenName a rdf:Property .
cust:ZipCode a rdf:Property .
cust:Email a rdf:Property .
ca:last_name a rdf:Property .
ca:first_name a rdf:Property .
ca:postal_code a rdf:Property.
ca:email a rdf:Property .
# LastPurchase and is_speaker: don't care (for now)!
Customer
Database
Conference
Attendees

Relating properties
# assuming prefix declarations from previous slide
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
cust:Surname rdfs:subPropertyOf schema:familyName .
ca:last_name rdfs:subPropertyOf schema:familyName .
cust:GivenName rdfs:subPropertyOf schema:givenName .
ca:first_name rdfs:subPropertyOf schema:givenName .
cust:Email rdfs:subPropertyOf schema:email .
ca:email rdfs:subPropertyOf schema:email .
Cust:ZipCode rdfs:subPropertyOf schema:postalCode .
ca:postal_code rdfs:subPropertyOf schema:postalCode .

Using the combined data
# SPARQL query: where should we open
# a government relations office?
SELECT ?postalCode
WHERE {
?person schema:email ?email .
FILTER(strends(?email,".gov"))
?person schema:postalCode ?postalCode .
}

Middleware to treat RDBMS as RDF
Customers
Mapping Middleware (e.g. D2R, Ultrawrap)
Application
SPARQL
query
SQL
query
Relational
results
SPARQL
query
results

Customers
Application
SPARQL
query
SQL
query
Relational
results
SPARQL
query
results
Conference
Attendees
SQL
query
Relational
results
Schema
metadata
triplestore

Further enhancement
ex:Person a rdfs:Class.
schema:familyName rdfs:domain ex:Person .
schema:givenName rdfs:domain ex:Person .
schema:email rdfs:domain ex:Person .
schema:postalCode rdfs:domain ex:Person .
schema:postalCode rdfs:label "postal code" .
Schema:postalCode rdfs:comment
"Zip code in the USA, postcode in the UK."

Adding more with OWL
equipment code room
X1703 main kitchen
Z0439 cold storage
room building
main kitchen 98 Main St.
cold storage 14 Broad St.
Equipment Room addresses
eq:room rdfs:subPropertyOf ex:locatedIn .
rmaddr:building rdfs:subPropertyOf ex:locatedIn .
ex:locatedIn a owl:TransitiveProperty.
rmaddr:98MainSt a ex:Building.
eq:X1703 eq:room eq:mainKitchen .
eq:mainKitchen rmaddr:building rmaddr:98MainSt .

Query for which building
# SPARQL query: what building is
# equipment piece x1703 in?
SELECT ?building
WHERE {
?building a ex:Building.
eq:X1703 ex:locatedIn ?building .
}
located
in
located
in

A little more OWL
schema:email a owl:inverseFunctionalProperty .
ex:cust401 cust:GivenName "James" .
ex:cust401 cust:Surname "Smith" .
ex:cust401 cust:Email "jsmith@somecompany.com" .
ex:ca04395 ca:first_name "Jim" .
ex:ca04395 ca:last_name "Smith" .
ex:ca04395 ca:email "jsmith@somecompany.com" .
ex:cust401 owl:sameAs ex:ca04395 .

What OWL adds to RDFS
 RDFS gives you properties to describe your
properties, classes, and instances (i.e. your
resources)
 OWL gives you:
• More properties to describe your resources
• Classes that you can use to describe resources
• The ability to define your own classes that you can
use to describe resources

Customers
Application
SPARQL
query
SQL
query
Relational
results
SPARQL
query
results
Conference
Attendees
SQL
query
Relational
results
Schema
metadata
triplestore

Descriptive vs. Proscriptive schemas
 Not rules to follow
– e.g. “Employee must have a first and last name!”
– Other ways to do implement constraints
 Machine-readable guides to what you’ve got
to work with
– Data types
– Relationships to other resources and classes of
resources
 Metadata!

Whose schemas?
 Your own schemas can describe what you need from
the data you’re using
 Standardized schemas (e.g. schema.org,
GoodRelations) can tie together your data with data
form other sources
 Tie together your custom schemas with (subsets that
you’re interested in of) standardized schemas
 Tie together (subsets that you’re interested in of)
different data sets from different sources

Top-down or bottom-up schema development?
 Whichever you like
 I like bottom-up
– (Hey Cyc project: good luck with that!)
 Lots of data to deal with?
– Model just enough to drive a simple, proof-of-
concept application
– Build the model (schema) a little at a time, then
add more to your application
– Connect that model to models of (subsets of)
other data sets

Who is doing this now?
 Pharma
 Oil and gas
 Publishing

TopQuadrant Products and Solutions
Solutions
Asset Management
Solutions
Search / Content
Enrichment
TopBraid Platform
Solution Engine
IDE
Solutions
Compose your own
Solutions
Master Data
Management
Solutions
Information Discovery for
Life Sciences
Solutions
Information
Exchange
• TopQuadrant offers configurable, out-of-the box
solutions enabling organizations to evolve their
information infrastructure into a semantic ecosystem

 Dynamic Interactive Exploration - Search, Query, Filter, Browse,
Navigate, Visualize, Share
 Logical Data Warehouse - Flexible, Adaptive Information Structuring
TopBraid Insight™ (TBI)
Connect the dots for new insights. Ease Big Data Variety

• Tames Big Data to empower businesses
• Offers on-demand integrated access to diverse data, making it
possible to discover information just in time
• Delivers new levels of creativity and infrastructure flexibility
TopBraid Insight: Connects the Dots

Photo credits
• Volume: (CC BY-NC 2.0) Fabrizio Monti
https://www.flickr.com/photos/delphaber/3514894189
• Velocity: (CC BY 2.0) Gabriel
https://www.flickr.com/photos/cod_gabriel/1332225362
• Variety: (CC BY-NC-SA 2.0) IRRI Photos
https://www.flickr.com/photos/ricephotos/4753359957

“A wonderful harmony is created when we join
together the seemingly unconnected.”
- Heraclitus
Bob DuCharme bducharme@topquadrant.com
Thank you!

Semantic Web Standards and the Variety “V” of Big Data

More Related Content

What's hot

Viewers also liked

Similar to Semantic Web Standards and the Variety “V” of Big Data

Recently uploaded

Semantic Web Standards and the Variety “V” of Big Data

Editor's Notes