Sindice warehousing meetup

Real Time Semantic
Warehousing: Sindice.com
technology for the enterprise
Giovanni Tummarello, Ph.D
Data Intensive Infrastructure UNIT -
DERI.ie

CEO SindiceTech

How we started : Sindice.com

80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data.
The Sindice Suite powers Sindice.com. Online with 99,9%+

Semantic Sandboxes on: Sindice.com

Data Sandboxes in Sindice.com – Powered by CloudSpaces

And then we met people asking
can you do it for us

Example story (Pharmaceutical company0
To stay competitive, Pharmaceutical companies need to leverage all the data available from
inside sources as well as from the increasingly many public HCLS data sources available. Due to
the diversity of this data with respect to nature, formats, quality, there are complex integration
issues. Traditional data warehousing technology require big upfront thinking and is handled
within a company in the “go via the IT department” approach. This does not meet the need of
data scientists who are the only ones that can do the complex cross-use case thinking required.
Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get:

• The ability to speed up “In silico” scientific workflows (interrelation of diverse large
datasets) by orders of magnitude by relying on a data warehousing approach.
• The ability to create large scale “data maps” or “aggregated views” which would allow
researchers to see “trends” and gather insights at high level which would not be possible by
data accessed via single lookups.
• The ability to receive recommendations and suggestions for new data connections based on
an ever evolving ecosystem of available experimental datasets.
• Provide their R&D departments with superior tools for investigating their internal
knowledge; search engines and data browsing tools which provide unified views of multiple,
evolving, live datasets without leakage of specific “queries” to the outside world which would
reveal internal research trends
• The ability to leverage the ever increasing body of public, crowd curated open data

5 of 16

Linked Data clouds for the Enterprise

– Strategic knowledge spaces, where new
databases can be added and “leveraged” with an
unprecedented ease
– Integration “Pay as you go” : explore now, fine
tune later.
– Its BigData (Cluster+Clouds) meets RDF and
Semantic Technologies

Because you need Semantic SandBoxes

A Dataspace Template

Semantic Web
A typical implementation template.
Data
Dataspaces own:
• Resources
• Services
• Datasets for others to reuse

Dataspace Composition

Scalable cascading semantic ‘Dataspaces”
• Resources allocated in public/private clouds
• Allow to get Sindice Data and mix it/ process it for private purposes

10 of 16

Cloud powered!
<dataspace id= “iphonedataspace”>

<dependencies>
http://ecommerce01.dataspace.sindice.net/</dataspace>
http://price01.dataspace.sindice.net/
</dependencies>

<resources>
<mysql name=“sql”>
<hbase size=“10g”>
<siren name=“index”>
<triplestore name=“sparql” kind=“virtuoso” />
</resources>

<retention> (see later)
<update-rate>1D</update-rate>
<timeout>1D</timeout>
</retention>
</dataspace>

11 of 16

Scale is only 1 dimension

Multiple dimensions of WeD data integration
• RDF tool stack  flexibility
• Cluster scalable processing  scalability
• “Cloud” Pipelines  dynamicity

Full Json Like Search.
On Solr.
All operators supported.

What is SIREn ?

• Plugin to Solr
• Built for searching and operating on
semistructured data and relational
datastructures

SIREn: Semantic IR Engine

• Extension to Enterprise Search Engine Solr
• Semantic, full-text, incremental updates,
distributed search
Semantic
SIREn
Databases

Constant time

Limitations of Apache Solr

• Not efficient with highly heterogeneous
structured data sources
– Limitation on the number of attributes:
Dictionary size explosion

Dictionary Size Explosion

Record 1
label Renaud Delbru

name Renaud Delbru

Dictionary Size Explosion
Dictionary
label:renaud
Record 1
label Renaud Delbru label:delbru

name Renaud Delbru name:renaud

name:delbru

 Dictionary construction
 Concatenation of attribute name and term
 N * M complexity (worst case)
 2 attributes * 2 terms = 4 dictionary entries
 100K attributes * 1B terms = 100B entries


Query clause explosion when searching across all
attributes


attributes
• Limited support for structured query
– Multi-valued attributes

Multi-valued attributes
• No support in Solr for "all words must match
in the same value of a multi-valued field".
• A field value is a bag of words
– No distinction between multiple values

Record 1 Record 2
label man's best pooch label man's worst friend to no one
friend enemy

Multi-valued attributes
• No support in Solr for "all words must match
in the same value of a multi-valued field".
• A field value is a bag of words
– No distinction between multiple values
• Query example
– label : man’s friend
– Solr returns Record 1 & 2 as results

Record 1 Record 2

label man's best friend pooch label man's worst enemy friend to no one


attributes
– No full-text search on attribute names

Full-text search on attribute names
• No support in Solr for “keyword search in
attribute names".
• Query example
– (name OR label) = “Renaud Delbru”
– Solr is unable to find the records without the exact
attribute name
Record 1 Record 2
rdfs:label Renaud Delbru foaf:name Renaud Delbru

Record 3 Record 4
sioc:name Renaud Delbru full_name Renaud Delbru

attributes
– No full-text search on attribute names
– No 1:N relationship materialisation

Relationship materialization

• Its Json like indexing and searching

• Materialize the relationships between your
entities and others.

Some numbers: Siren on Sindice

Data Collection Settings
 500M web data documents (RDF,  Cluster of 4 nodes
RDFa, Microformat, etc.)  2 nodes for indexing
 200K datasets  2 nodes for querying
 50B triples  Replication

Indexing Performance Services
 Full index construction takes  Keyword and structured queries
approx 24 hours  Dataset search
 436K triples / second  >> 99% uptime

Large scale RDF ‘Summaries”

Introducing large scale RDF ‘Summaries”

We do it for:
• Data exploration
– How to find datasets about movies ?
• Assisted SPARQL Query Editor
– What is the data structure ?
• Dataset Quality
– How to differentiate relevant form irrelevant
dataset ?

Large Scale RDF summaries

Class Level
12M relationships

10B relationships

Sindice Analytics Widget Demo

• http://test01.sindice.net:9001/sindice-stats-
webapp/

• http://test01.sindice.net/szydan/dataset-
view/dataset/default/www.bbc.co.uk

Relational Faceted Browsing. At speed of light

Patent Pending

SparQL is awesome.
And now your guys can actually use it.

Thank you

Sindice.com team April 2012

With the contribution of

Sindice warehousing meetup

More Related Content

What's hot

Similar to Sindice warehousing meetup

More from Semantic Web San Diego

Recently uploaded

Sindice warehousing meetup

Editor's Notes