SlideShare a Scribd company logo
1 of 219
Download to read offline
Linked Data
Semantic Technologies
RDF Compression
HDT
Linked Data Compression
Miguel A. Mart´ınez-Prieto Antonio Fari˜na
Univ. of Valladolid (Spain) Univ. of A Coru˜na (Spain)
migumar2@infor.uva.es fari@udc.es
Keyword search over Big Data.
– 1st KEYSTONE Training School –.
July 22nd, 2015. Faculty of ICT, Malta.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 1/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 2/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
– What is Linked Data? –
Linked Data
Linked Data is simply about using the Web to create typed links
between data from different sources [3].
Linked Data refers to a set of best practices for publishing and
connecting data on the Web.
These best practices have been adopted by an increasing number of data
providers, leading to the creation of a global data space:
Data are machine-readable.
Data meaning is explicitly defined.
Data are linked from/to external datasets.
The resulting Web of Data connects data from different domains:
Publications, movies, multimedia, government data, statistical data, etc.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 3/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
What is Linked Data?
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 4/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
The emergence of the Web was an authentic revolution 15 years ago:
Changed the way we consume information.
Changed human relationships.
Changed businesses.
...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 5/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
The Web is a global space comprising linked HTML documents:
Web pages are the atoms of the Web.
Each page is univocally identified by their URL.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 6/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
Where are (raw) data in the Web?
Web pages “cook” raw data in a human-readable way.
It is, probably, the main problem of the WWW.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 7/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
- I was excited for the Keystone Training School and looked
for information about this nice country.
- I wrote “malta” in a web search engine, and...
I found some relevant results for my query! :)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 8/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
- I was excited for the Keystone Training School and looked
for information about this nice country.
- I wrote “malta” in a web search engine, and...
But others seem a little strange to my (current) expectations... :(
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 9/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
Raw data are hidden among web pages contents:
In general, data are written in HTML paragraphs.
In the best case, they are structured in the form of HTML tables or
published as additional documents (CSV, XML...)
Anyway, HTML is not enough expressive to describe and link individual
data entities in the Web:
HTML-based descriptions lose semantics and structure from the raw
data.
This fact makes very difficult automatic data processing in the Web.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 10/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
The Web of Data [8] converts raw data into first-class citizens of the
Web...
Data entities are the atoms of the Web of Data.
Each entity has its own identity.
...and uses existing infrastructure:
It uses HTTP as communication protocol.
Entities are named using URIs.
The Web of Data is a cloud of data-to-data hyperlinks [5]:
These are labelled hyperlinks in contrast to the “plain” ones used
in the Web.
Thus, hyperlinks also provide semantics to data descriptions.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 11/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
Linked Data builds a Web of Data using the Internet infrastructure:
Data providers can publish their raw data in a standardized way.
These data can be interconnected using labelled hyperlinks.
The resulting cloud of data can be navigated using specific query
languages.
Linked Data achievements:
Knowledge from different fields can be easily integrated and universally
shared.
Automatic processes can exploit these knowledge to build innovative
software systems.
Semantic Search Engine
For instance, a semantic search engine would allow us for only retrieving entities which
describe “malta” as a country but not as a cereal.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 12/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
– Linked Data Principles –
Tim Berners-Lee [2] suggests four basic principles for Linked Data:
1 Use URIs as names for things.
2 Use HTTP URIs so that people can look up those names.
3 When someone looks up a URI, provide useful information, using the
standards (RDF, SPARQL).
4 Include links to other URIs, so that they can discover more things.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 13/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
1. URIs as names
What is his name?
For humans, his name is Clint Eastwood...
... but http://dataweb.infor.uva.es/movies/people/Clint Eastwood is a
better name for machines.
The use of URIs enables real-world entities (or their relationships with
other entities) to be identifed at universal scale.
This principle ensures any class of data has its own identity in the global
space of the Web of Data.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 14/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
2. HTTP URIs
All entities must be described using dereferenceable URIs:
These URIs are accesible via HTTP.
This principle exploits HTTP features to retrieve all data related to a
given URI.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 15/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
3. Standards
This principle states that all
stakeholders “must speak the same
languages” for effective
understanding.
RDF [10] provides a simple logical
model for data description.
SPARQL [12] describes a specific
language for querying RDF data.
Serialization formats, ontology
languages, etc.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 16/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
4. Linking URIs
This principle materializes the aim of data integration in Linked Data:
Linking two URIs establishes a particular connection between two existing
entities.
Linking URIs
http://dataweb.infor.uva.es/movies/people/Clint Eastwood names the entity which
describes “Clint Eastwood”.
http://dataweb.infor.uva.es/movies/film/Mystic River names the entity which describes
the movie “Mystic River”.
An hyperlink between these two URIs state that the entity “Clint Eastwood” is
related to the entity “Mystic River”... how?
The labelled link provides a semantic relationship between entities.
In this case, http://dataweb.infor.uva.es/movies/property/director tags the
“director” relationship between “Clint Eastwood” and “Mystic River”.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 17/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
– Linked Open Data –
The Linked Open Data (LOD) project1
promotes Linked Data to be
published as Open Data:
LOD is released under an open license which does not impede its reuse for
free [2].
LOD is the highest-level in the 5-star scheme2
for Open Data publication.
The dataset is available on the Web under an open license.
The dataset is available as structured data.
The dataset is encoded using a non-propietary format.
The dataset names entities using URIs.
The dataset is linked to other datasets.
1
http://linkeddata.org/; http://5stardata.info/
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 18/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
LOD (2007-2011)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 19/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
LOD (2014)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 20/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
LOD (2014)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 21/53
Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
Current Statistics (July, 2015)
9,960 datasets are openly available2
:
90 billion statements from 3,308 datasets.
6,639 datasets could not be crawled for different reasons.
LOD Laundromat4
provides access to more tha 38 billion statements
from 650K “cleaned” datasets.
DBpedia 2014 contains more than 3 billion statements:
538 million statements from English Wikipedia.
2.46 billion statements from other language editions.
50 million statements linking to external datasets.
More and more datasets are released and these are getting bigger:
The largest ones are in the order of hundreds of GB.
2
http://stats.lod2.eu/; http://lodlaundromat.org/
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 22/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 23/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
– Overview –
Semantic Technologies (in middle
layers) exploit features from the Web
infrastructure (low layers):
RDF is used for resource
description.
RDFS is used for describing
semantic vocabularies.
OWL extends RDFS and is used
for building ontologies.
SPARQL is the query language for
RDF data.
RIF is used for describing rules.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 24/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF & SPARQL
RDF & SPARQL are the most relevant technologies for our current aims:
Both standards are based on labelled directed graph features.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 25/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
– RDF –



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood



http : //dataweb.infor.uva.es/movies/film/Mystic River
http : //dataweb.infor.uva.es/movies/property/title
Mystic River



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/director
http : //dataweb.infor.uva.es/movies/film/Mystic River
RDF [10] is a framework for describing resources of any class:
People, movies, cities, proteins, statistical data...
Resources are described in the form of triples:
Subject: the resource being described.
Predicate: a property of that resource.
Object: the value for the corresponding property.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 26/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood



http : //dataweb.infor.uva.es/movies/film/Mystic River
http : //dataweb.infor.uva.es/movies/property/title
Mystic River



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/director
http : //dataweb.infor.uva.es/movies/film/Mystic River
An RDF triple is a labelled directed subgraph in which subject and object
nodes are linked by a particular (predicate) edge:
The subject node contains the URI which names the resource.
The predicate edge labels the relationship using a URI whose semantics is
described by any vocabulary/ontology.
The object node may contain a URI or a (string) Literal value.
RDF links (between entities) also take the form of RDF triples.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 27/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 28/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 29/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 30/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Graph
This graph view is only a mental model:
RDF graphs must be serialized!!
But the RDF Recommendation does not restrict the format to be used.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 31/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Serialization Formats
Traditional plain formats are commonly used:
RDF/XML, NTriples, Turtle...
These formats are very verbose in practice:
Data are serialized in a (more or less) human-readable way.
Large RDF files are finally compressed using gzip or bzip2.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 32/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
– SPARQL –
SPARQL [12] is a query language for RDF.
It is based on graph pattern matching:
Triple patterns are RDF triples in which subject, predicate and object may
be variable.
SPARQL supports more complex queries: joins, unions, filters...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 33/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 34/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 35/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 36/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 37/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
What is the problem?
RDF excels at logical level:
Structured and semi-structured data can be described using RDF triples.
Entities are also linked in the form of RDF triples.
But it is a source of redundancy at physical level
Serialization formats are highly verbose.
RDF data are redundant at three levels: semantic, symbolic, and
syntactic.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 38/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
– Semantic Compression –
Semantic redundancy occurs when the same meaning can be conveyed
using less triples.



http : //dataweb.infor.uva.es/movies/property/name
http : //www.w3.org/2000/01/rdf − schema#domain
http : //dataweb.infor.uva.es/movies/classes/person



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //www.w3.org/1999/02/22 − rdf − syntax − ns#type
http : //dataweb.infor.uva.es/movies/classes/person
The third triple is redundant because the first one state that the URI
http://dataweb.infor.uva.es/movies/people/Clint Eastwood describes an entity in the
domain of http://dataweb.infor.uva.es/movies/classes/person.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 39/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Semantic Compression
Semantic compressors perform at logical level:
Detect redundant triples and remove them from the original dataset.
Semantic compressors [9, 11, 13] are not so effective by themselves...
... but may be combined with symbolic and syntactic compressors!
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 40/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
– Symbolic Compression –
Symbolic redundancy is due to symbol repetitions in triples:
This is the “traditional” source of redundancy removed by universal
compressors.
Symbolic redundancy in RDF is mainly due to URIs:
URIs tend to be very large strings which share long prefixes.
http://dataweb.infor.uva.es/movies/film/Bird
http://dataweb.infor.uva.es/movies/film/Million Dollar Baby
http://dataweb.infor.uva.es/movies/film/Mystic River
http://dataweb.infor.uva.es/movies/people/Clint Eastwood
...
... but literals also contibute to this redundancy.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 41/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Symbolic Compression
The most prominent RDF compressors remove symbolic redundancy:
All different URIs/literals are indexed in a string dictionary.
Each string is identified by a unique integer ID.
- Triples are rewritten by replacing strings by their corresponding IDs.
Symbolic is, in general, the most important redundancy in RDF and has
(many) room for optimization.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 42/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
– Syntactic Compression –
Syntactic redundancy depends on the RDF graph serialization:
For instance, a serialized subset of n triples (which describes the same
resource) writes n times the subject value. It can be abbr.
... and also on the underlying graph structure:
For instance, resources of the same classes are described using (almost)
the same sub-graph structure.
Syntactic compression also has (many) room for optimization.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 43/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Syntactic Compression
HDT [7], k2
-triples [1], or RDFCSA [4] are syntactic compressors
reporting good numbers:
They are combined with symbolic compression.
In practice, they compress RDF triples in the form of ID triples.
Semantic compressors such as SSP [11] also remove symbolic and
syntactic redundancy.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 44/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 45/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
– What is HDT? –
HDT was the first binary serialization format for RDF:
It was acknowledged as W3C Member Submission [6] in 2011.
It exploits symbolic and syntactic redundancy:
It reduces up to 15 times the space used by traditional formats [7].
HDT is a core building block in some Linked Data applications:
It reports good compression numbers, but also provides efficient data
retrieval.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 46/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
– Components –
HDT encodes RDF data into three components:
The Header (H) comprises descriptive metadata.
The Dictionary (D) maps different strings (from nodes and edges) to IDs:
It manages four independent mappings: subjects-objects, subjects, objects, and
predicates.
The Triples (T) component encodes the inner structure as a graph of IDs.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 47/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
HDT Components
The Dictionary is encoded using specific compression techniques for string
dictionaries.
Triple IDs are organized into a forest of trees (one per different subject)...
...which is encoded using two bitsequences and two ID sequences.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 48/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
– Conclusions –
HDT integrates RDF serialization and compression into a practical
format:
HDT saves space storage and enables efficient data parsing/retrieval
using bit operations.
Symbolic rendundancy is addressed by the Dictionary component:
The collection of strings (in the dictionary) has high symbolic
redundancy...
The own dictionary is highly compressible!
Syntactic rendundancy is removed by the Triples component:
HDT triples is a straightforward compressor.
Their effectiveness can be improved using optimized graph compression
techniques.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 49/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography I
[1] Sandra ´Alvarez-Garc´ıa, Nieves Brisaboa, Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, and Gonzalo
Navarro.
Compressed Vertical Partitioning for Efficient RDF Management.
Knowledge and Information Systems (KAIS), 44(2):439–474, 2015.
[2] Tim Berners-Lee.
Linked Data, 2006.
http://www.w3.org/DesignIssues/LinkedData.html.
[3] Christian Bizer, Tom Heath, and Tim Berners-Lee.
Linked Data - The Story So Far.
International Journal of Semantic Web and Information Systems, 5(3):1–22, 2009.
[4] Nieves Brisaboa, Ana Cerdeira, Antonio Fari˜na, and Gonzalo Navarro.
A Compact RDF Store using Suffix Arrays.
In Proceedings of SPIRE, 2015.
To appear.
[5] Javier D. Fern´andez, Mario Arias, Miguel A. Mart´ınez-Prieto, and Claudio Guti´errez.
Management of Big Semantic Data.
In Big Data Computing, chapter 4. Taylor and Francis/CRC, 2013.
[6] Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, Claudio Guti´errez, and Axel Polleres.
Binary RDF Representation for Publication and Exchange.
W3C Member Submission, 2011.
www.w3.org/Submission/HDT/.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 50/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography II
[7] Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, Claudio Guti´errez, Axel Polleres, and Mario Arias.
Binary RDF Representation for Publication and Exchange.
Journal of Web Semantics, 19:22–41, 2013.
[8] Tom Heath and Christian Bizer.
Linked Data: Evolving the Web into a Global Data Space.
Morgan & Claypool, 1 edition, 2011.
http://linkeddatabook.com/.
[9] Amit K. Joshi, Pascal Hitzler, and Guozhu Dong.
Logical Linked Data Compression.
In Proceedings of ESWC, pages 170–184, 2013.
[10] Frank Manola and Eric Miller.
RDF Primer.
W3C Recommendation, 2004.
www.w3.org/TR/rdf-primer/.
[11] Jeff Z. Pan, Jos´e Manuel G´omez-P´erez, Yuan Ren, Honghan Wu, and Man Zhu.
SSP: Compressing RDF data by Summarisation, Serialisation and Predictive Encoding.
Technical report, 2014.
Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014 SSP.pdf.
[12] Eric Prud’hommeaux and Andy Seaborne.
SPARQL Query Language for RDF.
W3C Recommendation, 2008.
http://www.w3.org/TR/rdf-sparql-query/.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 51/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography III
[13] Gayathri V. and P. Sreenivasa Kumar.
Horn-Rule based Compression Technique for RDF Data.
In Proceedings of SAC, pages 396–401, 2015.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 52/53
Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
This presentation has been made available only for learning/teaching purposes.
The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 53/53
Onto some basics of:
compression, Compact Data Structures, and
indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto
Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing
• Disks are cheap !! But they are also slow!
– Compression can help more data to fit in main memory.
(access to memory is around 106 times faster than HDD)
• CPU speed is increasing faster
– We can trade processing time (needed to uncompress
data) by space.
Introduction
Why compression?
• Compression does not only reduce space!
– I/O access on disks and networks
– Processing time* (less data has to be processed)
• … If appropriate methods are used
– For example: Allowing handling data compressed all the time.
Introduction
Why compression?
Text collection (100%)
Doc 1 Doc 2 Doc 3 Doc n Compressed Text
collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
Compressed Text
collection (20%)
P7zip, others
Doc 1 Doc 2 Doc 3 Doc n
Let’s search
for “Malta"
• Indexing permits sublinear search time
Introduction
Why indexing?
Text collection (100%)
Doc 1 Doc 2 Doc 3 Doc n Compressed Text
collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
Let’s search
for “Malta"
term 1
…
Malta
…
term n
(> 5-30%)
Index
• Self-indexes:
– sublinear search time
– Text implicitly kept
Introduction
Why Compact Data Structures?
Text collection
Doc 1 Doc 2 Doc 3 Doc n
Let’s search
for “Malta"
term 1
…
Malta
…
term n
(> 5-30%)
Index
0 0 0 01 1
0 1
0 1 0 10 0
1
0
Self-index (WT, WCSA,…)
term 1
…
Malta
…
term n
Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing
Basic Compression
• A compressor could use as a source alphabet:
– A fixed number of symbols (statistical compressors)
• 1 char, 1 word
– A variable number of symbols (dictionary-based compressors)
• 1st occ of ‘a’ encoded alone, 2nd occ encoded with next one ‘ax’
• Codes are built using symbols of an target alphabet:
– Fixed length codes (1 bit, 10 bits, 1 byte, 2 bytes, …)
– Variable length codes (1,2,3,4 bits/bytes …)
• Classification (fixed-to-variable, variable-to-fixed,…)
Modeling & Coding
-- statistical
Input alphabet
dictionary var2var
Target alphabet
fixed
var
fixed var
Basic Compression
• Taxonomy
– Dictionary based (gzip, compress, p7zip… )
– Grammar based (BPE, Repair)
– Statistical compressors (Huffman, arithmetic, PPM,… )
• Statistical compressors
– Gather the frequencies of the source symbols.
– Assign shorter codewords to the most frequent symbols.
Obtain compression
Main families of compressors
Basic Compression
• How do they achieve compression
– Assign fixed-length codewords to variable length symbols (text
substrings)
– The longer the replaced substring  the better compression
• Well-known representatives: Lempel-Ziv family
– LZ77 (1977): GZIP, PKZIP, ARJ, P7zip
– LZ89 (1978)
• LZW (1984): Compress, GIF images
Dictionary-based compressors
Basic Compression
• Starts with an initial dictionary D (contains symbols in Σ)
• For a given position of the text.
– while D contains w, reads prefix w=w0 w1 w2 …
– If w0 …wk wk+1 is not in D (w0 …wk does!)
• output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
• Add w0 …wk wk+1 to D
• Continue from wk+1 on (included)
• Dictionary has limited length? Policies: LRU, truncate& go, …
LZW
EXAMPLE
Basic Compression
• Starts with an initial dictionary D (contains symbols in Σ)
• For a given position of the text.
– while D contains w, reads prefix w=w0 w1 w2 …
– If w0 …wk wk+1 is not in D (w0 …wk does!)
• output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
• Add w0 …wk wk+1 to D
• Continue from wk+1 on (included)
• Dictionary has limited length? Policies: LRU, truncate& go, …
LZW
EXAMPLE
Basic Compression
• Replaces pairs of symbols by a new one, until no pair
repeats twice
– Adds a rule to a Dictionary.
Grammar Based – BPE - Repair
A B C D E A B D E F D E D E F A B E C D
A B C G A B G F G G F A B E C D
H C G H G F G G F H E C D
H C G H I G I H E C D
DE G
AB  H
GF  I
Source sequence
Dictionary of Rules
Final Repair Sequence
Basic Compression
• Assign shorter codewords to the most frequent symbols
– Must gather symbol frequencies for each symbol c in Σ.
– Compression is lower bounded by the (zero-order) empirical
entropy of the sequence (S).
• Most representative method: Huffman coding
Statistical Compressors
n= num of symbols
nc= occs of symbol c
H0(S) <= log (|Σ|)
n H0(S) = lower bound of the size of S compressed with a zero-order compressor
Basic Compression
• Optimal prefix free coding
– No codeword is a prefix of one another.
• Decoding requires no look-ahead!
– Asymptotically optimal: |Huffman(S)| <= n(H0(S)+1)
• Typically using bit-wise codewords
– Yet D-ary Huffman variants exist (D=256 byte-wise)
• Builds a Huffman tree to generate codewords
Statistical Compressors: Huffman coding
Basic Compression
• Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE
Statistical Compressors: Huffman coding
Basic Compression
• Bottom – Up tree construction
Statistical Compressors: Huffman coding
Basic Compression
• Bottom – Up tree construction
Statistical Compressors: Huffman coding
Basic Compression
• Bottom – Up tree construction
Statistical Compressors: Huffman coding
Basic Compression
• Bottom – Up tree construction
Statistical Compressors: Huffman coding
Basic Compression
• Bottom – Up tree construction
Statistical Compressors: Huffman coding
Basic Compression
• Branch labeling
Statistical Compressors: Huffman coding
Basic Compression
• Code assignment
Statistical Compressors: Huffman coding
Basic Compression
• Compression of sequence S= ADB…
• ADB…  01 000 10 …
Statistical Compressors: Huffman coding
Basic Compression
• Given S= mississipii$, BWT(S) is obtained by: (1) creating
a Matrix M with all circular permutations of S$, (2) sorting
the rows of M, and (3) taking the last column.
Burrows-Wheeler Transform (BWT)
mississippi$
$mississippi
i$mississipp
pi$mississip
ppi$mississi
ippi$mississ
sippi$missis
ssippi$missi
issippi$miss
sissippi$mis
ssissippi$mi
ississippi$m
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
sort
L = BWT(S)F
Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8
which is the 2nd occ of ‘p’ in F
Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
In each step: S[n-i] = L[p];
p = LF[p];
i = i+1;
-
-
-
-
-
-
-
-
-
-
-
$
S
Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
Step i=0: S[n-i] = L[p]; S[12]=‘$’
p = LF[p]; p = 1
i = i+1; i=1
-
-
-
-
-
-
-
-
-
-
-
$
S
Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
Step i=1: S[n-i] = L[p]; S[11]=‘i’
p = LF[p]; p = 2
i = i+1; i=2
-
-
-
-
-
-
-
-
-
-
i
$
S
Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
Step i=1: S[n-i] = L[p]; S[11]=‘i’
p = LF[p]; p = 2
i = i+1; i=2
m
i
s
s
i
s
s
i
p
i
i
$
S
Basic Compression
• BWT. Many similar symbols appear adjacent
• MTF.
– Output the position o the current symbol within Σ ‘
– Keep the alphabet Σ ‘= {a,b,c,d,e,… } sorted so that the last used
symbol is moved to the begining of Σ ‘ .
• RLE.
– If a value (0) appears several times (000000  6 times)
– replace it by a pair <value,times>  <0,6>
• Huffman stage.
Bzip2: Burrows-Wheeler Transform (BWT)
Why does it work?
In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …
Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing
Sequences
• Given a Sequence of
– n integers
– m = maximum value
• We can representing it with n ⌈log2(m+1)⌉ bits
– 16 symbols x3 bits per symbol = 48 bits  array of 2 32-bit ints
– Direct access (access to an integer + bit operations)
Plain Representation of Data
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Sequences
• Is it compressible?
• Ho(S) = 1.59 (bits per symbol)
• Huffman: 1.62 bits per symbol
26 bits: No direct access!
(but we could add sampling)
Compresed Representation of Data (H0)
Symbol 4 1 2 3
Occurrences (nc) 9 4 2 1
0 1
16
7
1
43
0
1
2
0
1
2 3 1 4
9
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 01 000 0011 1 1 1 01 1 000 1 01 01 1 1
1 5 10 15 20 25
Sequences
• Operations of interest:
– Access(i) : Value of the ith symbol
– Ranks(i) : Number of occs of symbol s up to position i (count)
– Selects (i) : Where the ith occ of symbol s? (locate)
Summary: Plain/compressed  acess/rank/select ()
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100
1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46
1 01 000 0011 1 1 1 01 1 000 1 01 01 1 1
1 5 10 15 20 25
Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing
Bit Sequences
Rank1(6) = 3
Rank0(10) = 5
Access/rank/select on bitmaps
0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1415161718 19 20 21
B=
select0(10) =15
access (19) = 0
Bit Sequences
• Bitmaps a basic part of most Compact Data Structures
• Example: (We will see it later in the CSA)
S: AAABBCCCCCCCCDDDEEEEEEEEEEFG  n log σ bits
B: 1001010000000100100000000011  n bits
D: ABCDEFG  σ log σ bits
– Saves space:
– Fast access/rank/select is of interest !!
• Where is the 2nd C?
• How many Cs up to position k?
Applications
Bit Sequences
• Jacobson, Clark, Munro
– Variant by Fariña et al.
• Assuming 32 bit machine-word
• Step 1: Split de Bitmap into superblocks of 256 bits, and
store de number of 1s up to positions 1+256k
– O(1) time to superblock. Space: n/256 superblock and 1 int each
Reaching O(1) Rank y o(n) bits of extra space
0 1 0 ... 1
1 2 3 256
35 bits set to 1
1 ... 1
257 512
27 bits set to 1
350
1 2
Ds = 62
3
0 ... 1
513 768
45 bits set to 1
...
97
3
...
Bit Sequences
• Step 2: For each superblock of 256 bits
– Divide it into 8 block of 32 bits each (machine word size)
– Store the number of ones from the beginning of the superblock
– O(1) time to the blocks, 8 blocks per superblock, 1 byte each
Reaching O(1) Rank y o(n) bits of extra space
1 1 0 ... 1
1 2 3 256
35 bits set to 1
1 ... 0
257 512
27 bits set to 1
350
1 2
Ds = 62
3
0 ... 1
513 768
45 bits set to 1
...
97
3
...
1 1 0 ... 1
1 2 3 32
4 bits set to 1
0 ... 1
33 64
6 bits set to 1
...
40
1 2
Db = 10
3
...
1 ... 0
224 256
8 bits set to 1
Bit Sequences
• Step 3: Rank within a 32 bit block
Finally solving:
rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i)
where i= p mod 32
– Ex: rank1(D,300) = 35 + 4 + 4 = 43
– Yet, how to compute rank1(blk, i) in constant time ?
Reaching O(1) Rank y o(n) bits of extra space
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Bit Sequences
• how to compute rank1 (blk, i) in constant time ?
– Option 1: popcount within a machine word
– Option 2: Universal Table onesInByte (solution for each byte)
Only 256 entries storing values [0..8]
• Finally sum value onesInByte for the 4 bytes in blk
• Overall space: 1.375 n bits
Reaching O(1) Rank y o(n) bits of extra space
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0blks =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
1 0 0 1 0 0 0 0 1 1 0 0
Shift 32 – 12 = 20 posicións
Rank1(blk,12)
Val binary OnesInByte
0 00000000 0
1 00000001 1
2 00000010 1
3 00000011 2
252 11111100 6
253 11111101 7
254 11111110 7
255 11111111 8
... ... ...
Bit Sequences
select1(p)
• In practice, binary search using rank
– Binary search on superblocks O(log(n)) to find the superblock s
containing the pth 1  retval = Ds[s]
– Sequential search [uint <=256] within the in blocks until reaching
the block d that contains the position  retval += Db[d]
– Sequential search (1 byte at a time) within the last 32 bits, using
onesInByte[] table until reaching the byte b that contains the
position.
• In each iteration: retval += onesInByte[b]
– Table lookup over a new selb[] table over the last “byte” b
• retval += selb[b]
– Return retval
Select in O(log (n)) with the same structures
Bit Sequences
• Compressed bitmap representations exist.
– Compressed  [Raman et al]
– For very sparse bitmaps [Okanohara and Sadakane]
– …
Compressed representations
Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing
Integer Sequences
Access/rank/select on general sequences
Rank2(9) = 3
S=
select4(3) =7
access (13) = 3
4 4 3 2 6 2 4 2 4 1 1 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Integer Sequences
• Grossi et al.
• Given a sequence of symbols and an encoding
– The bits of the code of each symbol are distributed along the
different levels of the tree
00 01 00 10 11 00 A B A C D A C
0 0 0 0
10
1 1
0 1
A B A A C D C
0 1 0 10 0
1
0
Wavelet tree (construction)
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
• Searching for the 1st occurrence of ‘D’?
Integer Sequences
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
it is the 2nd bit in B1
Where is the
2nd ‘1’?
 at pos 5.
0
1
Where is the
1st ‘1’?
 at pos 2.
Wavelet tree (select)
Broot
B0 B1
Integer Sequences
• Recovering Data: extracting the next symbol
– Which symbol appears in the 6th position?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
Which bit occus at position 4 in B0?
How many
‘0’s are there
up to pos 6?
it is the 4th ‘0’
0
1
It is set to 0
The codeword read is ’00’ A
Wavelet tree (access)
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
Broot
B0 B1
Broot
B0 B1
Broot
B0
Integer Sequences
• Recovering Data: extracting the next symbol
– Which symbol appears in the 7th position?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
Which bit occurs at position 3 in B1?
How many ‘1’s
are there up to
pos 7?
it is the 3rd ‘1’
0
1
It is set to 0
The codeword read is ’10’  C
TEXT
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
Wavelet tree (access)
B1
Broot
B0
Integer Sequences
• How many C’s up to position 7?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
How many 0s up to position 3 in B1?
How many ‘1’s
are there up to
pos 7?
it is the 3rd ‘1’
0
1
2 !!
TEXT
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
Wavelet tree (Rank)
B1
Broot
B0
Select (locate symbol)
Access and Rank:
Integer Sequences
• Each level contains n + o(n) bits
• Rank/select/access expected O(log σ) time
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
1
0
Wavelet tree (Space and times)
WAVELET TREE
00 01 00 10 11 00 10
DATA
SYMBOL CODE
A B A C D A C
C
D
00
01
10
11
B
A
n + o(n) bits
n + o(n) bits
n ⌈log σ⌉ (1 + o(1)) bits
Integer Sequences
• Using Huffman coding (or others)  umbalanced
• Rank/select/access  O(nH0(S)) time
Huffman-shaped (or others) Wavelet tree
A B A C D A C
1 0 1 10 0
0 1
B C D C A A A
0 1 0 0
0
WAVELET TREE
1 000 1 01 001 1 01
DATA
SYMBOL CODE
A B A C D A C
C
D
1
000
01
001
B
A
nH0(S) + o(n) bits
0 1
B D C C
1 0
Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing
A brief Review about Indexing
• Traditional indexes (with or without compression)
– Inverted Indexes, Suffix Arrays,...
• Compressed Self-indexes
– Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index, …
Text Indexing: Well-known structures from The Web
implicit text
auxiliar structure explicit text
A brief Review about Indexing
Inverted indexes
Space-time trade-off
DCC
communications
compression
image
data
information
Cliff
Logde
0 142
104 165 341
506368
219 445
DCC is held at the Cliff Lodge convention center. It
is an international forum for current work on data
compression and related applications. DCC addresses
not only compression methods for specific types of
data (text, image, video, audio, space, graphics, web
content, etc.), but also the use of techniques from
information theory and data compression in
networking, communications, and storage applications
involving large datasets (including image and
information mining, retrieval, archiving, backup,
communications, and HCI).
99 207 336
128 395
19
25
Vocabulary Posting Lists
Indexed text
Searches
Word  posting of that word
Phrase  intersection of postings
Block1Block2
Compression
- Indexed text (Huffman,...)
- Posting lists (Rice,...)
1
1 2
2
1 2
1 2
1 2
1
1
DCC
communications
compression
image
data
information
Cliff
Lodge
Vocabulary Posting Lists
Full-positional information Block-addressing inverted index
A brief Review about Indexing
• Lists contain increasing integers
• Gaps between integers are smaller in the longest lists
Inverted indexes
4 10 15 25 29 40 46 54 57 70 79 82Posting list
original
1 2 3 4 5 6 7 8 9 10 11 12
4 6 5 10 4 11 6 8 3 13 9 3Diferenc.
4
c6 c5 c10
29
c11 c6 c8
57
c13 c9 c3
Sampling absoluto + codif long.
variable
 Acceso directo
Descompresión
parcial
c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3Codif long. variable
Descompresión
completa
A brief Review about Indexing
• Sorting all the suffix of T lexicographically
Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations Noccs = (4-3)+1
Occs = A[3] .. A[4] = { 8, 1}
Fast space
O(m lg n) O(4n)
O(m lg n + noccs) + T
Basic Compression
• BWT(S) + other structures  it is an index
BWT  FM-index
• C[c] : for each char c in Σ , stores the number of
occs in S of the chars that are lexicographically
smaller than c.
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8
• OCC(c, k): Number of occs of char c the prefix
of L: L (1, k)
For k in [1..12]
Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1
Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4
Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1
Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2
Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
• Char L[i] occurs in F at position LF(i):
LF(i) = C[L[i]] + Occ(L[i],i)
Basic Compression
• Count (S[1,u], P[1,p])
BWT  FM-index
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8
Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1
Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4
Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1
Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2
Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
Basic Compression
• Representing L with a wavelet tree
BWT  FM-index
Bibliography
1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical
Report 124, Digital Systems Research Center, 1994.
http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/.
2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th
SPIRE, LNCS 5280, pages 176–187, 2008.
3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc.
12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001.
4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-
581, 2005.
5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994
6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text
indexing. In Proc. 17th SODA, pages 368–373, 2006.
7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th
SODA, pages 841–850, 2003.
Bibliography
8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the
Institute of Radio Engineers, 40(9):1098-1101, 1952
9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE,
88(11):1722–1732, 2000
10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp.,
22(5):935–948, 1993
11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms .Kluwer 2002, ISBN 0-7923-
7668-4
12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37–42, 1996.
13. Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR),
v.39 n.1, p.2-es, 2007
14. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th
ALENEX, 2007.
15. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding
k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002.
Bibliography
16. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and
flexible word searching on compressed text. ACM Transactions on Information Systems,
18(2):113–139, 2000.
17. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and
Indexing Documents and Images. Morgan Kaufmann, 1999.
18. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory 23, 3, 337–343.
19. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE
Transactions on Information Theory 24, 5, 530–536.
Onto some basics of:
compression, Compact Data Structures, and
indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto
Introduction
Compressed String Dictionaries
Experimental Evaluation
Dictionary Compression
Miguel A. Mart´ınez-Prieto Antonio Fari˜na
Univ. of Valladolid (Spain) Univ. of A Coru˜na (Spain)
migumar2@infor.uva.es fari@udc.es
Keyword search over Big Data.
– 1st KEYSTONE Training School –.
July 22nd, 2015. Faculty of ICT, Malta.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 1/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 2/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
– What is a String Dictionary –
String Dictionary
A string dictionary is a serializable data structure
which organizes all different strings (vocabulary) used
in a dataset.
The vocabulary of a natural language text (lexicon) comprises all different
words used in it.
T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo”
V= {he, la, ni~na, no, que, s´ı, tarara, visto, yo}
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 3/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
What is a String Dictionary?
The dictionary implements a bijective function that maps
strings to identifiers (IDs, generally integer values) and back.
It must provide, at least, two complementary operations:
string-to-ID: locates the ID for a given string.
ID-to-string: extracts the string identified by a given ID.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 4/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
What is a String Dictionary?
String dictionaries are a simple and effective tool:
Enable replacing (long, variable-length) strings by simple
numbers (their IDs).
T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo”
T’= 2 7 6 2 7 4 2 7 3 5 2 1 8 9
The resulting IDs are more compact to represent and easier
and more efficient to handle:
T= 59 chars × 1 byte/chars = 59 bytes
T’= 14 IDs × log(9) bits/ID = 7 bytes
(plus the cost of dictionary encoding)
A compact dictionary which provides efficient mapping
between strings and IDs saves storage space, and
processing/transmission costs, in data-intensive
applications.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 5/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Compressing String Dictionaries
The growing volume of the datasets has led to increasingly large
dictionaries:
The dictionary size is a bottleneck for applications running under
restrictions of main memory.
Dictionary management is becoming a scalability issue by itself.
Dictionary compression aims to achieve competitive space/time tradeoffs:
Compact serialization.
Small memory footprint.
Efficient query resolution.
We focus on static dictionaries, which do not change along the
execution:
Many applications use dictionaries that either are static or are rebuilt only
sparingly.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 6/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
– Operations –
A string dictionary is a data structure that represents a sequence of n
distinct strings, D = s1, s2, . . . , sn .
It provides a mapping between ID numbers i and strings si :
- locate(p)
= i, if p = si for some i ∈ [1, n].
= 0 otherwise.
- extract(i) returns the string si , for i ∈ [1, n].
Some other operations can be useful in specific applications:
Prefix-based locate / extract operations.
Substring-based locate / extract operations.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 7/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Prefix-based Operations
- locatePrefix(p) = {i, ∃y, si = py}.
This result set is a contiguous ID range for lexicographically sorted
dictionaries.
- extractPrefix(p) = {si , ∃y, si = py}.
It is equivalent to composing locatePrefix(p) with individual
extract(i) operations.
Finding all URIs in a given domain is an example of prefix-based
operation:
Look for all properties used in http://dataweb.infor.uva.es/movies:
http://dataweb.infor.uva.es/movies/property/director (4).
http://dataweb.infor.uva.es/movies/property/name (7).
http://dataweb.infor.uva.es/movies/property/title (12).
...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 8/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Substring-based Operations
- locateSubstring(p) = {i, ∃x, y, si = xpy}.
It is very similar to the problem solved by full-text indexes.
- extractSubstring(p) = {si , ∃x, y, si = xpy}.
It is equivalent to composing locateSubstring(p) with individual
extract(i) operations.
Both operations may return duplicate results which must be removed
before reporting the ID result set.
regex query resolution in SPARQL is an example of substring-based
operation:
Look for all literals containing the substring Eastwood:
‘‘Clint Eastwood’’ (2544).
‘‘Jayne Eastwood is a Canadian actress...’’ (10584).
‘‘Kyle Eastwood’’ (13847).
...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 9/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Summary
- locate(“tarara”) = 7
- extract(2) = la
- locatePrefix(“n”) = 3,4
- extractPrefix(“n”) = ni˜na, no
- locateSubstring(“a”) = 2,3,7
- extractSubstring(“a”) = la, ni˜na, tarara
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 10/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
– RDF Dictionaries –
An RDF dictionary comprises all different terms used in the dataset:
RDF terms are drawn from three disjoint vocabularies: URIs, Literals, and
blank nodes.
Serialized (uncompressed) RDF vocabularies need up to 3 times more
space than (uncompressed) ID-triples [13].
URIs and Literals should be compressed and managed independently:
Their structure is very different and they are queried in a different way.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 11/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
URIs
URIs are medium-size strings sharing long prefixes:
Compressed dictionaries for URIs must exploit the continuous repetition of
such prefixes.
Prefix-based compression.
locate operations are common when the dictionary is used for lookup
purposes (e.g. RDF stores, semantic search engines, etc.).
extract operations are common when the dictionary is used for data
access purposes (e.g. decompression, result retrieval, etc.).
locatePrefix and extractPrefix are also useful for URI dictionaries.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 12/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Literals
Literals tends to be large-size strings with no predictable features:
The name “Clint Eastwood”.
The genome from an individual of any species.
The full text from “El Quijote”
...
Literal dictionaries must be based on universal compression.
locate and extract are used like in URI dictionaries.
locateSubstring and extractSubstring are useful because of
SPARQL needs.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 13/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Practical Configuration
A role-based partition is first performed:
Subjects are encoded in the range [1,|S|].
Predicates are encoded in the range [1,|P|].
Objects are encoded in the range [1,|O|].
URIs playing as subject and object are encoded
once:
IDs in [1,|SO|] encode subjects and objects.
Subjects are encoded in [|SO+1|,|S|].
Objects are encoded using two dictionaries:
1 [|SO+1|,|Ox |] encode URIs which only performs
as objects.
2 [|Ox +1|,|O|] encode Literals.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 14/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 15/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Compressed String Dictionaries
All revised dictionaries combine notions from universal compression and
compact data structures.
Universal compressors must enable fast decompression and comparison of
individual strings:
Huffman [8] and Hu-Tucker [7, 9] codes.
Re-Pair [10].
The serialized vocabulary Tdict concatenates all strings in lexicographic
order:
An special symbol $ is used as separator.
T =“alabar a la alabada alabarda”
Tdict = a$alabada$alabar$alabarda$la$
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 16/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Front-Coding –
Front-Coding [15] is a folklore compression technique for lexicographically
sorted dictionaries.
It exploits the fact that consecutive entries are likely to share a common
prefix:
Each entry in the dictionary is differentially encoded with respect to the
preceding one.
It needs two values:
× An integer encoding the length of the shared prefix.
× The remaining characters of the current entry.
a$alabada$alabar$alabarda$la$
→ (0,a$); (1,labada$); (5, r$); (6, da$); (0, la$)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 17/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Front-Coding
The vocabulary is divided into buckets of b strings:
The first string of each bucket (header) is explicitly stored.
The remaining b − 1 internal strings are differentially encoded.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 18/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Front-Coding Operations
locate(p):
1 Headers are binary searched until finding the bucket Bx where p must lie:
If the header is p, locate(p) = (b × (Bx − 1)) + 1.
2 The internal string are sequentially decoded:
If the internal ith
string is p, locate(p) = (b × (Bx − 1)) + i.
If the bucket is fully decoded with no result, p is not in the dictionary.
extract(i):
1 The string is encoded in the bucket Bx = i/b .
2 ((i − 1) mod b) internal strings are decoded to obtain the answer.
Prefix-based operations exploits the lexicographic order:
Their results are contiguous ranges in the dictionary.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 19/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Plain Front-Coding (PFC)
PFC is a straightforward byte-oriented Front-Coding implementation:
It uses VByte [14] to encode the length of the common prefix.
The remaining string is encoded with one byte per character, plus the
terminator $.
PFC is serialized as a byte array (Tpfc ) and a ptrs structure:
Both structures are directly mapped to main memory for data retrieval
purposes.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 20/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
HuTucker Front-Coding (HTFC)
HTFC is algorithmically similar to PFC, but it takes advantage of the Tpfc
redundancy to achieve a more compressed representation:
Operations are slightly slower than for PFC.
Headers are encoded using HuTucker:
It allows compressed headers to be directly compared with the query
pattern.
Internal strings are encoded using Huffman or Re-Pair compression.
HTFC is serialized as a bit array (Thtfc ) and also a ptrs structure:
Pointers in HTFC uses less bits because Thtfc is smaller than Tpfc .
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 21/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Hashing –
Hashing [3] is a folklore method to implement dictionaries:
A hash function transforms the string into an index x in the hash table.
A collision arises when two different strings are mapped to the same cell
in the table.
String dictionaries perform better with closed hashing [2]:
If the corresponding cell is not empty, one successively probes other cells
until finding a free cell.
The next cell to be probed is determined using double hashing.
Hash dictionaries provide very efficient locate, may support extract,
but the table size dissuades their use for managing large vocabularies.
Compressed hash dictionaries focuses on compacting the table, but
also the own vocabulary:
The vocabulary can be effectively compressed using Huffman or Re-Pair.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 22/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Vocabulary Compression
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 23/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Table Compression (I)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 24/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Table Compression (II)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 25/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Improving Data Access
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 26/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Hashing Operations (locate)
locate(p):
1 The pattern p is compressed using Huffman: cp.
2 cp is “hashed” to a position x in the (original) hash table.
3 x is mapped to its corresponding position y in the compressed
representation.
4 The string pointed from y is decompressed and compared to p.
locate(“alabada”)
1 Huffman(“alabada$”)=cp
2 hash(cp)=5
3 if B[5] = 1, rank1(B, 5)=4
if B[5] = 0, “alabada” is not in D.
4 strcmp(DAC[4],cp)=true → 4
strcmp(DAC[4],cp)=false → collision
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 27/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Hashing Operations (extract)
extract(i):
1 The string directly extract from DAC[i].
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 28/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Self-Indexed Dictionaries –
A self-index stores the original text T and provides indexed searches to
it, using space proportional to the T statistical entropy.
Self-indexes support two operations:
locate(p), returns all the positions in T where p occurs.
extract(i, j), retrieves the substring T [i, j].
A string dictionary can be easily self-indexed:
The corresponding self-index is built on the text Tdict .
The dictionary primitives (and also prefix and substring based queries) are
implemented using the self-index operations.
We choose the FM-Index [4, 5] because it is the most space-efficient
self-index in practice:
A $ symbol is prepended to the original Tdict .
The BWT (L) is a wavelet-tree (“plain” [5] and “compressed” [11]).
C is a simple array.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 29/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 30/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (locate)
The ith
string is encoded between the
i + 1th
and i + 2th
$.
locate(p) performs backwards search
of $p$:
The pattern is searched from right to
left until reach the corresponding $.
locate(p) performs in time
O(|p| log σ).
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 31/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (locate)
locate(’la’): Looking for $la$.
1. Range: [C($),C(a)-1]=[0,5].
Count the number of a before the range:
occs0=ranka(L, 0) = 0
Count the number of a to the end of the range:
occs1=ranka(L, 5) = 4
2. Range: [C(a)+occs0,C(a)+occs1-1]=[6,9].
Count the number of l before the range:
occs0=rankl (L, 6) = 0
Count the number of a to the end of the range:
occs1=rankl (L, 9) = 1
3. Range: [C(l)+occs0,C(l)+occs1-1]=[24,25].
Count the number of l before the range:
occs0=rank$(L, 24) = 5
Count the number of a to the end of the range:
occs1=rank$(L, 25) = 6
4. Range: [C($)+occs0,C($)+occs1-1]=[5,5].
’la’ is identified by 5.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 32/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (extract)
extract(i) retrieves symbols from the
(i + 1) − th $ to the i − th $:
It takes O(|si | log σ) time.
extract(5):
1. The search process starts from Position: 0.
Extracts the symbol in this position:
access(L, 0) =a
Count the number of as up to the position:
occs=ranka(L, 0) = 1
2. Position: C(a) + 1 − 1 = 6.
Extracts the symbol in this position:
access(L, 6) =l
Count the number of ls up to the position:
occs=rankl (L, 6) = 1
3. Position: C(l) + 1 − 1 = 24.
Extracts the symbol in this position:
access(L, 6) =$
The 5 − th string is la.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 33/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (prefix & substring operations)
locatePrefix(p) is similar to locate:
It looks for $p and finds the area
[sp,ep] in where all strings si that
start with p are encoded.
Substring-based operations generalize
prefix-based ones:
locateSubstring(p) look for p to
obtain the area [sp,ep] containing all
strings si with p.
For each match, the backwards search
continues until determining the
corresponding ID (sampling structure)
Duplicate IDs are finally removed.
extractPrefix(p) and
extractSubstring(p) perform extract
operations in the corresponding ranges.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 34/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Other Dictionaries (Tries)–
Tries [9] are tree-shaped structures which perform
efficiently for dictionary purposes:
Strings are located from root to leaves.
IDs are extracted from the corresponding leaf to the
root.
Tries use much space for managing large dictionaries.
Some compressed trie-based dictionaries exist in the
state of the art:
Compressed tries based on path decomposition [6].
LZ-compressed tries [1].
Self-indexed tries (XBW) [2].
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 35/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 36/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Experimental Setup
Two RDF real-world dictionaries:
26, 948, 638 URIs from Uniprot:
Averaged length: 51.04 chars per URI.
Highly-repetitive.
27, 592, 013 Literals from DBpedia:
Averaged length: 60.45 chars per Literal.
We analyze compression effectiveness and retrieval speed:
locate, extract.
Prefix-based operations (URIs)
Substring-based operations (Literals).
In practice, extract is the most important query:
It is used many times as results are retrieved from the compressed dataset.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 37/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
– URIs –
Compressed tries (LexRP and CentRP)
obtain the best compression results and
report better numbers for locate:
≈ 4.5 % of the original space.
≈ 2 − 3µs/string.
> 2µs/ID.
HTFC uses slightly more space, but it is
faster for extract:
≈ 5 − 13 % of the original space.
≈ 2.2-3 µs/string.
≈ 0.7-1.6 µs/ID.
The best tradeoff is for PFC:
≈ 9 − 19 % of the original space.
≈ 1.6 µs/string.
≈ 0.3-0.6 µs/ID.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 38/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Prefix-based Operations
PFC is the best choice for prefix-based operations:
Although it uses more space, it reports the best performance.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 39/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
– Literals –
Compressed tries (LexRP and CentRP)
obtain better compression results and
report better numbers for locate:
≈ 12 % of the original space.
≈ 2-2.5 µs/string.
> 2.5 µs/ID.
HTFC reports the best compression ratios,
but its performance is less competitive:
≈ 9 − 17 % of the original space.
≈ 4.5-40 µs/string.
≈ 3 − 20µs/ID.
The best tradeoff is for Hash:
≈ 15 % of the original space.
≈ 1.5 µs/string.
≈ 1µs/ID.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 40/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Substring-based Operations
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 41/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
– Conclusions –
RDF dictionaries are highly compressible:
URIs are very redundant and Literals also show non-negligible symbolic
redundancy.
This redundancy can be detected and removed within specific data
structures for dictionaries:
Structures for URIs use up to 20 times less space than the original
dictionaries.
For Literals, the corresponding structures use 6 − 8 times less space than
the original dictionaries.
All these structures report data retrieval performance at microsecond
level:
This functionality includes both simple and advanced operations.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 42/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
GitHub
All dictionaries explained in this lecture (and some more [12]) are
available in the libCSD C++ library:
https://github.com/migumar2/libCSD
Beta version: suggestions are accepted ;)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 43/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Bibliography I
[1] Julian Arz and Johannes Fischer.
LZ-compressed string dictionaries.
In Procedings of DCC, pages 322–331, 2014.
[2] Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, Miguel A. Mart´ınez-Prieto, and Gonzalo Navarro.
Compressed string dictionaries.
In Proceedings of SEA, pages 136–147, 2011.
[3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms.
MIT Press and McGraw-Hill, 2nd edition, 2001.
[4] Paolo Ferragina and Giovanni Manzini.
Indexing compressed texts.
Journal of the ACM, 52(4):552–581, 2005.
[5] Paolo Ferragina, Giovanni Manzini, Veli M¨akinen, and Gonzalo Navarro.
Compressed representations of sequences and full-text indexes.
ACM Transactions on Algorithms, 3(2):article 20, 2007.
[6] Roberto Grossi and Giuseppe Ottaviano.
Fast Compressed Tries through Path Decompositions.
In Proceedings of ALENEX, pages 65–74, 2012.
[7] T.C. Hu and Alan C. Tucker.
Optimal Computer-Search Trees and Variable-Length Alphabetic Codes.
SIAM Journal of Applied Mathematics, 21:514–532, 1971.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 44/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Bibliography II
[8] David A. Huffman.
A method for the construction of minimum-redundancy codes.
Proc. of the Institute of Radio Engineers, 40(9):1098–1101, 1952.
[9] Donald .E. Knuth.
The Art of Computer Programming, volume 3: Sorting and Searching.
Addison Wesley, 1973.
[10] N. Jesper Larsson and Alistair Moffat.
Offline dictionary-based compression.
Proceedings of the IEEE, 88:1722–1732, 2000.
[11] Veli M¨akinen and Gonzalo Navarro.
Dynamic entropy-compressed sequences and full-text indexes.
ACM Transactions on Algorithms, 4(3):article 32, 2008.
[12] Miguel A. Mart´ınez-Prieto, Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, and Gonzalo Navarro.
Practical compressed string dictionaries.
Information Systems, 2015.
Under review.
[13] Miguel A. Mart´ınez-Prieto, Javier D. Fern´andez, and Rodrigo C´anovas.
Querying RDF Dictionaries in Compressed Space.
SIGAPP Applied Computing Review, 12(2):64–77, 2012.
[14] Hugh E. Williams and Justin Zobel.
Compressing integers for fast file access.
The Computer Journal, 42:193–201, 1999.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 45/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Bibliography III
[15] Ian H. Witten, Alistair Moffat, and Timothy C. Bell.
Managing Gigabytes: Compressing and Indexing Documents and Images.
Morgan Kaufmann, 1999.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 46/47
Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
This presentation has been made only for learning/teaching purposes.
The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 47/47
Triples Compression and Indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto
Outline
RDF management overview
K2-Tree structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments
RDF management Overview
Dictionary + triples-IDS
UK
London
M.Lalmas R.Raman
A.Gionis
inv-speaker Finland
SPIREheld on
capitalof
livesin
lives in
position
lives
in
attends
attends
attendsworks
in
(SPIRE, held on, London)
(London, capital of, UK)
(A.Gionis, attends, SPIRE)
(R.Raman, attends, SPIRE)
(M.Lalmas, attends, SPIRE)
(M.Lalmas, lives in, UK)
(M.Lalmas, works in, London)
(A.Gionis, lives in, Finland)
(R.Raman, lives in, UK)
(R.Raman, position, inv-speaker)
Original Triplets
London
SPIRE
A.Gionis
M.Lalmas
R.Raman
Finland
inv-speaker
UK
attends
capital of
1
2
3
4
5
3
4
5
1
2
held on
lives in
position
works in
3
4
5
6
SO
S
O
P
Dictionary Encoding
(2,3,1)
(1,2,5)
(3,1,2)
(5,1,2)
(4,1,2)
(4,4,5)
(4,6,1)
(3,4,3)
(5,4,5)
(5,5,4)
Id-based
Triplets
Outline
RDF management overview
K2-Tree Data Structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments
K2-Tree
• Structure for representing adjacency matrix
• Originally designed for web graphs
– Simple directed graph
1
2
3
7
4
5
6
8
0 1 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0
0 0 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 1 0 0 1 0
1 2 3 4 5 6 7 8 9 10 11
1
2
3
4
5
6
7
8
9
10
119
10
11
0 1 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0
0 0 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 1 0 0 1 0
Motivation
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01 1 1
1 1 10 0 0 01 0 0 01
1 1 1111 1 1 10 0 0 00 0 0 0 0 0 0
0100 01000011 0010 0010 10101000 0110 0010
Example with K=2
T = 101111010100100011001000000101011110
L = 010000110010001010101000011000100100
K2-Tree
Construction process
K2-Tree
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01 1 1
1 1 10 0 0 01 0 0 01
1 1 1111 1 1 10 0 0 00 0 0 0 0 0 0
0100 01000011 0010 0010 10101000 0110 0010
T = 101111010100100011001000000101011110
L = 010000110010001010101000011000100100
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 children(2) = rank1(T,2)* k2 = 2*4=8
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30…
children(9) = rank1(T,9)* k2 = 7*4=28
Direct neighbour operation
Outline
RDF management overview
K2-Tree structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments
K2-Triples
RDF triples Mapped
triples
Dictionary
• Dictionary encoding
• Triples as a set of identifiers
Data Structure
K2-Triples
• Vertical partitioning
• One K2-tree per predicate
1(8,5,4)
(4,2,3)
(4,4,6)
(4,1,7)
(7,2,3)
(3,3,5)
(5,2,1)
(1,3,5)
(6,2,2)
(2,3,5)
1
1
1
1
1
1
1
1
1
P1 P2
P3 P4 P5
(S,P,O)
1 1
1
1
1
S4
O
7
Data Structure
K2-Triples
Query: (4,2,3)
1
1
1
1
P2
1
1
1
1
Result: (4,2,3)
• SPO  checking a cell
• SP?
• ?PO
• S?O
• S??
• ??O
• ?P?
Operations
K2-Triples
Query: (4,2,?)
1
1
1
1
P2
1
1
1
1
Result: (4,2,3)
• SPO  checking a cell
• SP?  direct neighbours
• ?PO
• S?O
• S??
• ??O
• ?P?
Operations
K2-Triples
Query: (?,2,3)
1
1
1
1
P2
1
1
1
1
Result: (4,2,3), (7,2,3)
• SPO  checking a cell
• SP?  direct neighbours
• ?PO  reverse neighbours
• S?O
• S??
• ??O
• ?P?
Operations
K2-Triples
Query: (4,?,6)
Result: (4,4,6)
1 1
1
1
1
1
1
1
1
1
P3 P4 P5
1 1
1
1
1
P1 P2
• SPO  checking a cell
• SP?  direct neighbours
• ?PO  reverse neighbours
• S?O  checking |P| cells
• S??
• ??O
• ?P?
Operations
K2-Triples
Query: (4,?,?)
Result: (4,1,7), (4,2,3), (4,4,6)
1 1
1
1
1
1
1
1
1
1
P3 P4 P5
1 1
1
1
1
P1 P2
• SPO  checking a cell
• SP?  direct neighbours
• ?PO  reverse neighbours
• S?O  checking |P| cells
• S??  |P| direct neighbours
• ??O
• ?P?
Operations
K2-Triples
Query: (?,?,4)
Result: (8,5,4)
1 1
1
1
1
1
1
1
1
1
P3 P4 P5
1 1
1
1
1
P1 P2
• SPO  checking a cell
• SP?  direct neighbours
• ?PO  reverse neighbours
• S?O  checking |P| cells
• S??  |P| direct neighbours
• ??O  |P| reverse neighbours
• ?P?
Operations
K2-Triples
Query: (?,2,?)
1
1
1
1
P
2
1
1
1
1
Result: (4,2,3), (5,2,1),(6,2,2),(7,2,3)
• SPO  checking a cell
• SP?  direct neighbours
• ?PO  reverse neighbours
• S?O  checking |P| cells
• S??  |P| direct neighbours
• ??O  |P| reverse neighbours
• ?P?  full adjacency matrix
Operations
• Weakness of vertical partitioning  unbounded predicates
– (S,?,?), (?,?,O), (S,?,O)
– Checking the |P| K2-trees!
• They proposed indexes SP and OP
K2-Triples
(8,5,4)
(4,2,3)
(4,4,6)
(4,1,7)
(7,2,3)
(3,3,5)
(5,2,1)
(1,3,5)
(6,2,2)
(2,3,5)
(S,P,O) S Predicates
1 3
2 3
3 3
4 1,2,4
5 2
6 2
7 2
8 5
SP INDEX
Statistically compressed
Direct access with DAC
Indexes SP & OP
K2-Triples
1 1
1
1
1
1
1
1
1
1
P1 P2 P3 P4 P5
SP INDEX
Subject 4?
Predicate list: 1,2,4
• Query (4,?,?)
Indexes SP & OP
K2-Triples
Joins
• Independent join
• Chain join
• Interactive join
K2-Triples
• They implemented three join strategies
– Taking advantage of the K2-triples structure
1
P5
(8,5,?X) (?X,2,?)
Query: (8,5,?X) (?X,2,?)
1
1
1
1
P2
Best strategy depends on the
dataset and the type of join
Joins
K2-Triples
1
P5
(8,5,?X) (?X,2,?)
Query: (8,5,?X) (?X,2,?)
1
1
1
1
P2
1 0
1
1
1
1
1 0
X[1-4]
X[5-8]
X[1-4]
X[5-8]
P5
P2
Joins > Interactive Join
• Real datasets from different domains
• Space results (Mbytes)
K2-Triples
Dataset Size(MB) #Triples #Predicates #Subjects #Objects
Jamendo 144.18 1,049,639 28 335,926 440,604
DBLP 7.58 46,597,620 27 2,840,639 19,639,731
Geonames 12,347.70 112,235,492 26 8,147,136 41,111,569
Dbpedia 33,912.71 232,542,405 39,672 18,425,128 65,200,769
Dataset MonetDB RDF-3X Hexastore K2-triples K2-triples+
Jamendo 8.76 37.73 1,371.25 0.74 1.28
DBLP 358.44 1,643.31 82.48 99.24
Geonames 859.66 3,584.80 152.20 188.63
Dbpedia 1,811.74 9,757.58 931.44 1178.38
Experiments
K2-Triples
Experiments > triple patterns
• Triple patterns (DBPEDIA)
K2-Triples
Experiments > Join
Outline
RDF management overview
K2-Tree Data Structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments
Compressed Suffix Array (CSA-SAD)
• Binary search for any pattern: “ab”
Back to Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations Noccs = (4-3)+1
Occs = A[3] .. A[4] = { 8, 1}
Fast space
O(m lg n) O(4n)
O(m lg n + noccs) + T
Compressed Suffix Array (CSA-SAD)
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
P = a b
CSA basics
• Can we reduce the space needs of a Suffix Array?
Compressed Suffix Array (CSA-SAD)
CSA basics
• Ψ
• A[Ψ(i)] = A[i] +1
racadabra$
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
Compressed Suffix Array (CSA-SAD)
CSA basics
• Ψ
• A[Ψ(i)] = A[i] +1
racadabra$
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
3
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
• Ψ
• A[Ψ(10)] = A[3] = A[10] +1 = 8
Compressed Suffix Array (CSA-SAD)
7
CSA basics
• Ψ and F
• Ψ and F are enought to perform binary search
and to recover the source data!!
Compressed Suffix Array (CSA-SAD)
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
4 1 7 8 9 10 11 12 6 2 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ= 3
7
F =
CSA basics
Compressed Suffix Array (CSA-SAD)
• Ψ and F (reducing space needs)
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F =
1 1 0 0 0 0 1 0 1 1 1 0=D
$ a b c d r
1 2 3 4 5 6
=S
Bitmap
Sorted alphabet
CSA basics
Compressed Suffix Array (CSA-SAD)
• Ψ and F (reducing space needs)
• Example: F[8] = S[rank1(D, 8)] = S[3] = ‘b’
Rank1(D,i):: Time O(1), by using o(n) extra space
Representing F
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F =
1 1 0 0 0 0 1 0 1 1 1 0=D
$ a b c d r
1 2 3 4 5 6
=S
Bitmap
Sorted alphabet
rank1(D, 8)
Compressed Suffix Array (CSA-SAD)
Compressing Ψ
• Absolute samples (k=sample period)
• Gap encoding on increasing values: Huffman & run-encoding
• Huffman with a N entries dictionary
– k reserved Huffman codes to encode 1-runs of size s ϵ [1..k-1]
– 32 + 32 Huffman codes representing the size (en bits) of large values [ + or - ]
• They are followd by that number encoded with log (v) bits
– The remaining N – k -32 -32 entries correspond to the most frequent gap values.
11 6 7 12 1 4 9 10 8 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ =
1 0 0 1 0 0 0 00 0 1 0 0 0 1 0 1 0 0 0 0 1 11 0 1 1 0 1 0 0 1 0 1 0 10 0 1 0 0 1 0 0
11
1
1
18
8
32
sΨ =
Δ =
Compressed Suffix Array (CSA-SAD)
sampling
sampling
sampling +
gap encoding +
- delta codes*
- Huffman-based
- encoding de runs
– Ψ(sampled), D, S  count
– A(sampled)  locate
– A-1(sampled)  extract
1 1 0 0 0 0 1 0 1 1 1 0=D
12 11 8 1 4 6 9 2 5 10 3A = 7
4 8 12 5 9 6 10 3 7 2 1A = 11
-1
4 1 7 8 9 10 11 12 6 2 5Ψ= 3
$ a b c d r=S
1 2 3 4 5 6 7 8 9 10 11 12
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
Parameters: space/time “trade-off”
Complete structure
Outline
RDF management overview
K2-Tree Data Structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments
RDF-CSA
• Step 1  Integer dictionary encoding of s, p, o
• Step 2  Ordered list of n triples (sequence of 3n elements)
Building RDF-CSA
We first sort by subject, then by predicate, and finally by object
…
RDF-CSA
• Step 3  Sid is transformed into S, in order to keep disjoint
alphabets
Building RDF-CSA
Range [1, ns] for subjects
Range [ns+1, ns + np] for predicates
Range [ns + np + 1, ns + np + no] for objects
Due to this alphabet mapping, every subject is smaller than every predicate,
and this in turn is smaller than every object !!!
RDF-CSA
• Step 4  We build an iCSA on S
Building RDF-CSA
 A has three ranges: each range points to suffixes starting with a subject, a predicate, or an
object
 cycles around the components of the same triple; that is, the object of a triple k does not
point to the subject of the triple k+1 in S, but to the subject of the same triple  we can start
at position A[i], pointing to any place within a triple (s,p,o), and recover the triple by succesive
applications of
RDF-CSA
• (S,P,O), (?S,P,O), (S,?P,O), (S,P,?O), (?S,?P,O),
(S,?P,?O), (?S,P,?O), (?S,?P,?O)
– Patterns with just one bounded element are directly solved using select on D
– Pattern (?S,?P,?O) retrieves all the triples, so it can be solved by retrieving every ith
triple, using
– For the rest of the patterns: binary iCSA search
• SPO  bsearch(SPO,3)
• ?SOP  bsearch (OP,2) … S?PO  bsearch (OS,2)
– Optimizations:
• D-select+forward-check strategy: find valid intervals into S, P and O ranges,
and check matches with into those intervals, starting from the shortest one.
• D-select+backward-check strategy: use binary search to limit valid intervals,
instead of sequentially verifying each position of the shortest interval.
Searching for triple patterns
Optimizations are applicable to pattern (S,P,O), and those with just one unbounded term!!
RDF-CSA
• (S,P,O) optimizations
– D-select+forward-check strategy: find valid intervals into S, P and O ranges,
and check matches with into those intervals, starting from the shortest one.
– D-select+backward-check strategy: use binary search to limit valid intervals,
instead of sequentially verifying each position of the shortest interval.
Searching for triple patterns
180 231 301 550 600 602
10 11 12 180 200 230 231 232 300 301 550 600 601 602
S=8 P=4 O=261
SP SPO
180 231 301 550 600 602
10 11 12 180 200 230 231 232 300 301 550 600 601 602
S=8 P=4 O=261
SPO PO
RDF-CSA
Experiments (dbpedia) space % VS micros/occ
RDF-CSA
Experiments (dbpedia) space % VS micros/occ
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

More Related Content

What's hot

An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) DataAli Khalili
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
Quick Linked Data Introduction
Quick Linked Data IntroductionQuick Linked Data Introduction
Quick Linked Data IntroductionMichael Hausenblas
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityunivTope Omitola
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
 
Linked Data Management
Linked Data ManagementLinked Data Management
Linked Data ManagementMarin Dimitrov
 
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)Rensselaer Polytechnic Institute
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale Bernadette Hyland-Wood
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsArmin Haller
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesOpen Data Support
 
Linked open data project
Linked open data projectLinked open data project
Linked open data projectFaathima Fayaza
 

What's hot (20)

An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) Data
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
Sanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUDSanderson Shout It Out: LOUD
Sanderson Shout It Out: LOUD
 
Quick Linked Data Introduction
Quick Linked Data IntroductionQuick Linked Data Introduction
Quick Linked Data Introduction
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Linked library data
Linked library dataLinked library data
Linked library data
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)
 
Linked Data Management
Linked Data ManagementLinked Data Management
Linked Data Management
 
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
ITWS 4310: Building and Consuming the Web of Data (Fall 2013)
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and Examples
 
Linked open data project
Linked open data projectLinked open data project
Linked open data project
 

Viewers also liked

1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon ChallengeJoel Azzopardi
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceJosé Ramón Ríos Viqueira
 
Data Compression for Multi-dimentional Data Warehouses
Data Compression for Multi-dimentional Data WarehousesData Compression for Multi-dimentional Data Warehouses
Data Compression for Multi-dimentional Data WarehousesMushfiqur Rahman
 
How GZIP works... in 10 minutes
How GZIP works... in 10 minutesHow GZIP works... in 10 minutes
How GZIP works... in 10 minutesRaul Fraile
 
G zip compresser ppt
G zip compresser pptG zip compresser ppt
G zip compresser pptgaurav kumar
 
Data Compression Project Presentation
Data Compression Project PresentationData Compression Project Presentation
Data Compression Project PresentationMyuran Kanga, MS, MBA
 
Chapter 5 - Data Compression
Chapter 5 - Data CompressionChapter 5 - Data Compression
Chapter 5 - Data CompressionPratik Pradhan
 
Data compression techniques
Data compression techniquesData compression techniques
Data compression techniquesDeep Bhatt
 
Text compression in LZW and Flate
Text compression in LZW and FlateText compression in LZW and Flate
Text compression in LZW and FlateSubeer Rangra
 
data compression technique
data compression techniquedata compression technique
data compression techniqueCHINMOY PAUL
 
Fundamentals of Data compression
Fundamentals of Data compressionFundamentals of Data compression
Fundamentals of Data compressionM.k. Praveen
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedSlideShare
 

Viewers also liked (19)

1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge
 
Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
School intro
School introSchool intro
School intro
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Data Compression for Multi-dimentional Data Warehouses
Data Compression for Multi-dimentional Data WarehousesData Compression for Multi-dimentional Data Warehouses
Data Compression for Multi-dimentional Data Warehouses
 
How GZIP works... in 10 minutes
How GZIP works... in 10 minutesHow GZIP works... in 10 minutes
How GZIP works... in 10 minutes
 
Compression
CompressionCompression
Compression
 
G zip compresser ppt
G zip compresser pptG zip compresser ppt
G zip compresser ppt
 
Data Compression Project Presentation
Data Compression Project PresentationData Compression Project Presentation
Data Compression Project Presentation
 
Chapter 5 - Data Compression
Chapter 5 - Data CompressionChapter 5 - Data Compression
Chapter 5 - Data Compression
 
Data compression techniques
Data compression techniquesData compression techniques
Data compression techniques
 
Text compression in LZW and Flate
Text compression in LZW and FlateText compression in LZW and Flate
Text compression in LZW and Flate
 
Data compression
Data compressionData compression
Data compression
 
Compression techniques
Compression techniquesCompression techniques
Compression techniques
 
data compression technique
data compression techniquedata compression technique
data compression technique
 
Data compression
Data compressionData compression
Data compression
 
Fundamentals of Data compression
Fundamentals of Data compressionFundamentals of Data compression
Fundamentals of Data compression
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-Presented
 

Similar to Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

RDFa From Theory to Practice
RDFa From Theory to PracticeRDFa From Theory to Practice
RDFa From Theory to PracticeAdrian Stevenson
 
On line footprint @upc
On line footprint @upcOn line footprint @upc
On line footprint @upcSilvia Puglisi
 
Towards cross-domain interoperation in the internet of FAIR data and services
Towards cross-domain interoperation in the internet of FAIR data and servicesTowards cross-domain interoperation in the internet of FAIR data and services
Towards cross-domain interoperation in the internet of FAIR data and servicesLuiz Olavo Bonino da Silva Santos
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?MIUR
 
Introducing the Linked Data Research Centre
Introducing the Linked Data Research CentreIntroducing the Linked Data Research Centre
Introducing the Linked Data Research CentreMichael Hausenblas
 
Linked Data and the Semantic Web - Mimas Seminar
Linked Data and the Semantic Web - Mimas SeminarLinked Data and the Semantic Web - Mimas Seminar
Linked Data and the Semantic Web - Mimas SeminarAdrian Stevenson
 
Linked Open Data and data-driven journalism
Linked Open Data and data-driven journalismLinked Open Data and data-driven journalism
Linked Open Data and data-driven journalismPia Jøsendal
 
Conclusions - Linked Data
Conclusions - Linked DataConclusions - Linked Data
Conclusions - Linked DataJuan Sequeda
 
Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?Adrian Stevenson
 
Putting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open DataPutting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open DataMartin Kaltenböck
 
Linked dataresearch
Linked dataresearchLinked dataresearch
Linked dataresearchTope Omitola
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutionsOpen Data Support
 
Discovering Resume Information using linked data  
Discovering Resume Information using linked data  Discovering Resume Information using linked data  
Discovering Resume Information using linked data  dannyijwest
 
The linked data value chain atif
The linked data value chain atifThe linked data value chain atif
The linked data value chain atifAtif Latif
 

Similar to Keystone summer school_2015_miguel_antonio_ldcompression_4-joined (20)

Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
 
RDFa From Theory to Practice
RDFa From Theory to PracticeRDFa From Theory to Practice
RDFa From Theory to Practice
 
Introducción a Linked Open Data (espacios enlazados y enlazables)
Introducción a Linked Open Data (espacios enlazados y enlazables)Introducción a Linked Open Data (espacios enlazados y enlazables)
Introducción a Linked Open Data (espacios enlazados y enlazables)
 
On line footprint @upc
On line footprint @upcOn line footprint @upc
On line footprint @upc
 
Towards cross-domain interoperation in the internet of FAIR data and services
Towards cross-domain interoperation in the internet of FAIR data and servicesTowards cross-domain interoperation in the internet of FAIR data and services
Towards cross-domain interoperation in the internet of FAIR data and services
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?
 
Introducing the Linked Data Research Centre
Introducing the Linked Data Research CentreIntroducing the Linked Data Research Centre
Introducing the Linked Data Research Centre
 
Linked Data and the Semantic Web - Mimas Seminar
Linked Data and the Semantic Web - Mimas SeminarLinked Data and the Semantic Web - Mimas Seminar
Linked Data and the Semantic Web - Mimas Seminar
 
Linked Open Data and data-driven journalism
Linked Open Data and data-driven journalismLinked Open Data and data-driven journalism
Linked Open Data and data-driven journalism
 
Conclusions - Linked Data
Conclusions - Linked DataConclusions - Linked Data
Conclusions - Linked Data
 
Linked Data
Linked DataLinked Data
Linked Data
 
Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?Linked Data and the Semantic Web: What Are They and Should I Care?
Linked Data and the Semantic Web: What Are They and Should I Care?
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Putting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open DataPutting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open Data
 
Broad Data
Broad DataBroad Data
Broad Data
 
Linked dataresearch
Linked dataresearchLinked dataresearch
Linked dataresearch
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutions
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Discovering Resume Information using linked data  
Discovering Resume Information using linked data  Discovering Resume Information using linked data  
Discovering Resume Information using linked data  
 
The linked data value chain atif
The linked data value chain atifThe linked data value chain atif
The linked data value chain atif
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

  • 1. Linked Data Semantic Technologies RDF Compression HDT Linked Data Compression Miguel A. Mart´ınez-Prieto Antonio Fari˜na Univ. of Valladolid (Spain) Univ. of A Coru˜na (Spain) migumar2@infor.uva.es fari@udc.es Keyword search over Big Data. – 1st KEYSTONE Training School –. July 22nd, 2015. Faculty of ICT, Malta. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 1/53
  • 2. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data Outline 1 Linked Data 2 Semantic Technologies 3 RDF Compression 4 HDT Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 2/53
  • 3. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data – What is Linked Data? – Linked Data Linked Data is simply about using the Web to create typed links between data from different sources [3]. Linked Data refers to a set of best practices for publishing and connecting data on the Web. These best practices have been adopted by an increasing number of data providers, leading to the creation of a global data space: Data are machine-readable. Data meaning is explicitly defined. Data are linked from/to external datasets. The resulting Web of Data connects data from different domains: Publications, movies, multimedia, government data, statistical data, etc. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 3/53
  • 4. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data What is Linked Data? Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 4/53
  • 5. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web... of Data The emergence of the Web was an authentic revolution 15 years ago: Changed the way we consume information. Changed human relationships. Changed businesses. ... Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 5/53
  • 6. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web The Web is a global space comprising linked HTML documents: Web pages are the atoms of the Web. Each page is univocally identified by their URL. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 6/53
  • 7. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web Where are (raw) data in the Web? Web pages “cook” raw data in a human-readable way. It is, probably, the main problem of the WWW. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 7/53
  • 8. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web - I was excited for the Keystone Training School and looked for information about this nice country. - I wrote “malta” in a web search engine, and... I found some relevant results for my query! :) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 8/53
  • 9. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web - I was excited for the Keystone Training School and looked for information about this nice country. - I wrote “malta” in a web search engine, and... But others seem a little strange to my (current) expectations... :( Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 9/53
  • 10. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web... of Data Raw data are hidden among web pages contents: In general, data are written in HTML paragraphs. In the best case, they are structured in the form of HTML tables or published as additional documents (CSV, XML...) Anyway, HTML is not enough expressive to describe and link individual data entities in the Web: HTML-based descriptions lose semantics and structure from the raw data. This fact makes very difficult automatic data processing in the Web. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 10/53
  • 11. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web... of Data The Web of Data [8] converts raw data into first-class citizens of the Web... Data entities are the atoms of the Web of Data. Each entity has its own identity. ...and uses existing infrastructure: It uses HTTP as communication protocol. Entities are named using URIs. The Web of Data is a cloud of data-to-data hyperlinks [5]: These are labelled hyperlinks in contrast to the “plain” ones used in the Web. Thus, hyperlinks also provide semantics to data descriptions. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 11/53
  • 12. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data The Web... of Data Linked Data builds a Web of Data using the Internet infrastructure: Data providers can publish their raw data in a standardized way. These data can be interconnected using labelled hyperlinks. The resulting cloud of data can be navigated using specific query languages. Linked Data achievements: Knowledge from different fields can be easily integrated and universally shared. Automatic processes can exploit these knowledge to build innovative software systems. Semantic Search Engine For instance, a semantic search engine would allow us for only retrieving entities which describe “malta” as a country but not as a cereal. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 12/53
  • 13. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data – Linked Data Principles – Tim Berners-Lee [2] suggests four basic principles for Linked Data: 1 Use URIs as names for things. 2 Use HTTP URIs so that people can look up those names. 3 When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). 4 Include links to other URIs, so that they can discover more things. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 13/53
  • 14. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data 1. URIs as names What is his name? For humans, his name is Clint Eastwood... ... but http://dataweb.infor.uva.es/movies/people/Clint Eastwood is a better name for machines. The use of URIs enables real-world entities (or their relationships with other entities) to be identifed at universal scale. This principle ensures any class of data has its own identity in the global space of the Web of Data. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 14/53
  • 15. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data 2. HTTP URIs All entities must be described using dereferenceable URIs: These URIs are accesible via HTTP. This principle exploits HTTP features to retrieve all data related to a given URI. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 15/53
  • 16. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data 3. Standards This principle states that all stakeholders “must speak the same languages” for effective understanding. RDF [10] provides a simple logical model for data description. SPARQL [12] describes a specific language for querying RDF data. Serialization formats, ontology languages, etc. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 16/53
  • 17. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data 4. Linking URIs This principle materializes the aim of data integration in Linked Data: Linking two URIs establishes a particular connection between two existing entities. Linking URIs http://dataweb.infor.uva.es/movies/people/Clint Eastwood names the entity which describes “Clint Eastwood”. http://dataweb.infor.uva.es/movies/film/Mystic River names the entity which describes the movie “Mystic River”. An hyperlink between these two URIs state that the entity “Clint Eastwood” is related to the entity “Mystic River”... how? The labelled link provides a semantic relationship between entities. In this case, http://dataweb.infor.uva.es/movies/property/director tags the “director” relationship between “Clint Eastwood” and “Mystic River”. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 17/53
  • 18. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data – Linked Open Data – The Linked Open Data (LOD) project1 promotes Linked Data to be published as Open Data: LOD is released under an open license which does not impede its reuse for free [2]. LOD is the highest-level in the 5-star scheme2 for Open Data publication. The dataset is available on the Web under an open license. The dataset is available as structured data. The dataset is encoded using a non-propietary format. The dataset names entities using URIs. The dataset is linked to other datasets. 1 http://linkeddata.org/; http://5stardata.info/ Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 18/53
  • 19. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data LOD (2007-2011) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 19/53
  • 20. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data LOD (2014) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 20/53
  • 21. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data LOD (2014) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 21/53
  • 22. Linked Data Semantic Technologies RDF Compression HDT What is Linked Data? Linked Data Principles Linked Open Data Current Statistics (July, 2015) 9,960 datasets are openly available2 : 90 billion statements from 3,308 datasets. 6,639 datasets could not be crawled for different reasons. LOD Laundromat4 provides access to more tha 38 billion statements from 650K “cleaned” datasets. DBpedia 2014 contains more than 3 billion statements: 538 million statements from English Wikipedia. 2.46 billion statements from other language editions. 50 million statements linking to external datasets. More and more datasets are released and these are getting bigger: The largest ones are in the order of hundreds of GB. 2 http://stats.lod2.eu/; http://lodlaundromat.org/ Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 22/53
  • 23. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL Outline 1 Linked Data 2 Semantic Technologies 3 RDF Compression 4 HDT Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 23/53
  • 24. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL – Overview – Semantic Technologies (in middle layers) exploit features from the Web infrastructure (low layers): RDF is used for resource description. RDFS is used for describing semantic vocabularies. OWL extends RDFS and is used for building ontologies. SPARQL is the query language for RDF data. RIF is used for describing rules. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 24/53
  • 25. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF & SPARQL RDF & SPARQL are the most relevant technologies for our current aims: Both standards are based on labelled directed graph features. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 25/53
  • 26. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL – RDF –    http : //dataweb.infor.uva.es/movies/people/Clint Eastwood http : //dataweb.infor.uva.es/movies/property/name Clint Eastwood    http : //dataweb.infor.uva.es/movies/film/Mystic River http : //dataweb.infor.uva.es/movies/property/title Mystic River    http : //dataweb.infor.uva.es/movies/people/Clint Eastwood http : //dataweb.infor.uva.es/movies/property/director http : //dataweb.infor.uva.es/movies/film/Mystic River RDF [10] is a framework for describing resources of any class: People, movies, cities, proteins, statistical data... Resources are described in the form of triples: Subject: the resource being described. Predicate: a property of that resource. Object: the value for the corresponding property. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 26/53
  • 27. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF Triples    http : //dataweb.infor.uva.es/movies/people/Clint Eastwood http : //dataweb.infor.uva.es/movies/property/name Clint Eastwood    http : //dataweb.infor.uva.es/movies/film/Mystic River http : //dataweb.infor.uva.es/movies/property/title Mystic River    http : //dataweb.infor.uva.es/movies/people/Clint Eastwood http : //dataweb.infor.uva.es/movies/property/director http : //dataweb.infor.uva.es/movies/film/Mystic River An RDF triple is a labelled directed subgraph in which subject and object nodes are linked by a particular (predicate) edge: The subject node contains the URI which names the resource. The predicate edge labels the relationship using a URI whose semantics is described by any vocabulary/ontology. The object node may contain a URI or a (string) Literal value. RDF links (between entities) also take the form of RDF triples. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 27/53
  • 28. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF Triples Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 28/53
  • 29. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF Triples Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 29/53
  • 30. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF Triples Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 30/53
  • 31. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF Graph This graph view is only a mental model: RDF graphs must be serialized!! But the RDF Recommendation does not restrict the format to be used. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 31/53
  • 32. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL RDF Serialization Formats Traditional plain formats are commonly used: RDF/XML, NTriples, Turtle... These formats are very verbose in practice: Data are serialized in a (more or less) human-readable way. Large RDF files are finally compressed using gzip or bzip2. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 32/53
  • 33. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL – SPARQL – SPARQL [12] is a query language for RDF. It is based on graph pattern matching: Triple patterns are RDF triples in which subject, predicate and object may be variable. SPARQL supports more complex queries: joins, unions, filters... Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 33/53
  • 34. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL SPARQL Resolution Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 34/53
  • 35. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL SPARQL Resolution Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 35/53
  • 36. Linked Data Semantic Technologies RDF Compression HDT Overview RDF SPARQL SPARQL Resolution Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 36/53
  • 37. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression Outline 1 Linked Data 2 Semantic Technologies 3 RDF Compression 4 HDT Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 37/53
  • 38. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression What is the problem? RDF excels at logical level: Structured and semi-structured data can be described using RDF triples. Entities are also linked in the form of RDF triples. But it is a source of redundancy at physical level Serialization formats are highly verbose. RDF data are redundant at three levels: semantic, symbolic, and syntactic. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 38/53
  • 39. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression – Semantic Compression – Semantic redundancy occurs when the same meaning can be conveyed using less triples.    http : //dataweb.infor.uva.es/movies/property/name http : //www.w3.org/2000/01/rdf − schema#domain http : //dataweb.infor.uva.es/movies/classes/person    http : //dataweb.infor.uva.es/movies/people/Clint Eastwood http : //dataweb.infor.uva.es/movies/property/name Clint Eastwood    http : //dataweb.infor.uva.es/movies/people/Clint Eastwood http : //www.w3.org/1999/02/22 − rdf − syntax − ns#type http : //dataweb.infor.uva.es/movies/classes/person The third triple is redundant because the first one state that the URI http://dataweb.infor.uva.es/movies/people/Clint Eastwood describes an entity in the domain of http://dataweb.infor.uva.es/movies/classes/person. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 39/53
  • 40. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression Semantic Compression Semantic compressors perform at logical level: Detect redundant triples and remove them from the original dataset. Semantic compressors [9, 11, 13] are not so effective by themselves... ... but may be combined with symbolic and syntactic compressors! Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 40/53
  • 41. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression – Symbolic Compression – Symbolic redundancy is due to symbol repetitions in triples: This is the “traditional” source of redundancy removed by universal compressors. Symbolic redundancy in RDF is mainly due to URIs: URIs tend to be very large strings which share long prefixes. http://dataweb.infor.uva.es/movies/film/Bird http://dataweb.infor.uva.es/movies/film/Million Dollar Baby http://dataweb.infor.uva.es/movies/film/Mystic River http://dataweb.infor.uva.es/movies/people/Clint Eastwood ... ... but literals also contibute to this redundancy. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 41/53
  • 42. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression Symbolic Compression The most prominent RDF compressors remove symbolic redundancy: All different URIs/literals are indexed in a string dictionary. Each string is identified by a unique integer ID. - Triples are rewritten by replacing strings by their corresponding IDs. Symbolic is, in general, the most important redundancy in RDF and has (many) room for optimization. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 42/53
  • 43. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression – Syntactic Compression – Syntactic redundancy depends on the RDF graph serialization: For instance, a serialized subset of n triples (which describes the same resource) writes n times the subject value. It can be abbr. ... and also on the underlying graph structure: For instance, resources of the same classes are described using (almost) the same sub-graph structure. Syntactic compression also has (many) room for optimization. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 43/53
  • 44. Linked Data Semantic Technologies RDF Compression HDT Semantic Compression Symbolic Compression Syntactic Compression Syntactic Compression HDT [7], k2 -triples [1], or RDFCSA [4] are syntactic compressors reporting good numbers: They are combined with symbolic compression. In practice, they compress RDF triples in the form of ID triples. Semantic compressors such as SSP [11] also remove symbolic and syntactic redundancy. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 44/53
  • 45. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions Outline 1 Linked Data 2 Semantic Technologies 3 RDF Compression 4 HDT Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 45/53
  • 46. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions – What is HDT? – HDT was the first binary serialization format for RDF: It was acknowledged as W3C Member Submission [6] in 2011. It exploits symbolic and syntactic redundancy: It reduces up to 15 times the space used by traditional formats [7]. HDT is a core building block in some Linked Data applications: It reports good compression numbers, but also provides efficient data retrieval. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 46/53
  • 47. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions – Components – HDT encodes RDF data into three components: The Header (H) comprises descriptive metadata. The Dictionary (D) maps different strings (from nodes and edges) to IDs: It manages four independent mappings: subjects-objects, subjects, objects, and predicates. The Triples (T) component encodes the inner structure as a graph of IDs. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 47/53
  • 48. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions HDT Components The Dictionary is encoded using specific compression techniques for string dictionaries. Triple IDs are organized into a forest of trees (one per different subject)... ...which is encoded using two bitsequences and two ID sequences. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 48/53
  • 49. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions – Conclusions – HDT integrates RDF serialization and compression into a practical format: HDT saves space storage and enables efficient data parsing/retrieval using bit operations. Symbolic rendundancy is addressed by the Dictionary component: The collection of strings (in the dictionary) has high symbolic redundancy... The own dictionary is highly compressible! Syntactic rendundancy is removed by the Triples component: HDT triples is a straightforward compressor. Their effectiveness can be improved using optimized graph compression techniques. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 49/53
  • 50. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions Bibliography I [1] Sandra ´Alvarez-Garc´ıa, Nieves Brisaboa, Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, and Gonzalo Navarro. Compressed Vertical Partitioning for Efficient RDF Management. Knowledge and Information Systems (KAIS), 44(2):439–474, 2015. [2] Tim Berners-Lee. Linked Data, 2006. http://www.w3.org/DesignIssues/LinkedData.html. [3] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So Far. International Journal of Semantic Web and Information Systems, 5(3):1–22, 2009. [4] Nieves Brisaboa, Ana Cerdeira, Antonio Fari˜na, and Gonzalo Navarro. A Compact RDF Store using Suffix Arrays. In Proceedings of SPIRE, 2015. To appear. [5] Javier D. Fern´andez, Mario Arias, Miguel A. Mart´ınez-Prieto, and Claudio Guti´errez. Management of Big Semantic Data. In Big Data Computing, chapter 4. Taylor and Francis/CRC, 2013. [6] Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, Claudio Guti´errez, and Axel Polleres. Binary RDF Representation for Publication and Exchange. W3C Member Submission, 2011. www.w3.org/Submission/HDT/. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 50/53
  • 51. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions Bibliography II [7] Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, Claudio Guti´errez, Axel Polleres, and Mario Arias. Binary RDF Representation for Publication and Exchange. Journal of Web Semantics, 19:22–41, 2013. [8] Tom Heath and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 1 edition, 2011. http://linkeddatabook.com/. [9] Amit K. Joshi, Pascal Hitzler, and Guozhu Dong. Logical Linked Data Compression. In Proceedings of ESWC, pages 170–184, 2013. [10] Frank Manola and Eric Miller. RDF Primer. W3C Recommendation, 2004. www.w3.org/TR/rdf-primer/. [11] Jeff Z. Pan, Jos´e Manuel G´omez-P´erez, Yuan Ren, Honghan Wu, and Man Zhu. SSP: Compressing RDF data by Summarisation, Serialisation and Predictive Encoding. Technical report, 2014. Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014 SSP.pdf. [12] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C Recommendation, 2008. http://www.w3.org/TR/rdf-sparql-query/. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 51/53
  • 52. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions Bibliography III [13] Gayathri V. and P. Sreenivasa Kumar. Horn-Rule based Compression Technique for RDF Data. In Proceedings of SAC, pages 396–401, 2015. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 52/53
  • 53. Linked Data Semantic Technologies RDF Compression HDT Basics Components Conclusions This presentation has been made available only for learning/teaching purposes. The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 53/53
  • 54. Onto some basics of: compression, Compact Data Structures, and indexing 1st KEYSTONE Training School July 22th, 2015. Faculty of ICT, Malta Antonio Fariña Miguel A Martínez Prieto
  • 56. • Disks are cheap !! But they are also slow! – Compression can help more data to fit in main memory. (access to memory is around 106 times faster than HDD) • CPU speed is increasing faster – We can trade processing time (needed to uncompress data) by space. Introduction Why compression?
  • 57. • Compression does not only reduce space! – I/O access on disks and networks – Processing time* (less data has to be processed) • … If appropriate methods are used – For example: Allowing handling data compressed all the time. Introduction Why compression? Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (20%) P7zip, others Doc 1 Doc 2 Doc 3 Doc n Let’s search for “Malta"
  • 58. • Indexing permits sublinear search time Introduction Why indexing? Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Doc 1 Doc 2 Doc 3 Doc n Let’s search for “Malta" term 1 … Malta … term n (> 5-30%) Index
  • 59. • Self-indexes: – sublinear search time – Text implicitly kept Introduction Why Compact Data Structures? Text collection Doc 1 Doc 2 Doc 3 Doc n Let’s search for “Malta" term 1 … Malta … term n (> 5-30%) Index 0 0 0 01 1 0 1 0 1 0 10 0 1 0 Self-index (WT, WCSA,…) term 1 … Malta … term n
  • 61. Basic Compression • A compressor could use as a source alphabet: – A fixed number of symbols (statistical compressors) • 1 char, 1 word – A variable number of symbols (dictionary-based compressors) • 1st occ of ‘a’ encoded alone, 2nd occ encoded with next one ‘ax’ • Codes are built using symbols of an target alphabet: – Fixed length codes (1 bit, 10 bits, 1 byte, 2 bytes, …) – Variable length codes (1,2,3,4 bits/bytes …) • Classification (fixed-to-variable, variable-to-fixed,…) Modeling & Coding -- statistical Input alphabet dictionary var2var Target alphabet fixed var fixed var
  • 62. Basic Compression • Taxonomy – Dictionary based (gzip, compress, p7zip… ) – Grammar based (BPE, Repair) – Statistical compressors (Huffman, arithmetic, PPM,… ) • Statistical compressors – Gather the frequencies of the source symbols. – Assign shorter codewords to the most frequent symbols. Obtain compression Main families of compressors
  • 63. Basic Compression • How do they achieve compression – Assign fixed-length codewords to variable length symbols (text substrings) – The longer the replaced substring  the better compression • Well-known representatives: Lempel-Ziv family – LZ77 (1977): GZIP, PKZIP, ARJ, P7zip – LZ89 (1978) • LZW (1984): Compress, GIF images Dictionary-based compressors
  • 64. Basic Compression • Starts with an initial dictionary D (contains symbols in Σ) • For a given position of the text. – while D contains w, reads prefix w=w0 w1 w2 … – If w0 …wk wk+1 is not in D (w0 …wk does!) • output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|)) • Add w0 …wk wk+1 to D • Continue from wk+1 on (included) • Dictionary has limited length? Policies: LRU, truncate& go, … LZW EXAMPLE
  • 65. Basic Compression • Starts with an initial dictionary D (contains symbols in Σ) • For a given position of the text. – while D contains w, reads prefix w=w0 w1 w2 … – If w0 …wk wk+1 is not in D (w0 …wk does!) • output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|)) • Add w0 …wk wk+1 to D • Continue from wk+1 on (included) • Dictionary has limited length? Policies: LRU, truncate& go, … LZW EXAMPLE
  • 66. Basic Compression • Replaces pairs of symbols by a new one, until no pair repeats twice – Adds a rule to a Dictionary. Grammar Based – BPE - Repair A B C D E A B D E F D E D E F A B E C D A B C G A B G F G G F A B E C D H C G H G F G G F H E C D H C G H I G I H E C D DE G AB  H GF  I Source sequence Dictionary of Rules Final Repair Sequence
  • 67. Basic Compression • Assign shorter codewords to the most frequent symbols – Must gather symbol frequencies for each symbol c in Σ. – Compression is lower bounded by the (zero-order) empirical entropy of the sequence (S). • Most representative method: Huffman coding Statistical Compressors n= num of symbols nc= occs of symbol c H0(S) <= log (|Σ|) n H0(S) = lower bound of the size of S compressed with a zero-order compressor
  • 68. Basic Compression • Optimal prefix free coding – No codeword is a prefix of one another. • Decoding requires no look-ahead! – Asymptotically optimal: |Huffman(S)| <= n(H0(S)+1) • Typically using bit-wise codewords – Yet D-ary Huffman variants exist (D=256 byte-wise) • Builds a Huffman tree to generate codewords Statistical Compressors: Huffman coding
  • 69. Basic Compression • Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE Statistical Compressors: Huffman coding
  • 70. Basic Compression • Bottom – Up tree construction Statistical Compressors: Huffman coding
  • 71. Basic Compression • Bottom – Up tree construction Statistical Compressors: Huffman coding
  • 72. Basic Compression • Bottom – Up tree construction Statistical Compressors: Huffman coding
  • 73. Basic Compression • Bottom – Up tree construction Statistical Compressors: Huffman coding
  • 74. Basic Compression • Bottom – Up tree construction Statistical Compressors: Huffman coding
  • 75. Basic Compression • Branch labeling Statistical Compressors: Huffman coding
  • 76. Basic Compression • Code assignment Statistical Compressors: Huffman coding
  • 77. Basic Compression • Compression of sequence S= ADB… • ADB…  01 000 10 … Statistical Compressors: Huffman coding
  • 78. Basic Compression • Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column. Burrows-Wheeler Transform (BWT) mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi sort L = BWT(S)F
  • 79. Basic Compression • Given L=BWT(S), we can recover S=BWT-1(L) Burrows-Wheeler Transform: reversible (BWT -1) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi LF 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F
  • 80. Basic Compression • Given L=BWT(S), we can recover S=BWT-1(L) Burrows-Wheeler Transform: reversible (BWT -1) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi LF 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1; - - - - - - - - - - - $ S
  • 81. Basic Compression • Given L=BWT(S), we can recover S=BWT-1(L) Burrows-Wheeler Transform: reversible (BWT -1) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi LF 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]=‘$’ p = LF[p]; p = 1 i = i+1; i=1 - - - - - - - - - - - $ S
  • 82. Basic Compression • Given L=BWT(S), we can recover S=BWT-1(L) Burrows-Wheeler Transform: reversible (BWT -1) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi LF 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]=‘i’ p = LF[p]; p = 2 i = i+1; i=2 - - - - - - - - - - i $ S
  • 83. Basic Compression • Given L=BWT(S), we can recover S=BWT-1(L) Burrows-Wheeler Transform: reversible (BWT -1) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi LF 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]=‘i’ p = LF[p]; p = 2 i = i+1; i=2 m i s s i s s i p i i $ S
  • 84. Basic Compression • BWT. Many similar symbols appear adjacent • MTF. – Output the position o the current symbol within Σ ‘ – Keep the alphabet Σ ‘= {a,b,c,d,e,… } sorted so that the last used symbol is moved to the begining of Σ ‘ . • RLE. – If a value (0) appears several times (000000  6 times) – replace it by a pair <value,times>  <0,6> • Huffman stage. Bzip2: Burrows-Wheeler Transform (BWT) Why does it work? In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …
  • 86. Sequences • Given a Sequence of – n integers – m = maximum value • We can representing it with n ⌈log2(m+1)⌉ bits – 16 symbols x3 bits per symbol = 48 bits  array of 2 32-bit ints – Direct access (access to an integer + bit operations) Plain Representation of Data 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
  • 87. Sequences • Is it compressible? • Ho(S) = 1.59 (bits per symbol) • Huffman: 1.62 bits per symbol 26 bits: No direct access! (but we could add sampling) Compresed Representation of Data (H0) Symbol 4 1 2 3 Occurrences (nc) 9 4 2 1 0 1 16 7 1 43 0 1 2 0 1 2 3 1 4 9 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 01 000 0011 1 1 1 01 1 000 1 01 01 1 1 1 5 10 15 20 25
  • 88. Sequences • Operations of interest: – Access(i) : Value of the ith symbol – Ranks(i) : Number of occs of symbol s up to position i (count) – Selects (i) : Where the ith occ of symbol s? (locate) Summary: Plain/compressed  acess/rank/select () 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100 1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46 1 01 000 0011 1 1 1 01 1 000 1 01 01 1 1 1 5 10 15 20 25
  • 90. Bit Sequences Rank1(6) = 3 Rank0(10) = 5 Access/rank/select on bitmaps 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1415161718 19 20 21 B= select0(10) =15 access (19) = 0
  • 91. Bit Sequences • Bitmaps a basic part of most Compact Data Structures • Example: (We will see it later in the CSA) S: AAABBCCCCCCCCDDDEEEEEEEEEEFG  n log σ bits B: 1001010000000100100000000011  n bits D: ABCDEFG  σ log σ bits – Saves space: – Fast access/rank/select is of interest !! • Where is the 2nd C? • How many Cs up to position k? Applications
  • 92. Bit Sequences • Jacobson, Clark, Munro – Variant by Fariña et al. • Assuming 32 bit machine-word • Step 1: Split de Bitmap into superblocks of 256 bits, and store de number of 1s up to positions 1+256k – O(1) time to superblock. Space: n/256 superblock and 1 int each Reaching O(1) Rank y o(n) bits of extra space 0 1 0 ... 1 1 2 3 256 35 bits set to 1 1 ... 1 257 512 27 bits set to 1 350 1 2 Ds = 62 3 0 ... 1 513 768 45 bits set to 1 ... 97 3 ...
  • 93. Bit Sequences • Step 2: For each superblock of 256 bits – Divide it into 8 block of 32 bits each (machine word size) – Store the number of ones from the beginning of the superblock – O(1) time to the blocks, 8 blocks per superblock, 1 byte each Reaching O(1) Rank y o(n) bits of extra space 1 1 0 ... 1 1 2 3 256 35 bits set to 1 1 ... 0 257 512 27 bits set to 1 350 1 2 Ds = 62 3 0 ... 1 513 768 45 bits set to 1 ... 97 3 ... 1 1 0 ... 1 1 2 3 32 4 bits set to 1 0 ... 1 33 64 6 bits set to 1 ... 40 1 2 Db = 10 3 ... 1 ... 0 224 256 8 bits set to 1
  • 94. Bit Sequences • Step 3: Rank within a 32 bit block Finally solving: rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i) where i= p mod 32 – Ex: rank1(D,300) = 35 + 4 + 4 = 43 – Yet, how to compute rank1(blk, i) in constant time ? Reaching O(1) Rank y o(n) bits of extra space 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
  • 95. Bit Sequences • how to compute rank1 (blk, i) in constant time ? – Option 1: popcount within a machine word – Option 2: Universal Table onesInByte (solution for each byte) Only 256 entries storing values [0..8] • Finally sum value onesInByte for the 4 bytes in blk • Overall space: 1.375 n bits Reaching O(1) Rank y o(n) bits of extra space 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0blks = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 0 0 1 0 0 0 0 1 1 0 0 Shift 32 – 12 = 20 posicións Rank1(blk,12) Val binary OnesInByte 0 00000000 0 1 00000001 1 2 00000010 1 3 00000011 2 252 11111100 6 253 11111101 7 254 11111110 7 255 11111111 8 ... ... ...
  • 96. Bit Sequences select1(p) • In practice, binary search using rank – Binary search on superblocks O(log(n)) to find the superblock s containing the pth 1  retval = Ds[s] – Sequential search [uint <=256] within the in blocks until reaching the block d that contains the position  retval += Db[d] – Sequential search (1 byte at a time) within the last 32 bits, using onesInByte[] table until reaching the byte b that contains the position. • In each iteration: retval += onesInByte[b] – Table lookup over a new selb[] table over the last “byte” b • retval += selb[b] – Return retval Select in O(log (n)) with the same structures
  • 97. Bit Sequences • Compressed bitmap representations exist. – Compressed  [Raman et al] – For very sparse bitmaps [Okanohara and Sadakane] – … Compressed representations
  • 99. Integer Sequences Access/rank/select on general sequences Rank2(9) = 3 S= select4(3) =7 access (13) = 3 4 4 3 2 6 2 4 2 4 1 1 2 3 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • 100. Integer Sequences • Grossi et al. • Given a sequence of symbols and an encoding – The bits of the code of each symbol are distributed along the different levels of the tree 00 01 00 10 11 00 A B A C D A C 0 0 0 0 10 1 1 0 1 A B A A C D C 0 1 0 10 0 1 0 Wavelet tree (construction) DATA SYMBOL CODE WAVELET TREEA B A C D A C C D 00 01 10 11 B A
  • 101. • Searching for the 1st occurrence of ‘D’? Integer Sequences DATA SYMBOL CODE WAVELET TREEA B A C D A C C D 00 01 10 11 B A A B A C D A C 0 0 0 01 1 0 1 A B A A C D C 0 1 0 10 0 it is the 2nd bit in B1 Where is the 2nd ‘1’?  at pos 5. 0 1 Where is the 1st ‘1’?  at pos 2. Wavelet tree (select) Broot B0 B1
  • 102. Integer Sequences • Recovering Data: extracting the next symbol – Which symbol appears in the 6th position? A B A C D A C 0 0 0 01 1 0 1 A B A A C D C 0 1 0 10 0 Which bit occus at position 4 in B0? How many ‘0’s are there up to pos 6? it is the 4th ‘0’ 0 1 It is set to 0 The codeword read is ’00’ A Wavelet tree (access) DATA SYMBOL CODE WAVELET TREEA B A C D A C C D 00 01 10 11 B A Broot B0 B1 Broot B0 B1 Broot B0
  • 103. Integer Sequences • Recovering Data: extracting the next symbol – Which symbol appears in the 7th position? A B A C D A C 0 0 0 01 1 0 1 A B A A C D C 0 1 0 10 0 Which bit occurs at position 3 in B1? How many ‘1’s are there up to pos 7? it is the 3rd ‘1’ 0 1 It is set to 0 The codeword read is ’10’  C TEXT SYMBOL CODE WAVELET TREEA B A C D A C C D 00 01 10 11 B A Wavelet tree (access) B1 Broot B0
  • 104. Integer Sequences • How many C’s up to position 7? A B A C D A C 0 0 0 01 1 0 1 A B A A C D C 0 1 0 10 0 How many 0s up to position 3 in B1? How many ‘1’s are there up to pos 7? it is the 3rd ‘1’ 0 1 2 !! TEXT SYMBOL CODE WAVELET TREEA B A C D A C C D 00 01 10 11 B A Wavelet tree (Rank) B1 Broot B0 Select (locate symbol) Access and Rank:
  • 105. Integer Sequences • Each level contains n + o(n) bits • Rank/select/access expected O(log σ) time A B A C D A C 0 0 0 01 1 0 1 A B A A C D C 0 1 0 10 0 1 0 Wavelet tree (Space and times) WAVELET TREE 00 01 00 10 11 00 10 DATA SYMBOL CODE A B A C D A C C D 00 01 10 11 B A n + o(n) bits n + o(n) bits n ⌈log σ⌉ (1 + o(1)) bits
  • 106. Integer Sequences • Using Huffman coding (or others)  umbalanced • Rank/select/access  O(nH0(S)) time Huffman-shaped (or others) Wavelet tree A B A C D A C 1 0 1 10 0 0 1 B C D C A A A 0 1 0 0 0 WAVELET TREE 1 000 1 01 001 1 01 DATA SYMBOL CODE A B A C D A C C D 1 000 01 001 B A nH0(S) + o(n) bits 0 1 B D C C 1 0
  • 108. A brief Review about Indexing • Traditional indexes (with or without compression) – Inverted Indexes, Suffix Arrays,... • Compressed Self-indexes – Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index, … Text Indexing: Well-known structures from The Web implicit text auxiliar structure explicit text
  • 109. A brief Review about Indexing Inverted indexes Space-time trade-off DCC communications compression image data information Cliff Logde 0 142 104 165 341 506368 219 445 DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not only compression methods for specific types of data (text, image, video, audio, space, graphics, web content, etc.), but also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). 99 207 336 128 395 19 25 Vocabulary Posting Lists Indexed text Searches Word  posting of that word Phrase  intersection of postings Block1Block2 Compression - Indexed text (Huffman,...) - Posting lists (Rice,...) 1 1 2 2 1 2 1 2 1 2 1 1 DCC communications compression image data information Cliff Lodge Vocabulary Posting Lists Full-positional information Block-addressing inverted index
  • 110. A brief Review about Indexing • Lists contain increasing integers • Gaps between integers are smaller in the longest lists Inverted indexes 4 10 15 25 29 40 46 54 57 70 79 82Posting list original 1 2 3 4 5 6 7 8 9 10 11 12 4 6 5 10 4 11 6 8 3 13 9 3Diferenc. 4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3 Sampling absoluto + codif long. variable  Acceso directo Descompresión parcial c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3Codif long. variable Descompresión completa
  • 111. A brief Review about Indexing • Sorting all the suffix of T lexicographically Suffix Arrays a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$
  • 112. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = P = a b
  • 113. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A =
  • 114. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A =
  • 115. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A =
  • 116. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A =
  • 117. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A =
  • 118. A brief Review about Indexing • Binary search for any pattern: “ab” Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = locations Noccs = (4-3)+1 Occs = A[3] .. A[4] = { 8, 1} Fast space O(m lg n) O(4n) O(m lg n + noccs) + T
  • 119. Basic Compression • BWT(S) + other structures  it is an index BWT  FM-index • C[c] : for each char c in Σ , stores the number of occs in S of the chars that are lexicographically smaller than c. C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 • OCC(c, k): Number of occs of char c the prefix of L: L (1, k) For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 • Char L[i] occurs in F at position LF(i): LF(i) = C[L[i]] + Occ(L[i],i)
  • 120. Basic Compression • Count (S[1,u], P[1,p]) BWT  FM-index C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
  • 121. Basic Compression • Representing L with a wavelet tree BWT  FM-index
  • 122. Bibliography 1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, 1994. http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/. 2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE, LNCS 5280, pages 176–187, 2008. 3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001. 4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552- 581, 2005. 5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994 6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages 368–373, 2006. 7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841–850, 2003.
  • 123. Bibliography 8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098-1101, 1952 9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722–1732, 2000 10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 22(5):935–948, 1993 11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms .Kluwer 2002, ISBN 0-7923- 7668-4 12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37–42, 1996. 13. Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39 n.1, p.2-es, 2007 14. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ALENEX, 2007. 15. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002.
  • 124. Bibliography 16. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, 2000. 17. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. 18. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3, 337–343. 19. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5, 530–536.
  • 125. Onto some basics of: compression, Compact Data Structures, and indexing 1st KEYSTONE Training School July 22th, 2015. Faculty of ICT, Malta Antonio Fariña Miguel A Martínez Prieto
  • 126. Introduction Compressed String Dictionaries Experimental Evaluation Dictionary Compression Miguel A. Mart´ınez-Prieto Antonio Fari˜na Univ. of Valladolid (Spain) Univ. of A Coru˜na (Spain) migumar2@infor.uva.es fari@udc.es Keyword search over Big Data. – 1st KEYSTONE Training School –. July 22nd, 2015. Faculty of ICT, Malta. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 1/47
  • 127. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Outline 1 Introduction 2 Compressed String Dictionaries 3 Experimental Evaluation Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 2/47
  • 128. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries – What is a String Dictionary – String Dictionary A string dictionary is a serializable data structure which organizes all different strings (vocabulary) used in a dataset. The vocabulary of a natural language text (lexicon) comprises all different words used in it. T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo” V= {he, la, ni~na, no, que, s´ı, tarara, visto, yo} Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 3/47
  • 129. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries What is a String Dictionary? The dictionary implements a bijective function that maps strings to identifiers (IDs, generally integer values) and back. It must provide, at least, two complementary operations: string-to-ID: locates the ID for a given string. ID-to-string: extracts the string identified by a given ID. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 4/47
  • 130. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries What is a String Dictionary? String dictionaries are a simple and effective tool: Enable replacing (long, variable-length) strings by simple numbers (their IDs). T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo” T’= 2 7 6 2 7 4 2 7 3 5 2 1 8 9 The resulting IDs are more compact to represent and easier and more efficient to handle: T= 59 chars × 1 byte/chars = 59 bytes T’= 14 IDs × log(9) bits/ID = 7 bytes (plus the cost of dictionary encoding) A compact dictionary which provides efficient mapping between strings and IDs saves storage space, and processing/transmission costs, in data-intensive applications. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 5/47
  • 131. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Compressing String Dictionaries The growing volume of the datasets has led to increasingly large dictionaries: The dictionary size is a bottleneck for applications running under restrictions of main memory. Dictionary management is becoming a scalability issue by itself. Dictionary compression aims to achieve competitive space/time tradeoffs: Compact serialization. Small memory footprint. Efficient query resolution. We focus on static dictionaries, which do not change along the execution: Many applications use dictionaries that either are static or are rebuilt only sparingly. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 6/47
  • 132. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries – Operations – A string dictionary is a data structure that represents a sequence of n distinct strings, D = s1, s2, . . . , sn . It provides a mapping between ID numbers i and strings si : - locate(p) = i, if p = si for some i ∈ [1, n]. = 0 otherwise. - extract(i) returns the string si , for i ∈ [1, n]. Some other operations can be useful in specific applications: Prefix-based locate / extract operations. Substring-based locate / extract operations. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 7/47
  • 133. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Prefix-based Operations - locatePrefix(p) = {i, ∃y, si = py}. This result set is a contiguous ID range for lexicographically sorted dictionaries. - extractPrefix(p) = {si , ∃y, si = py}. It is equivalent to composing locatePrefix(p) with individual extract(i) operations. Finding all URIs in a given domain is an example of prefix-based operation: Look for all properties used in http://dataweb.infor.uva.es/movies: http://dataweb.infor.uva.es/movies/property/director (4). http://dataweb.infor.uva.es/movies/property/name (7). http://dataweb.infor.uva.es/movies/property/title (12). ... Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 8/47
  • 134. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Substring-based Operations - locateSubstring(p) = {i, ∃x, y, si = xpy}. It is very similar to the problem solved by full-text indexes. - extractSubstring(p) = {si , ∃x, y, si = xpy}. It is equivalent to composing locateSubstring(p) with individual extract(i) operations. Both operations may return duplicate results which must be removed before reporting the ID result set. regex query resolution in SPARQL is an example of substring-based operation: Look for all literals containing the substring Eastwood: ‘‘Clint Eastwood’’ (2544). ‘‘Jayne Eastwood is a Canadian actress...’’ (10584). ‘‘Kyle Eastwood’’ (13847). ... Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 9/47
  • 135. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Summary - locate(“tarara”) = 7 - extract(2) = la - locatePrefix(“n”) = 3,4 - extractPrefix(“n”) = ni˜na, no - locateSubstring(“a”) = 2,3,7 - extractSubstring(“a”) = la, ni˜na, tarara Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 10/47
  • 136. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries – RDF Dictionaries – An RDF dictionary comprises all different terms used in the dataset: RDF terms are drawn from three disjoint vocabularies: URIs, Literals, and blank nodes. Serialized (uncompressed) RDF vocabularies need up to 3 times more space than (uncompressed) ID-triples [13]. URIs and Literals should be compressed and managed independently: Their structure is very different and they are queried in a different way. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 11/47
  • 137. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries URIs URIs are medium-size strings sharing long prefixes: Compressed dictionaries for URIs must exploit the continuous repetition of such prefixes. Prefix-based compression. locate operations are common when the dictionary is used for lookup purposes (e.g. RDF stores, semantic search engines, etc.). extract operations are common when the dictionary is used for data access purposes (e.g. decompression, result retrieval, etc.). locatePrefix and extractPrefix are also useful for URI dictionaries. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 12/47
  • 138. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Literals Literals tends to be large-size strings with no predictable features: The name “Clint Eastwood”. The genome from an individual of any species. The full text from “El Quijote” ... Literal dictionaries must be based on universal compression. locate and extract are used like in URI dictionaries. locateSubstring and extractSubstring are useful because of SPARQL needs. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 13/47
  • 139. Introduction Compressed String Dictionaries Experimental Evaluation What is a String Dictionary? Operations RDF Dictionaries Practical Configuration A role-based partition is first performed: Subjects are encoded in the range [1,|S|]. Predicates are encoded in the range [1,|P|]. Objects are encoded in the range [1,|O|]. URIs playing as subject and object are encoded once: IDs in [1,|SO|] encode subjects and objects. Subjects are encoded in [|SO+1|,|S|]. Objects are encoded using two dictionaries: 1 [|SO+1|,|Ox |] encode URIs which only performs as objects. 2 [|Ox +1|,|O|] encode Literals. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 14/47
  • 140. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Outline 1 Introduction 2 Compressed String Dictionaries 3 Experimental Evaluation Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 15/47
  • 141. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Compressed String Dictionaries All revised dictionaries combine notions from universal compression and compact data structures. Universal compressors must enable fast decompression and comparison of individual strings: Huffman [8] and Hu-Tucker [7, 9] codes. Re-Pair [10]. The serialized vocabulary Tdict concatenates all strings in lexicographic order: An special symbol $ is used as separator. T =“alabar a la alabada alabarda” Tdict = a$alabada$alabar$alabarda$la$ Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 16/47
  • 142. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries – Front-Coding – Front-Coding [15] is a folklore compression technique for lexicographically sorted dictionaries. It exploits the fact that consecutive entries are likely to share a common prefix: Each entry in the dictionary is differentially encoded with respect to the preceding one. It needs two values: × An integer encoding the length of the shared prefix. × The remaining characters of the current entry. a$alabada$alabar$alabarda$la$ → (0,a$); (1,labada$); (5, r$); (6, da$); (0, la$) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 17/47
  • 143. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Front-Coding The vocabulary is divided into buckets of b strings: The first string of each bucket (header) is explicitly stored. The remaining b − 1 internal strings are differentially encoded. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 18/47
  • 144. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Front-Coding Operations locate(p): 1 Headers are binary searched until finding the bucket Bx where p must lie: If the header is p, locate(p) = (b × (Bx − 1)) + 1. 2 The internal string are sequentially decoded: If the internal ith string is p, locate(p) = (b × (Bx − 1)) + i. If the bucket is fully decoded with no result, p is not in the dictionary. extract(i): 1 The string is encoded in the bucket Bx = i/b . 2 ((i − 1) mod b) internal strings are decoded to obtain the answer. Prefix-based operations exploits the lexicographic order: Their results are contiguous ranges in the dictionary. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 19/47
  • 145. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Plain Front-Coding (PFC) PFC is a straightforward byte-oriented Front-Coding implementation: It uses VByte [14] to encode the length of the common prefix. The remaining string is encoded with one byte per character, plus the terminator $. PFC is serialized as a byte array (Tpfc ) and a ptrs structure: Both structures are directly mapped to main memory for data retrieval purposes. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 20/47
  • 146. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries HuTucker Front-Coding (HTFC) HTFC is algorithmically similar to PFC, but it takes advantage of the Tpfc redundancy to achieve a more compressed representation: Operations are slightly slower than for PFC. Headers are encoded using HuTucker: It allows compressed headers to be directly compared with the query pattern. Internal strings are encoded using Huffman or Re-Pair compression. HTFC is serialized as a bit array (Thtfc ) and also a ptrs structure: Pointers in HTFC uses less bits because Thtfc is smaller than Tpfc . Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 21/47
  • 147. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries – Hashing – Hashing [3] is a folklore method to implement dictionaries: A hash function transforms the string into an index x in the hash table. A collision arises when two different strings are mapped to the same cell in the table. String dictionaries perform better with closed hashing [2]: If the corresponding cell is not empty, one successively probes other cells until finding a free cell. The next cell to be probed is determined using double hashing. Hash dictionaries provide very efficient locate, may support extract, but the table size dissuades their use for managing large vocabularies. Compressed hash dictionaries focuses on compacting the table, but also the own vocabulary: The vocabulary can be effectively compressed using Huffman or Re-Pair. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 22/47
  • 148. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Vocabulary Compression Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 23/47
  • 149. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Table Compression (I) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 24/47
  • 150. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Table Compression (II) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 25/47
  • 151. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Improving Data Access Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 26/47
  • 152. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Hashing Operations (locate) locate(p): 1 The pattern p is compressed using Huffman: cp. 2 cp is “hashed” to a position x in the (original) hash table. 3 x is mapped to its corresponding position y in the compressed representation. 4 The string pointed from y is decompressed and compared to p. locate(“alabada”) 1 Huffman(“alabada$”)=cp 2 hash(cp)=5 3 if B[5] = 1, rank1(B, 5)=4 if B[5] = 0, “alabada” is not in D. 4 strcmp(DAC[4],cp)=true → 4 strcmp(DAC[4],cp)=false → collision Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 27/47
  • 153. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries Hashing Operations (extract) extract(i): 1 The string directly extract from DAC[i]. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 28/47
  • 154. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries – Self-Indexed Dictionaries – A self-index stores the original text T and provides indexed searches to it, using space proportional to the T statistical entropy. Self-indexes support two operations: locate(p), returns all the positions in T where p occurs. extract(i, j), retrieves the substring T [i, j]. A string dictionary can be easily self-indexed: The corresponding self-index is built on the text Tdict . The dictionary primitives (and also prefix and substring based queries) are implemented using the self-index operations. We choose the FM-Index [4, 5] because it is the most space-efficient self-index in practice: A $ symbol is prepended to the original Tdict . The BWT (L) is a wavelet-tree (“plain” [5] and “compressed” [11]). C is a simple array. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 29/47
  • 155. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries FM-Index Dictionary Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 30/47
  • 156. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries FM-Index Dictionary (locate) The ith string is encoded between the i + 1th and i + 2th $. locate(p) performs backwards search of $p$: The pattern is searched from right to left until reach the corresponding $. locate(p) performs in time O(|p| log σ). Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 31/47
  • 157. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries FM-Index Dictionary (locate) locate(’la’): Looking for $la$. 1. Range: [C($),C(a)-1]=[0,5]. Count the number of a before the range: occs0=ranka(L, 0) = 0 Count the number of a to the end of the range: occs1=ranka(L, 5) = 4 2. Range: [C(a)+occs0,C(a)+occs1-1]=[6,9]. Count the number of l before the range: occs0=rankl (L, 6) = 0 Count the number of a to the end of the range: occs1=rankl (L, 9) = 1 3. Range: [C(l)+occs0,C(l)+occs1-1]=[24,25]. Count the number of l before the range: occs0=rank$(L, 24) = 5 Count the number of a to the end of the range: occs1=rank$(L, 25) = 6 4. Range: [C($)+occs0,C($)+occs1-1]=[5,5]. ’la’ is identified by 5. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 32/47
  • 158. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries FM-Index Dictionary (extract) extract(i) retrieves symbols from the (i + 1) − th $ to the i − th $: It takes O(|si | log σ) time. extract(5): 1. The search process starts from Position: 0. Extracts the symbol in this position: access(L, 0) =a Count the number of as up to the position: occs=ranka(L, 0) = 1 2. Position: C(a) + 1 − 1 = 6. Extracts the symbol in this position: access(L, 6) =l Count the number of ls up to the position: occs=rankl (L, 6) = 1 3. Position: C(l) + 1 − 1 = 24. Extracts the symbol in this position: access(L, 6) =$ The 5 − th string is la. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 33/47
  • 159. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries FM-Index Dictionary (prefix & substring operations) locatePrefix(p) is similar to locate: It looks for $p and finds the area [sp,ep] in where all strings si that start with p are encoded. Substring-based operations generalize prefix-based ones: locateSubstring(p) look for p to obtain the area [sp,ep] containing all strings si with p. For each match, the backwards search continues until determining the corresponding ID (sampling structure) Duplicate IDs are finally removed. extractPrefix(p) and extractSubstring(p) perform extract operations in the corresponding ranges. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 34/47
  • 160. Introduction Compressed String Dictionaries Experimental Evaluation Front-Coding Hashing Self-Indexed Dictionaries Other Dictionaries – Other Dictionaries (Tries)– Tries [9] are tree-shaped structures which perform efficiently for dictionary purposes: Strings are located from root to leaves. IDs are extracted from the corresponding leaf to the root. Tries use much space for managing large dictionaries. Some compressed trie-based dictionaries exist in the state of the art: Compressed tries based on path decomposition [6]. LZ-compressed tries [1]. Self-indexed tries (XBW) [2]. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 35/47
  • 161. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Outline 1 Introduction 2 Compressed String Dictionaries 3 Experimental Evaluation Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 36/47
  • 162. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Experimental Setup Two RDF real-world dictionaries: 26, 948, 638 URIs from Uniprot: Averaged length: 51.04 chars per URI. Highly-repetitive. 27, 592, 013 Literals from DBpedia: Averaged length: 60.45 chars per Literal. We analyze compression effectiveness and retrieval speed: locate, extract. Prefix-based operations (URIs) Substring-based operations (Literals). In practice, extract is the most important query: It is used many times as results are retrieved from the compressed dataset. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 37/47
  • 163. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions – URIs – Compressed tries (LexRP and CentRP) obtain the best compression results and report better numbers for locate: ≈ 4.5 % of the original space. ≈ 2 − 3µs/string. > 2µs/ID. HTFC uses slightly more space, but it is faster for extract: ≈ 5 − 13 % of the original space. ≈ 2.2-3 µs/string. ≈ 0.7-1.6 µs/ID. The best tradeoff is for PFC: ≈ 9 − 19 % of the original space. ≈ 1.6 µs/string. ≈ 0.3-0.6 µs/ID. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 38/47
  • 164. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Prefix-based Operations PFC is the best choice for prefix-based operations: Although it uses more space, it reports the best performance. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 39/47
  • 165. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions – Literals – Compressed tries (LexRP and CentRP) obtain better compression results and report better numbers for locate: ≈ 12 % of the original space. ≈ 2-2.5 µs/string. > 2.5 µs/ID. HTFC reports the best compression ratios, but its performance is less competitive: ≈ 9 − 17 % of the original space. ≈ 4.5-40 µs/string. ≈ 3 − 20µs/ID. The best tradeoff is for Hash: ≈ 15 % of the original space. ≈ 1.5 µs/string. ≈ 1µs/ID. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 40/47
  • 166. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Substring-based Operations Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 41/47
  • 167. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions – Conclusions – RDF dictionaries are highly compressible: URIs are very redundant and Literals also show non-negligible symbolic redundancy. This redundancy can be detected and removed within specific data structures for dictionaries: Structures for URIs use up to 20 times less space than the original dictionaries. For Literals, the corresponding structures use 6 − 8 times less space than the original dictionaries. All these structures report data retrieval performance at microsecond level: This functionality includes both simple and advanced operations. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 42/47
  • 168. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions GitHub All dictionaries explained in this lecture (and some more [12]) are available in the libCSD C++ library: https://github.com/migumar2/libCSD Beta version: suggestions are accepted ;) Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 43/47
  • 169. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Bibliography I [1] Julian Arz and Johannes Fischer. LZ-compressed string dictionaries. In Procedings of DCC, pages 322–331, 2014. [2] Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, Miguel A. Mart´ınez-Prieto, and Gonzalo Navarro. Compressed string dictionaries. In Proceedings of SEA, pages 136–147, 2011. [3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd edition, 2001. [4] Paolo Ferragina and Giovanni Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. [5] Paolo Ferragina, Giovanni Manzini, Veli M¨akinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007. [6] Roberto Grossi and Giuseppe Ottaviano. Fast Compressed Tries through Path Decompositions. In Proceedings of ALENEX, pages 65–74, 2012. [7] T.C. Hu and Alan C. Tucker. Optimal Computer-Search Trees and Variable-Length Alphabetic Codes. SIAM Journal of Applied Mathematics, 21:514–532, 1971. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 44/47
  • 170. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Bibliography II [8] David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098–1101, 1952. [9] Donald .E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, 1973. [10] N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. Proceedings of the IEEE, 88:1722–1732, 2000. [11] Veli M¨akinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3):article 32, 2008. [12] Miguel A. Mart´ınez-Prieto, Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, and Gonzalo Navarro. Practical compressed string dictionaries. Information Systems, 2015. Under review. [13] Miguel A. Mart´ınez-Prieto, Javier D. Fern´andez, and Rodrigo C´anovas. Querying RDF Dictionaries in Compressed Space. SIGAPP Applied Computing Review, 12(2):64–77, 2012. [14] Hugh E. Williams and Justin Zobel. Compressing integers for fast file access. The Computer Journal, 42:193–201, 1999. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 45/47
  • 171. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions Bibliography III [15] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 46/47
  • 172. Introduction Compressed String Dictionaries Experimental Evaluation URIs Literals Conclusions This presentation has been made only for learning/teaching purposes. The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors. Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 47/47
  • 173. Triples Compression and Indexing 1st KEYSTONE Training School July 22th, 2015. Faculty of ICT, Malta Antonio Fariña Miguel A Martínez Prieto
  • 174. Outline RDF management overview K2-Tree structure K2-Triples Compressed Suffix Array (CSA-Sad) RDF-CSA Experiments
  • 175. RDF management Overview Dictionary + triples-IDS UK London M.Lalmas R.Raman A.Gionis inv-speaker Finland SPIREheld on capitalof livesin lives in position lives in attends attends attendsworks in (SPIRE, held on, London) (London, capital of, UK) (A.Gionis, attends, SPIRE) (R.Raman, attends, SPIRE) (M.Lalmas, attends, SPIRE) (M.Lalmas, lives in, UK) (M.Lalmas, works in, London) (A.Gionis, lives in, Finland) (R.Raman, lives in, UK) (R.Raman, position, inv-speaker) Original Triplets London SPIRE A.Gionis M.Lalmas R.Raman Finland inv-speaker UK attends capital of 1 2 3 4 5 3 4 5 1 2 held on lives in position works in 3 4 5 6 SO S O P Dictionary Encoding (2,3,1) (1,2,5) (3,1,2) (5,1,2) (4,1,2) (4,4,5) (4,6,1) (3,4,3) (5,4,5) (5,5,4) Id-based Triplets
  • 176. Outline RDF management overview K2-Tree Data Structure K2-Triples Compressed Suffix Array (CSA-Sad) RDF-CSA Experiments
  • 177. K2-Tree • Structure for representing adjacency matrix • Originally designed for web graphs – Simple directed graph 1 2 3 7 4 5 6 8 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 119 10 11 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 Motivation
  • 178. 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 10 0 0 01 0 0 01 1 1 1111 1 1 10 0 0 00 0 0 0 0 0 0 0100 01000011 0010 0010 10101000 0110 0010 Example with K=2 T = 101111010100100011001000000101011110 L = 010000110010001010101000011000100100 K2-Tree Construction process
  • 179. K2-Tree 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 10 0 0 01 0 0 01 1 1 1111 1 1 10 0 0 00 0 0 0 0 0 0 0100 01000011 0010 0010 10101000 0110 0010 T = 101111010100100011001000000101011110 L = 010000110010001010101000011000100100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 children(2) = rank1(T,2)* k2 = 2*4=8 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30… children(9) = rank1(T,9)* k2 = 7*4=28 Direct neighbour operation
  • 180. Outline RDF management overview K2-Tree structure K2-Triples Compressed Suffix Array (CSA-Sad) RDF-CSA Experiments
  • 181. K2-Triples RDF triples Mapped triples Dictionary • Dictionary encoding • Triples as a set of identifiers Data Structure
  • 182. K2-Triples • Vertical partitioning • One K2-tree per predicate 1(8,5,4) (4,2,3) (4,4,6) (4,1,7) (7,2,3) (3,3,5) (5,2,1) (1,3,5) (6,2,2) (2,3,5) 1 1 1 1 1 1 1 1 1 P1 P2 P3 P4 P5 (S,P,O) 1 1 1 1 1 S4 O 7 Data Structure
  • 183. K2-Triples Query: (4,2,3) 1 1 1 1 P2 1 1 1 1 Result: (4,2,3) • SPO  checking a cell • SP? • ?PO • S?O • S?? • ??O • ?P? Operations
  • 184. K2-Triples Query: (4,2,?) 1 1 1 1 P2 1 1 1 1 Result: (4,2,3) • SPO  checking a cell • SP?  direct neighbours • ?PO • S?O • S?? • ??O • ?P? Operations
  • 185. K2-Triples Query: (?,2,3) 1 1 1 1 P2 1 1 1 1 Result: (4,2,3), (7,2,3) • SPO  checking a cell • SP?  direct neighbours • ?PO  reverse neighbours • S?O • S?? • ??O • ?P? Operations
  • 186. K2-Triples Query: (4,?,6) Result: (4,4,6) 1 1 1 1 1 1 1 1 1 1 P3 P4 P5 1 1 1 1 1 P1 P2 • SPO  checking a cell • SP?  direct neighbours • ?PO  reverse neighbours • S?O  checking |P| cells • S?? • ??O • ?P? Operations
  • 187. K2-Triples Query: (4,?,?) Result: (4,1,7), (4,2,3), (4,4,6) 1 1 1 1 1 1 1 1 1 1 P3 P4 P5 1 1 1 1 1 P1 P2 • SPO  checking a cell • SP?  direct neighbours • ?PO  reverse neighbours • S?O  checking |P| cells • S??  |P| direct neighbours • ??O • ?P? Operations
  • 188. K2-Triples Query: (?,?,4) Result: (8,5,4) 1 1 1 1 1 1 1 1 1 1 P3 P4 P5 1 1 1 1 1 P1 P2 • SPO  checking a cell • SP?  direct neighbours • ?PO  reverse neighbours • S?O  checking |P| cells • S??  |P| direct neighbours • ??O  |P| reverse neighbours • ?P? Operations
  • 189. K2-Triples Query: (?,2,?) 1 1 1 1 P 2 1 1 1 1 Result: (4,2,3), (5,2,1),(6,2,2),(7,2,3) • SPO  checking a cell • SP?  direct neighbours • ?PO  reverse neighbours • S?O  checking |P| cells • S??  |P| direct neighbours • ??O  |P| reverse neighbours • ?P?  full adjacency matrix Operations
  • 190. • Weakness of vertical partitioning  unbounded predicates – (S,?,?), (?,?,O), (S,?,O) – Checking the |P| K2-trees! • They proposed indexes SP and OP K2-Triples (8,5,4) (4,2,3) (4,4,6) (4,1,7) (7,2,3) (3,3,5) (5,2,1) (1,3,5) (6,2,2) (2,3,5) (S,P,O) S Predicates 1 3 2 3 3 3 4 1,2,4 5 2 6 2 7 2 8 5 SP INDEX Statistically compressed Direct access with DAC Indexes SP & OP
  • 191. K2-Triples 1 1 1 1 1 1 1 1 1 1 P1 P2 P3 P4 P5 SP INDEX Subject 4? Predicate list: 1,2,4 • Query (4,?,?) Indexes SP & OP
  • 193. • Independent join • Chain join • Interactive join K2-Triples • They implemented three join strategies – Taking advantage of the K2-triples structure 1 P5 (8,5,?X) (?X,2,?) Query: (8,5,?X) (?X,2,?) 1 1 1 1 P2 Best strategy depends on the dataset and the type of join Joins
  • 194. K2-Triples 1 P5 (8,5,?X) (?X,2,?) Query: (8,5,?X) (?X,2,?) 1 1 1 1 P2 1 0 1 1 1 1 1 0 X[1-4] X[5-8] X[1-4] X[5-8] P5 P2 Joins > Interactive Join
  • 195. • Real datasets from different domains • Space results (Mbytes) K2-Triples Dataset Size(MB) #Triples #Predicates #Subjects #Objects Jamendo 144.18 1,049,639 28 335,926 440,604 DBLP 7.58 46,597,620 27 2,840,639 19,639,731 Geonames 12,347.70 112,235,492 26 8,147,136 41,111,569 Dbpedia 33,912.71 232,542,405 39,672 18,425,128 65,200,769 Dataset MonetDB RDF-3X Hexastore K2-triples K2-triples+ Jamendo 8.76 37.73 1,371.25 0.74 1.28 DBLP 358.44 1,643.31 82.48 99.24 Geonames 859.66 3,584.80 152.20 188.63 Dbpedia 1,811.74 9,757.58 931.44 1178.38 Experiments
  • 196. K2-Triples Experiments > triple patterns • Triple patterns (DBPEDIA)
  • 198. Outline RDF management overview K2-Tree Data Structure K2-Triples Compressed Suffix Array (CSA-Sad) RDF-CSA Experiments
  • 199. Compressed Suffix Array (CSA-SAD) • Binary search for any pattern: “ab” Back to Suffix Arrays P = a b a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = locations Noccs = (4-3)+1 Occs = A[3] .. A[4] = { 8, 1} Fast space O(m lg n) O(4n) O(m lg n + noccs) + T
  • 200. Compressed Suffix Array (CSA-SAD) a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$ P = a b CSA basics • Can we reduce the space needs of a Suffix Array?
  • 201. Compressed Suffix Array (CSA-SAD) CSA basics • Ψ • A[Ψ(i)] = A[i] +1 racadabra$ a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ 1 2 3 4 5 6 7 8 9 10 11 12 Ψ=
  • 202. Compressed Suffix Array (CSA-SAD) CSA basics • Ψ • A[Ψ(i)] = A[i] +1 racadabra$ a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ 1 2 3 4 5 6 7 8 9 10 11 12 Ψ=
  • 203. a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 7 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$ 3 1 2 3 4 5 6 7 8 9 10 11 12 Ψ= • Ψ • A[Ψ(10)] = A[3] = A[10] +1 = 8 Compressed Suffix Array (CSA-SAD) 7 CSA basics
  • 204. • Ψ and F • Ψ and F are enought to perform binary search and to recover the source data!! Compressed Suffix Array (CSA-SAD) a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = 12 11 8 1 4 6 9 2 5 10 3 1 2 3 4 5 6 7 8 9 10 11 12 A = abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$ 4 1 7 8 9 10 11 12 6 2 5 1 2 3 4 5 6 7 8 9 10 11 12 Ψ= 3 7 F = CSA basics
  • 205. Compressed Suffix Array (CSA-SAD) • Ψ and F (reducing space needs) $ a a a a a b b c d r r 1 2 3 4 5 6 7 8 9 10 11 12 F = 1 1 0 0 0 0 1 0 1 1 1 0=D $ a b c d r 1 2 3 4 5 6 =S Bitmap Sorted alphabet CSA basics
  • 206. Compressed Suffix Array (CSA-SAD) • Ψ and F (reducing space needs) • Example: F[8] = S[rank1(D, 8)] = S[3] = ‘b’ Rank1(D,i):: Time O(1), by using o(n) extra space Representing F $ a a a a a b b c d r r 1 2 3 4 5 6 7 8 9 10 11 12 F = 1 1 0 0 0 0 1 0 1 1 1 0=D $ a b c d r 1 2 3 4 5 6 =S Bitmap Sorted alphabet rank1(D, 8)
  • 207. Compressed Suffix Array (CSA-SAD) Compressing Ψ • Absolute samples (k=sample period) • Gap encoding on increasing values: Huffman & run-encoding • Huffman with a N entries dictionary – k reserved Huffman codes to encode 1-runs of size s ϵ [1..k-1] – 32 + 32 Huffman codes representing the size (en bits) of large values [ + or - ] • They are followd by that number encoded with log (v) bits – The remaining N – k -32 -32 entries correspond to the most frequent gap values. 11 6 7 12 1 4 9 10 8 2 3 5 1 2 3 4 5 6 7 8 9 10 11 12 Ψ = 1 0 0 1 0 0 0 00 0 1 0 0 0 1 0 1 0 0 0 0 1 11 0 1 1 0 1 0 0 1 0 1 0 10 0 1 0 0 1 0 0 11 1 1 18 8 32 sΨ = Δ =
  • 208. Compressed Suffix Array (CSA-SAD) sampling sampling sampling + gap encoding + - delta codes* - Huffman-based - encoding de runs – Ψ(sampled), D, S  count – A(sampled)  locate – A-1(sampled)  extract 1 1 0 0 0 0 1 0 1 1 1 0=D 12 11 8 1 4 6 9 2 5 10 3A = 7 4 8 12 5 9 6 10 3 7 2 1A = 11 -1 4 1 7 8 9 10 11 12 6 2 5Ψ= 3 $ a b c d r=S 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 T = Parameters: space/time “trade-off” Complete structure
  • 209. Outline RDF management overview K2-Tree Data Structure K2-Triples Compressed Suffix Array (CSA-Sad) RDF-CSA Experiments
  • 210. RDF-CSA • Step 1  Integer dictionary encoding of s, p, o • Step 2  Ordered list of n triples (sequence of 3n elements) Building RDF-CSA We first sort by subject, then by predicate, and finally by object …
  • 211. RDF-CSA • Step 3  Sid is transformed into S, in order to keep disjoint alphabets Building RDF-CSA Range [1, ns] for subjects Range [ns+1, ns + np] for predicates Range [ns + np + 1, ns + np + no] for objects Due to this alphabet mapping, every subject is smaller than every predicate, and this in turn is smaller than every object !!!
  • 212. RDF-CSA • Step 4  We build an iCSA on S Building RDF-CSA  A has three ranges: each range points to suffixes starting with a subject, a predicate, or an object  cycles around the components of the same triple; that is, the object of a triple k does not point to the subject of the triple k+1 in S, but to the subject of the same triple  we can start at position A[i], pointing to any place within a triple (s,p,o), and recover the triple by succesive applications of
  • 213. RDF-CSA • (S,P,O), (?S,P,O), (S,?P,O), (S,P,?O), (?S,?P,O), (S,?P,?O), (?S,P,?O), (?S,?P,?O) – Patterns with just one bounded element are directly solved using select on D – Pattern (?S,?P,?O) retrieves all the triples, so it can be solved by retrieving every ith triple, using – For the rest of the patterns: binary iCSA search • SPO  bsearch(SPO,3) • ?SOP  bsearch (OP,2) … S?PO  bsearch (OS,2) – Optimizations: • D-select+forward-check strategy: find valid intervals into S, P and O ranges, and check matches with into those intervals, starting from the shortest one. • D-select+backward-check strategy: use binary search to limit valid intervals, instead of sequentially verifying each position of the shortest interval. Searching for triple patterns Optimizations are applicable to pattern (S,P,O), and those with just one unbounded term!!
  • 214. RDF-CSA • (S,P,O) optimizations – D-select+forward-check strategy: find valid intervals into S, P and O ranges, and check matches with into those intervals, starting from the shortest one. – D-select+backward-check strategy: use binary search to limit valid intervals, instead of sequentially verifying each position of the shortest interval. Searching for triple patterns 180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SP SPO 180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SPO PO