1. Linked Data
Semantic Technologies
RDF Compression
HDT
Linked Data Compression
Miguel A. Mart´ınez-Prieto Antonio Fari˜na
Univ. of Valladolid (Spain) Univ. of A Coru˜na (Spain)
migumar2@infor.uva.es fari@udc.es
Keyword search over Big Data.
– 1st KEYSTONE Training School –.
July 22nd, 2015. Faculty of ICT, Malta.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 1/53
2. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 2/53
3. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
– What is Linked Data? –
Linked Data
Linked Data is simply about using the Web to create typed links
between data from different sources [3].
Linked Data refers to a set of best practices for publishing and
connecting data on the Web.
These best practices have been adopted by an increasing number of data
providers, leading to the creation of a global data space:
Data are machine-readable.
Data meaning is explicitly defined.
Data are linked from/to external datasets.
The resulting Web of Data connects data from different domains:
Publications, movies, multimedia, government data, statistical data, etc.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 3/53
4. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
What is Linked Data?
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 4/53
5. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
The emergence of the Web was an authentic revolution 15 years ago:
Changed the way we consume information.
Changed human relationships.
Changed businesses.
...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 5/53
6. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
The Web is a global space comprising linked HTML documents:
Web pages are the atoms of the Web.
Each page is univocally identified by their URL.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 6/53
7. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
Where are (raw) data in the Web?
Web pages “cook” raw data in a human-readable way.
It is, probably, the main problem of the WWW.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 7/53
8. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
- I was excited for the Keystone Training School and looked
for information about this nice country.
- I wrote “malta” in a web search engine, and...
I found some relevant results for my query! :)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 8/53
9. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web
- I was excited for the Keystone Training School and looked
for information about this nice country.
- I wrote “malta” in a web search engine, and...
But others seem a little strange to my (current) expectations... :(
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 9/53
10. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
Raw data are hidden among web pages contents:
In general, data are written in HTML paragraphs.
In the best case, they are structured in the form of HTML tables or
published as additional documents (CSV, XML...)
Anyway, HTML is not enough expressive to describe and link individual
data entities in the Web:
HTML-based descriptions lose semantics and structure from the raw
data.
This fact makes very difficult automatic data processing in the Web.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 10/53
11. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
The Web of Data [8] converts raw data into first-class citizens of the
Web...
Data entities are the atoms of the Web of Data.
Each entity has its own identity.
...and uses existing infrastructure:
It uses HTTP as communication protocol.
Entities are named using URIs.
The Web of Data is a cloud of data-to-data hyperlinks [5]:
These are labelled hyperlinks in contrast to the “plain” ones used
in the Web.
Thus, hyperlinks also provide semantics to data descriptions.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 11/53
12. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
The Web... of Data
Linked Data builds a Web of Data using the Internet infrastructure:
Data providers can publish their raw data in a standardized way.
These data can be interconnected using labelled hyperlinks.
The resulting cloud of data can be navigated using specific query
languages.
Linked Data achievements:
Knowledge from different fields can be easily integrated and universally
shared.
Automatic processes can exploit these knowledge to build innovative
software systems.
Semantic Search Engine
For instance, a semantic search engine would allow us for only retrieving entities which
describe “malta” as a country but not as a cereal.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 12/53
13. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
– Linked Data Principles –
Tim Berners-Lee [2] suggests four basic principles for Linked Data:
1 Use URIs as names for things.
2 Use HTTP URIs so that people can look up those names.
3 When someone looks up a URI, provide useful information, using the
standards (RDF, SPARQL).
4 Include links to other URIs, so that they can discover more things.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 13/53
14. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
1. URIs as names
What is his name?
For humans, his name is Clint Eastwood...
... but http://dataweb.infor.uva.es/movies/people/Clint Eastwood is a
better name for machines.
The use of URIs enables real-world entities (or their relationships with
other entities) to be identifed at universal scale.
This principle ensures any class of data has its own identity in the global
space of the Web of Data.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 14/53
15. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
2. HTTP URIs
All entities must be described using dereferenceable URIs:
These URIs are accesible via HTTP.
This principle exploits HTTP features to retrieve all data related to a
given URI.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 15/53
16. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
3. Standards
This principle states that all
stakeholders “must speak the same
languages” for effective
understanding.
RDF [10] provides a simple logical
model for data description.
SPARQL [12] describes a specific
language for querying RDF data.
Serialization formats, ontology
languages, etc.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 16/53
17. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
4. Linking URIs
This principle materializes the aim of data integration in Linked Data:
Linking two URIs establishes a particular connection between two existing
entities.
Linking URIs
http://dataweb.infor.uva.es/movies/people/Clint Eastwood names the entity which
describes “Clint Eastwood”.
http://dataweb.infor.uva.es/movies/film/Mystic River names the entity which describes
the movie “Mystic River”.
An hyperlink between these two URIs state that the entity “Clint Eastwood” is
related to the entity “Mystic River”... how?
The labelled link provides a semantic relationship between entities.
In this case, http://dataweb.infor.uva.es/movies/property/director tags the
“director” relationship between “Clint Eastwood” and “Mystic River”.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 17/53
18. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
– Linked Open Data –
The Linked Open Data (LOD) project1
promotes Linked Data to be
published as Open Data:
LOD is released under an open license which does not impede its reuse for
free [2].
LOD is the highest-level in the 5-star scheme2
for Open Data publication.
The dataset is available on the Web under an open license.
The dataset is available as structured data.
The dataset is encoded using a non-propietary format.
The dataset names entities using URIs.
The dataset is linked to other datasets.
1
http://linkeddata.org/; http://5stardata.info/
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 18/53
19. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
LOD (2007-2011)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 19/53
20. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
LOD (2014)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 20/53
21. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
LOD (2014)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 21/53
22. Linked Data
Semantic Technologies
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
Current Statistics (July, 2015)
9,960 datasets are openly available2
:
90 billion statements from 3,308 datasets.
6,639 datasets could not be crawled for different reasons.
LOD Laundromat4
provides access to more tha 38 billion statements
from 650K “cleaned” datasets.
DBpedia 2014 contains more than 3 billion statements:
538 million statements from English Wikipedia.
2.46 billion statements from other language editions.
50 million statements linking to external datasets.
More and more datasets are released and these are getting bigger:
The largest ones are in the order of hundreds of GB.
2
http://stats.lod2.eu/; http://lodlaundromat.org/
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 22/53
23. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 23/53
24. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
– Overview –
Semantic Technologies (in middle
layers) exploit features from the Web
infrastructure (low layers):
RDF is used for resource
description.
RDFS is used for describing
semantic vocabularies.
OWL extends RDFS and is used
for building ontologies.
SPARQL is the query language for
RDF data.
RIF is used for describing rules.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 24/53
25. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF & SPARQL
RDF & SPARQL are the most relevant technologies for our current aims:
Both standards are based on labelled directed graph features.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 25/53
26. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
– RDF –
http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood
http : //dataweb.infor.uva.es/movies/film/Mystic River
http : //dataweb.infor.uva.es/movies/property/title
Mystic River
http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/director
http : //dataweb.infor.uva.es/movies/film/Mystic River
RDF [10] is a framework for describing resources of any class:
People, movies, cities, proteins, statistical data...
Resources are described in the form of triples:
Subject: the resource being described.
Predicate: a property of that resource.
Object: the value for the corresponding property.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 26/53
27. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood
http : //dataweb.infor.uva.es/movies/film/Mystic River
http : //dataweb.infor.uva.es/movies/property/title
Mystic River
http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/director
http : //dataweb.infor.uva.es/movies/film/Mystic River
An RDF triple is a labelled directed subgraph in which subject and object
nodes are linked by a particular (predicate) edge:
The subject node contains the URI which names the resource.
The predicate edge labels the relationship using a URI whose semantics is
described by any vocabulary/ontology.
The object node may contain a URI or a (string) Literal value.
RDF links (between entities) also take the form of RDF triples.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 27/53
28. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 28/53
29. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 29/53
30. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 30/53
31. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Graph
This graph view is only a mental model:
RDF graphs must be serialized!!
But the RDF Recommendation does not restrict the format to be used.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 31/53
32. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Serialization Formats
Traditional plain formats are commonly used:
RDF/XML, NTriples, Turtle...
These formats are very verbose in practice:
Data are serialized in a (more or less) human-readable way.
Large RDF files are finally compressed using gzip or bzip2.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 32/53
33. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
– SPARQL –
SPARQL [12] is a query language for RDF.
It is based on graph pattern matching:
Triple patterns are RDF triples in which subject, predicate and object may
be variable.
SPARQL supports more complex queries: joins, unions, filters...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 33/53
34. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 34/53
35. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 35/53
36. Linked Data
Semantic Technologies
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 36/53
37. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 37/53
38. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
What is the problem?
RDF excels at logical level:
Structured and semi-structured data can be described using RDF triples.
Entities are also linked in the form of RDF triples.
But it is a source of redundancy at physical level
Serialization formats are highly verbose.
RDF data are redundant at three levels: semantic, symbolic, and
syntactic.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 38/53
39. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
– Semantic Compression –
Semantic redundancy occurs when the same meaning can be conveyed
using less triples.
http : //dataweb.infor.uva.es/movies/property/name
http : //www.w3.org/2000/01/rdf − schema#domain
http : //dataweb.infor.uva.es/movies/classes/person
http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood
http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //www.w3.org/1999/02/22 − rdf − syntax − ns#type
http : //dataweb.infor.uva.es/movies/classes/person
The third triple is redundant because the first one state that the URI
http://dataweb.infor.uva.es/movies/people/Clint Eastwood describes an entity in the
domain of http://dataweb.infor.uva.es/movies/classes/person.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 39/53
40. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Semantic Compression
Semantic compressors perform at logical level:
Detect redundant triples and remove them from the original dataset.
Semantic compressors [9, 11, 13] are not so effective by themselves...
... but may be combined with symbolic and syntactic compressors!
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 40/53
41. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
– Symbolic Compression –
Symbolic redundancy is due to symbol repetitions in triples:
This is the “traditional” source of redundancy removed by universal
compressors.
Symbolic redundancy in RDF is mainly due to URIs:
URIs tend to be very large strings which share long prefixes.
http://dataweb.infor.uva.es/movies/film/Bird
http://dataweb.infor.uva.es/movies/film/Million Dollar Baby
http://dataweb.infor.uva.es/movies/film/Mystic River
http://dataweb.infor.uva.es/movies/people/Clint Eastwood
...
... but literals also contibute to this redundancy.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 41/53
42. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Symbolic Compression
The most prominent RDF compressors remove symbolic redundancy:
All different URIs/literals are indexed in a string dictionary.
Each string is identified by a unique integer ID.
- Triples are rewritten by replacing strings by their corresponding IDs.
Symbolic is, in general, the most important redundancy in RDF and has
(many) room for optimization.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 42/53
43. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
– Syntactic Compression –
Syntactic redundancy depends on the RDF graph serialization:
For instance, a serialized subset of n triples (which describes the same
resource) writes n times the subject value. It can be abbr.
... and also on the underlying graph structure:
For instance, resources of the same classes are described using (almost)
the same sub-graph structure.
Syntactic compression also has (many) room for optimization.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 43/53
44. Linked Data
Semantic Technologies
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Syntactic Compression
HDT [7], k2
-triples [1], or RDFCSA [4] are syntactic compressors
reporting good numbers:
They are combined with symbolic compression.
In practice, they compress RDF triples in the form of ID triples.
Semantic compressors such as SSP [11] also remove symbolic and
syntactic redundancy.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 44/53
45. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 45/53
46. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
– What is HDT? –
HDT was the first binary serialization format for RDF:
It was acknowledged as W3C Member Submission [6] in 2011.
It exploits symbolic and syntactic redundancy:
It reduces up to 15 times the space used by traditional formats [7].
HDT is a core building block in some Linked Data applications:
It reports good compression numbers, but also provides efficient data
retrieval.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 46/53
47. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
– Components –
HDT encodes RDF data into three components:
The Header (H) comprises descriptive metadata.
The Dictionary (D) maps different strings (from nodes and edges) to IDs:
It manages four independent mappings: subjects-objects, subjects, objects, and
predicates.
The Triples (T) component encodes the inner structure as a graph of IDs.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 47/53
48. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
HDT Components
The Dictionary is encoded using specific compression techniques for string
dictionaries.
Triple IDs are organized into a forest of trees (one per different subject)...
...which is encoded using two bitsequences and two ID sequences.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 48/53
49. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
– Conclusions –
HDT integrates RDF serialization and compression into a practical
format:
HDT saves space storage and enables efficient data parsing/retrieval
using bit operations.
Symbolic rendundancy is addressed by the Dictionary component:
The collection of strings (in the dictionary) has high symbolic
redundancy...
The own dictionary is highly compressible!
Syntactic rendundancy is removed by the Triples component:
HDT triples is a straightforward compressor.
Their effectiveness can be improved using optimized graph compression
techniques.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 49/53
50. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography I
[1] Sandra ´Alvarez-Garc´ıa, Nieves Brisaboa, Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, and Gonzalo
Navarro.
Compressed Vertical Partitioning for Efficient RDF Management.
Knowledge and Information Systems (KAIS), 44(2):439–474, 2015.
[2] Tim Berners-Lee.
Linked Data, 2006.
http://www.w3.org/DesignIssues/LinkedData.html.
[3] Christian Bizer, Tom Heath, and Tim Berners-Lee.
Linked Data - The Story So Far.
International Journal of Semantic Web and Information Systems, 5(3):1–22, 2009.
[4] Nieves Brisaboa, Ana Cerdeira, Antonio Fari˜na, and Gonzalo Navarro.
A Compact RDF Store using Suffix Arrays.
In Proceedings of SPIRE, 2015.
To appear.
[5] Javier D. Fern´andez, Mario Arias, Miguel A. Mart´ınez-Prieto, and Claudio Guti´errez.
Management of Big Semantic Data.
In Big Data Computing, chapter 4. Taylor and Francis/CRC, 2013.
[6] Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, Claudio Guti´errez, and Axel Polleres.
Binary RDF Representation for Publication and Exchange.
W3C Member Submission, 2011.
www.w3.org/Submission/HDT/.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 50/53
51. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography II
[7] Javier D. Fern´andez, Miguel A. Mart´ınez-Prieto, Claudio Guti´errez, Axel Polleres, and Mario Arias.
Binary RDF Representation for Publication and Exchange.
Journal of Web Semantics, 19:22–41, 2013.
[8] Tom Heath and Christian Bizer.
Linked Data: Evolving the Web into a Global Data Space.
Morgan & Claypool, 1 edition, 2011.
http://linkeddatabook.com/.
[9] Amit K. Joshi, Pascal Hitzler, and Guozhu Dong.
Logical Linked Data Compression.
In Proceedings of ESWC, pages 170–184, 2013.
[10] Frank Manola and Eric Miller.
RDF Primer.
W3C Recommendation, 2004.
www.w3.org/TR/rdf-primer/.
[11] Jeff Z. Pan, Jos´e Manuel G´omez-P´erez, Yuan Ren, Honghan Wu, and Man Zhu.
SSP: Compressing RDF data by Summarisation, Serialisation and Predictive Encoding.
Technical report, 2014.
Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014 SSP.pdf.
[12] Eric Prud’hommeaux and Andy Seaborne.
SPARQL Query Language for RDF.
W3C Recommendation, 2008.
http://www.w3.org/TR/rdf-sparql-query/.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 51/53
52. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography III
[13] Gayathri V. and P. Sreenivasa Kumar.
Horn-Rule based Compression Technique for RDF Data.
In Proceedings of SAC, pages 396–401, 2015.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 52/53
53. Linked Data
Semantic Technologies
RDF Compression
HDT
Basics
Components
Conclusions
This presentation has been made available only for learning/teaching purposes.
The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Linked Data Compression 53/53
54. Onto some basics of:
compression, Compact Data Structures, and
indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto
56. • Disks are cheap !! But they are also slow!
– Compression can help more data to fit in main memory.
(access to memory is around 106 times faster than HDD)
• CPU speed is increasing faster
– We can trade processing time (needed to uncompress
data) by space.
Introduction
Why compression?
57. • Compression does not only reduce space!
– I/O access on disks and networks
– Processing time* (less data has to be processed)
• … If appropriate methods are used
– For example: Allowing handling data compressed all the time.
Introduction
Why compression?
Text collection (100%)
Doc 1 Doc 2 Doc 3 Doc n Compressed Text
collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
Compressed Text
collection (20%)
P7zip, others
Doc 1 Doc 2 Doc 3 Doc n
Let’s search
for “Malta"
58. • Indexing permits sublinear search time
Introduction
Why indexing?
Text collection (100%)
Doc 1 Doc 2 Doc 3 Doc n Compressed Text
collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
Let’s search
for “Malta"
term 1
…
Malta
…
term n
(> 5-30%)
Index
59. • Self-indexes:
– sublinear search time
– Text implicitly kept
Introduction
Why Compact Data Structures?
Text collection
Doc 1 Doc 2 Doc 3 Doc n
Let’s search
for “Malta"
term 1
…
Malta
…
term n
(> 5-30%)
Index
0 0 0 01 1
0 1
0 1 0 10 0
1
0
Self-index (WT, WCSA,…)
term 1
…
Malta
…
term n
61. Basic Compression
• A compressor could use as a source alphabet:
– A fixed number of symbols (statistical compressors)
• 1 char, 1 word
– A variable number of symbols (dictionary-based compressors)
• 1st occ of ‘a’ encoded alone, 2nd occ encoded with next one ‘ax’
• Codes are built using symbols of an target alphabet:
– Fixed length codes (1 bit, 10 bits, 1 byte, 2 bytes, …)
– Variable length codes (1,2,3,4 bits/bytes …)
• Classification (fixed-to-variable, variable-to-fixed,…)
Modeling & Coding
-- statistical
Input alphabet
dictionary var2var
Target alphabet
fixed
var
fixed var
62. Basic Compression
• Taxonomy
– Dictionary based (gzip, compress, p7zip… )
– Grammar based (BPE, Repair)
– Statistical compressors (Huffman, arithmetic, PPM,… )
• Statistical compressors
– Gather the frequencies of the source symbols.
– Assign shorter codewords to the most frequent symbols.
Obtain compression
Main families of compressors
63. Basic Compression
• How do they achieve compression
– Assign fixed-length codewords to variable length symbols (text
substrings)
– The longer the replaced substring the better compression
• Well-known representatives: Lempel-Ziv family
– LZ77 (1977): GZIP, PKZIP, ARJ, P7zip
– LZ89 (1978)
• LZW (1984): Compress, GIF images
Dictionary-based compressors
64. Basic Compression
• Starts with an initial dictionary D (contains symbols in Σ)
• For a given position of the text.
– while D contains w, reads prefix w=w0 w1 w2 …
– If w0 …wk wk+1 is not in D (w0 …wk does!)
• output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
• Add w0 …wk wk+1 to D
• Continue from wk+1 on (included)
• Dictionary has limited length? Policies: LRU, truncate& go, …
LZW
EXAMPLE
65. Basic Compression
• Starts with an initial dictionary D (contains symbols in Σ)
• For a given position of the text.
– while D contains w, reads prefix w=w0 w1 w2 …
– If w0 …wk wk+1 is not in D (w0 …wk does!)
• output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
• Add w0 …wk wk+1 to D
• Continue from wk+1 on (included)
• Dictionary has limited length? Policies: LRU, truncate& go, …
LZW
EXAMPLE
66. Basic Compression
• Replaces pairs of symbols by a new one, until no pair
repeats twice
– Adds a rule to a Dictionary.
Grammar Based – BPE - Repair
A B C D E A B D E F D E D E F A B E C D
A B C G A B G F G G F A B E C D
H C G H G F G G F H E C D
H C G H I G I H E C D
DE G
AB H
GF I
Source sequence
Dictionary of Rules
Final Repair Sequence
67. Basic Compression
• Assign shorter codewords to the most frequent symbols
– Must gather symbol frequencies for each symbol c in Σ.
– Compression is lower bounded by the (zero-order) empirical
entropy of the sequence (S).
• Most representative method: Huffman coding
Statistical Compressors
n= num of symbols
nc= occs of symbol c
H0(S) <= log (|Σ|)
n H0(S) = lower bound of the size of S compressed with a zero-order compressor
68. Basic Compression
• Optimal prefix free coding
– No codeword is a prefix of one another.
• Decoding requires no look-ahead!
– Asymptotically optimal: |Huffman(S)| <= n(H0(S)+1)
• Typically using bit-wise codewords
– Yet D-ary Huffman variants exist (D=256 byte-wise)
• Builds a Huffman tree to generate codewords
Statistical Compressors: Huffman coding
78. Basic Compression
• Given S= mississipii$, BWT(S) is obtained by: (1) creating
a Matrix M with all circular permutations of S$, (2) sorting
the rows of M, and (3) taking the last column.
Burrows-Wheeler Transform (BWT)
mississippi$
$mississippi
i$mississipp
pi$mississip
ppi$mississi
ippi$mississ
sippi$missis
ssippi$missi
issippi$miss
sissippi$mis
ssissippi$mi
ississippi$m
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
sort
L = BWT(S)F
79. Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8
which is the 2nd occ of ‘p’ in F
80. Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
In each step: S[n-i] = L[p];
p = LF[p];
i = i+1;
-
-
-
-
-
-
-
-
-
-
-
$
S
81. Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
Step i=0: S[n-i] = L[p]; S[12]=‘$’
p = LF[p]; p = 1
i = i+1; i=1
-
-
-
-
-
-
-
-
-
-
-
$
S
82. Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
Step i=1: S[n-i] = L[p]; S[11]=‘i’
p = LF[p]; p = 2
i = i+1; i=2
-
-
-
-
-
-
-
-
-
-
i
$
S
83. Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8
which is the 2nd occ of ‘p’ in F
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
Step i=1: S[n-i] = L[p]; S[11]=‘i’
p = LF[p]; p = 2
i = i+1; i=2
m
i
s
s
i
s
s
i
p
i
i
$
S
84. Basic Compression
• BWT. Many similar symbols appear adjacent
• MTF.
– Output the position o the current symbol within Σ ‘
– Keep the alphabet Σ ‘= {a,b,c,d,e,… } sorted so that the last used
symbol is moved to the begining of Σ ‘ .
• RLE.
– If a value (0) appears several times (000000 6 times)
– replace it by a pair <value,times> <0,6>
• Huffman stage.
Bzip2: Burrows-Wheeler Transform (BWT)
Why does it work?
In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …
91. Bit Sequences
• Bitmaps a basic part of most Compact Data Structures
• Example: (We will see it later in the CSA)
S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log σ bits
B: 1001010000000100100000000011 n bits
D: ABCDEFG σ log σ bits
– Saves space:
– Fast access/rank/select is of interest !!
• Where is the 2nd C?
• How many Cs up to position k?
Applications
92. Bit Sequences
• Jacobson, Clark, Munro
– Variant by Fariña et al.
• Assuming 32 bit machine-word
• Step 1: Split de Bitmap into superblocks of 256 bits, and
store de number of 1s up to positions 1+256k
– O(1) time to superblock. Space: n/256 superblock and 1 int each
Reaching O(1) Rank y o(n) bits of extra space
0 1 0 ... 1
1 2 3 256
35 bits set to 1
1 ... 1
257 512
27 bits set to 1
350
1 2
Ds = 62
3
0 ... 1
513 768
45 bits set to 1
...
97
3
...
93. Bit Sequences
• Step 2: For each superblock of 256 bits
– Divide it into 8 block of 32 bits each (machine word size)
– Store the number of ones from the beginning of the superblock
– O(1) time to the blocks, 8 blocks per superblock, 1 byte each
Reaching O(1) Rank y o(n) bits of extra space
1 1 0 ... 1
1 2 3 256
35 bits set to 1
1 ... 0
257 512
27 bits set to 1
350
1 2
Ds = 62
3
0 ... 1
513 768
45 bits set to 1
...
97
3
...
1 1 0 ... 1
1 2 3 32
4 bits set to 1
0 ... 1
33 64
6 bits set to 1
...
40
1 2
Db = 10
3
...
1 ... 0
224 256
8 bits set to 1
94. Bit Sequences
• Step 3: Rank within a 32 bit block
Finally solving:
rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i)
where i= p mod 32
– Ex: rank1(D,300) = 35 + 4 + 4 = 43
– Yet, how to compute rank1(blk, i) in constant time ?
Reaching O(1) Rank y o(n) bits of extra space
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
96. Bit Sequences
select1(p)
• In practice, binary search using rank
– Binary search on superblocks O(log(n)) to find the superblock s
containing the pth 1 retval = Ds[s]
– Sequential search [uint <=256] within the in blocks until reaching
the block d that contains the position retval += Db[d]
– Sequential search (1 byte at a time) within the last 32 bits, using
onesInByte[] table until reaching the byte b that contains the
position.
• In each iteration: retval += onesInByte[b]
– Table lookup over a new selb[] table over the last “byte” b
• retval += selb[b]
– Return retval
Select in O(log (n)) with the same structures
97. Bit Sequences
• Compressed bitmap representations exist.
– Compressed [Raman et al]
– For very sparse bitmaps [Okanohara and Sadakane]
– …
Compressed representations
100. Integer Sequences
• Grossi et al.
• Given a sequence of symbols and an encoding
– The bits of the code of each symbol are distributed along the
different levels of the tree
00 01 00 10 11 00 A B A C D A C
0 0 0 0
10
1 1
0 1
A B A A C D C
0 1 0 10 0
1
0
Wavelet tree (construction)
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
101. • Searching for the 1st occurrence of ‘D’?
Integer Sequences
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
it is the 2nd bit in B1
Where is the
2nd ‘1’?
at pos 5.
0
1
Where is the
1st ‘1’?
at pos 2.
Wavelet tree (select)
Broot
B0 B1
102. Integer Sequences
• Recovering Data: extracting the next symbol
– Which symbol appears in the 6th position?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
Which bit occus at position 4 in B0?
How many
‘0’s are there
up to pos 6?
it is the 4th ‘0’
0
1
It is set to 0
The codeword read is ’00’ A
Wavelet tree (access)
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
Broot
B0 B1
Broot
B0 B1
Broot
B0
103. Integer Sequences
• Recovering Data: extracting the next symbol
– Which symbol appears in the 7th position?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
Which bit occurs at position 3 in B1?
How many ‘1’s
are there up to
pos 7?
it is the 3rd ‘1’
0
1
It is set to 0
The codeword read is ’10’ C
TEXT
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
Wavelet tree (access)
B1
Broot
B0
104. Integer Sequences
• How many C’s up to position 7?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
How many 0s up to position 3 in B1?
How many ‘1’s
are there up to
pos 7?
it is the 3rd ‘1’
0
1
2 !!
TEXT
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A
Wavelet tree (Rank)
B1
Broot
B0
Select (locate symbol)
Access and Rank:
105. Integer Sequences
• Each level contains n + o(n) bits
• Rank/select/access expected O(log σ) time
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
1
0
Wavelet tree (Space and times)
WAVELET TREE
00 01 00 10 11 00 10
DATA
SYMBOL CODE
A B A C D A C
C
D
00
01
10
11
B
A
n + o(n) bits
n + o(n) bits
n ⌈log σ⌉ (1 + o(1)) bits
106. Integer Sequences
• Using Huffman coding (or others) umbalanced
• Rank/select/access O(nH0(S)) time
Huffman-shaped (or others) Wavelet tree
A B A C D A C
1 0 1 10 0
0 1
B C D C A A A
0 1 0 0
0
WAVELET TREE
1 000 1 01 001 1 01
DATA
SYMBOL CODE
A B A C D A C
C
D
1
000
01
001
B
A
nH0(S) + o(n) bits
0 1
B D C C
1 0
108. A brief Review about Indexing
• Traditional indexes (with or without compression)
– Inverted Indexes, Suffix Arrays,...
• Compressed Self-indexes
– Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index, …
Text Indexing: Well-known structures from The Web
implicit text
auxiliar structure explicit text
109. A brief Review about Indexing
Inverted indexes
Space-time trade-off
DCC
communications
compression
image
data
information
Cliff
Logde
0 142
104 165 341
506368
219 445
DCC is held at the Cliff Lodge convention center. It
is an international forum for current work on data
compression and related applications. DCC addresses
not only compression methods for specific types of
data (text, image, video, audio, space, graphics, web
content, etc.), but also the use of techniques from
information theory and data compression in
networking, communications, and storage applications
involving large datasets (including image and
information mining, retrieval, archiving, backup,
communications, and HCI).
99 207 336
128 395
19
25
Vocabulary Posting Lists
Indexed text
Searches
Word posting of that word
Phrase intersection of postings
Block1Block2
Compression
- Indexed text (Huffman,...)
- Posting lists (Rice,...)
1
1 2
2
1 2
1 2
1 2
1
1
DCC
communications
compression
image
data
information
Cliff
Lodge
Vocabulary Posting Lists
Full-positional information Block-addressing inverted index
111. A brief Review about Indexing
• Sorting all the suffix of T lexicographically
Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
112. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
113. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
114. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
115. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
116. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
117. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
118. A brief Review about Indexing
• Binary search for any pattern: “ab”
Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations Noccs = (4-3)+1
Occs = A[3] .. A[4] = { 8, 1}
Fast space
O(m lg n) O(4n)
O(m lg n + noccs) + T
119. Basic Compression
• BWT(S) + other structures it is an index
BWT FM-index
• C[c] : for each char c in Σ , stores the number of
occs in S of the chars that are lexicographically
smaller than c.
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8
• OCC(c, k): Number of occs of char c the prefix
of L: L (1, k)
For k in [1..12]
Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1
Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4
Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1
Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2
Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
• Char L[i] occurs in F at position LF(i):
LF(i) = C[L[i]] + Occ(L[i],i)
122. Bibliography
1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical
Report 124, Digital Systems Research Center, 1994.
http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/.
2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th
SPIRE, LNCS 5280, pages 176–187, 2008.
3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc.
12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001.
4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-
581, 2005.
5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994
6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text
indexing. In Proc. 17th SODA, pages 368–373, 2006.
7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th
SODA, pages 841–850, 2003.
123. Bibliography
8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the
Institute of Radio Engineers, 40(9):1098-1101, 1952
9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE,
88(11):1722–1732, 2000
10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp.,
22(5):935–948, 1993
11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms .Kluwer 2002, ISBN 0-7923-
7668-4
12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37–42, 1996.
13. Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR),
v.39 n.1, p.2-es, 2007
14. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th
ALENEX, 2007.
15. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding
k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002.
124. Bibliography
16. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and
flexible word searching on compressed text. ACM Transactions on Information Systems,
18(2):113–139, 2000.
17. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and
Indexing Documents and Images. Morgan Kaufmann, 1999.
18. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory 23, 3, 337–343.
19. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE
Transactions on Information Theory 24, 5, 530–536.
125. Onto some basics of:
compression, Compact Data Structures, and
indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto
126. Introduction
Compressed String Dictionaries
Experimental Evaluation
Dictionary Compression
Miguel A. Mart´ınez-Prieto Antonio Fari˜na
Univ. of Valladolid (Spain) Univ. of A Coru˜na (Spain)
migumar2@infor.uva.es fari@udc.es
Keyword search over Big Data.
– 1st KEYSTONE Training School –.
July 22nd, 2015. Faculty of ICT, Malta.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 1/47
127. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 2/47
128. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
– What is a String Dictionary –
String Dictionary
A string dictionary is a serializable data structure
which organizes all different strings (vocabulary) used
in a dataset.
The vocabulary of a natural language text (lexicon) comprises all different
words used in it.
T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo”
V= {he, la, ni~na, no, que, s´ı, tarara, visto, yo}
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 3/47
129. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
What is a String Dictionary?
The dictionary implements a bijective function that maps
strings to identifiers (IDs, generally integer values) and back.
It must provide, at least, two complementary operations:
string-to-ID: locates the ID for a given string.
ID-to-string: extracts the string identified by a given ID.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 4/47
130. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
What is a String Dictionary?
String dictionaries are a simple and effective tool:
Enable replacing (long, variable-length) strings by simple
numbers (their IDs).
T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo”
T’= 2 7 6 2 7 4 2 7 3 5 2 1 8 9
The resulting IDs are more compact to represent and easier
and more efficient to handle:
T= 59 chars × 1 byte/chars = 59 bytes
T’= 14 IDs × log(9) bits/ID = 7 bytes
(plus the cost of dictionary encoding)
A compact dictionary which provides efficient mapping
between strings and IDs saves storage space, and
processing/transmission costs, in data-intensive
applications.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 5/47
131. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Compressing String Dictionaries
The growing volume of the datasets has led to increasingly large
dictionaries:
The dictionary size is a bottleneck for applications running under
restrictions of main memory.
Dictionary management is becoming a scalability issue by itself.
Dictionary compression aims to achieve competitive space/time tradeoffs:
Compact serialization.
Small memory footprint.
Efficient query resolution.
We focus on static dictionaries, which do not change along the
execution:
Many applications use dictionaries that either are static or are rebuilt only
sparingly.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 6/47
132. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
– Operations –
A string dictionary is a data structure that represents a sequence of n
distinct strings, D = s1, s2, . . . , sn .
It provides a mapping between ID numbers i and strings si :
- locate(p)
= i, if p = si for some i ∈ [1, n].
= 0 otherwise.
- extract(i) returns the string si , for i ∈ [1, n].
Some other operations can be useful in specific applications:
Prefix-based locate / extract operations.
Substring-based locate / extract operations.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 7/47
133. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Prefix-based Operations
- locatePrefix(p) = {i, ∃y, si = py}.
This result set is a contiguous ID range for lexicographically sorted
dictionaries.
- extractPrefix(p) = {si , ∃y, si = py}.
It is equivalent to composing locatePrefix(p) with individual
extract(i) operations.
Finding all URIs in a given domain is an example of prefix-based
operation:
Look for all properties used in http://dataweb.infor.uva.es/movies:
http://dataweb.infor.uva.es/movies/property/director (4).
http://dataweb.infor.uva.es/movies/property/name (7).
http://dataweb.infor.uva.es/movies/property/title (12).
...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 8/47
134. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Substring-based Operations
- locateSubstring(p) = {i, ∃x, y, si = xpy}.
It is very similar to the problem solved by full-text indexes.
- extractSubstring(p) = {si , ∃x, y, si = xpy}.
It is equivalent to composing locateSubstring(p) with individual
extract(i) operations.
Both operations may return duplicate results which must be removed
before reporting the ID result set.
regex query resolution in SPARQL is an example of substring-based
operation:
Look for all literals containing the substring Eastwood:
‘‘Clint Eastwood’’ (2544).
‘‘Jayne Eastwood is a Canadian actress...’’ (10584).
‘‘Kyle Eastwood’’ (13847).
...
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 9/47
135. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Summary
- locate(“tarara”) = 7
- extract(2) = la
- locatePrefix(“n”) = 3,4
- extractPrefix(“n”) = ni˜na, no
- locateSubstring(“a”) = 2,3,7
- extractSubstring(“a”) = la, ni˜na, tarara
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 10/47
136. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
– RDF Dictionaries –
An RDF dictionary comprises all different terms used in the dataset:
RDF terms are drawn from three disjoint vocabularies: URIs, Literals, and
blank nodes.
Serialized (uncompressed) RDF vocabularies need up to 3 times more
space than (uncompressed) ID-triples [13].
URIs and Literals should be compressed and managed independently:
Their structure is very different and they are queried in a different way.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 11/47
137. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
URIs
URIs are medium-size strings sharing long prefixes:
Compressed dictionaries for URIs must exploit the continuous repetition of
such prefixes.
Prefix-based compression.
locate operations are common when the dictionary is used for lookup
purposes (e.g. RDF stores, semantic search engines, etc.).
extract operations are common when the dictionary is used for data
access purposes (e.g. decompression, result retrieval, etc.).
locatePrefix and extractPrefix are also useful for URI dictionaries.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 12/47
138. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Literals
Literals tends to be large-size strings with no predictable features:
The name “Clint Eastwood”.
The genome from an individual of any species.
The full text from “El Quijote”
...
Literal dictionaries must be based on universal compression.
locate and extract are used like in URI dictionaries.
locateSubstring and extractSubstring are useful because of
SPARQL needs.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 13/47
139. Introduction
Compressed String Dictionaries
Experimental Evaluation
What is a String Dictionary?
Operations
RDF Dictionaries
Practical Configuration
A role-based partition is first performed:
Subjects are encoded in the range [1,|S|].
Predicates are encoded in the range [1,|P|].
Objects are encoded in the range [1,|O|].
URIs playing as subject and object are encoded
once:
IDs in [1,|SO|] encode subjects and objects.
Subjects are encoded in [|SO+1|,|S|].
Objects are encoded using two dictionaries:
1 [|SO+1|,|Ox |] encode URIs which only performs
as objects.
2 [|Ox +1|,|O|] encode Literals.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 14/47
140. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 15/47
141. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Compressed String Dictionaries
All revised dictionaries combine notions from universal compression and
compact data structures.
Universal compressors must enable fast decompression and comparison of
individual strings:
Huffman [8] and Hu-Tucker [7, 9] codes.
Re-Pair [10].
The serialized vocabulary Tdict concatenates all strings in lexicographic
order:
An special symbol $ is used as separator.
T =“alabar a la alabada alabarda”
Tdict = a$alabada$alabar$alabarda$la$
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 16/47
142. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Front-Coding –
Front-Coding [15] is a folklore compression technique for lexicographically
sorted dictionaries.
It exploits the fact that consecutive entries are likely to share a common
prefix:
Each entry in the dictionary is differentially encoded with respect to the
preceding one.
It needs two values:
× An integer encoding the length of the shared prefix.
× The remaining characters of the current entry.
a$alabada$alabar$alabarda$la$
→ (0,a$); (1,labada$); (5, r$); (6, da$); (0, la$)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 17/47
143. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Front-Coding
The vocabulary is divided into buckets of b strings:
The first string of each bucket (header) is explicitly stored.
The remaining b − 1 internal strings are differentially encoded.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 18/47
144. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Front-Coding Operations
locate(p):
1 Headers are binary searched until finding the bucket Bx where p must lie:
If the header is p, locate(p) = (b × (Bx − 1)) + 1.
2 The internal string are sequentially decoded:
If the internal ith
string is p, locate(p) = (b × (Bx − 1)) + i.
If the bucket is fully decoded with no result, p is not in the dictionary.
extract(i):
1 The string is encoded in the bucket Bx = i/b .
2 ((i − 1) mod b) internal strings are decoded to obtain the answer.
Prefix-based operations exploits the lexicographic order:
Their results are contiguous ranges in the dictionary.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 19/47
145. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Plain Front-Coding (PFC)
PFC is a straightforward byte-oriented Front-Coding implementation:
It uses VByte [14] to encode the length of the common prefix.
The remaining string is encoded with one byte per character, plus the
terminator $.
PFC is serialized as a byte array (Tpfc ) and a ptrs structure:
Both structures are directly mapped to main memory for data retrieval
purposes.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 20/47
146. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
HuTucker Front-Coding (HTFC)
HTFC is algorithmically similar to PFC, but it takes advantage of the Tpfc
redundancy to achieve a more compressed representation:
Operations are slightly slower than for PFC.
Headers are encoded using HuTucker:
It allows compressed headers to be directly compared with the query
pattern.
Internal strings are encoded using Huffman or Re-Pair compression.
HTFC is serialized as a bit array (Thtfc ) and also a ptrs structure:
Pointers in HTFC uses less bits because Thtfc is smaller than Tpfc .
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 21/47
147. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Hashing –
Hashing [3] is a folklore method to implement dictionaries:
A hash function transforms the string into an index x in the hash table.
A collision arises when two different strings are mapped to the same cell
in the table.
String dictionaries perform better with closed hashing [2]:
If the corresponding cell is not empty, one successively probes other cells
until finding a free cell.
The next cell to be probed is determined using double hashing.
Hash dictionaries provide very efficient locate, may support extract,
but the table size dissuades their use for managing large vocabularies.
Compressed hash dictionaries focuses on compacting the table, but
also the own vocabulary:
The vocabulary can be effectively compressed using Huffman or Re-Pair.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 22/47
148. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Vocabulary Compression
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 23/47
149. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Table Compression (I)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 24/47
150. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Table Compression (II)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 25/47
151. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Improving Data Access
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 26/47
152. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Hashing Operations (locate)
locate(p):
1 The pattern p is compressed using Huffman: cp.
2 cp is “hashed” to a position x in the (original) hash table.
3 x is mapped to its corresponding position y in the compressed
representation.
4 The string pointed from y is decompressed and compared to p.
locate(“alabada”)
1 Huffman(“alabada$”)=cp
2 hash(cp)=5
3 if B[5] = 1, rank1(B, 5)=4
if B[5] = 0, “alabada” is not in D.
4 strcmp(DAC[4],cp)=true → 4
strcmp(DAC[4],cp)=false → collision
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 27/47
153. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Hashing Operations (extract)
extract(i):
1 The string directly extract from DAC[i].
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 28/47
154. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Self-Indexed Dictionaries –
A self-index stores the original text T and provides indexed searches to
it, using space proportional to the T statistical entropy.
Self-indexes support two operations:
locate(p), returns all the positions in T where p occurs.
extract(i, j), retrieves the substring T [i, j].
A string dictionary can be easily self-indexed:
The corresponding self-index is built on the text Tdict .
The dictionary primitives (and also prefix and substring based queries) are
implemented using the self-index operations.
We choose the FM-Index [4, 5] because it is the most space-efficient
self-index in practice:
A $ symbol is prepended to the original Tdict .
The BWT (L) is a wavelet-tree (“plain” [5] and “compressed” [11]).
C is a simple array.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 29/47
155. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 30/47
156. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (locate)
The ith
string is encoded between the
i + 1th
and i + 2th
$.
locate(p) performs backwards search
of $p$:
The pattern is searched from right to
left until reach the corresponding $.
locate(p) performs in time
O(|p| log σ).
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 31/47
157. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (locate)
locate(’la’): Looking for $la$.
1. Range: [C($),C(a)-1]=[0,5].
Count the number of a before the range:
occs0=ranka(L, 0) = 0
Count the number of a to the end of the range:
occs1=ranka(L, 5) = 4
2. Range: [C(a)+occs0,C(a)+occs1-1]=[6,9].
Count the number of l before the range:
occs0=rankl (L, 6) = 0
Count the number of a to the end of the range:
occs1=rankl (L, 9) = 1
3. Range: [C(l)+occs0,C(l)+occs1-1]=[24,25].
Count the number of l before the range:
occs0=rank$(L, 24) = 5
Count the number of a to the end of the range:
occs1=rank$(L, 25) = 6
4. Range: [C($)+occs0,C($)+occs1-1]=[5,5].
’la’ is identified by 5.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 32/47
158. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (extract)
extract(i) retrieves symbols from the
(i + 1) − th $ to the i − th $:
It takes O(|si | log σ) time.
extract(5):
1. The search process starts from Position: 0.
Extracts the symbol in this position:
access(L, 0) =a
Count the number of as up to the position:
occs=ranka(L, 0) = 1
2. Position: C(a) + 1 − 1 = 6.
Extracts the symbol in this position:
access(L, 6) =l
Count the number of ls up to the position:
occs=rankl (L, 6) = 1
3. Position: C(l) + 1 − 1 = 24.
Extracts the symbol in this position:
access(L, 6) =$
The 5 − th string is la.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 33/47
159. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
FM-Index Dictionary (prefix & substring operations)
locatePrefix(p) is similar to locate:
It looks for $p and finds the area
[sp,ep] in where all strings si that
start with p are encoded.
Substring-based operations generalize
prefix-based ones:
locateSubstring(p) look for p to
obtain the area [sp,ep] containing all
strings si with p.
For each match, the backwards search
continues until determining the
corresponding ID (sampling structure)
Duplicate IDs are finally removed.
extractPrefix(p) and
extractSubstring(p) perform extract
operations in the corresponding ranges.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 34/47
160. Introduction
Compressed String Dictionaries
Experimental Evaluation
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
– Other Dictionaries (Tries)–
Tries [9] are tree-shaped structures which perform
efficiently for dictionary purposes:
Strings are located from root to leaves.
IDs are extracted from the corresponding leaf to the
root.
Tries use much space for managing large dictionaries.
Some compressed trie-based dictionaries exist in the
state of the art:
Compressed tries based on path decomposition [6].
LZ-compressed tries [1].
Self-indexed tries (XBW) [2].
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 35/47
161. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 36/47
162. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Experimental Setup
Two RDF real-world dictionaries:
26, 948, 638 URIs from Uniprot:
Averaged length: 51.04 chars per URI.
Highly-repetitive.
27, 592, 013 Literals from DBpedia:
Averaged length: 60.45 chars per Literal.
We analyze compression effectiveness and retrieval speed:
locate, extract.
Prefix-based operations (URIs)
Substring-based operations (Literals).
In practice, extract is the most important query:
It is used many times as results are retrieved from the compressed dataset.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 37/47
163. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
– URIs –
Compressed tries (LexRP and CentRP)
obtain the best compression results and
report better numbers for locate:
≈ 4.5 % of the original space.
≈ 2 − 3µs/string.
> 2µs/ID.
HTFC uses slightly more space, but it is
faster for extract:
≈ 5 − 13 % of the original space.
≈ 2.2-3 µs/string.
≈ 0.7-1.6 µs/ID.
The best tradeoff is for PFC:
≈ 9 − 19 % of the original space.
≈ 1.6 µs/string.
≈ 0.3-0.6 µs/ID.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 38/47
164. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Prefix-based Operations
PFC is the best choice for prefix-based operations:
Although it uses more space, it reports the best performance.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 39/47
165. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
– Literals –
Compressed tries (LexRP and CentRP)
obtain better compression results and
report better numbers for locate:
≈ 12 % of the original space.
≈ 2-2.5 µs/string.
> 2.5 µs/ID.
HTFC reports the best compression ratios,
but its performance is less competitive:
≈ 9 − 17 % of the original space.
≈ 4.5-40 µs/string.
≈ 3 − 20µs/ID.
The best tradeoff is for Hash:
≈ 15 % of the original space.
≈ 1.5 µs/string.
≈ 1µs/ID.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 40/47
167. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
– Conclusions –
RDF dictionaries are highly compressible:
URIs are very redundant and Literals also show non-negligible symbolic
redundancy.
This redundancy can be detected and removed within specific data
structures for dictionaries:
Structures for URIs use up to 20 times less space than the original
dictionaries.
For Literals, the corresponding structures use 6 − 8 times less space than
the original dictionaries.
All these structures report data retrieval performance at microsecond
level:
This functionality includes both simple and advanced operations.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 42/47
168. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
GitHub
All dictionaries explained in this lecture (and some more [12]) are
available in the libCSD C++ library:
https://github.com/migumar2/libCSD
Beta version: suggestions are accepted ;)
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 43/47
169. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Bibliography I
[1] Julian Arz and Johannes Fischer.
LZ-compressed string dictionaries.
In Procedings of DCC, pages 322–331, 2014.
[2] Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, Miguel A. Mart´ınez-Prieto, and Gonzalo Navarro.
Compressed string dictionaries.
In Proceedings of SEA, pages 136–147, 2011.
[3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms.
MIT Press and McGraw-Hill, 2nd edition, 2001.
[4] Paolo Ferragina and Giovanni Manzini.
Indexing compressed texts.
Journal of the ACM, 52(4):552–581, 2005.
[5] Paolo Ferragina, Giovanni Manzini, Veli M¨akinen, and Gonzalo Navarro.
Compressed representations of sequences and full-text indexes.
ACM Transactions on Algorithms, 3(2):article 20, 2007.
[6] Roberto Grossi and Giuseppe Ottaviano.
Fast Compressed Tries through Path Decompositions.
In Proceedings of ALENEX, pages 65–74, 2012.
[7] T.C. Hu and Alan C. Tucker.
Optimal Computer-Search Trees and Variable-Length Alphabetic Codes.
SIAM Journal of Applied Mathematics, 21:514–532, 1971.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 44/47
170. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Bibliography II
[8] David A. Huffman.
A method for the construction of minimum-redundancy codes.
Proc. of the Institute of Radio Engineers, 40(9):1098–1101, 1952.
[9] Donald .E. Knuth.
The Art of Computer Programming, volume 3: Sorting and Searching.
Addison Wesley, 1973.
[10] N. Jesper Larsson and Alistair Moffat.
Offline dictionary-based compression.
Proceedings of the IEEE, 88:1722–1732, 2000.
[11] Veli M¨akinen and Gonzalo Navarro.
Dynamic entropy-compressed sequences and full-text indexes.
ACM Transactions on Algorithms, 4(3):article 32, 2008.
[12] Miguel A. Mart´ınez-Prieto, Nieves Brisaboa, Rodrigo C´anovas, Francisco Claude, and Gonzalo Navarro.
Practical compressed string dictionaries.
Information Systems, 2015.
Under review.
[13] Miguel A. Mart´ınez-Prieto, Javier D. Fern´andez, and Rodrigo C´anovas.
Querying RDF Dictionaries in Compressed Space.
SIGAPP Applied Computing Review, 12(2):64–77, 2012.
[14] Hugh E. Williams and Justin Zobel.
Compressing integers for fast file access.
The Computer Journal, 42:193–201, 1999.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 45/47
171. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
Bibliography III
[15] Ian H. Witten, Alistair Moffat, and Timothy C. Bell.
Managing Gigabytes: Compressing and Indexing Documents and Images.
Morgan Kaufmann, 1999.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 46/47
172. Introduction
Compressed String Dictionaries
Experimental Evaluation
URIs
Literals
Conclusions
This presentation has been made only for learning/teaching purposes.
The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors.
Miguel A. Mart´ınez-Prieto & Antonio Fari˜na Dictionary Compression 47/47
173. Triples Compression and Indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto
175. RDF management Overview
Dictionary + triples-IDS
UK
London
M.Lalmas R.Raman
A.Gionis
inv-speaker Finland
SPIREheld on
capitalof
livesin
lives in
position
lives
in
attends
attends
attendsworks
in
(SPIRE, held on, London)
(London, capital of, UK)
(A.Gionis, attends, SPIRE)
(R.Raman, attends, SPIRE)
(M.Lalmas, attends, SPIRE)
(M.Lalmas, lives in, UK)
(M.Lalmas, works in, London)
(A.Gionis, lives in, Finland)
(R.Raman, lives in, UK)
(R.Raman, position, inv-speaker)
Original Triplets
London
SPIRE
A.Gionis
M.Lalmas
R.Raman
Finland
inv-speaker
UK
attends
capital of
1
2
3
4
5
3
4
5
1
2
held on
lives in
position
works in
3
4
5
6
SO
S
O
P
Dictionary Encoding
(2,3,1)
(1,2,5)
(3,1,2)
(5,1,2)
(4,1,2)
(4,4,5)
(4,6,1)
(3,4,3)
(5,4,5)
(5,5,4)
Id-based
Triplets
193. • Independent join
• Chain join
• Interactive join
K2-Triples
• They implemented three join strategies
– Taking advantage of the K2-triples structure
1
P5
(8,5,?X) (?X,2,?)
Query: (8,5,?X) (?X,2,?)
1
1
1
1
P2
Best strategy depends on the
dataset and the type of join
Joins
199. Compressed Suffix Array (CSA-SAD)
• Binary search for any pattern: “ab”
Back to Suffix Arrays
P = a b
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations Noccs = (4-3)+1
Occs = A[3] .. A[4] = { 8, 1}
Fast space
O(m lg n) O(4n)
O(m lg n + noccs) + T
200. Compressed Suffix Array (CSA-SAD)
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
P = a b
CSA basics
• Can we reduce the space needs of a Suffix Array?
201. Compressed Suffix Array (CSA-SAD)
CSA basics
• Ψ
• A[Ψ(i)] = A[i] +1
racadabra$
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
202. Compressed Suffix Array (CSA-SAD)
CSA basics
• Ψ
• A[Ψ(i)] = A[i] +1
racadabra$
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
203. a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
3
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
• Ψ
• A[Ψ(10)] = A[3] = A[10] +1 = 8
Compressed Suffix Array (CSA-SAD)
7
CSA basics
204. • Ψ and F
• Ψ and F are enought to perform binary search
and to recover the source data!!
Compressed Suffix Array (CSA-SAD)
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
4 1 7 8 9 10 11 12 6 2 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ= 3
7
F =
CSA basics
205. Compressed Suffix Array (CSA-SAD)
• Ψ and F (reducing space needs)
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F =
1 1 0 0 0 0 1 0 1 1 1 0=D
$ a b c d r
1 2 3 4 5 6
=S
Bitmap
Sorted alphabet
CSA basics
206. Compressed Suffix Array (CSA-SAD)
• Ψ and F (reducing space needs)
• Example: F[8] = S[rank1(D, 8)] = S[3] = ‘b’
Rank1(D,i):: Time O(1), by using o(n) extra space
Representing F
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F =
1 1 0 0 0 0 1 0 1 1 1 0=D
$ a b c d r
1 2 3 4 5 6
=S
Bitmap
Sorted alphabet
rank1(D, 8)
207. Compressed Suffix Array (CSA-SAD)
Compressing Ψ
• Absolute samples (k=sample period)
• Gap encoding on increasing values: Huffman & run-encoding
• Huffman with a N entries dictionary
– k reserved Huffman codes to encode 1-runs of size s ϵ [1..k-1]
– 32 + 32 Huffman codes representing the size (en bits) of large values [ + or - ]
• They are followd by that number encoded with log (v) bits
– The remaining N – k -32 -32 entries correspond to the most frequent gap values.
11 6 7 12 1 4 9 10 8 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ =
1 0 0 1 0 0 0 00 0 1 0 0 0 1 0 1 0 0 0 0 1 11 0 1 1 0 1 0 0 1 0 1 0 10 0 1 0 0 1 0 0
11
1
1
18
8
32
sΨ =
Δ =
208. Compressed Suffix Array (CSA-SAD)
sampling
sampling
sampling +
gap encoding +
- delta codes*
- Huffman-based
- encoding de runs
– Ψ(sampled), D, S count
– A(sampled) locate
– A-1(sampled) extract
1 1 0 0 0 0 1 0 1 1 1 0=D
12 11 8 1 4 6 9 2 5 10 3A = 7
4 8 12 5 9 6 10 3 7 2 1A = 11
-1
4 1 7 8 9 10 11 12 6 2 5Ψ= 3
$ a b c d r=S
1 2 3 4 5 6 7 8 9 10 11 12
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
Parameters: space/time “trade-off”
Complete structure
210. RDF-CSA
• Step 1 Integer dictionary encoding of s, p, o
• Step 2 Ordered list of n triples (sequence of 3n elements)
Building RDF-CSA
We first sort by subject, then by predicate, and finally by object
…
211. RDF-CSA
• Step 3 Sid is transformed into S, in order to keep disjoint
alphabets
Building RDF-CSA
Range [1, ns] for subjects
Range [ns+1, ns + np] for predicates
Range [ns + np + 1, ns + np + no] for objects
Due to this alphabet mapping, every subject is smaller than every predicate,
and this in turn is smaller than every object !!!
212. RDF-CSA
• Step 4 We build an iCSA on S
Building RDF-CSA
A has three ranges: each range points to suffixes starting with a subject, a predicate, or an
object
cycles around the components of the same triple; that is, the object of a triple k does not
point to the subject of the triple k+1 in S, but to the subject of the same triple we can start
at position A[i], pointing to any place within a triple (s,p,o), and recover the triple by succesive
applications of
213. RDF-CSA
• (S,P,O), (?S,P,O), (S,?P,O), (S,P,?O), (?S,?P,O),
(S,?P,?O), (?S,P,?O), (?S,?P,?O)
– Patterns with just one bounded element are directly solved using select on D
– Pattern (?S,?P,?O) retrieves all the triples, so it can be solved by retrieving every ith
triple, using
– For the rest of the patterns: binary iCSA search
• SPO bsearch(SPO,3)
• ?SOP bsearch (OP,2) … S?PO bsearch (OS,2)
– Optimizations:
• D-select+forward-check strategy: find valid intervals into S, P and O ranges,
and check matches with into those intervals, starting from the shortest one.
• D-select+backward-check strategy: use binary search to limit valid intervals,
instead of sequentially verifying each position of the shortest interval.
Searching for triple patterns
Optimizations are applicable to pattern (S,P,O), and those with just one unbounded term!!
214. RDF-CSA
• (S,P,O) optimizations
– D-select+forward-check strategy: find valid intervals into S, P and O ranges,
and check matches with into those intervals, starting from the shortest one.
– D-select+backward-check strategy: use binary search to limit valid intervals,
instead of sequentially verifying each position of the shortest interval.
Searching for triple patterns
180 231 301 550 600 602
10 11 12 180 200 230 231 232 300 301 550 600 601 602
S=8 P=4 O=261
SP SPO
180 231 301 550 600 602
10 11 12 180 200 230 231 232 300 301 550 600 601 602
S=8 P=4 O=261
SPO PO