Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

Linked Data
Semantic Technologies
RDF Compression
HDT
Linked Data Compression
Miguel A. Mart´ınez-Prieto Antonio Fariña
Univ. of Valladolid (Spain) Univ. of A Coruña (Spain)
migumar2@infor.uva.es fari@udc.es
Keyword search over Big Data.
– 1st KEYSTONE Training School –.
July 22nd, 2015. Faculty of ICT, Malta.
Miguel A. Mart´ınez-Prieto & Antonio Fariña Linked Data Compression 1/53

Linked Data
RDF Compression
HDT
What is Linked Data?
Linked Data Principles
Linked Open Data
Outline
1 Linked Data
2 Semantic Technologies
3 RDF Compression
4 HDT

Linked Data
RDF Compression
HDT
Linked Open Data
– What is Linked Data? –
Linked Data
Linked Data is simply about using the Web to create typed links
between data from different sources [3].
Linked Data refers to a set of best practices for publishing and
connecting data on the Web.
These best practices have been adopted by an increasing number of data
providers, leading to the creation of a global data space:
Data are machine-readable.
Data meaning is explicitly defined.
Data are linked from/to external datasets.
The resulting Web of Data connects data from different domains:
Publications, movies, multimedia, government data, statistical data, etc.

Linked Data
RDF Compression
HDT
Linked Open Data

Linked Data
RDF Compression
HDT
Linked Open Data
The Web... of Data
The emergence of the Web was an authentic revolution 15 years ago:
Changed the way we consume information.
Changed human relationships.
Changed businesses.
...

Linked Data
RDF Compression
HDT
Linked Open Data
The Web
The Web is a global space comprising linked HTML documents:
Web pages are the atoms of the Web.
Each page is univocally identiﬁed by their URL.

Linked Data
RDF Compression
HDT
Linked Open Data
The Web
Where are (raw) data in the Web?
Web pages “cook” raw data in a human-readable way.
It is, probably, the main problem of the WWW.

Linked Data
RDF Compression
HDT
Linked Open Data
The Web
- I was excited for the Keystone Training School and looked
for information about this nice country.
- I wrote “malta” in a web search engine, and...
I found some relevant results for my query! :)

Linked Data
RDF Compression
HDT
Linked Open Data
The Web
- I was excited for the Keystone Training School and looked
for information about this nice country.
- I wrote “malta” in a web search engine, and...
But others seem a little strange to my (current) expectations... :(

Linked Data
RDF Compression
HDT
Linked Open Data
The Web... of Data
Raw data are hidden among web pages contents:
In general, data are written in HTML paragraphs.
In the best case, they are structured in the form of HTML tables or
published as additional documents (CSV, XML...)
Anyway, HTML is not enough expressive to describe and link individual
data entities in the Web:
HTML-based descriptions lose semantics and structure from the raw
data.
This fact makes very diﬃcult automatic data processing in the Web.

Linked Data
RDF Compression
HDT
Linked Open Data
The Web... of Data
The Web of Data [8] converts raw data into ﬁrst-class citizens of the
Web...
Data entities are the atoms of the Web of Data.
Each entity has its own identity.
...and uses existing infrastructure:
It uses HTTP as communication protocol.
Entities are named using URIs.
The Web of Data is a cloud of data-to-data hyperlinks [5]:
These are labelled hyperlinks in contrast to the “plain” ones used
in the Web.
Thus, hyperlinks also provide semantics to data descriptions.

Linked Data
RDF Compression
HDT
Linked Open Data
The Web... of Data
Linked Data builds a Web of Data using the Internet infrastructure:
Data providers can publish their raw data in a standardized way.
These data can be interconnected using labelled hyperlinks.
The resulting cloud of data can be navigated using specific query
languages.
Linked Data achievements:
Knowledge from different fields can be easily integrated and universally
shared.
Automatic processes can exploit these knowledge to build innovative
software systems.
Semantic Search Engine
For instance, a semantic search engine would allow us for only retrieving entities which
describe “malta” as a country but not as a cereal.

Linked Data
RDF Compression
HDT
Linked Open Data
– Linked Data Principles –
Tim Berners-Lee [2] suggests four basic principles for Linked Data:
1 Use URIs as names for things.
2 Use HTTP URIs so that people can look up those names.
3 When someone looks up a URI, provide useful information, using the
standards (RDF, SPARQL).
4 Include links to other URIs, so that they can discover more things.

Linked Data
RDF Compression
HDT
Linked Open Data
1. URIs as names
What is his name?
For humans, his name is Clint Eastwood...
... but http://dataweb.infor.uva.es/movies/people/Clint Eastwood is a
better name for machines.
The use of URIs enables real-world entities (or their relationships with
other entities) to be identifed at universal scale.
This principle ensures any class of data has its own identity in the global
space of the Web of Data.

Linked Data
RDF Compression
HDT
Linked Open Data
2. HTTP URIs
All entities must be described using dereferenceable URIs:
These URIs are accesible via HTTP.
This principle exploits HTTP features to retrieve all data related to a
given URI.

Linked Data
RDF Compression
HDT
Linked Open Data
3. Standards
This principle states that all
stakeholders “must speak the same
languages” for eﬀective
understanding.
RDF [10] provides a simple logical
model for data description.
SPARQL [12] describes a speciﬁc
language for querying RDF data.
Serialization formats, ontology
languages, etc.

Linked Data
RDF Compression
HDT
Linked Open Data
4. Linking URIs
This principle materializes the aim of data integration in Linked Data:
Linking two URIs establishes a particular connection between two existing
entities.
Linking URIs
http://dataweb.infor.uva.es/movies/people/Clint Eastwood names the entity which
describes “Clint Eastwood”.
http://dataweb.infor.uva.es/movies/film/Mystic River names the entity which describes
the movie “Mystic River”.
An hyperlink between these two URIs state that the entity “Clint Eastwood” is
related to the entity “Mystic River”... how?
The labelled link provides a semantic relationship between entities.
In this case, http://dataweb.infor.uva.es/movies/property/director tags the
“director” relationship between “Clint Eastwood” and “Mystic River”.

Linked Data
RDF Compression
HDT
Linked Open Data
– Linked Open Data –
The Linked Open Data (LOD) project1
promotes Linked Data to be
published as Open Data:
LOD is released under an open license which does not impede its reuse for
free [2].
LOD is the highest-level in the 5-star scheme2
for Open Data publication.
The dataset is available on the Web under an open license.
The dataset is available as structured data.
The dataset is encoded using a non-propietary format.
The dataset names entities using URIs.
The dataset is linked to other datasets.
1
http://linkeddata.org/; http://5stardata.info/

Linked Data
RDF Compression
HDT
Linked Open Data
LOD (2007-2011)

Linked Data
RDF Compression
HDT
Linked Open Data
LOD (2014)

Linked Data
RDF Compression
HDT
Linked Open Data
Current Statistics (July, 2015)
9,960 datasets are openly available2
:
90 billion statements from 3,308 datasets.
6,639 datasets could not be crawled for diﬀerent reasons.
LOD Laundromat4
provides access to more tha 38 billion statements
from 650K “cleaned” datasets.
DBpedia 2014 contains more than 3 billion statements:
538 million statements from English Wikipedia.
2.46 billion statements from other language editions.
50 million statements linking to external datasets.
More and more datasets are released and these are getting bigger:
The largest ones are in the order of hundreds of GB.
2
http://stats.lod2.eu/; http://lodlaundromat.org/

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
Outline
1 Linked Data
3 RDF Compression
4 HDT

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
– Overview –
Semantic Technologies (in middle
layers) exploit features from the Web
infrastructure (low layers):
RDF is used for resource
description.
RDFS is used for describing
semantic vocabularies.
OWL extends RDFS and is used
for building ontologies.
SPARQL is the query language for
RDF data.
RIF is used for describing rules.

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
RDF & SPARQL
RDF & SPARQL are the most relevant technologies for our current aims:
Both standards are based on labelled directed graph features.

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
– RDF –



http : //dataweb.infor.uva.es/movies/people/Clint Eastwood
http : //dataweb.infor.uva.es/movies/property/name
Clint Eastwood



http : //dataweb.infor.uva.es/movies/ﬁlm/Mystic River
http : //dataweb.infor.uva.es/movies/property/title
Mystic River



http : //dataweb.infor.uva.es/movies/property/director
RDF [10] is a framework for describing resources of any class:
People, movies, cities, proteins, statistical data...
Resources are described in the form of triples:
Subject: the resource being described.
Predicate: a property of that resource.
Object: the value for the corresponding property.

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples



Clint Eastwood



http : //dataweb.infor.uva.es/movies/property/title
Mystic River



http : //dataweb.infor.uva.es/movies/property/director
An RDF triple is a labelled directed subgraph in which subject and object
nodes are linked by a particular (predicate) edge:
The subject node contains the URI which names the resource.
The predicate edge labels the relationship using a URI whose semantics is
described by any vocabulary/ontology.
The object node may contain a URI or a (string) Literal value.
RDF links (between entities) also take the form of RDF triples.

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Triples

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Graph
This graph view is only a mental model:
RDF graphs must be serialized!!
But the RDF Recommendation does not restrict the format to be used.

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
RDF Serialization Formats
Traditional plain formats are commonly used:
RDF/XML, NTriples, Turtle...
These formats are very verbose in practice:
Data are serialized in a (more or less) human-readable way.
Large RDF ﬁles are ﬁnally compressed using gzip or bzip2.

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
– SPARQL –
SPARQL [12] is a query language for RDF.
It is based on graph pattern matching:
Triple patterns are RDF triples in which subject, predicate and object may
be variable.
SPARQL supports more complex queries: joins, unions, ﬁlters...

Linked Data
RDF Compression
HDT
Overview
RDF
SPARQL
SPARQL Resolution

Linked Data
RDF Compression
HDT
Semantic Compression
Symbolic Compression
Syntactic Compression
Outline
1 Linked Data
3 RDF Compression
4 HDT

Linked Data
RDF Compression
HDT
What is the problem?
RDF excels at logical level:
Structured and semi-structured data can be described using RDF triples.
Entities are also linked in the form of RDF triples.
But it is a source of redundancy at physical level
Serialization formats are highly verbose.
RDF data are redundant at three levels: semantic, symbolic, and
syntactic.

Linked Data
RDF Compression
HDT
– Semantic Compression –
Semantic redundancy occurs when the same meaning can be conveyed
using less triples.



http : //www.w3.org/2000/01/rdf − schema#domain
http : //dataweb.infor.uva.es/movies/classes/person



Clint Eastwood



http : //www.w3.org/1999/02/22 − rdf − syntax − ns#type
http : //dataweb.infor.uva.es/movies/classes/person
The third triple is redundant because the ﬁrst one state that the URI
http://dataweb.infor.uva.es/movies/people/Clint Eastwood describes an entity in the
domain of http://dataweb.infor.uva.es/movies/classes/person.

Linked Data
RDF Compression
HDT
Semantic compressors perform at logical level:
Detect redundant triples and remove them from the original dataset.
Semantic compressors [9, 11, 13] are not so eﬀective by themselves...
... but may be combined with symbolic and syntactic compressors!

Linked Data
RDF Compression
HDT
– Symbolic Compression –
Symbolic redundancy is due to symbol repetitions in triples:
This is the “traditional” source of redundancy removed by universal
compressors.
Symbolic redundancy in RDF is mainly due to URIs:
URIs tend to be very large strings which share long prefixes.
http://dataweb.infor.uva.es/movies/film/Bird
http://dataweb.infor.uva.es/movies/film/Million Dollar Baby
http://dataweb.infor.uva.es/movies/film/Mystic River
http://dataweb.infor.uva.es/movies/people/Clint Eastwood
...
... but literals also contibute to this redundancy.

Linked Data
RDF Compression
HDT
The most prominent RDF compressors remove symbolic redundancy:
All diﬀerent URIs/literals are indexed in a string dictionary.
Each string is identiﬁed by a unique integer ID.
- Triples are rewritten by replacing strings by their corresponding IDs.
Symbolic is, in general, the most important redundancy in RDF and has
(many) room for optimization.

Linked Data
RDF Compression
HDT
– Syntactic Compression –
Syntactic redundancy depends on the RDF graph serialization:
For instance, a serialized subset of n triples (which describes the same
resource) writes n times the subject value. It can be abbr.
... and also on the underlying graph structure:
For instance, resources of the same classes are described using (almost)
the same sub-graph structure.
Syntactic compression also has (many) room for optimization.

Linked Data
RDF Compression
HDT
HDT [7], k2
-triples [1], or RDFCSA [4] are syntactic compressors
reporting good numbers:
They are combined with symbolic compression.
In practice, they compress RDF triples in the form of ID triples.
Semantic compressors such as SSP [11] also remove symbolic and
syntactic redundancy.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
Outline
1 Linked Data
3 RDF Compression
4 HDT

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
– What is HDT? –
HDT was the ﬁrst binary serialization format for RDF:
It was acknowledged as W3C Member Submission [6] in 2011.
It exploits symbolic and syntactic redundancy:
It reduces up to 15 times the space used by traditional formats [7].
HDT is a core building block in some Linked Data applications:
It reports good compression numbers, but also provides eﬃcient data
retrieval.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
– Components –
HDT encodes RDF data into three components:
The Header (H) comprises descriptive metadata.
The Dictionary (D) maps diﬀerent strings (from nodes and edges) to IDs:
It manages four independent mappings: subjects-objects, subjects, objects, and
predicates.
The Triples (T) component encodes the inner structure as a graph of IDs.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
HDT Components
The Dictionary is encoded using speciﬁc compression techniques for string
dictionaries.
Triple IDs are organized into a forest of trees (one per diﬀerent subject)...
...which is encoded using two bitsequences and two ID sequences.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
– Conclusions –
HDT integrates RDF serialization and compression into a practical
format:
HDT saves space storage and enables eﬃcient data parsing/retrieval
using bit operations.
Symbolic rendundancy is addressed by the Dictionary component:
The collection of strings (in the dictionary) has high symbolic
redundancy...
The own dictionary is highly compressible!
Syntactic rendundancy is removed by the Triples component:
HDT triples is a straightforward compressor.
Their eﬀectiveness can be improved using optimized graph compression
techniques.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography I
[1] Sandra Álvarez-Garc´ıa, Nieves Brisaboa, Javier D. Fernández, Miguel A. Mart´ınez-Prieto, and Gonzalo
Navarro.
Compressed Vertical Partitioning for Efficient RDF Management.
Knowledge and Information Systems (KAIS), 44(2):439–474, 2015.
[2] Tim Berners-Lee.
Linked Data, 2006.
http://www.w3.org/DesignIssues/LinkedData.html.
[3] Christian Bizer, Tom Heath, and Tim Berners-Lee.
Linked Data - The Story So Far.
International Journal of Semantic Web and Information Systems, 5(3):1–22, 2009.
[4] Nieves Brisaboa, Ana Cerdeira, Antonio Fariña, and Gonzalo Navarro.
A Compact RDF Store using Suffix Arrays.
In Proceedings of SPIRE, 2015.
To appear.
[5] Javier D. Fernández, Mario Arias, Miguel A. Mart´ınez-Prieto, and Claudio Gutiérrez.
Management of Big Semantic Data.
In Big Data Computing, chapter 4. Taylor and Francis/CRC, 2013.
[6] Javier D. Fernández, Miguel A. Mart´ınez-Prieto, Claudio Gutiérrez, and Axel Polleres.
Binary RDF Representation for Publication and Exchange.
W3C Member Submission, 2011.
www.w3.org/Submission/HDT/.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography II
[7] Javier D. Fernández, Miguel A. Mart´ınez-Prieto, Claudio Gutiérrez, Axel Polleres, and Mario Arias.
Binary RDF Representation for Publication and Exchange.
Journal of Web Semantics, 19:22–41, 2013.
[8] Tom Heath and Christian Bizer.
Linked Data: Evolving the Web into a Global Data Space.
Morgan & Claypool, 1 edition, 2011.
http://linkeddatabook.com/.
[9] Amit K. Joshi, Pascal Hitzler, and Guozhu Dong.
Logical Linked Data Compression.
In Proceedings of ESWC, pages 170–184, 2013.
[10] Frank Manola and Eric Miller.
RDF Primer.
W3C Recommendation, 2004.
www.w3.org/TR/rdf-primer/.
[11] Jeff Z. Pan, José Manuel Gómez-Pérez, Yuan Ren, Honghan Wu, and Man Zhu.
SSP: Compressing RDF data by Summarisation, Serialisation and Predictive Encoding.
Technical report, 2014.
Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014 SSP.pdf.
[12] Eric Prud’hommeaux and Andy Seaborne.
SPARQL Query Language for RDF.
W3C Recommendation, 2008.
http://www.w3.org/TR/rdf-sparql-query/.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
Bibliography III
[13] Gayathri V. and P. Sreenivasa Kumar.
Horn-Rule based Compression Technique for RDF Data.
In Proceedings of SAC, pages 396–401, 2015.

Linked Data
RDF Compression
HDT
Basics
Components
Conclusions
This presentation has been made available only for learning/teaching purposes.
The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors.

Onto some basics of:
compression, Compact Data Structures, and
indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto

Outline
Introduction
Basic compression
Sequences
Bit sequences
Integer sequences
A brief Review about Indexing

• Disks are cheap !! But they are also slow!
– Compression can help more data to fit in main memory.
(access to memory is around 106 times faster than HDD)
• CPU speed is increasing faster
– We can trade processing time (needed to uncompress
data) by space.
Introduction
Why compression?

• Compression does not only reduce space!
– I/O access on disks and networks
– Processing time* (less data has to be processed)
• … If appropriate methods are used
– For example: Allowing handling data compressed all the time.
Introduction
Why compression?
Text collection (100%)
Doc 1 Doc 2 Doc 3 Doc n Compressed Text
collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
Compressed Text
collection (20%)
P7zip, others
Let’s search
for “Malta"

• Indexing permits sublinear search time
Introduction
Why indexing?
Text collection (100%)
Doc 1 Doc 2 Doc 3 Doc n Compressed Text
collection (30%)
Let’s search
for “Malta"
term 1
…
Malta
…
term n
(> 5-30%)
Index

• Self-indexes:
– sublinear search time
– Text implicitly kept
Introduction
Why Compact Data Structures?
Text collection
Let’s search
for “Malta"
term 1
…
Malta
…
term n
(> 5-30%)
Index
0 0 0 01 1
0 1
0 1 0 10 0
1
0
Self-index (WT, WCSA,…)
term 1
…
Malta
…
term n

Basic Compression
• A compressor could use as a source alphabet:
– A fixed number of symbols (statistical compressors)
• 1 char, 1 word
– A variable number of symbols (dictionary-based compressors)
• 1st occ of ‘a’ encoded alone, 2nd occ encoded with next one ‘ax’
• Codes are built using symbols of an target alphabet:
– Fixed length codes (1 bit, 10 bits, 1 byte, 2 bytes, …)
– Variable length codes (1,2,3,4 bits/bytes …)
• Classification (fixed-to-variable, variable-to-fixed,…)
Modeling & Coding
-- statistical
Input alphabet
dictionary var2var
Target alphabet
fixed
var
fixed var

Basic Compression
• Taxonomy
– Dictionary based (gzip, compress, p7zip… )
– Grammar based (BPE, Repair)
– Statistical compressors (Huffman, arithmetic, PPM,… )
• Statistical compressors
– Gather the frequencies of the source symbols.
– Assign shorter codewords to the most frequent symbols.
Obtain compression
Main families of compressors

Basic Compression
• How do they achieve compression
– Assign fixed-length codewords to variable length symbols (text
substrings)
– The longer the replaced substring  the better compression
• Well-known representatives: Lempel-Ziv family
– LZ77 (1977): GZIP, PKZIP, ARJ, P7zip
– LZ89 (1978)
• LZW (1984): Compress, GIF images
Dictionary-based compressors

Basic Compression
• Starts with an initial dictionary D (contains symbols in Σ)
• For a given position of the text.
– while D contains w, reads prefix w=w0 w1 w2 …
– If w0 …wk wk+1 is not in D (w0 …wk does!)
• output (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
• Add w0 …wk wk+1 to D
• Continue from wk+1 on (included)
• Dictionary has limited length? Policies: LRU, truncate& go, …
LZW
EXAMPLE

Basic Compression
• Replaces pairs of symbols by a new one, until no pair
repeats twice
– Adds a rule to a Dictionary.
Grammar Based – BPE - Repair
A B C D E A B D E F D E D E F A B E C D
A B C G A B G F G G F A B E C D
H C G H G F G G F H E C D
H C G H I G I H E C D
DE G
AB  H
GF  I
Source sequence
Dictionary of Rules
Final Repair Sequence

Basic Compression
• Assign shorter codewords to the most frequent symbols
– Must gather symbol frequencies for each symbol c in Σ.
– Compression is lower bounded by the (zero-order) empirical
entropy of the sequence (S).
• Most representative method: Huffman coding
Statistical Compressors
n= num of symbols
nc= occs of symbol c
H0(S) <= log (|Σ|)
n H0(S) = lower bound of the size of S compressed with a zero-order compressor

Basic Compression
• Optimal prefix free coding
– No codeword is a prefix of one another.
• Decoding requires no look-ahead!
– Asymptotically optimal: |Huffman(S)| <= n(H0(S)+1)
• Typically using bit-wise codewords
– Yet D-ary Huffman variants exist (D=256 byte-wise)
• Builds a Huffman tree to generate codewords
Statistical Compressors: Huffman coding

Basic Compression
• Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE

Basic Compression
• Bottom – Up tree construction

Basic Compression
• Branch labeling

Basic Compression
• Code assignment

Basic Compression
• Compression of sequence S= ADB…
• ADB…  01 000 10 …

Basic Compression
• Given S= mississipii$, BWT(S) is obtained by: (1) creating
a Matrix M with all circular permutations of S$, (2) sorting
the rows of M, and (3) taking the last column.
Burrows-Wheeler Transform (BWT)
mississippi$
$mississippi
i$mississipp
pi$mississip
ppi$mississi
ippi$mississ
sippi$missis
ssippi$missi
issippi$miss
sissippi$mis
ssissippi$mi
ississippi$m
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
sort
L = BWT(S)F

Basic Compression
• Given L=BWT(S), we can recover S=BWT-1(L)
Burrows-Wheeler Transform: reversible (BWT -1)
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
1. Sort L to obtain F
2. Build LF mapping so that
If L[i]=‘c’, and
k= the number of times ‘c’ occurs in L[1..i], and
j=position in F of the kth occurrence of ‘c’
Then set LF[i]=j
Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8
which is the 2nd occ of ‘p’ in F

Basic Compression
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
Then set LF[i]=j
3. Recover the source sequence S in n steps:
Initially p=l=6 (position of $ in L); i=0; n=12;
In each step: S[n-i] = L[p];
p = LF[p];
i = i+1;
-
-
-
-
-
-
-
-
-
-
-
$
S

Basic Compression
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
Then set LF[i]=j
Step i=0: S[n-i] = L[p]; S[12]=‘$’
p = LF[p]; p = 1
i = i+1; i=1
-
-
-
-
-
-
-
-
-
-
-
$
S

Basic Compression
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
Then set LF[i]=j
Step i=1: S[n-i] = L[p]; S[11]=‘i’
p = LF[p]; p = 2
i = i+1; i=2
-
-
-
-
-
-
-
-
-
-
i
$
S

Basic Compression
$mississippi
i$mississipp
ippi$mississ
issippi$miss
ississippi$m
mississippi$
pi$mississip
ppi$mississi
sippi$missis
sissippi$mis
ssippi$missi
ssissippi$mi
LF
1
2
3
4
5
6
7
8
9
10
11
12
2
7
9
10
6
1
8
3
11
12
4
5
LF
Steps:
Then set LF[i]=j
Step i=1: S[n-i] = L[p]; S[11]=‘i’
p = LF[p]; p = 2
i = i+1; i=2
m
i
s
s
i
s
s
i
p
i
i
$
S

Basic Compression
• BWT. Many similar symbols appear adjacent
• MTF.
– Output the position o the current symbol within Σ ‘
– Keep the alphabet Σ ‘= {a,b,c,d,e,… } sorted so that the last used
symbol is moved to the begining of Σ ‘ .
• RLE.
– If a value (0) appears several times (000000  6 times)
– replace it by a pair <value,times>  <0,6>
• Huffman stage.
Bzip2: Burrows-Wheeler Transform (BWT)
Why does it work?
In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …

Sequences
• Given a Sequence of
– n integers
– m = maximum value
• We can representing it with n ⌈log2(m+1)⌉ bits
– 16 symbols x3 bits per symbol = 48 bits  array of 2 32-bit ints
– Direct access (access to an integer + bit operations)
Plain Representation of Data
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sequences
• Is it compressible?
• Ho(S) = 1.59 (bits per symbol)
• Huffman: 1.62 bits per symbol
26 bits: No direct access!
(but we could add sampling)
Compresed Representation of Data (H0)
Symbol 4 1 2 3
Occurrences (nc) 9 4 2 1
0 1
16
7
1
43
0
1
2
0
1
2 3 1 4
9
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 01 000 0011 1 1 1 01 1 000 1 01 01 1 1
1 5 10 15 20 25

Sequences
• Operations of interest:
– Access(i) : Value of the ith symbol
– Ranks(i) : Number of occs of symbol s up to position i (count)
– Selects (i) : Where the ith occ of symbol s? (locate)
Summary: Plain/compressed  acess/rank/select ()
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100
1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46
1 01 000 0011 1 1 1 01 1 000 1 01 01 1 1
1 5 10 15 20 25

Bit Sequences
Rank1(6) = 3
Rank0(10) = 5
Access/rank/select on bitmaps
0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1415161718 19 20 21
B=
select0(10) =15
access (19) = 0

Bit Sequences
• Bitmaps a basic part of most Compact Data Structures
• Example: (We will see it later in the CSA)
S: AAABBCCCCCCCCDDDEEEEEEEEEEFG  n log σ bits
B: 1001010000000100100000000011  n bits
D: ABCDEFG  σ log σ bits
– Saves space:
– Fast access/rank/select is of interest !!
• Where is the 2nd C?
• How many Cs up to position k?
Applications

Bit Sequences
• Jacobson, Clark, Munro
– Variant by Fariña et al.
• Assuming 32 bit machine-word
• Step 1: Split de Bitmap into superblocks of 256 bits, and
store de number of 1s up to positions 1+256k
– O(1) time to superblock. Space: n/256 superblock and 1 int each
Reaching O(1) Rank y o(n) bits of extra space
0 1 0 ... 1
1 2 3 256
35 bits set to 1
1 ... 1
257 512
27 bits set to 1
350
1 2
Ds = 62
3
0 ... 1
513 768
45 bits set to 1
...
97
3
...

Bit Sequences
• Step 2: For each superblock of 256 bits
– Divide it into 8 block of 32 bits each (machine word size)
– Store the number of ones from the beginning of the superblock
– O(1) time to the blocks, 8 blocks per superblock, 1 byte each
1 1 0 ... 1
1 2 3 256
35 bits set to 1
1 ... 0
257 512
27 bits set to 1
350
1 2
Ds = 62
3
0 ... 1
513 768
45 bits set to 1
...
97
3
...
1 1 0 ... 1
1 2 3 32
4 bits set to 1
0 ... 1
33 64
6 bits set to 1
...
40
1 2
Db = 10
3
...
1 ... 0
224 256
8 bits set to 1

Bit Sequences
• Step 3: Rank within a 32 bit block
Finally solving:
rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i)
where i= p mod 32
– Ex: rank1(D,300) = 35 + 4 + 4 = 43
– Yet, how to compute rank1(blk, i) in constant time ?
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Bit Sequences
• how to compute rank1 (blk, i) in constant time ?
– Option 1: popcount within a machine word
– Option 2: Universal Table onesInByte (solution for each byte)
Only 256 entries storing values [0..8]
• Finally sum value onesInByte for the 4 bytes in blk
• Overall space: 1.375 n bits
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0blks =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
1 0 0 1 0 0 0 0 1 1 0 0
Shift 32 – 12 = 20 posicións
Rank1(blk,12)
Val binary OnesInByte
0 00000000 0
1 00000001 1
2 00000010 1
3 00000011 2
252 11111100 6
253 11111101 7
254 11111110 7
255 11111111 8
... ... ...

Bit Sequences
select1(p)
• In practice, binary search using rank
– Binary search on superblocks O(log(n)) to find the superblock s
containing the pth 1  retval = Ds[s]
– Sequential search [uint <=256] within the in blocks until reaching
the block d that contains the position  retval += Db[d]
– Sequential search (1 byte at a time) within the last 32 bits, using
onesInByte[] table until reaching the byte b that contains the
position.
• In each iteration: retval += onesInByte[b]
– Table lookup over a new selb[] table over the last “byte” b
• retval += selb[b]
– Return retval
Select in O(log (n)) with the same structures

Bit Sequences
• Compressed bitmap representations exist.
– Compressed  [Raman et al]
– For very sparse bitmaps [Okanohara and Sadakane]
– …
Compressed representations

Integer Sequences
Access/rank/select on general sequences
Rank2(9) = 3
S=
select4(3) =7
access (13) = 3
4 4 3 2 6 2 4 2 4 1 1 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14

Integer Sequences
• Grossi et al.
• Given a sequence of symbols and an encoding
– The bits of the code of each symbol are distributed along the
different levels of the tree
00 01 00 10 11 00 A B A C D A C
0 0 0 0
10
1 1
0 1
A B A A C D C
0 1 0 10 0
1
0
Wavelet tree (construction)
DATA
SYMBOL CODE
WAVELET TREEA B A C D A C
C
D
00
01
10
11
B
A

• Searching for the 1st occurrence of ‘D’?
Integer Sequences
DATA
SYMBOL CODE
C
D
00
01
10
11
B
A
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
it is the 2nd bit in B1
Where is the
2nd ‘1’?
 at pos 5.
0
1
Where is the
1st ‘1’?
 at pos 2.
Wavelet tree (select)
Broot
B0 B1

Integer Sequences
• Recovering Data: extracting the next symbol
– Which symbol appears in the 6th position?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
Which bit occus at position 4 in B0?
How many
‘0’s are there
up to pos 6?
it is the 4th ‘0’
0
1
It is set to 0
The codeword read is ’00’ A
Wavelet tree (access)
DATA
SYMBOL CODE
C
D
00
01
10
11
B
A
Broot
B0 B1
Broot
B0 B1
Broot
B0

Integer Sequences
• Recovering Data: extracting the next symbol
– Which symbol appears in the 7th position?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
Which bit occurs at position 3 in B1?
How many ‘1’s
are there up to
pos 7?
it is the 3rd ‘1’
0
1
It is set to 0
The codeword read is ’10’  C
TEXT
SYMBOL CODE
C
D
00
01
10
11
B
A
Wavelet tree (access)
B1
Broot
B0

Integer Sequences
• How many C’s up to position 7?
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
How many 0s up to position 3 in B1?
How many ‘1’s
are there up to
pos 7?
it is the 3rd ‘1’
0
1
2 !!
TEXT
SYMBOL CODE
C
D
00
01
10
11
B
A
Wavelet tree (Rank)
B1
Broot
B0
Select (locate symbol)
Access and Rank:

Integer Sequences
• Each level contains n + o(n) bits
• Rank/select/access expected O(log σ) time
A B A C D A C
0 0 0 01 1
0 1
A B A A C D C
0 1 0 10 0
1
0
Wavelet tree (Space and times)
WAVELET TREE
00 01 00 10 11 00 10
DATA
SYMBOL CODE
A B A C D A C
C
D
00
01
10
11
B
A
n + o(n) bits
n + o(n) bits
n ⌈log σ⌉ (1 + o(1)) bits

Integer Sequences
• Using Huffman coding (or others)  umbalanced
• Rank/select/access  O(nH0(S)) time
Huffman-shaped (or others) Wavelet tree
A B A C D A C
1 0 1 10 0
0 1
B C D C A A A
0 1 0 0
0
WAVELET TREE
1 000 1 01 001 1 01
DATA
SYMBOL CODE
A B A C D A C
C
D
1
000
01
001
B
A
nH0(S) + o(n) bits
0 1
B D C C
1 0

• Traditional indexes (with or without compression)
– Inverted Indexes, Suffix Arrays,...
• Compressed Self-indexes
– Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index, …
Text Indexing: Well-known structures from The Web
implicit text
auxiliar structure explicit text

Inverted indexes
Space-time trade-off
DCC
communications
compression
image
data
information
Cliff
Logde
0 142
104 165 341
506368
219 445
DCC is held at the Cliff Lodge convention center. It
is an international forum for current work on data
compression and related applications. DCC addresses
not only compression methods for specific types of
data (text, image, video, audio, space, graphics, web
content, etc.), but also the use of techniques from
information theory and data compression in
networking, communications, and storage applications
involving large datasets (including image and
information mining, retrieval, archiving, backup,
communications, and HCI).
99 207 336
128 395
19
25
Vocabulary Posting Lists
Indexed text
Searches
Word  posting of that word
Phrase  intersection of postings
Block1Block2
Compression
- Indexed text (Huffman,...)
- Posting lists (Rice,...)
1
1 2
2
1 2
1 2
1 2
1
1
DCC
communications
compression
image
data
information
Cliff
Lodge
Vocabulary Posting Lists
Full-positional information Block-addressing inverted index

• Lists contain increasing integers
• Gaps between integers are smaller in the longest lists
Inverted indexes
4 10 15 25 29 40 46 54 57 70 79 82Posting list
original
1 2 3 4 5 6 7 8 9 10 11 12
4 6 5 10 4 11 6 8 3 13 9 3Diferenc.
4
c6 c5 c10
29
c11 c6 c8
57
c13 c9 c3
Sampling absoluto + codif long.
variable
 Acceso directo
Descompresión
parcial
c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3Codif long. variable
Descompresión
completa

• Sorting all the suffix of T lexicographically
Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$

• Binary search for any pattern: “ab”
Suffix Arrays
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b

Suffix Arrays
P = a b
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =

Suffix Arrays
P = a b
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations Noccs = (4-3)+1
Occs = A[3] .. A[4] = { 8, 1}
Fast space
O(m lg n) O(4n)
O(m lg n + noccs) + T

Basic Compression
• BWT(S) + other structures  it is an index
BWT  FM-index
• C[c] : for each char c in Σ , stores the number of
occs in S of the chars that are lexicographically
smaller than c.
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8
• OCC(c, k): Number of occs of char c the prefix
of L: L (1, k)
For k in [1..12]
Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1
Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4
Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1
Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2
Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
• Char L[i] occurs in F at position LF(i):
LF(i) = C[L[i]] + Occ(L[i],i)

Basic Compression
• Count (S[1,u], P[1,p])
BWT  FM-index
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8
Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1
Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4
Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1
Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2
Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4

Basic Compression
• Representing L with a wavelet tree
BWT  FM-index

Bibliography
1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical
Report 124, Digital Systems Research Center, 1994.
http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/.
2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th
SPIRE, LNCS 5280, pages 176–187, 2008.
3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc.
12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001.
4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-
581, 2005.
5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994
6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text
indexing. In Proc. 17th SODA, pages 368–373, 2006.
7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th
SODA, pages 841–850, 2003.

Bibliography
8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the
Institute of Radio Engineers, 40(9):1098-1101, 1952
9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE,
88(11):1722–1732, 2000
10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp.,
22(5):935–948, 1993
11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms .Kluwer 2002, ISBN 0-7923-
7668-4
12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37–42, 1996.
13. Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR),
v.39 n.1, p.2-es, 2007
14. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th
ALENEX, 2007.
15. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding
k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002.

Bibliography
16. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and
flexible word searching on compressed text. ACM Transactions on Information Systems,
18(2):113–139, 2000.
17. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and
Indexing Documents and Images. Morgan Kaufmann, 1999.
18. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory 23, 3, 337–343.
19. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE
Transactions on Information Theory 24, 5, 530–536.

Introduction
Compressed String Dictionaries
Experimental Evaluation
Dictionary Compression
Miguel A. Mart´ınez-Prieto Antonio Fariña
Univ. of Valladolid (Spain) Univ. of A Coruña (Spain)
migumar2@infor.uva.es fari@udc.es
Keyword search over Big Data.
– 1st KEYSTONE Training School –.
July 22nd, 2015. Faculty of ICT, Malta.
Miguel A. Mart´ınez-Prieto & Antonio Fariña Dictionary Compression 1/47

Introduction
What is a String Dictionary?
Operations
RDF Dictionaries
Outline
1 Introduction
2 Compressed String Dictionaries
3 Experimental Evaluation

Introduction
Operations
RDF Dictionaries
– What is a String Dictionary –
String Dictionary
A string dictionary is a serializable data structure
which organizes all diﬀerent strings (vocabulary) used
in a dataset.
The vocabulary of a natural language text (lexicon) comprises all diﬀerent
words used in it.
T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo”
V= {he, la, ni~na, no, que, s´ı, tarara, visto, yo}

Introduction
Operations
RDF Dictionaries
The dictionary implements a bijective function that maps
strings to identiﬁers (IDs, generally integer values) and back.
It must provide, at least, two complementary operations:
string-to-ID: locates the ID for a given string.
ID-to-string: extracts the string identiﬁed by a given ID.

Introduction
Operations
RDF Dictionaries
String dictionaries are a simple and effective tool:
Enable replacing (long, variable-length) strings by simple
numbers (their IDs).
T= “la tarara s´ı la tarara no la tarara ni~na que la he visto yo”
T’= 2 7 6 2 7 4 2 7 3 5 2 1 8 9
The resulting IDs are more compact to represent and easier
and more efficient to handle:
T= 59 chars × 1 byte/chars = 59 bytes
T’= 14 IDs × log(9) bits/ID = 7 bytes
(plus the cost of dictionary encoding)
A compact dictionary which provides efficient mapping
between strings and IDs saves storage space, and
processing/transmission costs, in data-intensive
applications.

Introduction
Operations
RDF Dictionaries
Compressing String Dictionaries
The growing volume of the datasets has led to increasingly large
dictionaries:
The dictionary size is a bottleneck for applications running under
restrictions of main memory.
Dictionary management is becoming a scalability issue by itself.
Dictionary compression aims to achieve competitive space/time tradeoﬀs:
Compact serialization.
Small memory footprint.
Eﬃcient query resolution.
We focus on static dictionaries, which do not change along the
execution:
Many applications use dictionaries that either are static or are rebuilt only
sparingly.

Introduction
Operations
RDF Dictionaries
– Operations –
A string dictionary is a data structure that represents a sequence of n
distinct strings, D = s1, s2, . . . , sn .
It provides a mapping between ID numbers i and strings si :
- locate(p)
= i, if p = si for some i ∈ [1, n].
= 0 otherwise.
- extract(i) returns the string si , for i ∈ [1, n].
Some other operations can be useful in speciﬁc applications:
Preﬁx-based locate / extract operations.
Substring-based locate / extract operations.

Introduction
Operations
RDF Dictionaries
Preﬁx-based Operations
- locatePrefix(p) = {i, ∃y, si = py}.
This result set is a contiguous ID range for lexicographically sorted
dictionaries.
- extractPrefix(p) = {si , ∃y, si = py}.
It is equivalent to composing locatePrefix(p) with individual
extract(i) operations.
Finding all URIs in a given domain is an example of preﬁx-based
operation:
Look for all properties used in http://dataweb.infor.uva.es/movies:
http://dataweb.infor.uva.es/movies/property/director (4).
http://dataweb.infor.uva.es/movies/property/name (7).
http://dataweb.infor.uva.es/movies/property/title (12).
...

Introduction
Operations
RDF Dictionaries
Substring-based Operations
- locateSubstring(p) = {i, ∃x, y, si = xpy}.
It is very similar to the problem solved by full-text indexes.
- extractSubstring(p) = {si , ∃x, y, si = xpy}.
It is equivalent to composing locateSubstring(p) with individual
extract(i) operations.
Both operations may return duplicate results which must be removed
before reporting the ID result set.
regex query resolution in SPARQL is an example of substring-based
operation:
Look for all literals containing the substring Eastwood:
‘‘Clint Eastwood’’ (2544).
‘‘Jayne Eastwood is a Canadian actress...’’ (10584).
‘‘Kyle Eastwood’’ (13847).
...

Introduction
Operations
RDF Dictionaries
Summary
- locate(“tarara”) = 7
- extract(2) = la
- locatePrefix(“n”) = 3,4
- extractPrefix(“n”) = ni˜na, no
- locateSubstring(“a”) = 2,3,7
- extractSubstring(“a”) = la, ni˜na, tarara

Introduction
Operations
RDF Dictionaries
– RDF Dictionaries –
An RDF dictionary comprises all different terms used in the dataset:
RDF terms are drawn from three disjoint vocabularies: URIs, Literals, and
blank nodes.
Serialized (uncompressed) RDF vocabularies need up to 3 times more
space than (uncompressed) ID-triples [13].
URIs and Literals should be compressed and managed independently:
Their structure is very different and they are queried in a different way.

Introduction
Operations
RDF Dictionaries
URIs
URIs are medium-size strings sharing long prefixes:
Compressed dictionaries for URIs must exploit the continuous repetition of
such prefixes.
Prefix-based compression.
locate operations are common when the dictionary is used for lookup
purposes (e.g. RDF stores, semantic search engines, etc.).
extract operations are common when the dictionary is used for data
access purposes (e.g. decompression, result retrieval, etc.).
locatePrefix and extractPrefix are also useful for URI dictionaries.

Introduction
Operations
RDF Dictionaries
Literals
Literals tends to be large-size strings with no predictable features:
The name “Clint Eastwood”.
The genome from an individual of any species.
The full text from “El Quijote”
...
Literal dictionaries must be based on universal compression.
locate and extract are used like in URI dictionaries.
locateSubstring and extractSubstring are useful because of
SPARQL needs.

Introduction
Operations
RDF Dictionaries
Practical Conﬁguration
A role-based partition is ﬁrst performed:
Subjects are encoded in the range [1,|S|].
Predicates are encoded in the range [1,|P|].
Objects are encoded in the range [1,|O|].
URIs playing as subject and object are encoded
once:
IDs in [1,|SO|] encode subjects and objects.
Subjects are encoded in [|SO+1|,|S|].
Objects are encoded using two dictionaries:
1 [|SO+1|,|Ox |] encode URIs which only performs
as objects.
2 [|Ox +1|,|O|] encode Literals.

Introduction
Front-Coding
Hashing
Self-Indexed Dictionaries
Other Dictionaries
Outline
1 Introduction

Introduction
Front-Coding
Hashing
Other Dictionaries
All revised dictionaries combine notions from universal compression and
compact data structures.
Universal compressors must enable fast decompression and comparison of
individual strings:
Huﬀman [8] and Hu-Tucker [7, 9] codes.
Re-Pair [10].
The serialized vocabulary Tdict concatenates all strings in lexicographic
order:
An special symbol $ is used as separator.
T =“alabar a la alabada alabarda”
Tdict = a$alabada$alabar$alabarda$la$

Introduction
Front-Coding
Hashing
Other Dictionaries
– Front-Coding –
Front-Coding [15] is a folklore compression technique for lexicographically
sorted dictionaries.
It exploits the fact that consecutive entries are likely to share a common
prefix:
Each entry in the dictionary is differentially encoded with respect to the
preceding one.
It needs two values:
× An integer encoding the length of the shared prefix.
× The remaining characters of the current entry.
a$alabada$alabar$alabarda$la$
→ (0,a$); (1,labada$); (5, r$); (6, da$); (0, la$)

Introduction
Front-Coding
Hashing
Other Dictionaries
Front-Coding
The vocabulary is divided into buckets of b strings:
The ﬁrst string of each bucket (header) is explicitly stored.
The remaining b − 1 internal strings are diﬀerentially encoded.

Introduction
Front-Coding
Hashing
Other Dictionaries
Front-Coding Operations
locate(p):
1 Headers are binary searched until ﬁnding the bucket Bx where p must lie:
If the header is p, locate(p) = (b × (Bx − 1)) + 1.
2 The internal string are sequentially decoded:
If the internal ith
string is p, locate(p) = (b × (Bx − 1)) + i.
If the bucket is fully decoded with no result, p is not in the dictionary.
extract(i):
1 The string is encoded in the bucket Bx = i/b .
2 ((i − 1) mod b) internal strings are decoded to obtain the answer.
Preﬁx-based operations exploits the lexicographic order:
Their results are contiguous ranges in the dictionary.

Introduction
Front-Coding
Hashing
Other Dictionaries
Plain Front-Coding (PFC)
PFC is a straightforward byte-oriented Front-Coding implementation:
It uses VByte [14] to encode the length of the common preﬁx.
The remaining string is encoded with one byte per character, plus the
terminator $.
PFC is serialized as a byte array (Tpfc ) and a ptrs structure:
Both structures are directly mapped to main memory for data retrieval
purposes.

Introduction
Front-Coding
Hashing
Other Dictionaries
HuTucker Front-Coding (HTFC)
HTFC is algorithmically similar to PFC, but it takes advantage of the Tpfc
redundancy to achieve a more compressed representation:
Operations are slightly slower than for PFC.
Headers are encoded using HuTucker:
It allows compressed headers to be directly compared with the query
pattern.
Internal strings are encoded using Huﬀman or Re-Pair compression.
HTFC is serialized as a bit array (Thtfc ) and also a ptrs structure:
Pointers in HTFC uses less bits because Thtfc is smaller than Tpfc .

Introduction
Front-Coding
Hashing
Other Dictionaries
– Hashing –
Hashing [3] is a folklore method to implement dictionaries:
A hash function transforms the string into an index x in the hash table.
A collision arises when two different strings are mapped to the same cell
in the table.
String dictionaries perform better with closed hashing [2]:
If the corresponding cell is not empty, one successively probes other cells
until finding a free cell.
The next cell to be probed is determined using double hashing.
Hash dictionaries provide very efficient locate, may support extract,
but the table size dissuades their use for managing large vocabularies.
Compressed hash dictionaries focuses on compacting the table, but
also the own vocabulary:
The vocabulary can be effectively compressed using Huffman or Re-Pair.

Introduction
Front-Coding
Hashing
Other Dictionaries
Vocabulary Compression

Introduction
Front-Coding
Hashing
Other Dictionaries
Table Compression (I)

Introduction
Front-Coding
Hashing
Other Dictionaries
Table Compression (II)

Introduction
Front-Coding
Hashing
Other Dictionaries
Improving Data Access

Introduction
Front-Coding
Hashing
Other Dictionaries
Hashing Operations (locate)
locate(p):
1 The pattern p is compressed using Huﬀman: cp.
2 cp is “hashed” to a position x in the (original) hash table.
3 x is mapped to its corresponding position y in the compressed
representation.
4 The string pointed from y is decompressed and compared to p.
locate(“alabada”)
1 Huffman(“alabada$”)=cp
2 hash(cp)=5
3 if B[5] = 1, rank1(B, 5)=4
if B[5] = 0, “alabada” is not in D.
4 strcmp(DAC[4],cp)=true → 4
strcmp(DAC[4],cp)=false → collision

Introduction
Front-Coding
Hashing
Other Dictionaries
Hashing Operations (extract)
extract(i):
1 The string directly extract from DAC[i].

Introduction
Front-Coding
Hashing
Other Dictionaries
– Self-Indexed Dictionaries –
A self-index stores the original text T and provides indexed searches to
it, using space proportional to the T statistical entropy.
Self-indexes support two operations:
locate(p), returns all the positions in T where p occurs.
extract(i, j), retrieves the substring T [i, j].
A string dictionary can be easily self-indexed:
The corresponding self-index is built on the text Tdict .
The dictionary primitives (and also preﬁx and substring based queries) are
implemented using the self-index operations.
We choose the FM-Index [4, 5] because it is the most space-eﬃcient
self-index in practice:
A $ symbol is prepended to the original Tdict .
The BWT (L) is a wavelet-tree (“plain” [5] and “compressed” [11]).
C is a simple array.

Introduction
Front-Coding
Hashing
Other Dictionaries
FM-Index Dictionary

Introduction
Front-Coding
Hashing
Other Dictionaries
FM-Index Dictionary (locate)
The ith
string is encoded between the
i + 1th
and i + 2th
$.
locate(p) performs backwards search
of $p$:
The pattern is searched from right to
left until reach the corresponding $.
locate(p) performs in time
O(|p| log σ).

Introduction
Front-Coding
Hashing
Other Dictionaries
FM-Index Dictionary (locate)
locate(’la’): Looking for $la$.
1. Range: [C($),C(a)-1]=[0,5].
Count the number of a before the range:
occs0=ranka(L, 0) = 0
Count the number of a to the end of the range:
occs1=ranka(L, 5) = 4
2. Range: [C(a)+occs0,C(a)+occs1-1]=[6,9].
Count the number of l before the range:
occs0=rankl (L, 6) = 0
occs1=rankl (L, 9) = 1
3. Range: [C(l)+occs0,C(l)+occs1-1]=[24,25].
Count the number of l before the range:
occs0=rank$(L, 24) = 5
occs1=rank$(L, 25) = 6
4. Range: [C($)+occs0,C($)+occs1-1]=[5,5].
’la’ is identiﬁed by 5.

Introduction
Front-Coding
Hashing
Other Dictionaries
FM-Index Dictionary (extract)
extract(i) retrieves symbols from the
(i + 1) − th $ to the i − th $:
It takes O(|si | log σ) time.
extract(5):
1. The search process starts from Position: 0.
Extracts the symbol in this position:
access(L, 0) =a
Count the number of as up to the position:
occs=ranka(L, 0) = 1
2. Position: C(a) + 1 − 1 = 6.
access(L, 6) =l
Count the number of ls up to the position:
occs=rankl (L, 6) = 1
3. Position: C(l) + 1 − 1 = 24.
access(L, 6) =$
The 5 − th string is la.

Introduction
Front-Coding
Hashing
Other Dictionaries
FM-Index Dictionary (prefix & substring operations)
locatePrefix(p) is similar to locate:
It looks for $p and finds the area
[sp,ep] in where all strings si that
start with p are encoded.
Substring-based operations generalize
prefix-based ones:
locateSubstring(p) look for p to
obtain the area [sp,ep] containing all
strings si with p.
For each match, the backwards search
continues until determining the
corresponding ID (sampling structure)
Duplicate IDs are finally removed.
extractPrefix(p) and
extractSubstring(p) perform extract
operations in the corresponding ranges.

Introduction
Front-Coding
Hashing
Other Dictionaries
– Other Dictionaries (Tries)–
Tries [9] are tree-shaped structures which perform
eﬃciently for dictionary purposes:
Strings are located from root to leaves.
IDs are extracted from the corresponding leaf to the
root.
Tries use much space for managing large dictionaries.
Some compressed trie-based dictionaries exist in the
state of the art:
Compressed tries based on path decomposition [6].
LZ-compressed tries [1].
Self-indexed tries (XBW) [2].

Introduction
URIs
Literals
Conclusions
Outline
1 Introduction

Introduction
URIs
Literals
Conclusions
Experimental Setup
Two RDF real-world dictionaries:
26, 948, 638 URIs from Uniprot:
Averaged length: 51.04 chars per URI.
Highly-repetitive.
27, 592, 013 Literals from DBpedia:
Averaged length: 60.45 chars per Literal.
We analyze compression eﬀectiveness and retrieval speed:
locate, extract.
Preﬁx-based operations (URIs)
Substring-based operations (Literals).
In practice, extract is the most important query:
It is used many times as results are retrieved from the compressed dataset.

Introduction
URIs
Literals
Conclusions
– URIs –
Compressed tries (LexRP and CentRP)
obtain the best compression results and
report better numbers for locate:
≈ 4.5 % of the original space.
≈ 2 − 3µs/string.
> 2µs/ID.
HTFC uses slightly more space, but it is
faster for extract:
≈ 5 − 13 % of the original space.
≈ 2.2-3 µs/string.
≈ 0.7-1.6 µs/ID.
The best tradeoﬀ is for PFC:
≈ 1.6 µs/string.
≈ 0.3-0.6 µs/ID.

Introduction
URIs
Literals
Conclusions
Preﬁx-based Operations
PFC is the best choice for preﬁx-based operations:
Although it uses more space, it reports the best performance.

Introduction
URIs
Literals
Conclusions
– Literals –
Compressed tries (LexRP and CentRP)
obtain better compression results and
report better numbers for locate:
≈ 12 % of the original space.
≈ 2-2.5 µs/string.
> 2.5 µs/ID.
HTFC reports the best compression ratios,
but its performance is less competitive:
≈ 4.5-40 µs/string.
≈ 3 − 20µs/ID.
The best tradeoﬀ is for Hash:
≈ 15 % of the original space.
≈ 1.5 µs/string.
≈ 1µs/ID.

Introduction
URIs
Literals
Conclusions
Substring-based Operations

Introduction
URIs
Literals
Conclusions
– Conclusions –
RDF dictionaries are highly compressible:
URIs are very redundant and Literals also show non-negligible symbolic
redundancy.
This redundancy can be detected and removed within speciﬁc data
structures for dictionaries:
Structures for URIs use up to 20 times less space than the original
dictionaries.
For Literals, the corresponding structures use 6 − 8 times less space than
the original dictionaries.
All these structures report data retrieval performance at microsecond
level:
This functionality includes both simple and advanced operations.

Introduction
URIs
Literals
Conclusions
GitHub
All dictionaries explained in this lecture (and some more [12]) are
available in the libCSD C++ library:
https://github.com/migumar2/libCSD
Beta version: suggestions are accepted ;)

Introduction
URIs
Literals
Conclusions
Bibliography I
[1] Julian Arz and Johannes Fischer.
LZ-compressed string dictionaries.
In Procedings of DCC, pages 322–331, 2014.
[2] Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Mart´ınez-Prieto, and Gonzalo Navarro.
Compressed string dictionaries.
In Proceedings of SEA, pages 136–147, 2011.
[3] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms.
MIT Press and McGraw-Hill, 2nd edition, 2001.
[4] Paolo Ferragina and Giovanni Manzini.
Indexing compressed texts.
Journal of the ACM, 52(4):552–581, 2005.
[5] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro.
Compressed representations of sequences and full-text indexes.
ACM Transactions on Algorithms, 3(2):article 20, 2007.
[6] Roberto Grossi and Giuseppe Ottaviano.
Fast Compressed Tries through Path Decompositions.
In Proceedings of ALENEX, pages 65–74, 2012.
[7] T.C. Hu and Alan C. Tucker.
Optimal Computer-Search Trees and Variable-Length Alphabetic Codes.
SIAM Journal of Applied Mathematics, 21:514–532, 1971.

Introduction
URIs
Literals
Conclusions
Bibliography II
[8] David A. Huffman.
A method for the construction of minimum-redundancy codes.
Proc. of the Institute of Radio Engineers, 40(9):1098–1101, 1952.
[9] Donald .E. Knuth.
The Art of Computer Programming, volume 3: Sorting and Searching.
Addison Wesley, 1973.
[10] N. Jesper Larsson and Alistair Moffat.
Offline dictionary-based compression.
Proceedings of the IEEE, 88:1722–1732, 2000.
[11] Veli Mäkinen and Gonzalo Navarro.
Dynamic entropy-compressed sequences and full-text indexes.
ACM Transactions on Algorithms, 4(3):article 32, 2008.
[12] Miguel A. Mart´ınez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro.
Practical compressed string dictionaries.
Information Systems, 2015.
Under review.
[13] Miguel A. Mart´ınez-Prieto, Javier D. Fernández, and Rodrigo Cánovas.
Querying RDF Dictionaries in Compressed Space.
SIGAPP Applied Computing Review, 12(2):64–77, 2012.
[14] Hugh E. Williams and Justin Zobel.
Compressing integers for fast file access.
The Computer Journal, 42:193–201, 1999.

Introduction
URIs
Literals
Conclusions
Bibliography III
[15] Ian H. Witten, Alistair Moﬀat, and Timothy C. Bell.
Managing Gigabytes: Compressing and Indexing Documents and Images.
Morgan Kaufmann, 1999.

Introduction
URIs
Literals
Conclusions
This presentation has been made only for learning/teaching purposes.
The pictures used in the slides may be owned by other parties, so their property is exclusively of their authors.

Triples Compression and Indexing
1st KEYSTONE Training School
July 22th, 2015. Faculty of ICT, Malta
Antonio Fariña
Miguel A Martínez Prieto

Outline
RDF management overview
K2-Tree structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments

RDF management Overview
Dictionary + triples-IDS
UK
London
M.Lalmas R.Raman
A.Gionis
inv-speaker Finland
SPIREheld on
capitalof
livesin
lives in
position
lives
in
attends
attends
attendsworks
in
(SPIRE, held on, London)
(London, capital of, UK)
(A.Gionis, attends, SPIRE)
(R.Raman, attends, SPIRE)
(M.Lalmas, attends, SPIRE)
(M.Lalmas, lives in, UK)
(M.Lalmas, works in, London)
(A.Gionis, lives in, Finland)
(R.Raman, lives in, UK)
(R.Raman, position, inv-speaker)
Original Triplets
London
SPIRE
A.Gionis
M.Lalmas
R.Raman
Finland
inv-speaker
UK
attends
capital of
1
2
3
4
5
3
4
5
1
2
held on
lives in
position
works in
3
4
5
6
SO
S
O
P
Dictionary Encoding
(2,3,1)
(1,2,5)
(3,1,2)
(5,1,2)
(4,1,2)
(4,4,5)
(4,6,1)
(3,4,3)
(5,4,5)
(5,5,4)
Id-based
Triplets

Outline
RDF management overview
K2-Tree Data Structure
K2-Triples
Compressed Suffix Array (CSA-Sad)
RDF-CSA
Experiments

K2-Tree
• Structure for representing adjacency matrix
• Originally designed for web graphs
– Simple directed graph
1
2
3
7
4
5
6
8
0 1 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0
0 0 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 1 0 0 1 0
1 2 3 4 5 6 7 8 9 10 11
1
2
3
4
5
6
7
8
9
10
119
10
11
0 1 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0
0 0 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 1 0 0 1 0
Motivation

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01 1 1
1 1 10 0 0 01 0 0 01
1 1 1111 1 1 10 0 0 00 0 0 0 0 0 0
0100 01000011 0010 0010 10101000 0110 0010
Example with K=2
T = 101111010100100011001000000101011110
L = 010000110010001010101000011000100100
K2-Tree
Construction process

K2-Tree
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01 1 1
1 1 10 0 0 01 0 0 01
1 1 1111 1 1 10 0 0 00 0 0 0 0 0 0
0100 01000011 0010 0010 10101000 0110 0010
T = 101111010100100011001000000101011110
L = 010000110010001010101000011000100100
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 children(2) = rank1(T,2)* k2 = 2*4=8
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30…
children(9) = rank1(T,9)* k2 = 7*4=28
Direct neighbour operation

K2-Triples
RDF triples Mapped
triples
Dictionary
• Dictionary encoding
• Triples as a set of identifiers
Data Structure

K2-Triples
• Vertical partitioning
• One K2-tree per predicate
1(8,5,4)
(4,2,3)
(4,4,6)
(4,1,7)
(7,2,3)
(3,3,5)
(5,2,1)
(1,3,5)
(6,2,2)
(2,3,5)
1
1
1
1
1
1
1
1
1
P1 P2
P3 P4 P5
(S,P,O)
1 1
1
1
1
S4
O
7
Data Structure

K2-Triples
Query: (4,2,3)
1
1
1
1
P2
1
1
1
1
Result: (4,2,3)
• SPO  checking a cell
• SP?
• ?PO
• S?O
• S??
• ??O
• ?P?
Operations

K2-Triples
Query: (4,2,?)
1
1
1
1
P2
1
1
1
1
Result: (4,2,3)
• SP?  direct neighbours
• ?PO
• S?O
• S??
• ??O
• ?P?
Operations

K2-Triples
Query: (?,2,3)
1
1
1
1
P2
1
1
1
1
Result: (4,2,3), (7,2,3)
• ?PO  reverse neighbours
• S?O
• S??
• ??O
• ?P?
Operations

K2-Triples
Query: (4,?,6)
Result: (4,4,6)
1 1
1
1
1
1
1
1
1
1
P3 P4 P5
1 1
1
1
1
P1 P2
• S?O  checking |P| cells
• S??
• ??O
• ?P?
Operations

K2-Triples
Query: (4,?,?)
Result: (4,1,7), (4,2,3), (4,4,6)
1 1
1
1
1
1
1
1
1
1
P3 P4 P5
1 1
1
1
1
P1 P2
• S??  |P| direct neighbours
• ??O
• ?P?
Operations

K2-Triples
Query: (?,?,4)
Result: (8,5,4)
1 1
1
1
1
1
1
1
1
1
P3 P4 P5
1 1
1
1
1
P1 P2
• ??O  |P| reverse neighbours
• ?P?
Operations

K2-Triples
Query: (?,2,?)
1
1
1
1
P
2
1
1
1
1
Result: (4,2,3), (5,2,1),(6,2,2),(7,2,3)
• ??O  |P| reverse neighbours
• ?P?  full adjacency matrix
Operations

• Weakness of vertical partitioning  unbounded predicates
– (S,?,?), (?,?,O), (S,?,O)
– Checking the |P| K2-trees!
• They proposed indexes SP and OP
K2-Triples
(8,5,4)
(4,2,3)
(4,4,6)
(4,1,7)
(7,2,3)
(3,3,5)
(5,2,1)
(1,3,5)
(6,2,2)
(2,3,5)
(S,P,O) S Predicates
1 3
2 3
3 3
4 1,2,4
5 2
6 2
7 2
8 5
SP INDEX
Statistically compressed
Direct access with DAC
Indexes SP & OP

K2-Triples
1 1
1
1
1
1
1
1
1
1
P1 P2 P3 P4 P5
SP INDEX
Subject 4?
Predicate list: 1,2,4
• Query (4,?,?)
Indexes SP & OP

• Independent join
• Chain join
• Interactive join
K2-Triples
• They implemented three join strategies
– Taking advantage of the K2-triples structure
1
P5
(8,5,?X) (?X,2,?)
Query: (8,5,?X) (?X,2,?)
1
1
1
1
P2
Best strategy depends on the
dataset and the type of join
Joins

K2-Triples
1
P5
(8,5,?X) (?X,2,?)
Query: (8,5,?X) (?X,2,?)
1
1
1
1
P2
1 0
1
1
1
1
1 0
X[1-4]
X[5-8]
X[1-4]
X[5-8]
P5
P2
Joins > Interactive Join

• Real datasets from different domains
• Space results (Mbytes)
K2-Triples
Dataset Size(MB) #Triples #Predicates #Subjects #Objects
Jamendo 144.18 1,049,639 28 335,926 440,604
DBLP 7.58 46,597,620 27 2,840,639 19,639,731
Geonames 12,347.70 112,235,492 26 8,147,136 41,111,569
Dbpedia 33,912.71 232,542,405 39,672 18,425,128 65,200,769
Dataset MonetDB RDF-3X Hexastore K2-triples K2-triples+
Jamendo 8.76 37.73 1,371.25 0.74 1.28
DBLP 358.44 1,643.31 82.48 99.24
Geonames 859.66 3,584.80 152.20 188.63
Dbpedia 1,811.74 9,757.58 931.44 1178.38
Experiments

K2-Triples
Experiments > triple patterns
• Triple patterns (DBPEDIA)

Compressed Suffix Array (CSA-SAD)
Back to Suffix Arrays
P = a b
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations Noccs = (4-3)+1
Occs = A[3] .. A[4] = { 8, 1}
Fast space
O(m lg n) O(4n)
O(m lg n + noccs) + T

1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
P = a b
CSA basics
• Can we reduce the space needs of a Suffix Array?

CSA basics
• Ψ
• A[Ψ(i)] = A[i] +1
racadabra$
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=

1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
3
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
• Ψ
• A[Ψ(10)] = A[3] = A[10] +1 = 8
7
CSA basics

• Ψ and F
• Ψ and F are enought to perform binary search
and to recover the source data!!
1 2 3 4 5 6 7 8 9 10 11 12
T =
12 11 8 1 4 6 9 2 5 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$
acadabra$
$
a$
adabra$
bra$
bracadabra$
cadabra$
abra$
dabra$
ra$
racadabra$
4 1 7 8 9 10 11 12 6 2 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ= 3
7
F =
CSA basics

• Ψ and F (reducing space needs)
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F =
1 1 0 0 0 0 1 0 1 1 1 0=D
$ a b c d r
1 2 3 4 5 6
=S
Bitmap
Sorted alphabet
CSA basics

• Ψ and F (reducing space needs)
• Example: F[8] = S[rank1(D, 8)] = S[3] = ‘b’
Rank1(D,i):: Time O(1), by using o(n) extra space
Representing F
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F =
1 1 0 0 0 0 1 0 1 1 1 0=D
$ a b c d r
1 2 3 4 5 6
=S
Bitmap
Sorted alphabet
rank1(D, 8)

Compressing Ψ
• Absolute samples (k=sample period)
• Gap encoding on increasing values: Huffman & run-encoding
• Huffman with a N entries dictionary
– k reserved Huffman codes to encode 1-runs of size s ϵ [1..k-1]
– 32 + 32 Huffman codes representing the size (en bits) of large values [ + or - ]
• They are followd by that number encoded with log (v) bits
– The remaining N – k -32 -32 entries correspond to the most frequent gap values.
11 6 7 12 1 4 9 10 8 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ =
1 0 0 1 0 0 0 00 0 1 0 0 0 1 0 1 0 0 0 0 1 11 0 1 1 0 1 0 0 1 0 1 0 10 0 1 0 0 1 0 0
11
1
1
18
8
32
sΨ =
Δ =

sampling
sampling
sampling +
gap encoding +
- delta codes*
- Huffman-based
- encoding de runs
– Ψ(sampled), D, S  count
– A(sampled)  locate
– A-1(sampled)  extract
1 1 0 0 0 0 1 0 1 1 1 0=D
12 11 8 1 4 6 9 2 5 10 3A = 7
4 8 12 5 9 6 10 3 7 2 1A = 11
-1
4 1 7 8 9 10 11 12 6 2 5Ψ= 3
$ a b c d r=S
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
T =
Parameters: space/time “trade-off”
Complete structure

RDF-CSA
• Step 1  Integer dictionary encoding of s, p, o
• Step 2  Ordered list of n triples (sequence of 3n elements)
Building RDF-CSA
We first sort by subject, then by predicate, and finally by object
…

RDF-CSA
• Step 3  Sid is transformed into S, in order to keep disjoint
alphabets
Building RDF-CSA
Range [1, ns] for subjects
Range [ns+1, ns + np] for predicates
Range [ns + np + 1, ns + np + no] for objects
Due to this alphabet mapping, every subject is smaller than every predicate,
and this in turn is smaller than every object !!!

RDF-CSA
• Step 4  We build an iCSA on S
Building RDF-CSA
 A has three ranges: each range points to suffixes starting with a subject, a predicate, or an
object
 cycles around the components of the same triple; that is, the object of a triple k does not
point to the subject of the triple k+1 in S, but to the subject of the same triple  we can start
at position A[i], pointing to any place within a triple (s,p,o), and recover the triple by succesive
applications of

RDF-CSA
• (S,P,O), (?S,P,O), (S,?P,O), (S,P,?O), (?S,?P,O),
(S,?P,?O), (?S,P,?O), (?S,?P,?O)
– Patterns with just one bounded element are directly solved using select on D
– Pattern (?S,?P,?O) retrieves all the triples, so it can be solved by retrieving every ith
triple, using
– For the rest of the patterns: binary iCSA search
• SPO  bsearch(SPO,3)
• ?SOP  bsearch (OP,2) … S?PO  bsearch (OS,2)
– Optimizations:
• D-select+forward-check strategy: find valid intervals into S, P and O ranges,
and check matches with into those intervals, starting from the shortest one.
• D-select+backward-check strategy: use binary search to limit valid intervals,
instead of sequentially verifying each position of the shortest interval.
Searching for triple patterns
Optimizations are applicable to pattern (S,P,O), and those with just one unbounded term!!

RDF-CSA
• (S,P,O) optimizations
– D-select+forward-check strategy: find valid intervals into S, P and O ranges,
and check matches with into those intervals, starting from the shortest one.
– D-select+backward-check strategy: use binary search to limit valid intervals,
instead of sequentially verifying each position of the shortest interval.
Searching for triple patterns
180 231 301 550 600 602
10 11 12 180 200 230 231 232 300 301 550 600 601 602
S=8 P=4 O=261
SP SPO
180 231 301 550 600 602
10 11 12 180 200 230 231 232 300 301 550 600 601 602
S=8 P=4 O=261
SPO PO

RDF-CSA
Experiments (dbpedia) space % VS micros/occ

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Keystone summer school_2015_miguel_antonio_ldcompression_4-joined

Similar to Keystone summer school_2015_miguel_antonio_ldcompression_4-joined (20)

Recently uploaded

Recently uploaded (20)

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined