SlideShare a Scribd company logo
Positional Data Organization
and Compression in Web
Inverted Indexes
Leonidas Akritidis
Panayiotis Bozanis
Department of Computer & Communication
Engineering, University of Thessaly, Greece
23rd International Conference on Database and Expert Systems Applications
DEXA 2012, September 3-7, Vienna, Austria
University of Thessaly, Greece DEXA 2012
Introduction
 We present a method for the organization
and compression of the positional data in
Web Inverted Indexes.
 Important problem since the positions enable:
 Evaluation of ranked proximity queries.
 Evaluation of exact phrase queries.
University of Thessaly, Greece DEXA 2012
Block-based index setup
University of Thessaly, Greece DEXA 2012
Block-based index setup
 It allows partial decompression of the docID
blocks, without touching the rest of the index
data (frequencies, positions, etc).
 Useful during DAAT/TAAT query processing
where we are interested in locating docIDs
smaller than or equal to a given value.
 It requires:
 A skip table to jump forward in the inverted list.
 A positions look-up structure (more later).
University of Thessaly, Greece DEXA 2012
Inverted index compression
 Many algorithms, divided in two categories:
 Individual integer compression methods
 Applicable to single integer values.
 Variable Byte, Elias Gamma/Delta, Golomb/Rice.
 Group integer compression methods
 Applicable to entire bundles of integers.
 VSEncoding, PForDelta (P4D), Optimized P4D.
 Fast, but require decoding the entire block.
University of Thessaly, Greece DEXA 2012
Positional data
 The group encoding methods are ideal for
docIDs
 But what about the positional data?
 The positional data dominate the index: Each
posting includes on average 2.5-3.5
positional values.
 Hence, decoding all the positions for all of the
candidate postings is very expensive.
University of Thessaly, Greece DEXA 2012
Positional data access during
query processing
 Query processing optimization.
 Divide the processing in two stages:
 On the first step, quickly identify the most suitable
results by using docIDs and frequencies only.
 On the second step, perform a more refined
processing by accessing the positional data for
the K best results of the previous phase only.
 Apparently, we are only interested in
accessing the positions of specific postings.
 Decoding entire blocks is redundant.
University of Thessaly, Greece DEXA 2012
Positional Data Organization
 Consequently, the group encoding methods
are rendered inappropriate.
 Moreover, we do not know exactly where the
positional values of a posting are stored.
 For this reason, two strategies have been
proposed:
 Create a tree-based positions look-up structure.
 Construct indexed lists.
University of Thessaly, Greece DEXA 2012
Positional data Look-ups
 The look-up structure requires one look-up
per posting, thus decelerating processing.
 The indexed list requires one pointer per
posting pointing to the respective positional
data, thus it is prohibitively expensive.
University of Thessaly, Greece DEXA 2012
PFBC: Positions Fixed-Bit
Compression
 PFBC is designed to overcome all these
problems.
 For each inverted list block, it employs a fixed
number of bits to binary-encode each
positional value.
 Then, by using the frequency values, we are
able to calculate the exact location of the
positions for each posting, without look-ups.
 We decode only the data actually required.
University of Thessaly, Greece DEXA 2012
PFBC: Index organization
 PFBC is compatible to all state-of-the-art inverted
list partitioning approaches.
 Compressed Embedded Skip Lists, Skips every 128
postings, Skips every postings (Ft: number of
documents containing t), etc.
 We store all the positional values contiguously.
 For each inverted list block Bi we store:
 A pointer RBi pointing to the beginning of the positional
data of the postings of the block Bi.
 A CBi value representing the number of bits used to
encode the positions of the block Bi.
tF
University of Thessaly, Greece DEXA 2012
PFBC: Index organization
University of Thessaly, Greece DEXA 2012
PFBC: Compression Phase
 Encode a bundle of K
positional values.
 Steps 3-8: Identify the
largest positional value pmax.
 Step 9: Calculate C, i.e. the
number of bits required to
represent all K values.
 Step 13: The function
write() stores an integer
into a compressed
sequence P by using C bits.
University of Thessaly, Greece DEXA 2012
PFBC: Decompression Phase
 Decode the positional data
of posting j, of the block Bi.
 Steps 2-5: Sum up all the
frequency values of the
previous postings of Bi.
 Step 6: Locate the bit where
the positions of j start from.
 Step 9: The read() function
reads each encoded position
from the P vector.
University of Thessaly, Greece DEXA 2012
PFBC Gains
 It facilitates direct access to the positional data
without requiring expensive look-ups. Therefore,
query processing is accelerated.
 It saves the space cost of maintaining a separate
look-up structure; the required data is stored within
the skip table.
 It needs much fewer pointers than the indexed lists.
 It enables decoding of the information actually
needed, without decompressing entire blocks of
integers.
University of Thessaly, Greece DEXA 2012
Experiments
 Compression experiments: PFBC against
OptP4D and VSEncoding.
 Organization experiments: PFBC against the
tree look-up structure of [Yan et.al] and the
indexed list of [Transier and Sanders].
 Document collection: ClueWeb-09B dataset
(~50 million documents, ~1.5 TB).
University of Thessaly, Greece DEXA 2012
Experiments: Index size
 PFBC introduces a slight loss in the overall
index size (increase of 1%-2%).
 This loss is amortized by the reduced size of
the accompanying data structures (skip table,
pointers to positions).
University of Thessaly, Greece DEXA 2012
Experiments: Query
throughput
 Recall: The query processing is performed in two
phases; the positions are retrieved for the K best results
of the first phase only.
 PFBC touches much fewer data.
 PFBC is look-up free.
 PFBC offers ~5 times faster decompression.
University of Thessaly, Greece DEXA 2012
Conclusions
 We presented PFBC, a method for effective
organization and compression of the positional
data in Web inverted indexes.
 PFBC uses a fixed number of bits to encode
the positions of an inverted list block.
 The slight increase in the overall index size is
amortized by the slight decrease in the
assistant data structures sizes.
 Much faster decompression.
 No look-ups for the positions are required.
 Allows us to decode the data actually needed.
University of Thessaly, Greece DEXA 2012
The end
Thank you for watching!
Any questions?

More Related Content

What's hot

Lecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic SortingLecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic Sorting
Haitham El-Ghareeb
 
Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
Bin Cai
 
Data structure
Data structureData structure
Data structure
Mohd Arif
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
koolkampus
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc
 

What's hot (17)

Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial Operations
 
Lecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic SortingLecture 07 Data Structures - Basic Sorting
Lecture 07 Data Structures - Basic Sorting
 
Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
Data structure
Data structureData structure
Data structure
 
ICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and Processing
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structure
 
Trees
TreesTrees
Trees
 
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
2. Linear Data Structure Using Arrays - Data Structures using C++ by Varsha P...
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPath
 
An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF Graphs
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
 
Symmetry 13-00195-v2
Symmetry 13-00195-v2Symmetry 13-00195-v2
Symmetry 13-00195-v2
 
An ai planning approach for generating
An ai planning approach for generatingAn ai planning approach for generating
An ai planning approach for generating
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
ModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex informationModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex information
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 

Similar to Positional Data Organization and Compression in Web Inverted Indexes

Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information Resource
JEAN-MICHEL LETENNIER
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
CSCJournals
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
ijseajournal
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrieval
Amjad Ali
 

Similar to Positional Data Organization and Compression in Web Inverted Indexes (20)

ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologies
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information Resource
 
Bo4301369372
Bo4301369372Bo4301369372
Bo4301369372
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl Framework
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrieval
 
A Practical Dynamic Single Assignment Transformation
A Practical Dynamic Single Assignment TransformationA Practical Dynamic Single Assignment Transformation
A Practical Dynamic Single Assignment Transformation
 
I explore
I exploreI explore
I explore
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 

More from Leonidas Akritidis

More from Leonidas Akritidis (8)

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
 
A Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsA Self-Pruning Classification Model for News
A Self-Pruning Classification Model for News
 
Effective Products Categorization with Importance Scores and Morphological An...
Effective Products Categorization with ImportanceScores and Morphological An...Effective Products Categorization with ImportanceScores and Morphological An...
Effective Products Categorization with Importance Scores and Morphological An...
 
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
 

Recently uploaded

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 

Recently uploaded (20)

fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answer
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturing
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 

Positional Data Organization and Compression in Web Inverted Indexes

  • 1. Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering, University of Thessaly, Greece 23rd International Conference on Database and Expert Systems Applications DEXA 2012, September 3-7, Vienna, Austria
  • 2. University of Thessaly, Greece DEXA 2012 Introduction  We present a method for the organization and compression of the positional data in Web Inverted Indexes.  Important problem since the positions enable:  Evaluation of ranked proximity queries.  Evaluation of exact phrase queries.
  • 3. University of Thessaly, Greece DEXA 2012 Block-based index setup
  • 4. University of Thessaly, Greece DEXA 2012 Block-based index setup  It allows partial decompression of the docID blocks, without touching the rest of the index data (frequencies, positions, etc).  Useful during DAAT/TAAT query processing where we are interested in locating docIDs smaller than or equal to a given value.  It requires:  A skip table to jump forward in the inverted list.  A positions look-up structure (more later).
  • 5. University of Thessaly, Greece DEXA 2012 Inverted index compression  Many algorithms, divided in two categories:  Individual integer compression methods  Applicable to single integer values.  Variable Byte, Elias Gamma/Delta, Golomb/Rice.  Group integer compression methods  Applicable to entire bundles of integers.  VSEncoding, PForDelta (P4D), Optimized P4D.  Fast, but require decoding the entire block.
  • 6. University of Thessaly, Greece DEXA 2012 Positional data  The group encoding methods are ideal for docIDs  But what about the positional data?  The positional data dominate the index: Each posting includes on average 2.5-3.5 positional values.  Hence, decoding all the positions for all of the candidate postings is very expensive.
  • 7. University of Thessaly, Greece DEXA 2012 Positional data access during query processing  Query processing optimization.  Divide the processing in two stages:  On the first step, quickly identify the most suitable results by using docIDs and frequencies only.  On the second step, perform a more refined processing by accessing the positional data for the K best results of the previous phase only.  Apparently, we are only interested in accessing the positions of specific postings.  Decoding entire blocks is redundant.
  • 8. University of Thessaly, Greece DEXA 2012 Positional Data Organization  Consequently, the group encoding methods are rendered inappropriate.  Moreover, we do not know exactly where the positional values of a posting are stored.  For this reason, two strategies have been proposed:  Create a tree-based positions look-up structure.  Construct indexed lists.
  • 9. University of Thessaly, Greece DEXA 2012 Positional data Look-ups  The look-up structure requires one look-up per posting, thus decelerating processing.  The indexed list requires one pointer per posting pointing to the respective positional data, thus it is prohibitively expensive.
  • 10. University of Thessaly, Greece DEXA 2012 PFBC: Positions Fixed-Bit Compression  PFBC is designed to overcome all these problems.  For each inverted list block, it employs a fixed number of bits to binary-encode each positional value.  Then, by using the frequency values, we are able to calculate the exact location of the positions for each posting, without look-ups.  We decode only the data actually required.
  • 11. University of Thessaly, Greece DEXA 2012 PFBC: Index organization  PFBC is compatible to all state-of-the-art inverted list partitioning approaches.  Compressed Embedded Skip Lists, Skips every 128 postings, Skips every postings (Ft: number of documents containing t), etc.  We store all the positional values contiguously.  For each inverted list block Bi we store:  A pointer RBi pointing to the beginning of the positional data of the postings of the block Bi.  A CBi value representing the number of bits used to encode the positions of the block Bi. tF
  • 12. University of Thessaly, Greece DEXA 2012 PFBC: Index organization
  • 13. University of Thessaly, Greece DEXA 2012 PFBC: Compression Phase  Encode a bundle of K positional values.  Steps 3-8: Identify the largest positional value pmax.  Step 9: Calculate C, i.e. the number of bits required to represent all K values.  Step 13: The function write() stores an integer into a compressed sequence P by using C bits.
  • 14. University of Thessaly, Greece DEXA 2012 PFBC: Decompression Phase  Decode the positional data of posting j, of the block Bi.  Steps 2-5: Sum up all the frequency values of the previous postings of Bi.  Step 6: Locate the bit where the positions of j start from.  Step 9: The read() function reads each encoded position from the P vector.
  • 15. University of Thessaly, Greece DEXA 2012 PFBC Gains  It facilitates direct access to the positional data without requiring expensive look-ups. Therefore, query processing is accelerated.  It saves the space cost of maintaining a separate look-up structure; the required data is stored within the skip table.  It needs much fewer pointers than the indexed lists.  It enables decoding of the information actually needed, without decompressing entire blocks of integers.
  • 16. University of Thessaly, Greece DEXA 2012 Experiments  Compression experiments: PFBC against OptP4D and VSEncoding.  Organization experiments: PFBC against the tree look-up structure of [Yan et.al] and the indexed list of [Transier and Sanders].  Document collection: ClueWeb-09B dataset (~50 million documents, ~1.5 TB).
  • 17. University of Thessaly, Greece DEXA 2012 Experiments: Index size  PFBC introduces a slight loss in the overall index size (increase of 1%-2%).  This loss is amortized by the reduced size of the accompanying data structures (skip table, pointers to positions).
  • 18. University of Thessaly, Greece DEXA 2012 Experiments: Query throughput  Recall: The query processing is performed in two phases; the positions are retrieved for the K best results of the first phase only.  PFBC touches much fewer data.  PFBC is look-up free.  PFBC offers ~5 times faster decompression.
  • 19. University of Thessaly, Greece DEXA 2012 Conclusions  We presented PFBC, a method for effective organization and compression of the positional data in Web inverted indexes.  PFBC uses a fixed number of bits to encode the positions of an inverted list block.  The slight increase in the overall index size is amortized by the slight decrease in the assistant data structures sizes.  Much faster decompression.  No look-ups for the positions are required.  Allows us to decode the data actually needed.
  • 20. University of Thessaly, Greece DEXA 2012 The end Thank you for watching! Any questions?