CPM2013-tabei201306

Yasuo Tabei
Yasuo TabeiResearcher, Software Developer at Japan Science and Technology Agency
24th Annual Symposium on Combinatorial Pattern Matching,
Bad Herrenalb, 2013/6/19

A Succinct Grammar Compression
Yasuo Tabei1,
Yoshimasa Takabatake2,
Hiroshi Sakamoto2

1. Japan Science and Technology Agency
2. Kyusyu Institute of Technology
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single string
• Every production rule satisfies
– Right hand side of a production rule is a digram
– Subscript of the left symbol is larger than subscrips of
the right symbols, i.e., Xk ➝ XiXj (k>i,j)

Example:

X1 → ab
X2 → aX1
X3 → bX1
X4 → X3b
X5 → X2X4

X5
X2
a

X4

X1
ab

b

X3
b

X1
ab
Grammar compression
• Builds up an SLP from a given string
• Two crucial data structures to access a production rule
Xk ➝ XiXj
1. Dictionary (Array) : Given Xk, return XiXj
2. Reverse dictionary (Hash table) : Given XiXj, return Xk if
Xk ➝ XiXj is registered in the dictionary
X1 → ab
X2 → aX1
X3 → bX1
X4 → X3b
X5 → X2X4

Access Xk ➝ A[2k-2]A[2k-1]

Space : 2nlogn bits (n : #variables)
Three open problems about an optimal
encoding of an SLP
1. An nontrivial information theoretic lower bound
for encoding an SLP
2. Optimal encoding of an SLP
– Standard array : 2nlogn bits (n:#variables)
– Present an encoding asymptotically equivalent to the
lower bound, while supporting fast random access

3. Space-efficient data structure for reverse
dictionary
– Hash table uses O(nlog(n)) bits
– Present a data structure of 2nlogn(1+o(1)) bits
An information theoretic lower bound for
representing an SLP
An information theoretic lower bound for
representing an SLP of n variables :
logn! + 2n + o(n) bits
• Use two techniques for the proof
1. Spanning tree decomposition for representing an
SLP as two ordered trees
2. Right most expansion for completely enumerating
ordered trees

• First introduce these two techniques, and then
show a sketch of the proof
Spanning tree decomposition [SPIRE11]
• Any SLP can be represented as left and right
spanning trees.

X5

X5
X2

X2
a

X4

X4

X2

X3

X1

b

X3
b

X1
a b

X5

X5

X4

a b

Spanning trees

DAG representation

Parse tree

X2
X3

X3

X1

X1

X1

a

X4

b
s

Indegree(s) = 2σ

a

b
s

b

a
s
Right most expansion [KDD02,SDM02]
• Build trees of (m+1) nodes from a tree of m
nodes
– Add a node to the nodes on the right most path
Example :

: right most path
Search space : All trees can be enumerated by
applying the right most expansion, recursively

level1 2

3

4
Sketch of the proof (detail)
• Basic idea: Consider a super set S(n) of DAG(n) without
the restriction of the in-degree 2σ of the sink, and count
|S(n)| by the induction
– Get
• Decompose S(n) into the left and right trees by the
spanning tree decomposition
• Count the number of left trees and the right trees of
(n+1) nodes by induction
ー Apply the right most expansion to the left tree
–
• Get the information-theoretic minimum bits for
representing G∈DAG(n) : logn!+2n+o(n) bits
An optimal encoding of an SLP
An optimal encoding of an SLP
• Basic idea : Encode the left and right symbols of the right
hand side of the production rules, respectively
• Rename the variables by traversing the left tree in the
breadth first manner
X5

X5
X4

X2

X4 Rename

X2

b
s

X1
X3
X2

X2

b

a
s

X5

X3

X1

X1
a

X5

X1

X3

X3

X4

X4

a

b
s

b

a
s
Encoding the left symbols
• Left symbols are monotonically increasing
• Apply gap encoding to the left symbols
• Use rank/select dictionary for O(1)-time access
X1 → a
X2 → a
X3 → b
X4 → X1
X5 → X3

X2
b
X2
X5
b

00013
Gap encoding

0010010010(1-0)10(3-1)1
11101001
n + o(n) bits (n: #variables)
O(1) time access
Encoding the right symbols D (detail)
• Extract subarrays si of monotonically increasing and
decreasing elements from D
• Use two integer arrays
,
and two bit arrays B,b
•
bits, and
access
time
Space-efficient data structure for reverse
dictionary
Space-efficient data structure for reverse
dictionary
• Recap : Reverse dictionary D-1

• Basic idea: Build a wavelet tree (WT) consisting
of right symbols XiXj, and simulate reverse
dictionary on the WT.
• Access and update time: O(logn), Space:
2nlogn(1+o(1)) bits
Build WT from digrams : The range of a digram is split
into the higher half (right) and the lower half (left)
Accessing Xk➝XiXj
EX) Access X3X2

rank1

rank0

• Start from the root B1 as
i = n (#variables)
• Apply rank1(Bj,i) for the right
child and rank0(Bj,i) for the
left child
• After reaching a leaf, go up to
the root by applying select
operation
• Solution: Xk = select0/1(B1,i)
Conclusion
• Three open problems related to an optimal
encoding of an SLP
1. an information theoretic lower bound
2. an optimal encoding
3. a dynamic data structure for reverse dictionary

• Novel challenges : Developing succinct data
structures of an SLP for various applications
e.g., self-index, pattern mining, q-gram mining etc
Fully-online grammar compression (FOLCA)
Maruyama, Tabei, Sakamoto, Sadakane (submitted)
• Builds and directly encodes an SLP into a
succinct representation of the partial parse tree
from a give text in an online fashion
Partial parse tree

Text : abaababb

Succinct
representation
12345678910
B:0010101011
L:abaX1X2

• Space : nlogn + 2n + o(n) bits, asymptotically
equivalent to our information theoretic lower bound
1 of 20

Recommended

SPIRE2013-tabei20131009 by
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
4.9K views25 slides
DCC2014 - Fully Online Grammar Compression in Constant Space by
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
1.1K views21 slides
Finding similar items in high dimensional spaces locality sensitive hashing by
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
1.2K views20 slides
Big O Notation by
Big O NotationBig O Notation
Big O NotationMarcello Missiroli
1.7K views23 slides
3 - Finding similar items by
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
4.1K views49 slides
lecture 13 by
lecture 13lecture 13
lecture 13sajinsc
378 views20 slides

More Related Content

What's hot

Locality sensitive hashing by
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashingSameera Horawalavithana
2K views17 slides
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing... by
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...Tomonari Masada
284 views37 slides
LSH by
LSHLSH
LSHHsiao-Fei Liu
1.7K views30 slides
Big o notation by
Big o notationBig o notation
Big o notationhamza mushtaq
5.8K views16 slides
Sortsearch by
SortsearchSortsearch
Sortsearch Krishna Chaytaniah
952 views26 slides
Real Time Big Data Management by
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
1.3K views67 slides

What's hot(20)

Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing... by Tomonari Masada
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...
Steering Time-Dependent Estimation of Posteriors with Hyperparameter Indexing...
Tomonari Masada284 views
Real Time Big Data Management by Albert Bifet
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
Albert Bifet1.3K views
New zealand bloom filter by xlight
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight1.5K views
Monad presentation scala as a category by samthemonad
Monad presentation   scala as a categoryMonad presentation   scala as a category
Monad presentation scala as a category
samthemonad2K views
Introducing: A Complete Algebra of Data by Inside Analysis
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
Inside Analysis1.1K views
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data by Thomas Gottron
Of Sampling and Smoothing: Approximating Distributions over Linked Open DataOf Sampling and Smoothing: Approximating Distributions over Linked Open Data
Of Sampling and Smoothing: Approximating Distributions over Linked Open Data
Thomas Gottron496 views
Computer Science Engineering : Data structure & algorithm, THE GATE ACADEMY by klirantga
Computer Science Engineering : Data structure & algorithm, THE GATE ACADEMYComputer Science Engineering : Data structure & algorithm, THE GATE ACADEMY
Computer Science Engineering : Data structure & algorithm, THE GATE ACADEMY
klirantga1.4K views
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ... by T. E. BOGALE
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
T. E. BOGALE136 views
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups by MITSUNARI Shigeo
Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsPractical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups
MITSUNARI Shigeo1.4K views
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti... by Fabian Pedregosa
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Fabian Pedregosa157 views

Similar to CPM2013-tabei201306

zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ... by
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden
791 views48 slides
03.01 hash tables by
03.01 hash tables03.01 hash tables
03.01 hash tablesJyotsna Suryadevara
129 views16 slides
Analysis Of Algorithms - Hashing by
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingSam Light
544 views53 slides
Randamization.pdf by
Randamization.pdfRandamization.pdf
Randamization.pdfPrashanth460337
5 views46 slides
Towards typesafe deep learning in scala by
Towards typesafe deep learning in scalaTowards typesafe deep learning in scala
Towards typesafe deep learning in scalaTongfei Chen
474 views43 slides
Longest Common Subsequence by
Longest Common SubsequenceLongest Common Subsequence
Longest Common SubsequenceSwati Swati
1.5K views30 slides

Similar to CPM2013-tabei201306(20)

zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ... by Alex Pruden
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden791 views
Analysis Of Algorithms - Hashing by Sam Light
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
Sam Light544 views
Towards typesafe deep learning in scala by Tongfei Chen
Towards typesafe deep learning in scalaTowards typesafe deep learning in scala
Towards typesafe deep learning in scala
Tongfei Chen474 views
Longest Common Subsequence by Swati Swati
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
Swati Swati1.5K views
Dixon Deep Learning by SciCompIIT
Dixon Deep LearningDixon Deep Learning
Dixon Deep Learning
SciCompIIT521 views
Dimension Reduction Introduction & PCA.pptx by RohanBorgalli
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
RohanBorgalli120 views
lecture 11 by sajinsc
lecture 11lecture 11
lecture 11
sajinsc455 views
Succinct Data Structure for Analyzing Document Collection by Preferred Networks
Succinct Data Structure for Analyzing Document CollectionSuccinct Data Structure for Analyzing Document Collection
Succinct Data Structure for Analyzing Document Collection
Preferred Networks10.4K views
Ee693 questionshomework by Gopi Saiteja
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
Gopi Saiteja643 views
Introduction to Ultra-succinct representation of ordered trees with applications by Yu Liu
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
Yu Liu511 views

More from Yasuo Tabei

Space-efficient Feature Maps for String Alignment Kernels by
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsYasuo Tabei
253 views15 slides
SISAP17 by
SISAP17SISAP17
SISAP17Yasuo Tabei
325 views32 slides
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices by
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
4.4K views26 slides
Kdd2015reading-tabei by
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
1.1K views18 slides
NIPS2013読み会: Scalable kernels for graphs with continuous attributes by
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
8.3K views22 slides
GIW2013 by
GIW2013GIW2013
GIW2013Yasuo Tabei
1.1K views26 slides

More from Yasuo Tabei(17)

Space-efficient Feature Maps for String Alignment Kernels by Yasuo Tabei
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
Yasuo Tabei253 views
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices by Yasuo Tabei
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Yasuo Tabei4.4K views
Kdd2015reading-tabei by Yasuo Tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
Yasuo Tabei1.1K views
NIPS2013読み会: Scalable kernels for graphs with continuous attributes by Yasuo Tabei
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
Yasuo Tabei8.3K views
WABI2012-SuccinctMultibitTree by Yasuo Tabei
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
Yasuo Tabei4.3K views
Mlab2012 tabei 20120806 by Yasuo Tabei
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
Yasuo Tabei2.8K views
Ibisml2011 06-20 by Yasuo Tabei
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
Yasuo Tabei896 views
Gwt presen alsip-20111201 by Yasuo Tabei
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
Yasuo Tabei631 views
Lgm pakdd2011 public by Yasuo Tabei
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei995 views
Gwt sdm public by Yasuo Tabei
Gwt sdm publicGwt sdm public
Gwt sdm public
Yasuo Tabei2.6K views
Lgm saarbrucken by Yasuo Tabei
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
Yasuo Tabei1.2K views
Sketch sort sugiyamalab-20101026 - public by Yasuo Tabei
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
Yasuo Tabei632 views
Sketch sort ochadai20101015-public by Yasuo Tabei
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
Yasuo Tabei1.4K views

Recently uploaded

State of the Union - Rohit Yadav - Apache CloudStack by
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStackShapeBlue
218 views53 slides
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...ShapeBlue
121 views15 slides
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...ShapeBlue
105 views15 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
63 views15 slides
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericShapeBlue
58 views9 slides
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
40 views52 slides

Recently uploaded(20)

State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue121 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue63 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue120 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue110 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays49 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue97 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson142 views

CPM2013-tabei201306

  • 1. 24th Annual Symposium on Combinatorial Pattern Matching, Bad Herrenalb, 2013/6/19 A Succinct Grammar Compression Yasuo Tabei1, Yoshimasa Takabatake2, Hiroshi Sakamoto2 1. Japan Science and Technology Agency 2. Kyusyu Institute of Technology
  • 2. Straight Line Program (SLP) • Canonical form of a CFG deriving a single string • Every production rule satisfies – Right hand side of a production rule is a digram – Subscript of the left symbol is larger than subscrips of the right symbols, i.e., Xk ➝ XiXj (k>i,j) Example: X1 → ab X2 → aX1 X3 → bX1 X4 → X3b X5 → X2X4 X5 X2 a X4 X1 ab b X3 b X1 ab
  • 3. Grammar compression • Builds up an SLP from a given string • Two crucial data structures to access a production rule Xk ➝ XiXj 1. Dictionary (Array) : Given Xk, return XiXj 2. Reverse dictionary (Hash table) : Given XiXj, return Xk if Xk ➝ XiXj is registered in the dictionary X1 → ab X2 → aX1 X3 → bX1 X4 → X3b X5 → X2X4 Access Xk ➝ A[2k-2]A[2k-1] Space : 2nlogn bits (n : #variables)
  • 4. Three open problems about an optimal encoding of an SLP 1. An nontrivial information theoretic lower bound for encoding an SLP 2. Optimal encoding of an SLP – Standard array : 2nlogn bits (n:#variables) – Present an encoding asymptotically equivalent to the lower bound, while supporting fast random access 3. Space-efficient data structure for reverse dictionary – Hash table uses O(nlog(n)) bits – Present a data structure of 2nlogn(1+o(1)) bits
  • 5. An information theoretic lower bound for representing an SLP
  • 6. An information theoretic lower bound for representing an SLP of n variables : logn! + 2n + o(n) bits • Use two techniques for the proof 1. Spanning tree decomposition for representing an SLP as two ordered trees 2. Right most expansion for completely enumerating ordered trees • First introduce these two techniques, and then show a sketch of the proof
  • 7. Spanning tree decomposition [SPIRE11] • Any SLP can be represented as left and right spanning trees. X5 X5 X2 X2 a X4 X4 X2 X3 X1 b X3 b X1 a b X5 X5 X4 a b Spanning trees DAG representation Parse tree X2 X3 X3 X1 X1 X1 a X4 b s Indegree(s) = 2σ a b s b a s
  • 8. Right most expansion [KDD02,SDM02] • Build trees of (m+1) nodes from a tree of m nodes – Add a node to the nodes on the right most path Example : : right most path
  • 9. Search space : All trees can be enumerated by applying the right most expansion, recursively level1 2 3 4
  • 10. Sketch of the proof (detail) • Basic idea: Consider a super set S(n) of DAG(n) without the restriction of the in-degree 2σ of the sink, and count |S(n)| by the induction – Get • Decompose S(n) into the left and right trees by the spanning tree decomposition • Count the number of left trees and the right trees of (n+1) nodes by induction ー Apply the right most expansion to the left tree – • Get the information-theoretic minimum bits for representing G∈DAG(n) : logn!+2n+o(n) bits
  • 11. An optimal encoding of an SLP
  • 12. An optimal encoding of an SLP • Basic idea : Encode the left and right symbols of the right hand side of the production rules, respectively • Rename the variables by traversing the left tree in the breadth first manner X5 X5 X4 X2 X4 Rename X2 b s X1 X3 X2 X2 b a s X5 X3 X1 X1 a X5 X1 X3 X3 X4 X4 a b s b a s
  • 13. Encoding the left symbols • Left symbols are monotonically increasing • Apply gap encoding to the left symbols • Use rank/select dictionary for O(1)-time access X1 → a X2 → a X3 → b X4 → X1 X5 → X3 X2 b X2 X5 b 00013 Gap encoding 0010010010(1-0)10(3-1)1 11101001 n + o(n) bits (n: #variables) O(1) time access
  • 14. Encoding the right symbols D (detail) • Extract subarrays si of monotonically increasing and decreasing elements from D • Use two integer arrays , and two bit arrays B,b • bits, and access time
  • 15. Space-efficient data structure for reverse dictionary
  • 16. Space-efficient data structure for reverse dictionary • Recap : Reverse dictionary D-1 • Basic idea: Build a wavelet tree (WT) consisting of right symbols XiXj, and simulate reverse dictionary on the WT. • Access and update time: O(logn), Space: 2nlogn(1+o(1)) bits
  • 17. Build WT from digrams : The range of a digram is split into the higher half (right) and the lower half (left)
  • 18. Accessing Xk➝XiXj EX) Access X3X2 rank1 rank0 • Start from the root B1 as i = n (#variables) • Apply rank1(Bj,i) for the right child and rank0(Bj,i) for the left child • After reaching a leaf, go up to the root by applying select operation • Solution: Xk = select0/1(B1,i)
  • 19. Conclusion • Three open problems related to an optimal encoding of an SLP 1. an information theoretic lower bound 2. an optimal encoding 3. a dynamic data structure for reverse dictionary • Novel challenges : Developing succinct data structures of an SLP for various applications e.g., self-index, pattern mining, q-gram mining etc
  • 20. Fully-online grammar compression (FOLCA) Maruyama, Tabei, Sakamoto, Sadakane (submitted) • Builds and directly encodes an SLP into a succinct representation of the partial parse tree from a give text in an online fashion Partial parse tree Text : abaababb Succinct representation 12345678910 B:0010101011 L:abaX1X2 • Space : nlogn + 2n + o(n) bits, asymptotically equivalent to our information theoretic lower bound