SlideShare a Scribd company logo
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
1
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
marketplace scenarios [Cancedda ACL 2012]
1
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
marketplace scenarios [Cancedda ACL 2012]
copyright concerns
1
Problem
1 Given n-gram information of a document d, how well can we
reconstruct d?
2 If I want/have to share n-gram statistics, what is a good strategy to
avoid reconstruction, while preserving utility of data?
2
Example
s = $ a rose rose is a rose is a rose #
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
=⇒ Find large chunks of text of whose presence we are
certain
3
Problem Encoding
An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where
edges correspond to n-grams
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4
Problem Encoding
[2, 2, 3, 1] → rose rose is a
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem: Find those parts that are common in all of them
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem: Find those parts that are common in all of them
BEST Theorem, 1951
Given an Eulerian graph G = (V , E), the number of different Eulerian
cycles is
Tw (G)
v∈V
(d(v) − 1)!
Tw (G) is the number of trees directed towards the root at a fixed node w
5
Problem Encoding
[0, 1, 2] → $ a rose
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
6
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Given G∗ we can just read maximal blocks from the labels.
7
Example
s = $ a rose rose is a rose is a rose #
2
rose rose , 1
rose is a rose , 2
4
rose # , 1
0
$ a rose , 1
8
9
Rule 1 (Pigeonhole rule)
10
Rule 1 (Pigeonhole rule)
α.δ occurs at least 4 times
10
Rule 2: non-local information
11
Rule 2: non-local information
x is an “articulation point” [Tarjan 1971]
11
Rule 2: non-local information
x is an “articulation point” [Tarjan 1971]
α.β occurs at least once
11
Main Result
Theorem
Both rules are correct and complete: their application on G leads to a
graph G∗ that is equivalent to G and irreducible.
12
Experiments
13
Experiments
Gutenberg project: out-of-copyright (US) books. 1 000 random single
books.
average maximal
Mean of average and maximal block size
13
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
Average number of large blocks (≥ 100)
Remove completeness assumption
Remove those n-grams whose frequency is < M.
15
Remove completeness assumption
Remove those n-grams whose frequency is < M.
mean / max vs M
(n = 5)
15
Remove completeness assumption
Remove those n-grams whose frequency is < M.
mean / max vs M error rate vs M
(n = 5)
15
A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
16
A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
removing edges vs adding edges
16
Keep utility
17
Keep utility
Removing
17
Keep utility
Removing Adding
17
Conclusions
How well can textual documents be reconstructed from their list of
n-grams
Resilience to standard noisifying approach
Better noisifying by adding (instead of removing) n-grams
18
Questions?
19
Appendix
20
Rule 1 (Pigeonhole rule)
Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn)
Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km).
If ∃i, j such that pi > d(x) − kj .
then
E = E  ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where
a = pi − (d(x) − kj ).
if a = d(x) then V = V  {x}, else V = V
21
Rule 2: non-local information
x division point dividing G in components G1, G2. If ˆdinG1
(x) = 1 and
ˆdoutG2
(x) = 1 (( v, x, , p) and ( x, w, t , k)), then
E = (E  {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)}
V = V
22
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
(Mean of average block size)
23
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
23

More Related Content

What's hot

5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t
math260
 
Zeros of a polynomial function
Zeros of a polynomial functionZeros of a polynomial function
Zeros of a polynomial function
MartinGeraldine
 
Linear equation in two variable
Linear equation in two variableLinear equation in two variable
Linear equation in two variable
Nadeem Uddin
 
Invers fungsi
Invers fungsiInvers fungsi
Invers fungsi
Aman Daffa
 
Evaluating a function
Evaluating a functionEvaluating a function
Evaluating a function
MartinGeraldine
 
Power set
Power setPower set
Power set
Ahsan Raza
 
Comp decomp worked
Comp decomp workedComp decomp worked
Comp decomp worked
Jonna Ramsey
 
Evaluating functions
Evaluating functionsEvaluating functions
Evaluating functions
REYEMMANUELILUMBA
 
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.41d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
Dr. I. Uma Maheswari Maheswari
 
Domain alg worked
Domain alg workedDomain alg worked
Domain alg worked
Jonna Ramsey
 
Section 2.1 functions
Section 2.1 functions Section 2.1 functions
Section 2.1 functions
Wong Hsiung
 
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ssusere0a682
 
Multipying polynomial functions
Multipying polynomial functionsMultipying polynomial functions
Multipying polynomial functions
MartinGeraldine
 
Sum and difference of two squares
Sum and difference of two squaresSum and difference of two squares
Sum and difference of two squares
MartinGeraldine
 
Distributive property ppt
Distributive property pptDistributive property ppt
Distributive property pptnglaze10
 
Algorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methodsAlgorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methods
Im Rafid
 
NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS
AKBAR1961
 
Guia 1
Guia 1Guia 1
Guia 1
CAUCANITO
 
Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1
MartinGeraldine
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties
Wong Hsiung
 

What's hot (20)

5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t
 
Zeros of a polynomial function
Zeros of a polynomial functionZeros of a polynomial function
Zeros of a polynomial function
 
Linear equation in two variable
Linear equation in two variableLinear equation in two variable
Linear equation in two variable
 
Invers fungsi
Invers fungsiInvers fungsi
Invers fungsi
 
Evaluating a function
Evaluating a functionEvaluating a function
Evaluating a function
 
Power set
Power setPower set
Power set
 
Comp decomp worked
Comp decomp workedComp decomp worked
Comp decomp worked
 
Evaluating functions
Evaluating functionsEvaluating functions
Evaluating functions
 
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.41d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
 
Domain alg worked
Domain alg workedDomain alg worked
Domain alg worked
 
Section 2.1 functions
Section 2.1 functions Section 2.1 functions
Section 2.1 functions
 
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
 
Multipying polynomial functions
Multipying polynomial functionsMultipying polynomial functions
Multipying polynomial functions
 
Sum and difference of two squares
Sum and difference of two squaresSum and difference of two squares
Sum and difference of two squares
 
Distributive property ppt
Distributive property pptDistributive property ppt
Distributive property ppt
 
Algorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methodsAlgorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methods
 
NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS
 
Guia 1
Guia 1Guia 1
Guia 1
 
Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties
 

Similar to Reconstructing Textual Documents from n-grams

Prime numbers boundary
Prime numbers boundary Prime numbers boundary
Prime numbers boundary
Camilo Ulloa
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
Luis Galárraga
 
Unit 3
Unit 3Unit 3
Unit 3
guna287176
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanations
Gopi Saiteja
 
Group theory notes
Group theory notesGroup theory notes
Group theory notes
mkumaresan
 
Cs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer keyCs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer key
appasami
 
Lego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawingsLego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawings
Mathieu Dutour Sikiric
 
Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Modelirrrrr
 
Skiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationSkiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationzukun
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
cseiitgn
 
graph theory
graph theorygraph theory
graph theory
Shashank Singh
 
lecture 1
lecture 1lecture 1
lecture 1sajinsc
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
Mathieu Dutour Sikiric
 
2.pptx
2.pptx2.pptx
2.pptx
MohAlyasin1
 
Greek logic and mathematics
Greek logic and mathematicsGreek logic and mathematics
Greek logic and mathematics
Bob Marcus
 
Daa chapter 3
Daa chapter 3Daa chapter 3
Daa chapter 3
B.Kirron Reddi
 
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRYON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
Fransiskeran
 

Similar to Reconstructing Textual Documents from n-grams (20)

Prime numbers boundary
Prime numbers boundary Prime numbers boundary
Prime numbers boundary
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanations
 
Group theory notes
Group theory notesGroup theory notes
Group theory notes
 
Cs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer keyCs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer key
 
Scribed lec8
Scribed lec8Scribed lec8
Scribed lec8
 
Lego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawingsLego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawings
 
Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
 
Skiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationSkiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notation
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 
graph theory
graph theorygraph theory
graph theory
 
lecture 1
lecture 1lecture 1
lecture 1
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
2.pptx
2.pptx2.pptx
2.pptx
 
Greek logic and mathematics
Greek logic and mathematicsGreek logic and mathematics
Greek logic and mathematics
 
Q
QQ
Q
 
Daa chapter 3
Daa chapter 3Daa chapter 3
Daa chapter 3
 
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRYON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
 

Recently uploaded

in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 

Recently uploaded (20)

in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 

Reconstructing Textual Documents from n-grams

  • 1.
  • 2. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents 1
  • 3. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents marketplace scenarios [Cancedda ACL 2012] 1
  • 4. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents marketplace scenarios [Cancedda ACL 2012] copyright concerns 1
  • 5. Problem 1 Given n-gram information of a document d, how well can we reconstruct d? 2 If I want/have to share n-gram statistics, what is a good strategy to avoid reconstruction, while preserving utility of data? 2
  • 6. Example s = $ a rose rose is a rose is a rose # 3
  • 7. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 3
  • 8. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 Note that the same 2-grams are obtained starting from: s = $ a rose is a rose rose is a rose # s = $ a rose is a rose is a rose rose # 3
  • 9. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 Note that the same 2-grams are obtained starting from: s = $ a rose is a rose rose is a rose # s = $ a rose is a rose is a rose rose # =⇒ Find large chunks of text of whose presence we are certain 3
  • 10. Problem Encoding An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where edges correspond to n-grams 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 4
  • 11. Problem Encoding [2, 2, 3, 1] → rose rose is a 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 4
  • 12. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction
  • 13. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction Problem: Find those parts that are common in all of them
  • 14. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction Problem: Find those parts that are common in all of them BEST Theorem, 1951 Given an Eulerian graph G = (V , E), the number of different Eulerian cycles is Tw (G) v∈V (d(v) − 1)! Tw (G) is the number of trees directed towards the root at a fixed node w 5
  • 15. Problem Encoding [0, 1, 2] → $ a rose 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 6
  • 16. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
  • 17. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation) Given G, we want G∗ st: 1 is equivalent: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )} 2 is irreducible: ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗ )
  • 18. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation) Given G, we want G∗ st: 1 is equivalent: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )} 2 is irreducible: ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗ ) Given G∗ we can just read maximal blocks from the labels. 7
  • 19. Example s = $ a rose rose is a rose is a rose # 2 rose rose , 1 rose is a rose , 2 4 rose # , 1 0 $ a rose , 1 8
  • 20. 9
  • 21. Rule 1 (Pigeonhole rule) 10
  • 22. Rule 1 (Pigeonhole rule) α.δ occurs at least 4 times 10
  • 23. Rule 2: non-local information 11
  • 24. Rule 2: non-local information x is an “articulation point” [Tarjan 1971] 11
  • 25. Rule 2: non-local information x is an “articulation point” [Tarjan 1971] α.β occurs at least once 11
  • 26. Main Result Theorem Both rules are correct and complete: their application on G leads to a graph G∗ that is equivalent to G and irreducible. 12
  • 28. Experiments Gutenberg project: out-of-copyright (US) books. 1 000 random single books. average maximal Mean of average and maximal block size 13
  • 29. Increasing Diversity Instead of running on a single book, run on concatenation of k books.
  • 30. Increasing Diversity Instead of running on a single book, run on concatenation of k books. Average number of large blocks (≥ 100)
  • 31. Remove completeness assumption Remove those n-grams whose frequency is < M. 15
  • 32. Remove completeness assumption Remove those n-grams whose frequency is < M. mean / max vs M (n = 5) 15
  • 33. Remove completeness assumption Remove those n-grams whose frequency is < M. mean / max vs M error rate vs M (n = 5) 15
  • 34. A better noisifying strategy Instead of removing n-grams, add strategically chosen n-grams 16
  • 35. A better noisifying strategy Instead of removing n-grams, add strategically chosen n-grams removing edges vs adding edges 16
  • 39. Conclusions How well can textual documents be reconstructed from their list of n-grams Resilience to standard noisifying approach Better noisifying by adding (instead of removing) n-grams 18
  • 42. Rule 1 (Pigeonhole rule) Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn) Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km). If ∃i, j such that pi > d(x) − kj . then E = E ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where a = pi − (d(x) − kj ). if a = d(x) then V = V {x}, else V = V 21
  • 43. Rule 2: non-local information x division point dividing G in components G1, G2. If ˆdinG1 (x) = 1 and ˆdoutG2 (x) = 1 (( v, x, , p) and ( x, w, t , k)), then E = (E {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)} V = V 22
  • 44. Increasing Diversity Instead of running on a single book, run on concatenation of k books. (Mean of average block size) 23
  • 45. Increasing Diversity Instead of running on a single book, run on concatenation of k books. 23