SlideShare a Scribd company logo
1 of 45
Download to read offline
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
1
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
marketplace scenarios [Cancedda ACL 2012]
1
Motivation: Privacy-preserving data mining
Share textual data for mutual benefit, general good or contractual reasons
But not all of it:
text analytics on private documents
marketplace scenarios [Cancedda ACL 2012]
copyright concerns
1
Problem
1 Given n-gram information of a document d, how well can we
reconstruct d?
2 If I want/have to share n-gram statistics, what is a good strategy to
avoid reconstruction, while preserving utility of data?
2
Example
s = $ a rose rose is a rose is a rose #
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
3
Example
s = $ a rose rose is a rose is a rose #
2-grams:
$ a 1
a rose 3
rose rose 1
rose is 2
is a 2
rose # 1
Note that the same 2-grams are obtained starting from:
s = $ a rose is a rose rose is a rose #
s = $ a rose is a rose is a rose rose #
=⇒ Find large chunks of text of whose presence we are
certain
3
Problem Encoding
An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where
edges correspond to n-grams
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4
Problem Encoding
[2, 2, 3, 1] → rose rose is a
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
4
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem: Find those parts that are common in all of them
Problem encoding
Given such a graph, each Eulerian path gives a plausible reconstruction
Problem: Find those parts that are common in all of them
BEST Theorem, 1951
Given an Eulerian graph G = (V , E), the number of different Eulerian
cycles is
Tw (G)
v∈V
(d(v) − 1)!
Tw (G) is the number of trees directed towards the root at a fixed node w
5
Problem Encoding
[0, 1, 2] → $ a rose
0
1
$ a , 1
2
a rose , 3
rose rose , 1
3
rose is , 2
4
rose # , 1
is a , 2
6
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Definitions
ec(G): the set of all Eulerian paths of G
given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)]
s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
Given G, we want G∗ st:
1 is equivalent:
{s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗
)}
2 is irreducible:
∃e1, e2 ∈ E∗
: [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗
)
Given G∗ we can just read maximal blocks from the labels.
7
Example
s = $ a rose rose is a rose is a rose #
2
rose rose , 1
rose is a rose , 2
4
rose # , 1
0
$ a rose , 1
8
9
Rule 1 (Pigeonhole rule)
10
Rule 1 (Pigeonhole rule)
α.δ occurs at least 4 times
10
Rule 2: non-local information
11
Rule 2: non-local information
x is an “articulation point” [Tarjan 1971]
11
Rule 2: non-local information
x is an “articulation point” [Tarjan 1971]
α.β occurs at least once
11
Main Result
Theorem
Both rules are correct and complete: their application on G leads to a
graph G∗ that is equivalent to G and irreducible.
12
Experiments
13
Experiments
Gutenberg project: out-of-copyright (US) books. 1 000 random single
books.
average maximal
Mean of average and maximal block size
13
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
Average number of large blocks (≥ 100)
Remove completeness assumption
Remove those n-grams whose frequency is < M.
15
Remove completeness assumption
Remove those n-grams whose frequency is < M.
mean / max vs M
(n = 5)
15
Remove completeness assumption
Remove those n-grams whose frequency is < M.
mean / max vs M error rate vs M
(n = 5)
15
A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
16
A better noisifying strategy
Instead of removing n-grams, add strategically chosen n-grams
removing edges vs adding edges
16
Keep utility
17
Keep utility
Removing
17
Keep utility
Removing Adding
17
Conclusions
How well can textual documents be reconstructed from their list of
n-grams
Resilience to standard noisifying approach
Better noisifying by adding (instead of removing) n-grams
18
Questions?
19
Appendix
20
Rule 1 (Pigeonhole rule)
Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn)
Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km).
If ∃i, j such that pi > d(x) − kj .
then
E = E  ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where
a = pi − (d(x) − kj ).
if a = d(x) then V = V  {x}, else V = V
21
Rule 2: non-local information
x division point dividing G in components G1, G2. If ˆdinG1
(x) = 1 and
ˆdoutG2
(x) = 1 (( v, x, , p) and ( x, w, t , k)), then
E = (E  {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)}
V = V
22
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
(Mean of average block size)
23
Increasing Diversity
Instead of running on a single book, run on concatenation of k books.
23

More Related Content

What's hot

Distributive property ppt
Distributive property pptDistributive property ppt
Distributive property ppt
nglaze10
 

What's hot (20)

5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t5.2 arithmetic sequences and sums t
5.2 arithmetic sequences and sums t
 
Zeros of a polynomial function
Zeros of a polynomial functionZeros of a polynomial function
Zeros of a polynomial function
 
Linear equation in two variable
Linear equation in two variableLinear equation in two variable
Linear equation in two variable
 
Invers fungsi
Invers fungsiInvers fungsi
Invers fungsi
 
Evaluating a function
Evaluating a functionEvaluating a function
Evaluating a function
 
Power set
Power setPower set
Power set
 
Comp decomp worked
Comp decomp workedComp decomp worked
Comp decomp worked
 
Evaluating functions
Evaluating functionsEvaluating functions
Evaluating functions
 
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.41d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
1d. Pedagogy of Mathematics (Part II) - Set language introduction and Ex.1.4
 
Domain alg worked
Domain alg workedDomain alg worked
Domain alg worked
 
Section 2.1 functions
Section 2.1 functions Section 2.1 functions
Section 2.1 functions
 
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
ゲーム理論BASIC 第20回 -無限回繰り返しゲーム-
 
Multipying polynomial functions
Multipying polynomial functionsMultipying polynomial functions
Multipying polynomial functions
 
Sum and difference of two squares
Sum and difference of two squaresSum and difference of two squares
Sum and difference of two squares
 
Distributive property ppt
Distributive property pptDistributive property ppt
Distributive property ppt
 
Algorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methodsAlgorithm_Matroids and greedy methods
Algorithm_Matroids and greedy methods
 
NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS NCERT ARITHMETIC PROGRESSIONS
NCERT ARITHMETIC PROGRESSIONS
 
Guia 1
Guia 1Guia 1
Guia 1
 
Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1Zeros or roots of a polynomial if a greater than1
Zeros or roots of a polynomial if a greater than1
 
Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties Section 3.3 quadratic functions and their properties
Section 3.3 quadratic functions and their properties
 

Similar to Reconstructing Textual Documents from n-grams

Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
irrrrr
 
Skiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationSkiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notation
zukun
 
lecture 1
lecture 1lecture 1
lecture 1
sajinsc
 

Similar to Reconstructing Textual Documents from n-grams (20)

Prime numbers boundary
Prime numbers boundary Prime numbers boundary
Prime numbers boundary
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanations
 
Group theory notes
Group theory notesGroup theory notes
Group theory notes
 
Cs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer keyCs6660 compiler design november december 2016 Answer key
Cs6660 compiler design november december 2016 Answer key
 
Scribed lec8
Scribed lec8Scribed lec8
Scribed lec8
 
Lego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawingsLego like spheres and tori, enumeration and drawings
Lego like spheres and tori, enumeration and drawings
 
Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
 
Skiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notationSkiena algorithm 2007 lecture02 asymptotic notation
Skiena algorithm 2007 lecture02 asymptotic notation
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 
graph theory
graph theorygraph theory
graph theory
 
lecture 1
lecture 1lecture 1
lecture 1
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
2.pptx
2.pptx2.pptx
2.pptx
 
Greek logic and mathematics
Greek logic and mathematicsGreek logic and mathematics
Greek logic and mathematics
 
Q
QQ
Q
 
Daa chapter 3
Daa chapter 3Daa chapter 3
Daa chapter 3
 
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRYON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
ON ALGORITHMIC PROBLEMS CONCERNING GRAPHS OF HIGHER DEGREE OF SYMMETRY
 

Recently uploaded

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 

Recently uploaded (20)

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 

Reconstructing Textual Documents from n-grams

  • 1.
  • 2. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents 1
  • 3. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents marketplace scenarios [Cancedda ACL 2012] 1
  • 4. Motivation: Privacy-preserving data mining Share textual data for mutual benefit, general good or contractual reasons But not all of it: text analytics on private documents marketplace scenarios [Cancedda ACL 2012] copyright concerns 1
  • 5. Problem 1 Given n-gram information of a document d, how well can we reconstruct d? 2 If I want/have to share n-gram statistics, what is a good strategy to avoid reconstruction, while preserving utility of data? 2
  • 6. Example s = $ a rose rose is a rose is a rose # 3
  • 7. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 3
  • 8. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 Note that the same 2-grams are obtained starting from: s = $ a rose is a rose rose is a rose # s = $ a rose is a rose is a rose rose # 3
  • 9. Example s = $ a rose rose is a rose is a rose # 2-grams: $ a 1 a rose 3 rose rose 1 rose is 2 is a 2 rose # 1 Note that the same 2-grams are obtained starting from: s = $ a rose is a rose rose is a rose # s = $ a rose is a rose is a rose rose # =⇒ Find large chunks of text of whose presence we are certain 3
  • 10. Problem Encoding An n-gram corpus is encoded as a graph, subgraph of the de Bruijn graph, where edges correspond to n-grams 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 4
  • 11. Problem Encoding [2, 2, 3, 1] → rose rose is a 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 4
  • 12. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction
  • 13. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction Problem: Find those parts that are common in all of them
  • 14. Problem encoding Given such a graph, each Eulerian path gives a plausible reconstruction Problem: Find those parts that are common in all of them BEST Theorem, 1951 Given an Eulerian graph G = (V , E), the number of different Eulerian cycles is Tw (G) v∈V (d(v) − 1)! Tw (G) is the number of trees directed towards the root at a fixed node w 5
  • 15. Problem Encoding [0, 1, 2] → $ a rose 0 1 $ a , 1 2 a rose , 3 rose rose , 1 3 rose is , 2 4 rose # , 1 is a , 2 6
  • 16. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation)
  • 17. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation) Given G, we want G∗ st: 1 is equivalent: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )} 2 is irreducible: ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗ )
  • 18. Definitions ec(G): the set of all Eulerian paths of G given the path c = e1, . . . , en; (c) = [label(e1), . . . , label(en)] s(c) = label(e1).label(e2). . . . .label(en) (overlapping concatenation) Given G, we want G∗ st: 1 is equivalent: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )} 2 is irreducible: ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all (c), c ∈ ec(G∗ ) Given G∗ we can just read maximal blocks from the labels. 7
  • 19. Example s = $ a rose rose is a rose is a rose # 2 rose rose , 1 rose is a rose , 2 4 rose # , 1 0 $ a rose , 1 8
  • 20. 9
  • 21. Rule 1 (Pigeonhole rule) 10
  • 22. Rule 1 (Pigeonhole rule) α.δ occurs at least 4 times 10
  • 23. Rule 2: non-local information 11
  • 24. Rule 2: non-local information x is an “articulation point” [Tarjan 1971] 11
  • 25. Rule 2: non-local information x is an “articulation point” [Tarjan 1971] α.β occurs at least once 11
  • 26. Main Result Theorem Both rules are correct and complete: their application on G leads to a graph G∗ that is equivalent to G and irreducible. 12
  • 28. Experiments Gutenberg project: out-of-copyright (US) books. 1 000 random single books. average maximal Mean of average and maximal block size 13
  • 29. Increasing Diversity Instead of running on a single book, run on concatenation of k books.
  • 30. Increasing Diversity Instead of running on a single book, run on concatenation of k books. Average number of large blocks (≥ 100)
  • 31. Remove completeness assumption Remove those n-grams whose frequency is < M. 15
  • 32. Remove completeness assumption Remove those n-grams whose frequency is < M. mean / max vs M (n = 5) 15
  • 33. Remove completeness assumption Remove those n-grams whose frequency is < M. mean / max vs M error rate vs M (n = 5) 15
  • 34. A better noisifying strategy Instead of removing n-grams, add strategically chosen n-grams 16
  • 35. A better noisifying strategy Instead of removing n-grams, add strategically chosen n-grams removing edges vs adding edges 16
  • 39. Conclusions How well can textual documents be reconstructed from their list of n-grams Resilience to standard noisifying approach Better noisifying by adding (instead of removing) n-grams 18
  • 42. Rule 1 (Pigeonhole rule) Incoming edges of x: ( v1, x, 1 , p1), . . . , ( vn, x, n , pn) Outgoing edges ( x, w1, t1 , k1) . . . , ( x, wm, tm , km). If ∃i, j such that pi > d(x) − kj . then E = E ({ vi , x, i , a), (x, wj , tj , a)}) ∪ { vi , wj , i .tj , a)} where a = pi − (d(x) − kj ). if a = d(x) then V = V {x}, else V = V 21
  • 43. Rule 2: non-local information x division point dividing G in components G1, G2. If ˆdinG1 (x) = 1 and ˆdoutG2 (x) = 1 (( v, x, , p) and ( x, w, t , k)), then E = (E {( v, x, , 1), ( x, w, t , 1)}) ∪ {( v, w, .t , 1)} V = V 22
  • 44. Increasing Diversity Instead of running on a single book, run on concatenation of k books. (Mean of average block size) 23
  • 45. Increasing Diversity Instead of running on a single book, run on concatenation of k books. 23