Ultra-efficient algorithms for testing well-parenthesised expressions by Tatiana Starikovskaya

Ultra-efﬁcient algorithms for testing
well-parenthesised expressions
Tatiana Starikovskaya (ENS Paris)
Joint work with Eldar Fisher (Technion) and Frédéric Magniez (Paris-Diderot)
WiMLDS Paris, November 24, 2017

Pattern matching: you use it every time you search for something
More general: algorithms on strings (= sequences of characters)
My research area

My research area
Applications
• Bioinformatics
• Information Retrieval
• …
Classical approaches
• We can read the whole input
• We can afford to store linear-space
data structures
In the Big Data world, we must do better!

My research area
Streaming algorithms
We receive the input as a stream, and must
process it on-the-ﬂy, without storing it
Property testing algorithms
We must decide if the input has a property P,
but we can read only a small part of the input
?
?
?
We need efﬁcient algorithms for string processing!

Property testers
Wait a second! How can we make the
decision not reading the whole input?
Well, in general, we cannot…
For example, we cannot say if the input is
well-parenthesised by reading just a small
fraction of it
?
?
?
Task: We must decide if the input has a property
P, but we can read only a small part of the input
Objective: Save time
()(()())()
()(()()(()
queried parentheses
are identical
? ? ?
? ? ?

Property testers
We must
1. accept, if the input has the property P
2. reject, if the input is far from having the
property
3. accept or reject otherwise
Far = we must ﬁx at least εn characters of the
input so that the property is satisﬁed
The output must be correct probability at least 2/3
?
?
?
Task: We must decide if the input has a property
P, but we can read only a small part of the input
Objective: Save time
()(()())()
()(()()(((
()(()()(()
ε = 0.2, n = 10, εn = 2
?

Well-parenthesised expressions
Dm = well-balanced strings on parentheses of m types
Task: develop a property tester that decides whether
the input is in Dm
()([]())[]([]) ()(([][)()((([]
1. It accepts all inputs that are in Dm with
probability at least 2/3
2. It rejects all inputs that are ε-far from Dm
with probability at least 2/3
Time = number of read characters!

Simplicity: simplest context-free language
Universality: any context-free language can be expressed
through it (Chomsky-Schützenberger theorem)
Practicality: processing of semi-structured documents
• Visibly pushdown languages
• Nested strings
Why is it interesting?

What do we know
()(()())()
()(()()(((
()([]())([])
()(([)()(([]
const.m =1 Alon et al.’01
m ≥ 2
Parnas et al.’03c n1/11 < T < C n2/3
c n1/5 < T < C n2/5+δ
NEW!

New tester for Dm-membership
Hmmm… does not look like a simple property to test!
Let’s start with a property tester for strawberries
()({()})([]){((([]())([])([{}]())}([])))
red
sweet
yellow seeds
simple
properties,
easy to test!
?

If we replace all opening parentheses with (, and all closing
parentheses with ), the resulting string must be in D1
And we know how to test in O(1) time [Alon et al.’01]!
Not sufﬁcient: becomes
()({()})([]){((([]())([])([{}]()))([]))}
()((()))(())((((()())(())((())()))(())))
()({{)}) ()((()))

Each block is Dm-consistent = is a substring of a string in Dm
We test that the blocks are Dm-consistent by running our
Dm-test in a recursive fashion
()({()})([ ]){((([]() )([])([{}] ()))([]))}
()({()})([ ]){((([]() )([])([{}] ()))([]))}
b = n4/5 b = n4/5 b = n4/5 b = n4/5

We have checked that the string is good locally, but can we
guarantee that it is good globally?
()({()})([ ]){((([]() )([])([{}] ()))([]))}
()({()})([ ]){((([]() )([])([{}] ()))([]))}
b = n4/5 b = n4/5 b = n4/5 b = n4/5

Approximate matching graph: nodes = blocks, edge (B1,B2) =
many excess parentheses in block B1 must be matched with excess
parentheses in block B2
()({()})([ ]){((([]() )([])([{}] ()))([]))}
()({()})([ ]){((([]() )([])([{}] ()))([]))}
b = n4/5 b = n4/5 b = n4/5 b = n4/5

()({()})([ ]){((([]() )([])([{}] ()))([]))}
()({()})([ ]){((([]() )([])([{}] ()))([]))}
1. Build an approximate matching graph
2. Run a recursive inter-block matching procedure
b = n4/5 b = n4/5 b = n4/5 b = n4/5

()({()})([ ]){((([]() )([])([{}] ())}([])))
]){((([]() ))((((()() (())((((()()))))
S S w/o types D1
{e1(S) = 2
e0(S) = 4
e1(S) - excess closing parentheses
e0(S) - excess opening parentheses
T1, T2, …, Tn/b - blocks of the input
Parentheses in Ti that must be matched with parentheses in Tj
min(e0(Ti), e1(Ti+1Ti+2…Tj)) - e1(Ti+1Ti+2…Tj-1)

()({()})([ ]){((([]() )([])([{}] ())}([])))
]){((([]() ))((((()() (())((((()()))))
S S w/o types D1
{e1(S) = 2
e0(S) = 4
Observation e1(S) = max{S’ - preﬁx of S} (n1(S’) - n0(S’))
n1(S’) = |closing parentheses in S’|
n0(S’) = |opening parentheses in S’|
Lemma By querying x2/Δ2 positions of a string S of length x,
we can compute a Δ-additive approximation of n1(S’) for any
substring S’ of S correctly w.h.p.

()({()})([ ]){((([]() )([])([{}] ())}([])))
Proof
Query x2/Δ2 positions of S uniformly at random
If |S’| ≤ Δ, output Δ
Otherwise, |S’| = yΔ, where y > 1
S’ contains ~yx/Δ of the queried positions

()({()})([ ]){((([]() )([])([{}] ())}([])))
Proof (cont.)
Xi = 1 iff the i-th queried position is a closing parenthesis
E[(Δ2/x) ⋅ Σ Xi] = (Δ2/x)⋅ n1(S’) (yx/Δ) / yΔ = n1(S’)
By additive Chernoff bound,
P[|(Δ2/x) ⋅ Σ Xi - n1(S’)| > Δ] < 2e-2

If we replace all opening parentheses with (, and all
closing parentheses with ), the resulting string ∈ D1
Test that the blocks are Dm-consistent by running
the test in a recursive fashion
Complexity: O(n2/5)
()({()})([ ]){((([]() )([])([{}] ())}([])))
()({()})([ ]){((([]() )([])([{}] ())}([])))
O(1)
O(√b)
b = n4/5
b = n4/5 b = n4/5 b = n4/5
O(n2/b2)

Take-home message
• Streaming or property testing settings
• We have new, ultra-efﬁcient algorithms for string
processing
• It is enough to use a polylog space or to read a
constant number of data items in the input to solve
a problem with good guarantees
Questions? Comments?

Ultra-efficient algorithms for testing well-parenthesised expressions by Tatiana Starikovskaya

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ultra-efficient algorithms for testing well-parenthesised expressions by Tatiana Starikovskaya

Similar to Ultra-efficient algorithms for testing well-parenthesised expressions by Tatiana Starikovskaya (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

Ultra-efficient algorithms for testing well-parenthesised expressions by Tatiana Starikovskaya