RamezanPhDSlides

ESTIMATION AND sampling IN SLOW MIXING
MARKOV PROCESSES
Ramezan Paravi | Ph.D. Candidate
EE Department, UH Manoa
Advisor: Dr. Santhanam
August 2015

2 Intro Challenges Formal Transition Stationary Sampling Community Algorithm Finally
Overview
Markov Sources in slow mixing regime:
Empirical counts do not reﬂect stationary
Analysis of samples before mixing happens
Part I: Statistical properties of ﬁnite samples
Part II: Modeling using slow mixing Markov process
Andrey Markov
A
B C
1
3
2
31
2
3
4
1
4
1
2

Motivation
Any structure?

Motivation
Any structure?
Reorganizing

Motivation
Any structure?
Reorganizing
• Social Networks
• Biological Networks
• Recommender Systems

Motivation
1
2
3
4 5
6
7
8
9
10
11
1213
1415
Random walk on graph

Motivation
1
2
3
4 5
6
7
8
9
10
11
1213
1415
Uniform random walk.
Explore state space fast.

Motivation
1
2
3
4 5
6
7
8
9
10
11
1213
1415
Non-uniform random walk.
Polarized state space.
Walks starting here will be trapped.

Overview of Theoretical Results
p(1|111)
p(1|011)
p(1|101)
p(1|001)
1
0
1
0
1
0
1
Given sample Y1, . . . ,Yn from unknown
binary Markov source p
Transition probabilities?
Stationary probabilities?
What we do:
ﬁxed sample, best answer
What we are not doing:
e.a.s. results

Complications
Two major sources of diﬃculties:
Long memory
Slow mixing
May not estimate accurately/completely given n samples
Rather, want best possible answer with sample

Two Samples with same # of 0’s and 1s:
A 11111111111111000 v.s. B 11011110111111011

A 11111111111111000 v.s. B 11011110111111011
Q1: Which one is generated by a memory-1 Markov source v.s an i.i.d source?

A 11111111111111000 v.s. B 11011110111111011
Ans: Probably, B ∼ i.i.d, while A ∼ Markov memory-1

A 11111111111111000 v.s. B 11011110111111011
Q2: If B ∼ i.i.d, what can be said?

A 11111111111111000 v.s. B 11011110111111011
Ans: P(1) P(0), (only 3 zeros in the sample)

A 11111111111111000 v.s. B 11011110111111011
Q3: If A ∼ Markov memory-1, what about transition probs?

A 11111111111111000 v.s. B 11011110111111011
Q3: If A ∼ Markov memory-1, what about transition probs?
Ans: P(1|1) is high

A 11111111111111000 v.s. B 11011110111111011
Q4: If B ∼ i.i.d, then what about P(0)?

A 11111111111111000 v.s. B 11011110111111011
Ans: More 1’s than 0’s. With high conﬁdence, P(0) small.

A 11111111111111000 v.s. B 11011110111111011
Q5: If A ∼ Markov memory-1, what about P(0)?

A 11111111111111000 v.s. B 11011110111111011
Q5: If A ∼ Markov memory-1, what about P(0)?
Ans: Can not judge with ﬁnite sample.

A 11111111111111000 v.s. B 11011110111111011
Q6: If B ∼ i.i.d and see more bits, then what?

A 11111111111111000 v.s. B 11011110111111011
Ans: Likely lots of 1’s, few 0’s. P(0) ≈ #0’s
len is small.

A 11111111111111000 v.s. B 11011110111111011
len is small.
Q7: If A ∼ Markov memory-1 and see more bits, can we say #0’s
len is small?

A 11111111111111000 v.s. B 11011110111111011
len is small.
Q7: If A ∼ Markov memory-1 and see more bits, can we say #0’s
len is small?
Ans: NO. P(0) could be arbitrarily large!!

Transition Probabilities:
Given sample x, string w and a ∈ A, can P(a|w) be estimated from x?

Easy: If |w| memory :

Subsequence following w is i.i.d
(# wa)
(# w) ≈ P(a|w)

(# wa)
(# w) ≈ P(a|w)
Harder: If memory unknown, but the source has mixed:

(# wa)
(# w) ≈ P(a|w)
Both #w and #wa reﬂect P(w) and P(wa)
Still (# wa)
(# w) ≈ P(a|w)

(# wa)
(# w) ≈ P(a|w)
Still (# wa)
(# w) ≈ P(a|w)
Diﬃcult: If memory unknown and the source has not mixed:

(# wa)
(# w) ≈ P(a|w)
Still (# wa)
(# w) ≈ P(a|w)
Diﬃcult: If memory unknown and the source has not mixed:
Non trivial

Results on Estimation
Given a length-n sample from binary model with dying dependencies
Amount of information two bits provide about each other,
conditioned on middle, diminishes farther they are.

Identify from data which parameters can be estimated
Set ˜G of “good” strings w of length Θ(log n)
Only those frequently occurred in the sample

Provide (conﬁdence and) accuracy bounds

Transition probabilities of w ∈ ˜G
Universal Compression + combinatorial arguments
(# wa)
(# w) ≈ P(a|w)

Surprises:
!! even if (#wa) or (#w) ≈ stationary
! bound may not hold for shorter substrings of w
Stationary Transition

Stationary probabilities of w ∈ ˜G
Doob Martingale + Coupling Markov chains + Azuma
#w
˜n ≈ P(w)
P( ˜G)
for w ∈ ˜G,
˜n = number of times strings in ˜G appear in the sample

All bounds are entirely data dependent
Conﬁdence/accuracy obtained w.h.p. from sample
Knowledge of the source is not required

Modeling using slow mixing
Provide randomized algorithm for community detection

Build slow mixing random walks on graphs

Mixing properties of the walk → reveal community structure
Framework: Coupling From the Past

Mixing properties of the walk → reveal community structure
Framework: Coupling From the Past
Simulation results on benchmark networks

Part I:
Estimation in Slow Mixing
Markov Processes

Estimation Challenges: Arbitrary Mass
All strings in a ﬁnite sample, can have arbitrarily small mass.
1 −
m
1
0
p(0) = m
m+1
p(1) = 1
m+1
p(1) can be arbitrarily small for m large
enough.
If o(1/n), starting from 1, see a sequence
of O(1/ ) 1’s whp.
p(any seq. of 1’s) ≤ 1
m+1, can be arbitrarily
small.

Estimation Challenges: Long Memory
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
-------------
- - - -
- - - -
- - - -
- - - -
depth k
↑1
↓0
Bernoulli(1/2) i.i.d. source.
Over parameterized.

1/2
1/2
1/2
1/2
1 −
2
1/2
1/2
-------------
- - - -
- - - -
- - - -
- - - -
depth k
↑1
↓0
If k ω(log n), starting from 1 we
will not see a sequence of k − 1
consecutive 0s whp in a sample of
size n.

1/2
1/2
1/2
1/2
1 −
2
1/2
1/2
-------------
- - - -
- - - -
- - - -
- - - -
depth k
↑1
↓0
If k ω(log n), starting from 1 we
will not see a sequence of k − 1
consecutive 0s whp in a sample of
size n.
Therefore, all bits generated wp 1
2.
Cannot distinguish from i.i.d.
Bernoulli (1/2) whp

Estimation Challenges: Slow Mixing
p(0) = p(1) = 1
2
1 − 1
0
p (1) = 2
3, p (0) = 1
3
1 −
2
1
0
Starting from 1, whp both generate sequence of O(1/ ) 1s. If o(1/n), whp
cannot distinguish using length-n sample

Estimation Challenges: Slow Mixing
p(0) = p(1) = 1
2
1 − 1
0
p (1) = 2
3, p (0) = 1
3
1 −
2
1
0
Starting from 1, whp both generate sequence of O(1/ ) 1s. If o(1/n), whp
cannot distinguish using length-n sample
Caution!
Slow mixing only hurts estimation, not compression!
good compression for memory-1 sources, slow mixing or not

Prior Work
Any consistent estimator converges pointwise (NOT uniformly) over the class
of stationary and ergodic Markov models.
Extensive work on

Prior Work
Extensive work on consistency,

Prior Work
Extensive work on consistency, e.a.s. results,

Prior Work
Extensive work on consistency, e.a.s. results, ﬁnite-sample but model
dependent results
[Buhlmann, Csiszár, alves, Gariviér, Leonardi, Marton, Maum-Deschamps,
Morvai, Rissanen, Schmitt,Shields, Talata, Weiss, Wyner]
Our Philosophy
Can we look at a length-n sample and identify what, if anything, can be
estimated accurately?

Dependencies
Problem futile if the dependencies are arbitrary.
We assume dependencies die down.
d(4)
Formalize with d : N → R+.
Siblings s, s (nodes of same color) satisfy
p(1|s)
p(1|s )
− 1 ≤ d(4).
Md = {srcs satisfying above for all siblings}.

Dependencies Die Down
Information-theoretic interpretation
I(Y0; Yi+1|Y i
1 ) ≤ log(1 + d(i)).
b1 b2
b2b1
Not related to mixing properties of the source.
No bound on memory of the source.
Need d(i) summable over i, equivalently δj = i≥j d(i) → 0

Aggregation
Unknown set of states T
kn = Θ(log n)

Aggregation
Consider a coarser model (aggregation)
Ask p(1|w), where |w| = kn
kn = Θ(log n)

Aggregation
Consider a coarser model (aggregation)
Ask p(1|w), where |w| = kn
Set kn = Θ(log n)
Makes sense to ask length-3 aggregations for
memory-2 source (source itself)
kn = Θ(log n)

Estimation Goal
Unknown source p in Md
For any length-kn string w (known kn = Θ(log n))
• Estimate transition p(·|w)
• Estimate stationary p(w)
kn = Θ(log n)

Estimation Goal
Unknown source p in Md
For any length-kn string w (known kn = Θ(log n))
• Estimate transition p(·|w)
• Estimate stationary p(w)
p
p
Unkown Src in Md
Space of memory-kn Srcs
Caution
The problem is not the same as estimating a memory-kn source.
We never see samples from the aggregated model.

Naive Estimators
Example
Suppose Y 0
−∞ = · · · 00 and Y n
1 = 11010010011.
Depth-2 aggregated parameters, e,g., transition prb from w = 10.
· · · 00, 110 10 10 10 0
Subsequence following 10 is 1110

Naive Estimators
Example
Suppose Y 0
−∞ = · · · 00 and Y n
1 = 11010010011.
· · · 00, 110 10 10 10 0
This is not iid in general because sample is from true model, not aggregated model!

Naive Estimators
Example
Suppose Y 0
−∞ = · · · 00 and Y n
1 = 11010010011.
· · · 00, 110 10 10 10 0
This is not iid in general because sample is from true model, not aggregated model!
Naive (“1110 i.i.d.”): ˆp(1|w) = 3
4 and ˆp(0|w) = 1
4.

Deviation Bounds for Conditional Probabilities
Theorem
Conﬁdence ≥ 1 − 1
22kn+1 log n
(conditioned on any past Y 0
−∞), for all w ∈ {0, 1}kn
simultaneously
(#w·)
(#w)
− p(·|w)
1
≤ 2
(ln 2)(2kn+1 log n + nδkn )
Nw
.
Again, δkn = i≥kn
d(i).
The more occurrence, the stronger bound.
If d(i) decreases exponentially as γi and Nw = nβ, rhs diminishes as
O( 1√
nβ−1−γ log γ
).

Proof Idea
The result is built on following facts:
• Source belongs to Md, i.e., dependencies die down.
• Compression result on Md reminiscent of method of types.
• Arguments relating strong compression results to variational distance between
estimators.

Surprises
Proof order

Surprises
Proof order

Bound may hold for w
! even without empirical frequencies ≈ stationary probabilities

Surprises
Proof order

!! but not for its suﬃxes

Surprises
Proof order

!! but not for its suﬃxes
Possible p(1|100 zeros), but p(1|0)

Good States
Good “states” ˜G are length-kn strings appearing frequently enough
˜G = w : count(w) ≥ max{nδkn log
1
δkn
, 2kn+1
log2
n}
Concentration bound is at least as fast as 1√
log n
.
If d(i) decreases exponentially as γi, concentration bound is poly in n.

Stationary Probabilities
Sensitive function of conditional probabilities
How interpret counts of w ∈ ˜G in the sample?

Deviation Bounds for Stationary Probabilities
Theorem
For any t 0, Y 0
−∞ and w ∈ ˜G,
P
(#w)
ñ
−
p(w)
p( ˜G)
≥ t|Y 0
−∞ ≤ 2 exp −
(ñt − B)2
2ñB2
where ñ is the sum count of states in ˜G, B depends on n and how quickly
dependencies d(i) die off.

Deviation Bounds for Stationary Probabilities
Theorem
For any t 0, Y 0
−∞ and w ∈ ˜G,
P
(#w)
ñ
−
p(w)
p( ˜G)
≥ t|Y 0
−∞ ≤ 2 exp −
(ñt − B)2
2ñB2
where ñ is the sum count of states in ˜G, B depends on n and how quickly
dependencies d(i) die off.
B = O(log n) if d(i) = γi
For bound to be non-vacuously true, need d to be “twice summable”, or
δi = j≥i d(j) to be summable

Estimation Along a Sequence of Stopping Times
Restriction of the process to Good states.
Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
↑1
↓0

Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
↑1
↓0
˜G = {01, 10}
1
0

Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
1 0
↑
τ0
↑1
↓0
w = 10 1
1
110
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Z0
1 1 0

Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
0 1
↑
τ1
↑1
↓0
w = 01
0
0
001
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Z1
0 0 1

Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
1 0
↑
τ2
↑1
↓0
w = 10 1
1
110
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Z2
1 1 0

Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
0 1
↑
τ3
↑1
↓0
w = 01
0
0
001
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Z3
0 0 1

Y 0
−∞ Y1 Y2
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Yn
0 1
↑
τ˜n
↑1
↓0
w = 01
0
0
001
0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 0 1 1
Z˜n
0 0 1

Proof Outline
1 Construct a natural Doob martingale
Vm = E(# 01|Z0, Z1, · · · , Zm), m = 0, · · · , ˜n

Proof Outline
Vm = E(# 01|Z0, Z1, · · · , Zm), m = 0, · · · , ˜n
2 Bound |Vm − Vm−1| using Coupling argument

Proof Outline
Vm = E(# 01|Z0, Z1, · · · , Zm), m = 0, · · · , ˜n
2 Bound |Vm − Vm−1| using Coupling argument
3 Azuma’s inequality closes the bound

Coupling Technique
X0
Y0
X1
Y1
X2
Y2
X3
Y3
X4
Y4
· · ·
· · ·
Xn−1
Yn−1
Xn
Yn

Coupling Technique
X0
Y0
X1
Y1
X2
Y2
X3
Y3
X4
Y4
· · ·
· · ·
Xn−1
Yn−1
Xn
Yn
Jointly evolve according to ω

Coupling Technique
X0
Y0
X1
Y1
X2
Y2
X3
Y3
X4
Y4
· · ·
· · ·
Xn−1
Yn−1
Xn
Yn
Jointly evolve according to ω
Encourage evolution st Xt = Yt for all t after (random) τ steps
P(τ i) = ω(Xi = Yi)
E(τ) =
i≥1
P(τ i) =
i≥1
ω(Xi=Yi)
Sample size Eτ ⇒ empirical ≈ stationary [Aldous]

Proof Outline (contd)
• Run two coupled copies Zj and Zj of the restricted process
• Long and unknown memory ⇒ They do not coalesce the usual way
• Instead, approximate coalescence ⇒ Longer together, harder to separate out
• Reason: Dependencies die down
|Vm − Vm−1| ≤
n
j=m+1
ω(Zj≈Zj )

Part II:
Sampling From Slow Mixing
Markov Processes

Coupling From the Past (CFTP)
Markov Chain Over S
Stationary Distribution is π
Goal: Sample from exactly π(.)
Idea (Propp and Wilson):
Expose states to shared source of randomness
Simulate chains backward in time
Wait until all chains merge
A
B C
1
3
2
3
1
2
3
4
1
4
1
2
π(A) = 9
19, π(B) = 4
19, π(C) = 6
19

Coupling
A
B C
1
3
2
3
1
2
3
4
1
4
1
2
A B C
A
C
B
A
A
B
[0, 1]
1
3

Example
time
-6 -5 -4 -3 -2 -1 0
A B C
C
B
A
C
B
A

Example
time
-6 -5 -4 -3 -2 -1 0
A B C
C
B
A
C
B
A
C
B
A

Example
time
-6 -5 -4 -3 -2 -1 0
A B C
C
B
A
C
B
A
C
B
A
C
B
A

Example
time
-6 -5 -4 -3 -2 -1 0
A B C
C
B
A
C
B
A
C
B
A
C
B
A
C
B
A

Example
time
-6 -5 -4 -3 -2 -1 0
A B C
C
B
A
C
B
A
C
B
A
C
B
A
C
B
A
C
B
A

Example
time
-6 -5 -4 -3 -2 -1 0
C
Coalescence
Output ∼ π(.)C
B
A
C
B
A
C
B
A
C
B
A
C
B
A
C
B
A
C
B
A

Graph Clustering
Similarity Graph

Graph Clustering
Similarity Graph
Reorganizing

Graph Clustering
Similarity Graph
Reorganizing
Complexity: NP-hard

Community Detection on Graphs
Number of clusters?
Usually not known in advance.
What is a good measure?
Nodes within a cluster tightly connected.
Nodes in disparate clusters loosely connected.
Correlation Clustering
Cost: Graph clustering distance
Min # of edge add/deletion s.t.
Graph G © disjoint cliques

Approaches
Spectral Clustering
• Eigenvectors of Laplacian of similarity graph
Simple implementation
Laplacian could be ill-conditioned
# of clusters need to be known in advance
[Ng, Jordan, White, Smyth, Weiss, Shi, Malik, Kannan, Vempala, Vetta, Meila, · · · ]
Semi-Deﬁnite Programming (SDP)
• LP relaxation
Asymptotically optimal for Stochastic Block Models
Implicit assumption on generative model
# of clusters need to be known in advance
[Abbe, Sandon, Hajeck, Bandeira, Hall, Decelle, Mossel, Neeman, Sly, Rao, Chen, Wu, Xu, · · · ]

Core Idea
a
b
c
d
e
f
g
h
i
Similarity graph.

Core Idea
a
b
c
d
e
f
g
h
i
Similarity graph.
Deﬁne random walk.

Core Idea
a
b
c
d
e
f
g
h
i
Non-uniform
Probability of following a link ∝ # of common neighbors
Similarity graph.
Couple random walks
starting from diﬀerent nodes

Core Idea
a
b
c
d
e
f
g
h
i
Similarity graph.
Couple random walks
Adapt CFTP algorithm

Core Idea
a
b
c
d
e
f
g
h
i
Similarity graph.
Couple random walks
Do not care about exact
sample

Core Idea
a
b
c
d
e
f
g
h
i
Similarity graph.
Couple random walks
sample
Identify clusters “before
coalescence happens”

Core Idea
a
b
c
d
e
f
g
h
i
Restricted walk within a cluster mixes faster.
Similarity graph.
Couple random walks
sample
Identify clusters “before
coalescence happens”

Algorithm Overview
S

Algorithm Overview
S
Initially: All singletons

Algorithm Overview
S
Partial coalescence
Small Clusters formation

Algorithm Overview
S
Critical times
Clusters merge

Algorithm Overview
S
Full coalescence

Partial Coalescence
time
S
G
Gc
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Paths starting from G ⊂ S

Remarks
Algorithm can stop at critical times
Still yield a cluster
Choose cluster with optimal cost
Purely based on random walk
Use of mixing properties
Circumvent issues with ill-conditioned matrices in spectral based approaches
No prior assumption on generative model
No prior assumption on number of clusters

Benchmark Networks
Stochastic Block Models (SBM)
Size of communities equal
Average degree equal for all nodes
LFR Models
More realistic
# of communities and sizes admit power law.
Real world networks

Stochastic Block Model

Ber(p)

Ber(p)
Ber(q)

Ber(p)
Ber(q)
p q

SBM Realization
500 nodes
p = 0.5
q = 0.1
5 communities
Randomly permuted.

Identifying communities

CC-PIVOT Output

LFR Model
200 nodes
6 communities

Identifying communities

CC-PIVOT output

American College Football
115 teams.
Divided into conferences.

Identiﬁed communities

CC-PIVOT Output

Future Directions
Open: Theoretical guarantees for recovery in SBMs
Open: Stationary is more diﬃcult (needs twice summability of d) than
transition (just summability of d)
Theoretical foundation of the proposed algorithm
Extension to broader set of community detection problems

Acknowledgement
My Advisor: Prasad
Committee Members
Dr. Rui Zhang
Oﬃce Mates:
• Maryam and Meysam
My Friends:
• Masoud, Saeed, Navid, Elyas, Harir Chee, Reza, Ehsaneh, Ali, Ashkan, Hamed,
Ehsan, Seyed, Alireza,...
My Family:
• My parents, my sister Nasrin and my brother Naser

RamezanPhDSlides

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to RamezanPhDSlides

Similar to RamezanPhDSlides (20)

RamezanPhDSlides