Cohort Representation and Exploration

Cohort Representa-on and Explora-on
Behrooz Omidvar-Tehrani
omidvar.info
May 29, 2019 @ SLIDE

Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 2
Data Explora-on (enabling interac3ons
with data) for customized analysis,
originated from Exploratory Data
Analysis (EDA) in sta3s3cs
My research

Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Exploring Data Explora-on
1. PhD in mining, exploring, and visualizing user groups (with Sihem Amer-Yahia) 
[CIKM’15] [EDBT’15] [PKDD’16] [DEXA’17] [ICDE Demo’18] [VLDBJ’18] [EDBT’18] [CIKM Tutorial’18]
[TKDE’19] [HILDA’19] [SIGMOD Tutorial’19]
2. Mini-Postdoc in exploring student progression in French medical educa3on
system (with Marie-Chris3ne Rousset) 
[J. AI in Med ’19]
3. Postdoc in exploring spa3otemporal data (with Arnab Nandi at OSU) 
[ICDE Demo’17] [UrbanGIS’17] [SIGSPATIAL’17]
4. Postdoc in exploring user data and medical data (with Sihem Amer-Yahia) 
[DMAH’18] [DSAA’18] [VLDB Demo’19]
5. Research scien-st in exploring POI recommenda3ons (with Naver Labs Europe)
3

Medical data analysis
4
Public health
analysis
Precision
medicine
BEFORE NOW

Public Health Analysis
Overall health-related conclusions
about the masses

Precision Medicine
A medical model for the customiza3on
of healthcare, with medical decisions
and treatments being tailored to the
individual pa3ent.
Na3onal Ins3tute of Health (NIH)

Precision medicine for medical cohorts
Precision medicine Public health analysis
How does air pollu3on
affect people?
How does air pollu3on
affect Julia’s health?
Explore cohorts of pa3ents
and their trajectories
How does air pollu3on affect the
health status of middle-age females
in Paris suffering from sleep apnea?
towards more customiza3on

Julia who is suffering from epilepsy and is trying to understand what has caused an
increase in the frequency of her absence seizures.
Why pa-ent cohorts?
Cohort of pa3ents
suffering from
epilepsy.
Hospital
External benefit
Medical units prepare special
care for the specific needs of
the cohort members.
For hospitals
(a beLer cohort)
Internal benefit
Julia observes the profiles of
look-alike (or like-minded)
pa3ents and see where she
falls rela3ve to the norm.
For Julia
(a beLer me)

Medical cohort analysis
Medical cohort analysis exhibits how the health of a set of pa3ents is aﬀected
by treatments and diseases.
9
Representa-on: what has happened?
Explora-on: what happened to similar cohorts?
Predic-on: what will happen next?
Representa-on: what has happened?
Explora-on: what happened to similar cohorts?
Cohort of middle-age females in
Paris suﬀering from sleep apnea

Cohort representa-on
Help medical experts understand which sequence of treatments and diseases
lead to a ﬁnal health status.
10
Cohort of middle-age females in
Paris suﬀering from sleep apnea
Which sequence of treatments is the most
relevant to death?
Which treatment is administered right aeer admission
which kept cohort members alive for a longer period?
What changes in Body Mass Index (abbr., BMI)
lead to death?

Cohort explora-on
Enable naviga3on in the space of cohorts to compare their representa3on and
see how their health evolves.
11
Cohort of pa3ents inside Grenoble
suﬀering from sleep apnea
Cohort of pa3ents outside Grenoble
Explore How does their sequence
of treatments and disease
evolve?

1 Health-care data model
Outline
2 Cohort representa-on
3 Cohort explora-on
4 Experiments
0 Introduc-on

1
Outline
2
3
4
0
Health-care data model
Cohort explora-on
Experiments
Introduc-on

Health-care datasets
14
Number of pa-ents 56,286 260,099
Number of ac-ons 1,845 428
Number of events 1,543,263 1,300,987
Time period January 2000 to December 2018 January 2004 to October 2007
Demographic aYributes gender, age, loca3on, life status gender, age, life status
Types of ac-ons
treatment, e3ology, BMI marker, sleepiness
marker, hospitaliza3on
visit outcome, visited hoposital dep, visited
ward, emergency dstay, hospitaliza3on

Star schema of health-care datasets
AGIR-à-Dom dataset for 56,284 pa3ents with respiratory diseases.
15
Health service
Pa3ent ID
Service dura3on
Service ID
324,017 records
E-ology
Pa3ent ID
E3ology date
E3ology label
223,142 records
Compliance
Pa3ent ID
Compliance date
Compliance ID
659,361 records
Pa-ent
Pa3ent ID
Gender
Birth year
Postal code
Life status
Death date
Hospitaliza-on
Pa3ent ID
Hospital. dura3on
Hospital name
Service ID
28,712 records
Insurance
Pa3ent ID
Insurance
dura3on
Insurer
Insurance type
Exemp3on
149,624 records
Fa-gue marker
Pa3ent ID
Observa3on date
Marker value
217,581 records
BMI marker
Pa3ent ID
Observa3on
date
Height value
Weight value
Marker value
390,028 records
Respira-on
marker
Pa3ent ID
Observa3on
date
Marker value
151,810 records

Pa-ent events
We want to express all health-care events in the following form.
16
pa)ent ac)on )mestamp
event:

Pa-ent trajectories
• A pa3ent trajectory is a list of temporally sorted events for the pa3ent.
17
…
t0 t1 t2 t10 t14
Admission )me (ﬁrst appearance
of the pa3ent in our health-care
database)
BMI_obese,
epworth_sleepy
hospitaliza3on_begin
oxygen_begin hospitaliza3on_end
BMI_obese,
epworth_normal,
oxygen_end

Pa-ents’ cohort
• A cohort is a set of pa3ents deﬁned with a set of predicates.
• For instance the cohort of middle-aged females in Paris suﬀering from sleep
apnea has 3 following pa3ents.
18
…
t0 t3 t8 t14 t19
BMI_obese oxygen_begin hospitaliza3on_beginoxygen_end
…
t0
t5
t11 t15
…
t0 t3 t9 t15 t19
BMI_obese
BMI_overweighthospitaliza3on_begin
hospitaliza3on_begin
hospitaliza3on_end
hospitaliza3on_endepwoth_borderline
Pa3ent 1’s
trajectory
Pa3ent 2’s
trajectory
Pa3ent 3’s
trajectory

1
Outline
2
3
4
0
Cohort explora-on
Experiments
Introduc-on

• Goal. Help medical experts understand the sequence of treatments and
diseases for pa3ents over 3me.
• Challenge. Cohorts oeen consist of hundreds of pa3ents with various types
of events.
20
…Pa3ent 1
…
…
…
…
…
…
Pa3ent 2
Pa3ent 3
Pa3ent 4
Pa3ent 499
Pa3ent 500
a cohort with 500 pa)ents

Ideal representa-on of a cohort
• An ideal representa3on for medical experts should describe a single end-to-
end story for the cohort, be succinct and readable.
21
What has happened to the cohort of
middle-aged females in Paris with
Bronchi3s who stay alive?
The cohort starts with oxygenotherapy,
follows ven3la3on aber 11 months and
takes another ven3la3on treatment
one month later.
Expert Cohort representa)on system

{A,C} A→C→C A (0.7) → C (1.0) → C (0.6)
Finding ideal cohort representa-ons
• First oﬀ, how can we compare two pa3ent trajectories together?
22
…Pa3ent 1’s
trajectory
…Pa3ent 2’s
trajectory
A B C C
A A C D A C
Set-based Order-based Time-based
Representa)on:
E
Time-based

Aggregated event
23
pa)ent ac)on )mestamp
event:
pa)ents ac)on )mestamps
aggregated
event:

{A,C} A→C→C A (0.7) → C (1.0) → C (0.6)
Finding ideal cohort representa-ons
• First oﬀ, how can we compare two pa3ent trajectories together?
24
…Pa3ent 1’s
trajectory
…Pa3ent 2’s
trajectory
A B C C
A A C D A C
Set-based Order-based Time-based
Representa)on:
E
Time-based
t5
t7

Quality of aggregated events
• Dispersion: an aggregated event is dispersed if its 3mestamps are spread in
3me.
• Significance: an aggregated event is more significant if it contains several
different 3mestamps.
25
Behrooz Omidvar-Tehrani et al.
ely, an aggregated event is more significant if
several di↵erent timestamps. For instance,
aggregated events ¯e1 = hP, x, {t0, t6}i and
, {t1, t1, t3, t4}i (P, P0
✓ P), ¯e2 is more sig-
an ¯e1, i.e., significance( ¯e3) = 2, and signi-
= 1. As ¯e2 contains more events, it is more
Note that while ¯e2 has higher significance
e latter has more dispersion.
end the definition of cohort representation
Behrooz Omidvar
(S) in the instance Z is minimized. Hence
presentation problem is NP-complete. ⇤
t is assumed that the distance function
satisfies triangular inequality. In case it
problem becomes even harder. The SP func-
-SP satisfies triangular inequality, which
or any three letters x, y, and z, (x, z) 
z). The proof is intuitive: assume SP does
angular inequality, then (x, z) = 1, (x, y)
Intuitively, an aggregated event is mor
it contains several di↵erent timestamps.
given two aggregated events ¯e1 = hP, x,
¯e2 = hP0
, x, {t1, t1, t3, t4}i (P, P0
✓ P), ¯e
nificant than ¯e1, i.e., significance( ¯e3) =
ficance( ¯e2) = 1. As ¯e2 contains more even
significant. Note that while ¯e2 has highe
than ¯e1, the latter has more dispersion.
We extend the definition of cohort r
with the significance property, and propo
dispersion(ē1) = 6
significance(ē1) = 2
dispersion(ē2) = 3
significance(ē2) = 4

Problem of cohort representa-on
• Given a cohort c and a significance threshold σ, the problem of cohort
representa3on is to find all aggregated events Ēc where for each ē ∈ Ēc, two
following condi3ons are sa3sfied:
1. significance(ē) ≥ σ,
2. dispersion(ē) is minimized.
• The problem is NP-Complete by a reduc3on from Mul3ple Sequence
Alignment (MSA) problem.
26

Our cohort representa-on methodology
27
Ap1
Medical expert
specifies a cohort
gender loca:on age life
p1 female Grenoble old alive
p4 female Paris middle dead
p5 male Grenoble old alive
p6 female Paris old dead
p7 male Grenoble old alive
cohort demographics:
⟨gender, female⟩ ⋀
⟨loca?on, Grenoble⟩ ⋀
⟨age, old⟩ ⋀ ⟨life, alive⟩
t0 t6 t9… t4 t5 t7 t8 t10
Ap3
Ap2
cohort ac1ons: ∅
cohort members:
p1, p2,p3
E
B
B
B,E
C
D
H
C
C
admission <me ("me zero)
t11
E
aggregated events
between T(p1)
and T(p2)
⟨{p1,p2}, A, {t0,t0}⟩ ⟨{p1,p2}, B, {t5,t6}⟩ ⟨{p1,p2}, C, {t10,t10}⟩
between T(p2)
and T(p3)
⟨{p2,p3}, A, {t0,t0}⟩ ⟨{p2,p3}, B, {t5,t6}⟩
between T(p1)
and T(p3)
⟨{p1,p3}, A, {t0,t0}⟩ ⟨{p1,p3}, B, {t5,t5}⟩
aggregated event count final score
⟨{p1,p2,p3}, A, {t0,t0,t0,t0,t0,t0}}⟩ 3 1.00
⟨{p2,p3}, B, {t5,t5}}⟩ 1 0.33
⟨{p1,p2,p3},, B, {t5,t5,t5,t5,t5,t6}}⟩ 3 1.00
⟨{p1,p2}, C, {t10,t10}}⟩ 1 0.33
0. EHR dataset (pa1ents) 1. Cohort specifica1on 2. Trajectories of the cohort’s members
3. Sequence matching 4. Compute significance score
t0 t6 t9… t5 t7 t8 t10
A B
t11
5. Cohort representa1on given 𝛔 = 0.8 6. Cohort representa1on given 𝛔 = 0.3
Trajectoriesof
t0 t6 t9… t5 t7 t8 t10
A B
t11
C

Genera-on of aggregated events using
sequence matching
• Ini3ally proposed to ﬁnd matches between protein sequences and discover
homologous protein pairs.
• In Needleman-Wunsch algorithm, we ﬁrst build the Manhavan graph using
cost tables. Then we back-chain to obtain alignments.
• In lack of cost tables, we use default cost templates, i.e., +1 for match, -1 for
mismatch, and -2 for gap.
28
BEGIN A G C
BEGIN 0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
t0 t15
A
Pa3ent 1’s
trajectory
t42
G C
t0 t3
A
Pa3ent 2’s
trajectory
t12
A CA
t72
ManhaDan graph

Why is cohort representa-on useful?
• Storytelling for the cohort: star3ng with event A, the cohort follows up with
B aeer ﬁve months and then C aeer 6 months.
29
t0 t5
A
Cohort
representa3on
t11
B C
Decrease hospitaliza3on costs in the ﬁrst 5-
month period of treatment.
Prepare the medical unit rela3ve to events A,B,
and C and not other events.
Organize care subscrip3ons in periods of 5-6
months.

Efficiency concerns of cohort representa-on
• The worst case complexity of cohort representa3on is O(|c|2 ×n2).
• To improve the efficiency of cohort representa3on, we need to reduce
either the length of trajectories (i.e., n) or the number of trajectory
comparisons (i.e., |c|).
30
Filter out unnecessary
comparisons
Trajectory families
Stra-fied sampling
Reduc,on of trajectory length Reduc,on of trajectory comparisons

Trajectory families
• Generate non-overlapping clusters of trajectories in a 3me-agnos3c way.
• Employ k-medoids as the clustering approach to obtain k families.
• Abstract pa3ent trajectories with the medoid of their family
31
T(p1) T(p4)
T(p7)
T(p2)
med(F2) = T(p3)
T(p6)
T(p8)
T(p9) T(p10)
0.2
0.40.3
0.1
med(F1) = T(p5)
0.40.6
0.3 0.1
Trajectory family F1 Trajectory family F2
Fig. 4. Trajectory families. The distance between patient trajectories is shown
on edges as inverse of their similarity. The upper bound of precision loss for
F1 and F2 is loss(F1, F2) = max(0.4, 0.6) = 0.6. It originates from the
Algorithm 2: E
Input: Cohor
1 while all attri
2 ha, vi g
3 mark a as
4 c0
c h
5 if sim(T(
6 B.sort()

Stra-fied sampling
• Generate a sampled cohort c’ ⊂ c by picking at random r×|c| members from
each demographic group of a given stra3fica3on avribute a.
• A demographic group is iden3fied with an avribute value pair, e.g., ⟨gender,
female⟩ and ⟨gender, male⟩ in case the stra3fica3on avribute is “gender”.
• Stra3fied sampling ensures that the sampled cohort members represent all
demographic groups.
• The sampling ra3o r is a value between 0 and 1 (exclusive), where values
closer to 0 lead to more reduc3on.
32

1
Outline
2
3
4
0
Cohort explora-on
Experiments
Introduc-on

Ideal cohort explora-on
34
Cohort of pa3ents inside Grenoble
Candidate cohort for explora3on
Explore
?
• Experts may be interested in ﬁnding cohorts that are similar to a given
cohort.

Comparison of cohorts: similarity
• The similarity between two cohorts c1 and c2, is the similarity between their
representa3ons Ec1 and Ec2.
35
equal to the largest possible dispersion, i.e., the largest
size of the trajectories for patients in ¯e.P. For instance,
given an aggregated event ¯e1 = h{p1, p2}, x, {t0, t2}i,
and ⇧ = max(|p1.trajectory|, |p2.trajectory|) = 10, we
obtain concentration( ¯e1) = 1.0 0.2 = 0.8, which means
¯e1 is highly concentrated.
Given Equation 3, we now define similarity between
two cohorts c1 and c2 as follows.
similarity(c1, c2) = average[concentration(¯e) s.t.,
¯e1 = hc1, x, 1i 2 T(c1)^
¯e2 = hc2, x, 2i 2 T(c2)^
= 1 [ 2^
¯e = hc1 [ c2, x, i 2 ¯E]
(4)
Intuitively, two cohorts are similar if their common
follows.
⇡(ci, cj) =
Note tha
⌦(ci, cj) =
✏ is a smal
ity. Equatio
est probabi
their patien
bility value
C! = {c1, c2
and the am
|c1 c2|= 2,
these cohor
exploration options should have maximal inf
entropy. In the following, we first provide for
nitions for “similarity” and “entropy”, and th
the problem of cohort exploration.
Similarity in cohort exploration. Two co
similar if their common events are aligned, i.e
persed. Hence we consider an inverse definiti
persion, called “concentration”, as follows.
concentration(¯e) = 1.0
dispersion(¯e)
⇧
In Equation 3, ⇧ is a normalization fact
equal to the largest possible dispersion, i.e., t
size of the trajectories for patients in ¯e.P. For
given an aggregated event ¯e1 = h{p1, p2}, x,
and ⇧ = max(|p1.trajectory|, |p2.trajectory|)
obtain concentration( ¯e1) = 1.0 0.2 = 0.8, wh
¯e1 is highly concentrated.
Given Equation 3, we now define similarity
two cohorts c1 and c2 as follows.
Ec
Ec’

Comparison of cohorts: entropy
• Cohort explora3on should return a limited number of well-separated similar
cohorts to help experts explore different direc3ons in their health-care data.
• Given a set of cohorts C where |C| ≤ ω, we measure the amount of
informa3on that C coveys using its Shannon’s informa3on entropy.
36
patients.
veyed by
d also be
er words,
ormation
rmal defi-
hen define
horts are
, less dis-
on of dis-
alysts explore di↵erent directions in their health-care
data.
We consider an expert-defined parameter ! which
defines the size of the exploration set. Given a set of
cohorts ¯C!
c ⇢ C where | ¯C!
c | !, we measure the amount
of information that ¯C!
c coveys using its Shannon’s in-
formation entropy, defined as follows.
entropy( ¯C!
c ) =
X
(ci,cj 2 ¯C!
c )
⇡(ci, cj)⇥log| ¯C!
c |⇡(ci, cj) (5)
Equation 5 is an adaptation of edge-weighted graph
entropy introduced in [18]. The amount of information
conveyed by the ! cohorts is high if they describe dis-
C
C
C
It captures the amount of
pa0ent overlap between the
cohorts .

Problem of cohort explora-on
• Given a cohort c, a similarity threshold θ, and a size threshold ω, return at
most ω cohorts C = {c1,c2 ...,cω} where similarity (c, ci) ≥ θ, and entropy(C) is
maximized.
• The problem is NP-Complete by a reduc3on from Maximum Sub-Edge
Graph problem.
37

Mul--staged cohort explora-on
38
2. Compute the similarity of each iden3ﬁed
contrast cohort to the given cohort.
1. Limit the set of candidates to contrast cohorts
and sort them in increasing order of peripherality.
3. Return ω-similar cohorts with maximized
entropy.
14
Explore
t0 t11 t12
A
(1.0)
C
(0.31)
C
(0.33)
Gender = female ⋀ location
= Grenoble ⋀ age = old gender 0.5
location 0.25
age 0
Makealternativeq
c1: Gender =
c2: Gend
c3: Gende
alive
0. Input cohort
⟨age, old⟩ ⋀
⟨life, alive⟩
cohort events:
{A,Y,M}
1. Contrast cohorts and their events
c1 demographics:
⟨gender, male⟩ ⋀ …
c2 demographics:
⟨loca?on, Lyon⟩ ⋀ …
c3 demographics:
⟨age, young⟩ ⋀ …
…
c1 events:
{A,Y,C}
c2 events:
{E,B,L}
c3 events:
{B,Y,M,D}
…
2. Similarity checking
and sor1ng
c3 events: {B,Y,M,D}
c1 events: {A,Y,C}
c2 events: {E,B,L}
3. Entropy
maximiza1on
…
Given ω = 2
Cω =
{ c1, c3 }
The entropy of
c1, c3 are
maximal.
sim(c,c3) = 0.9
sim(c,c1) = 0.7
sim(c,c2) = 0.0
θ = 0.5
discarded by “event sets”
Fig. 6 Running example for cohort exploration.
a given
trieves
the con
tion (li
other i
order b
Comp
tween t
capture
betwee
move a
the sim
process
horts w
are dis
For
tations
gorithm
T(c0
). I
a given

Contrast cohorts
• A μ-contrast cohort differs in at most μ avribute values or ac3ons.
• We consider μ-contrast cohorts as explora3on candidates, as they are
similar to the input cohort but also different enough to provide addi3onal
insights.
39
tropy. The algorithm terminates in two di↵erent cases,
(i) ¯C!
c is filled with ! cohorts (line 10), (ii) no better
option exists to improve entropy (line 12). The flow of
the algorithm is detailed as follows.
Materialization of contrast cohorts. Given a co-
hort c = hdemogs, actionsi, the function contrast(c, µ)
returns the set of all µ-contrast cohorts of c, following
Equation 11.
constrast(c,µ) = {c0
2 C s.t.,
(|c.demogs c0
.demogs|= |c.demogs| µ1
^ 9ha, vi 2 c, ha, v0
i 2 c0
, v 6= v0
)_
(|c.actions c0
.actions|= |c.actions| µ2)}
µ1 + µ2 = µ
(11)
Intuitively, a µ-contrast cohort di↵ers in at most µ
a given cohort c.
a “virtual patient
tual patients is c
patients, i.e., usin
Order of contra
which cohorts sh
an ordering based
that removing an
tion ratio from co
change the memb
itive to start scan
with a small cont
similar and di↵er
While orderin
itive, it does not
can achieve an o
rithm only if the
ration candidates
order should be m

Entropy maximiza-on
• We can achieve an op3mal execu3on of our greedy
algorithm (and ensure 1-1/e approxima3on guarantee)
only if there is a total order between the explora3on
candidates.
• Peripherality func3on is a common centrality measure in
social network analysis.
• The intui3on behind this measure is that more
peripheral cohorts (i.e., which are farther from other
cohorts) contribute more to entropy.
• The peripherality of a cohort from an input cohort is
defined as the inverse of their closeness, i.e., their
amount of overlap.
40
turns approximate results.
ed sampling. Given a cohort c = hdemogs, acti-
a stratification attribute a, and a sampling ratio
tified sampling generates a sampled cohort ˆc ⇢ c
king at random r⇥|c|
|dom(a)| members from each de-
phic group of a. A demographic group is identi-
th an attribute value pair, e.g., hgender, femalei
ender, malei in case the stratification attribute
nder”. Stratified sampling ensures that the sam-
ohort members represent di↵erent demographic
. The sampling ratio r is a value between 0 and 1
sive), where values closer to 0 lead to more reduc-
The value of r defines the tradeo↵ between e -
and precision. For a cohort c, the upper-bound of
cision loss with the sampling ratio of r is denoted
2 S( ¯C) events of ( ¯C)
3 for c0 2 ¯C do
4 if S(c) S(c0) 6= ; ^ sim
5 ¯C ¯C c0
6 end
7 end
8 ¯C sort( ¯C, )
9 ¯C!
c ;
10 while | ¯C!
c |< ! do
11 c⇤ get next( ¯C)
12 if entropy( ¯C!
c [ c⇤) e
13 return ( ¯C!
c )
14 end
15 ¯C!
c
¯C!
c [ c⇤
16 ¯C ¯C c⇤
17 end
18 return ¯C!
c
a stratification attribute a, and a sampling ratio
tified sampling generates a sampled cohort ˆc ⇢ c
king at random r⇥|c|
|dom(a)| members from each de-
phic group of a. A demographic group is identi-
th an attribute value pair, e.g., hgender, femalei
ender, malei in case the stratification attribute
nder”. Stratified sampling ensures that the sam-
ohort members represent di↵erent demographic
. The sampling ratio r is a value between 0 and 1
sive), where values closer to 0 lead to more reduc-
The value of r defines the tradeo↵ between e -
and precision. For a cohort c, the upper-bound of
ecision loss with the sampling ratio of r is denoted
pling loss, and is computed using Equation 10.
5 ¯C ¯C c0
6 end
7 end
8 ¯C sort( ¯C, )
9 ¯C!
c ;
10 while | ¯C!
c |< ! do
11 c⇤ get next( ¯C)
12 if entropy( ¯C!
c [ c⇤) e
13 return ( ¯C!
c )
14 end
15 ¯C!
c
¯C!
c [ c⇤
16 ¯C ¯C c⇤
17 end
18 return ¯C!
c
challenging problem becau
space, i.e., the huge numbe

Eﬃciency concerns of cohort explora-on
• The bovleneck of cohort explora3on is
similarity computa3on.
• By avoiding similarity computa3on for
“irrelevant” contrast cohorts, the
execu3on 3me improves dras3cally.
• Inspired from double dic3onary encoding,
we build an event set for each contrast
cohort regardless of their 3me of
occurrence. Event sets enable early
pruning of irrelevant cohorts.
41
14
Explore
contribution ratios
t0 t11 t12
A
(1.0)
C
(0.31)
C
(0.33)
Gender = female ⋀ location
= Grenoble ⋀ age = old gender 0.5
location 0.25
age 0
Makealternativequ
c1:
alive
0. Input cohort
⟨age, old⟩ ⋀
⟨life, alive⟩
cohort events:
{A,Y,M}
1. Contrast cohorts and their events
c1 demographics:
⟨gender, male⟩ ⋀ …
c2 demographics:
⟨loca?on, Lyon⟩ ⋀ …
c3 demographics:
⟨age, young⟩ ⋀ …
…
c1 events:
{A,Y,C}
c2 events:
{E,B,L}
c3 events:
{B,Y,M,D}
…
2. Similarity checking
and sor1ng
c3 events: {B,Y,M,D}
c1 events: {A,Y,C}
c2 events: {E,B,L}
3. Entropy
maximiza1on
…
Given ω = 2
Cω =
{ c1, c3 }
The entropy of
c1, c3 are
maximal.
sim(c,c3) = 0.9
sim(c,c1) = 0.7
sim(c,c2) = 0.0
θ = 0.5
discarded by “event sets”
Fig. 6 Running example for cohort exploration.
tropy. The algorithm terminates in two di↵erent cases,

1
Outline
2
3
4
0
Cohort explora-on
Experiments
Introduc-on

Performance of cohort representa-on
43
18 Behrooz Omidvar-Tehrani et al.
0 0.2 0.4 0.6 0.8 1
·104
96
98
100
102
Cohort size
Executiontime(ms)
Agir
Rambam
0 0.5 1 1.5 2
·104
100
120
140
160
Cohort size
Executiontime(ms)
Agir
Rambam
0 0.5 1 1.5 2
·104
200
400
600
Cohort size
Executiontime(ms)
Agir
Rambam
0 0.5 1 1.5 2
·104
0
0.5
1
·104
Cohort size
Executiontime(ms)
Agir
Rambam
Fig. 8 Execution time of cohort representation with 10, 50, 100, and 200 trajectory families, respectively from left to right.
applied, respectively. For Rambam, the maximum ex-
ecution times are smaller and do not exceed 100ms,
150ms, and 610ms, respectively. The reason is that the
former has longer trajectories and requires more time
to aggregate events and build representations. In case
of 200 trajectory families, a cohort may end up with
all 200 medoids, which result in a large number of tra-
jectory comparisons. Although cohorts of size 5000 or
smaller can be executed in attention preserving latency,
larger cohorts may exceed this latency threshold.
ification attribute, as the same health situation may
happen for both genders.
While trajectory families and stratified sampling im-
prove execution time, we still need to verify how much
loss they entail (defined in Equations 9 and 10, respec-
tively). Figure 10-left shows the precision loss by vary-
ing the number of trajectory families from 5 to 200.
We observe that in both datasets, the loss decreases
when increasing the number of trajectory families. For
instance, while having only 5 trajectory families leads
to an 83% and 81% loss, 200 families lead to an 11%
Cohort Analytics: E ciency and Applicability 19
50 100 200
0
100
200
300
Cohort size
Executiontime(ms)
age
life
gender
random
50 100 200
0
500
1,000
1,500
Cohort size
Executiontime(ms)
age
life
gender
random
50 100 200
0
1,000
2,000
3,000
Cohort size
Executiontime(ms)
age
life
gender
random
50 100 200
0
2,000
4,000
6,000
8,000
Cohort size
Executiontime(ms)
age
life
gender
random
10
20
30
tiontime(ms)
age
life
gender
random
100
200
tiontime(ms)
age
life
gender
random
500
1,000
tiontime(ms)
age
life
gender
random
1,000
2,000
3,000
tiontime(ms)
age
life
gender
random
Execu-on -me of cohort representa-on with 10, 50, 100, and 200 trajectory families, respec-vely from lej to right.
Execu-on -me of cohort representa-on with stra-fied sampling on AGIR with sampling ra-os of 0.2, 0.4, 0.6,
and 0.8, respec-vely from lej to right.

• Fitness verifies if trajectories of cohort members have a footprint in the
representa3on.
• Replayability verifies if all ac3ons in the trajectories of cohort members are
observed in the representa3on.
• Specificity verifies how specific the representa3on is to the cohort
members.
Quality of representa-ons
44
Cohort Analytics: E ciency and Applicability
50 100 200
0.6
0.7
0.8
0.9
1
cohort size
fitness
= 0.2
= 0.5
= 0.8
50 100 200
0.94
0.96
0.98
cohort size
fitness
= 0.2
= 0.5
= 0.8
0.4
0.4
0.5
50 100 200
0.97
0.98
0.99
1
# trajectory families
fitness
Agir
Rambam
0.2
0.94
0.96
0.98
1
sam
fitness
Agir
Rambam
0.8 0.8
Cohort Analytics: E ciency and Applicability
50 100 200
0.6
0.7
0.8
0.9
1
cohort size
fitness
= 0.2
= 0.5
= 0.8
50 100 200
0.94
0.96
0.98
cohort size
fitness
= 0.2
= 0.5
= 0.8
50 100 200
0
0.1
0.2
0.3
0.4
cohort size
replayability
50 100 200
0.1
0.2
0.3
0.4
0.5
cohort size
replayability
0.95
0.9
1
50
0.97
0.98
0.99
1
#
fitness
Ra
50
0.4
0.5
0.6
0.7
0.8
#
replayability
1
50 100 200
0.6
0.7
0.8
cohort size
fitn
50
0.94
0.96
fitn
50 100 200
0
0.1
0.2
0.3
0.4
cohort size
replayability
50
0.1
0.2
0.3
0.4
0.5
replayability
50 100 200
0.8
0.85
0.9
0.95
cohort size
specificity
50
0.5
0.6
0.7
0.8
0.9
1
specificity
Fig. 14 Quality of cohort representation
Rambam (right).

Performance of cohort explora-on
45
22 Behrooz
1 2 3
102
103
contrast cohort di↵erence µ
Executiontime(ms)
without event sets
with event sets
1 2 3
101
102
103
Executiontime(ms)
without event sets
with event sets
102
103
utiontime(ms)
102
103
utiontime(ms)
Another important observatio
di↵erence is that the execution tim
the same order of magnitude as th
contrast cohorts. For instance in
possible cohorts grows by 3 orde
tween µ = 1 and µ = 2, but the e
one order of magnitude worse. Thi
our materialization policy for cont
loading. The only piece of informa
for each contrast cohort c0
2 ¯C i
This retrieval can be instantaneou
index in the database on patient
1 2 3
102
103
Executiontime(ms)
without event sets
with event sets
1 2 3
101
102
103
Executiontime(ms)
without event sets
with event sets
3 5 10 20 50 100
101
102
103
# exploration options !
Executiontime(ms)
3 5 10 20 50 100
102
103
Executiontime(ms)
102
utiontime(ms)
102
utiontime(ms)
di↵
the
con
pos
twe
one
our
load
for
Thi
ind
pon
bet
is v
(lin
Nu
inﬂ
the
3 5 10 20 50 100
101
102
103
Executiontime(ms)
10
10
Executiontime(ms)
0.2 0.5 0.8
101
102
similarity threshold ✓
Executiontime(ms)
10
10
Executiontime(ms)
Fig. 16 Execution time of cohort
the contrast cohort di↵erence µ (top
options ! (middle), and the similar
on Agir (left) and Rambam (right).

Conclusion
• A data-driven framework for medical cohort representa3on and explora3on.
• Representa3on builds a concise representa3on of a cohort by pruning
insignificant events.
• Explora3on relies on finding contrast cohorts as explora3on candidates.
• For an efficient computa3on of cohort representa3on, we employ
“trajectory families” and “stra3fied sampling”, and for cohort explora3on, we
employ “event sets”.
• We plan to deploy a distributed infrastructure where different components
of representa3on and explora3on can be performed in parallel.
46

Thank you!
Cohort Representa3on and Explora3on
May 29, 2019 @ SLIDE

Cohort Representation and Exploration

Recommended

Recommended

More Related Content

Similar to Cohort Representation and Exploration

Similar to Cohort Representation and Exploration (17)

Recently uploaded

Recently uploaded (20)

Cohort Representation and Exploration