SlideShare a Scribd company logo
1 of 47
Cohort Representa-on and Explora-on
Behrooz Omidvar-Tehrani
omidvar.info
May 29, 2019 @ SLIDE
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 2
Data Explora-on (enabling interac3ons
with data) for customized analysis,
originated from Exploratory Data
Analysis (EDA) in sta3s3cs
My research
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Exploring Data Explora-on
1. PhD in mining, exploring, and visualizing user groups (with Sihem Amer-Yahia)

[CIKM’15] [EDBT’15] [PKDD’16] [DEXA’17] [ICDE Demo’18] [VLDBJ’18] [EDBT’18] [CIKM Tutorial’18]
[TKDE’19] [HILDA’19] [SIGMOD Tutorial’19]
2. Mini-Postdoc in exploring student progression in French medical educa3on
system (with Marie-Chris3ne Rousset)

[J. AI in Med ’19]
3. Postdoc in exploring spa3otemporal data (with Arnab Nandi at OSU)

[ICDE Demo’17] [UrbanGIS’17] [SIGSPATIAL’17]
4. Postdoc in exploring user data and medical data (with Sihem Amer-Yahia)

[DMAH’18] [DSAA’18] [VLDB Demo’19]
5. Research scien-st in exploring POI recommenda3ons (with Naver Labs Europe)
3
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Medical data analysis
4
Public health
analysis
Precision
medicine
BEFORE NOW
Public Health Analysis
Overall health-related conclusions
about the masses
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 6
Precision Medicine
A medical model for the customiza3on
of healthcare, with medical decisions
and treatments being tailored to the
individual pa3ent.
Na3onal Ins3tute of Health (NIH)
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 7
Precision medicine for medical cohorts
Precision medicine Public health analysis
How does air pollu3on
affect people?
How does air pollu3on
affect Julia’s health?
Explore cohorts of pa3ents
and their trajectories
How does air pollu3on affect the
health status of middle-age females
in Paris suffering from sleep apnea?
towards more customiza3on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 8
Julia who is suffering from epilepsy and is trying to understand what has caused an
increase in the frequency of her absence seizures.
Why pa-ent cohorts?
Cohort of pa3ents
suffering from
epilepsy.
Hospital
External benefit
Medical units prepare special
care for the specific needs of
the cohort members.
For hospitals
(a beLer cohort)
Internal benefit
Julia observes the profiles of
look-alike (or like-minded)
pa3ents and see where she
falls rela3ve to the norm.
For Julia
(a beLer me)
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Medical cohort analysis
Medical cohort analysis exhibits how the health of a set of pa3ents is affected
by treatments and diseases.
9
Representa-on: what has happened?
Explora-on: what happened to similar cohorts?
Predic-on: what will happen next?
Representa-on: what has happened?
Explora-on: what happened to similar cohorts?
Cohort of middle-age females in
Paris suffering from sleep apnea
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Cohort representa-on
Help medical experts understand which sequence of treatments and diseases
lead to a final health status.
10
Cohort of middle-age females in
Paris suffering from sleep apnea
Which sequence of treatments is the most
relevant to death?
Which treatment is administered right aeer admission
which kept cohort members alive for a longer period?
What changes in Body Mass Index (abbr., BMI)
lead to death?
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Cohort explora-on
Enable naviga3on in the space of cohorts to compare their representa3on and
see how their health evolves.
11
Cohort of pa3ents inside Grenoble
suffering from sleep apnea
Cohort of pa3ents outside Grenoble
suffering from sleep apnea
Explore How does their sequence
of treatments and disease
evolve?
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 12
1 Health-care data model
Outline
2 Cohort representa-on
3 Cohort explora-on
4 Experiments
0 Introduc-on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 13
1
Outline
2
3
4
0
Health-care data model
Cohort representa-on
Cohort explora-on
Experiments
Introduc-on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Health-care datasets
14
Number of pa-ents 56,286 260,099
Number of ac-ons 1,845 428
Number of events 1,543,263 1,300,987
Time period January 2000 to December 2018 January 2004 to October 2007
Demographic aYributes gender, age, loca3on, life status gender, age, life status
Types of ac-ons
treatment, e3ology, BMI marker, sleepiness
marker, hospitaliza3on
visit outcome, visited hoposital dep, visited
ward, emergency dstay, hospitaliza3on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Star schema of health-care datasets
AGIR-à-Dom dataset for 56,284 pa3ents with respiratory diseases.
15
Health service
Pa3ent ID
Service dura3on
Service ID
324,017 records
E-ology
Pa3ent ID
E3ology date
E3ology label
223,142 records
Compliance
Pa3ent ID
Compliance date
Compliance ID
659,361 records
Pa-ent
Pa3ent ID
Gender
Birth year
Postal code
Life status
Death date
Hospitaliza-on
Pa3ent ID
Hospital. dura3on
Hospital name
Service ID
28,712 records
Insurance
Pa3ent ID
Insurance
dura3on
Insurer
Insurance type
Exemp3on
149,624 records
Fa-gue marker
Pa3ent ID
Observa3on date
Marker value
217,581 records
BMI marker
Pa3ent ID
Observa3on
date
Height value
Weight value
Marker value
390,028 records
Respira-on
marker
Pa3ent ID
Observa3on
date
Marker value
151,810 records
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Pa-ent events
We want to express all health-care events in the following form.
16
pa)ent ac)on )mestamp
event:
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Pa-ent trajectories
• A pa3ent trajectory is a list of temporally sorted events for the pa3ent.
17
…
t0 t1 t2 t10 t14
Admission )me (first appearance
of the pa3ent in our health-care
database)
BMI_obese,
epworth_sleepy
hospitaliza3on_begin
oxygen_begin hospitaliza3on_end
BMI_obese,
epworth_normal,
oxygen_end
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Pa-ents’ cohort
• A cohort is a set of pa3ents defined with a set of predicates.
• For instance the cohort of middle-aged females in Paris suffering from sleep
apnea has 3 following pa3ents.
18
…
t0 t3 t8 t14 t19
BMI_obese oxygen_begin hospitaliza3on_beginoxygen_end
…
t0
t5
t11 t15
…
t0 t3 t9 t15 t19
BMI_obese
BMI_overweighthospitaliza3on_begin
hospitaliza3on_begin
hospitaliza3on_end
hospitaliza3on_endepwoth_borderline
Pa3ent 1’s
trajectory
Pa3ent 2’s
trajectory
Pa3ent 3’s
trajectory
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 19
1
Outline
2
3
4
0
Health-care data model
Cohort representa-on
Cohort explora-on
Experiments
Introduc-on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Cohort representa-on
• Goal. Help medical experts understand the sequence of treatments and
diseases for pa3ents over 3me.
• Challenge. Cohorts oeen consist of hundreds of pa3ents with various types
of events.
20
…Pa3ent 1
…
…
…
…
…
…
Pa3ent 2
Pa3ent 3
Pa3ent 4
Pa3ent 499
Pa3ent 500
a cohort with 500 pa)ents
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Ideal representa-on of a cohort
• An ideal representa3on for medical experts should describe a single end-to-
end story for the cohort, be succinct and readable.
21
What has happened to the cohort of
middle-aged females in Paris with
Bronchi3s who stay alive?
The cohort starts with oxygenotherapy,
follows ven3la3on aber 11 months and
takes another ven3la3on treatment
one month later.
Expert Cohort representa)on system
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
{A,C} A→C→C A (0.7) → C (1.0) → C (0.6)
Finding ideal cohort representa-ons
• First off, how can we compare two pa3ent trajectories together?
22
…Pa3ent 1’s
trajectory
…Pa3ent 2’s
trajectory
A B C C
A A C D A C
Set-based Order-based Time-based
Representa)on:
E
Time-based
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Aggregated event
23
pa)ent ac)on )mestamp
event:
pa)ents ac)on )mestamps
aggregated
event:
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
{A,C} A→C→C A (0.7) → C (1.0) → C (0.6)
Finding ideal cohort representa-ons
• First off, how can we compare two pa3ent trajectories together?
24
…Pa3ent 1’s
trajectory
…Pa3ent 2’s
trajectory
A B C C
A A C D A C
Set-based Order-based Time-based
Representa)on:
E
Time-based
t5
t7
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Quality of aggregated events
• Dispersion: an aggregated event is dispersed if its 3mestamps are spread in
3me.
• Significance: an aggregated event is more significant if it contains several
different 3mestamps.
25
Behrooz Omidvar-Tehrani et al.
ely, an aggregated event is more significant if
several di↵erent timestamps. For instance,
aggregated events ¯e1 = hP, x, {t0, t6}i and
, {t1, t1, t3, t4}i (P, P0
✓ P), ¯e2 is more sig-
an ¯e1, i.e., significance( ¯e3) = 2, and signi-
= 1. As ¯e2 contains more events, it is more
Note that while ¯e2 has higher significance
e latter has more dispersion.
end the definition of cohort representation
Behrooz Omidvar
(S) in the instance Z is minimized. Hence
presentation problem is NP-complete. ⇤
t is assumed that the distance function
satisfies triangular inequality. In case it
problem becomes even harder. The SP func-
-SP satisfies triangular inequality, which
or any three letters x, y, and z, (x, z) 
z). The proof is intuitive: assume SP does
angular inequality, then (x, z) = 1, (x, y)
Intuitively, an aggregated event is mor
it contains several di↵erent timestamps.
given two aggregated events ¯e1 = hP, x,
¯e2 = hP0
, x, {t1, t1, t3, t4}i (P, P0
✓ P), ¯e
nificant than ¯e1, i.e., significance( ¯e3) =
ficance( ¯e2) = 1. As ¯e2 contains more even
significant. Note that while ¯e2 has highe
than ¯e1, the latter has more dispersion.
We extend the definition of cohort r
with the significance property, and propo
dispersion(ē1) = 6
significance(ē1) = 2
dispersion(ē2) = 3
significance(ē2) = 4
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Problem of cohort representa-on
• Given a cohort c and a significance threshold σ, the problem of cohort
representa3on is to find all aggregated events Ēc where for each ē ∈ Ēc, two
following condi3ons are sa3sfied:
1. significance(ē) ≥ σ,
2. dispersion(ē) is minimized.
• The problem is NP-Complete by a reduc3on from Mul3ple Sequence
Alignment (MSA) problem.
26
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Our cohort representa-on methodology
27
Ap1
Medical expert
specifies a cohort
gender loca:on age life
p1 female Grenoble old alive
p2 female Grenoble old alive
p3 female Grenoble old alive
p4 female Paris middle dead
p5 male Grenoble old alive
p6 female Paris old dead
p7 male Grenoble old alive
cohort demographics:
⟨gender, female⟩ ⋀
⟨loca?on, Grenoble⟩ ⋀
⟨age, old⟩ ⋀ ⟨life, alive⟩
t0 t6 t9… t4 t5 t7 t8 t10
Ap3
Ap2
cohort ac1ons: ∅
cohort members:
p1, p2,p3
E
B
B
B,E
C
D
H
C
C
admission <me ("me zero)
t11
E
aggregated events
between T(p1)
and T(p2)
⟨{p1,p2}, A, {t0,t0}⟩ ⟨{p1,p2}, B, {t5,t6}⟩ ⟨{p1,p2}, C, {t10,t10}⟩
between T(p2)
and T(p3)
⟨{p2,p3}, A, {t0,t0}⟩ ⟨{p2,p3}, B, {t5,t6}⟩
between T(p1)
and T(p3)
⟨{p1,p3}, A, {t0,t0}⟩ ⟨{p1,p3}, B, {t5,t5}⟩
aggregated event count final score
⟨{p1,p2,p3}, A, {t0,t0,t0,t0,t0,t0}}⟩ 3 1.00
⟨{p2,p3}, B, {t5,t5}}⟩ 1 0.33
⟨{p1,p2,p3},, B, {t5,t5,t5,t5,t5,t6}}⟩ 3 1.00
⟨{p1,p2}, C, {t10,t10}}⟩ 1 0.33
0. EHR dataset (pa1ents) 1. Cohort specifica1on 2. Trajectories of the cohort’s members
3. Sequence matching 4. Compute significance score
t0 t6 t9… t5 t7 t8 t10
A B
t11
5. Cohort representa1on given 𝛔 = 0.8 6. Cohort representa1on given 𝛔 = 0.3
Trajectoriesof
t0 t6 t9… t5 t7 t8 t10
A B
t11
C
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Genera-on of aggregated events using
sequence matching
• Ini3ally proposed to find matches between protein sequences and discover
homologous protein pairs.
• In Needleman-Wunsch algorithm, we first build the Manhavan graph using
cost tables. Then we back-chain to obtain alignments.
• In lack of cost tables, we use default cost templates, i.e., +1 for match, -1 for
mismatch, and -2 for gap.
28
BEGIN A G C
BEGIN 0 -2 -4 -6
A -2 1 -1 -3
A -4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
t0 t15
A
Pa3ent 1’s
trajectory
t42
G C
t0 t3
A
Pa3ent 2’s
trajectory
t12
A CA
t72
ManhaDan graph
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Why is cohort representa-on useful?
• Storytelling for the cohort: star3ng with event A, the cohort follows up with
B aeer five months and then C aeer 6 months.
29
t0 t5
A
Cohort
representa3on
t11
B C
Decrease hospitaliza3on costs in the first 5-
month period of treatment.
Prepare the medical unit rela3ve to events A,B,
and C and not other events.
Organize care subscrip3ons in periods of 5-6
months.
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Efficiency concerns of cohort representa-on
• The worst case complexity of cohort representa3on is O(|c|2 ×n2).
• To improve the efficiency of cohort representa3on, we need to reduce
either the length of trajectories (i.e., n) or the number of trajectory
comparisons (i.e., |c|).
30
Filter out unnecessary
comparisons
Trajectory families
Stra-fied sampling
Reduc,on of trajectory length Reduc,on of trajectory comparisons
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Trajectory families
• Generate non-overlapping clusters of trajectories in a 3me-agnos3c way.
• Employ k-medoids as the clustering approach to obtain k families.
• Abstract pa3ent trajectories with the medoid of their family
31
T(p1) T(p4)
T(p7)
T(p2)
med(F2) = T(p3)
T(p6)
T(p8)
T(p9) T(p10)
0.2
0.40.3
0.1
med(F1) = T(p5)
0.40.6
0.3 0.1
Trajectory family F1 Trajectory family F2
Fig. 4. Trajectory families. The distance between patient trajectories is shown
on edges as inverse of their similarity. The upper bound of precision loss for
F1 and F2 is loss(F1, F2) = max(0.4, 0.6) = 0.6. It originates from the
Algorithm 2: E
Input: Cohor
1 while all attri
2 ha, vi g
3 mark a as
4 c0
c h
5 if sim(T(
6 B.sort()
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Stra-fied sampling
• Generate a sampled cohort c’ ⊂ c by picking at random r×|c| members from
each demographic group of a given stra3fica3on avribute a.
• A demographic group is iden3fied with an avribute value pair, e.g., ⟨gender,
female⟩ and ⟨gender, male⟩ in case the stra3fica3on avribute is “gender”.
• Stra3fied sampling ensures that the sampled cohort members represent all
demographic groups.
• The sampling ra3o r is a value between 0 and 1 (exclusive), where values
closer to 0 lead to more reduc3on.
32
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 33
1
Outline
2
3
4
0
Health-care data model
Cohort representa-on
Cohort explora-on
Experiments
Introduc-on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Ideal cohort explora-on
34
Cohort of pa3ents inside Grenoble
suffering from sleep apnea
Candidate cohort for explora3on
Explore
?
• Experts may be interested in finding cohorts that are similar to a given
cohort.
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Comparison of cohorts: similarity
• The similarity between two cohorts c1 and c2, is the similarity between their
representa3ons Ec1 and Ec2.
35
equal to the largest possible dispersion, i.e., the largest
size of the trajectories for patients in ¯e.P. For instance,
given an aggregated event ¯e1 = h{p1, p2}, x, {t0, t2}i,
and ⇧ = max(|p1.trajectory|, |p2.trajectory|) = 10, we
obtain concentration( ¯e1) = 1.0 0.2 = 0.8, which means
¯e1 is highly concentrated.
Given Equation 3, we now define similarity between
two cohorts c1 and c2 as follows.
similarity(c1, c2) = average[concentration(¯e) s.t.,
¯e1 = hc1, x, 1i 2 T(c1)^
¯e2 = hc2, x, 2i 2 T(c2)^
= 1 [ 2^
¯e = hc1 [ c2, x, i 2 ¯E]
(4)
Intuitively, two cohorts are similar if their common
follows.
⇡(ci, cj) =
Note tha
⌦(ci, cj) =
✏ is a smal
ity. Equatio
est probabi
their patien
bility value
C! = {c1, c2
and the am
|c1 c2|= 2,
these cohor
exploration options should have maximal inf
entropy. In the following, we first provide for
nitions for “similarity” and “entropy”, and th
the problem of cohort exploration.
Similarity in cohort exploration. Two co
similar if their common events are aligned, i.e
persed. Hence we consider an inverse definiti
persion, called “concentration”, as follows.
concentration(¯e) = 1.0
dispersion(¯e)
⇧
In Equation 3, ⇧ is a normalization fact
equal to the largest possible dispersion, i.e., t
size of the trajectories for patients in ¯e.P. For
given an aggregated event ¯e1 = h{p1, p2}, x,
and ⇧ = max(|p1.trajectory|, |p2.trajectory|)
obtain concentration( ¯e1) = 1.0 0.2 = 0.8, wh
¯e1 is highly concentrated.
Given Equation 3, we now define similarity
two cohorts c1 and c2 as follows.
Ec
Ec’
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Comparison of cohorts: entropy
• Cohort explora3on should return a limited number of well-separated similar
cohorts to help experts explore different direc3ons in their health-care data.
• Given a set of cohorts C where |C| ≤ ω, we measure the amount of
informa3on that C coveys using its Shannon’s informa3on entropy.
36
patients.
veyed by
d also be
er words,
ormation
rmal defi-
hen define
horts are
, less dis-
on of dis-
alysts explore di↵erent directions in their health-care
data.
We consider an expert-defined parameter ! which
defines the size of the exploration set. Given a set of
cohorts ¯C!
c ⇢ C where | ¯C!
c | !, we measure the amount
of information that ¯C!
c coveys using its Shannon’s in-
formation entropy, defined as follows.
entropy( ¯C!
c ) =
X
(ci,cj 2 ¯C!
c )
⇡(ci, cj)⇥log| ¯C!
c |⇡(ci, cj) (5)
Equation 5 is an adaptation of edge-weighted graph
entropy introduced in [18]. The amount of information
conveyed by the ! cohorts is high if they describe dis-
C
C
C
It captures the amount of
pa0ent overlap between the
cohorts .
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Problem of cohort explora-on
• Given a cohort c, a similarity threshold θ, and a size threshold ω, return at
most ω cohorts C = {c1,c2 ...,cω} where similarity (c, ci) ≥ θ, and entropy(C) is
maximized.
• The problem is NP-Complete by a reduc3on from Maximum Sub-Edge
Graph problem.
37
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Mul--staged cohort explora-on
38
2. Compute the similarity of each iden3fied
contrast cohort to the given cohort.
1. Limit the set of candidates to contrast cohorts
and sort them in increasing order of peripherality.
3. Return ω-similar cohorts with maximized
entropy.
14
Explore
t0 t11 t12
A
(1.0)
C
(0.31)
C
(0.33)
Gender = female ⋀ location
= Grenoble ⋀ age = old gender 0.5
location 0.25
age 0
Makealternativeq
c1: Gender =
c2: Gend
c3: Gende
alive
0. Input cohort
cohort demographics:
⟨gender, female⟩ ⋀
⟨loca?on, Grenoble⟩ ⋀
⟨age, old⟩ ⋀
⟨life, alive⟩
cohort events:
{A,Y,M}
1. Contrast cohorts and their events
c1 demographics:
⟨gender, male⟩ ⋀ …
c2 demographics:
⟨loca?on, Lyon⟩ ⋀ …
c3 demographics:
⟨age, young⟩ ⋀ …
…
c1 events:
{A,Y,C}
c2 events:
{E,B,L}
c3 events:
{B,Y,M,D}
…
2. Similarity checking
and sor1ng
c3 events: {B,Y,M,D}
c1 events: {A,Y,C}
c2 events: {E,B,L}
3. Entropy
maximiza1on
…
Given ω = 2
Cω =
{ c1, c3 }
The entropy of
c1, c3 are
maximal.
sim(c,c3) = 0.9
sim(c,c1) = 0.7
sim(c,c2) = 0.0
θ = 0.5
discarded by “event sets”
Fig. 6 Running example for cohort exploration.
a given
trieves
the con
tion (li
other i
order b
Comp
tween t
capture
betwee
move a
the sim
process
horts w
are dis
For
tations
gorithm
T(c0
). I
a given
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Contrast cohorts
• A μ-contrast cohort differs in at most μ avribute values or ac3ons.
• We consider μ-contrast cohorts as explora3on candidates, as they are
similar to the input cohort but also different enough to provide addi3onal
insights.
39
tropy. The algorithm terminates in two di↵erent cases,
(i) ¯C!
c is filled with ! cohorts (line 10), (ii) no better
option exists to improve entropy (line 12). The flow of
the algorithm is detailed as follows.
Materialization of contrast cohorts. Given a co-
hort c = hdemogs, actionsi, the function contrast(c, µ)
returns the set of all µ-contrast cohorts of c, following
Equation 11.
constrast(c,µ) = {c0
2 C s.t.,
(|c.demogs  c0
.demogs|= |c.demogs| µ1
^ 9ha, vi 2 c, ha, v0
i 2 c0
, v 6= v0
)_
(|c.actions  c0
.actions|= |c.actions| µ2)}
µ1 + µ2 = µ
(11)
Intuitively, a µ-contrast cohort di↵ers in at most µ
a given cohort c.
a “virtual patient
tual patients is c
patients, i.e., usin
Order of contra
which cohorts sh
an ordering based
that removing an
tion ratio from co
change the memb
itive to start scan
with a small cont
similar and di↵er
While orderin
itive, it does not
can achieve an o
rithm only if the
ration candidates
order should be m
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Entropy maximiza-on
• We can achieve an op3mal execu3on of our greedy
algorithm (and ensure 1-1/e approxima3on guarantee)
only if there is a total order between the explora3on
candidates.
• Peripherality func3on is a common centrality measure in
social network analysis.
• The intui3on behind this measure is that more
peripheral cohorts (i.e., which are farther from other
cohorts) contribute more to entropy.
• The peripherality of a cohort from an input cohort is
defined as the inverse of their closeness, i.e., their
amount of overlap.
40
turns approximate results.
ed sampling. Given a cohort c = hdemogs, acti-
a stratification attribute a, and a sampling ratio
tified sampling generates a sampled cohort ˆc ⇢ c
king at random r⇥|c|
|dom(a)| members from each de-
phic group of a. A demographic group is identi-
th an attribute value pair, e.g., hgender, femalei
ender, malei in case the stratification attribute
nder”. Stratified sampling ensures that the sam-
ohort members represent di↵erent demographic
. The sampling ratio r is a value between 0 and 1
sive), where values closer to 0 lead to more reduc-
The value of r defines the tradeo↵ between e -
and precision. For a cohort c, the upper-bound of
cision loss with the sampling ratio of r is denoted
2 S( ¯C) events of ( ¯C)
3 for c0 2 ¯C do
4 if S(c)  S(c0) 6= ; ^ sim
5 ¯C ¯C  c0
6 end
7 end
8 ¯C sort( ¯C, )
9 ¯C!
c ;
10 while | ¯C!
c |< ! do
11 c⇤ get next( ¯C)
12 if entropy( ¯C!
c [ c⇤) e
13 return ( ¯C!
c )
14 end
15 ¯C!
c
¯C!
c [ c⇤
16 ¯C ¯C  c⇤
17 end
18 return ¯C!
c
a stratification attribute a, and a sampling ratio
tified sampling generates a sampled cohort ˆc ⇢ c
king at random r⇥|c|
|dom(a)| members from each de-
phic group of a. A demographic group is identi-
th an attribute value pair, e.g., hgender, femalei
ender, malei in case the stratification attribute
nder”. Stratified sampling ensures that the sam-
ohort members represent di↵erent demographic
. The sampling ratio r is a value between 0 and 1
sive), where values closer to 0 lead to more reduc-
The value of r defines the tradeo↵ between e -
and precision. For a cohort c, the upper-bound of
ecision loss with the sampling ratio of r is denoted
pling loss, and is computed using Equation 10.
5 ¯C ¯C  c0
6 end
7 end
8 ¯C sort( ¯C, )
9 ¯C!
c ;
10 while | ¯C!
c |< ! do
11 c⇤ get next( ¯C)
12 if entropy( ¯C!
c [ c⇤) e
13 return ( ¯C!
c )
14 end
15 ¯C!
c
¯C!
c [ c⇤
16 ¯C ¯C  c⇤
17 end
18 return ¯C!
c
challenging problem becau
space, i.e., the huge numbe
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Efficiency concerns of cohort explora-on
• The bovleneck of cohort explora3on is
similarity computa3on.
• By avoiding similarity computa3on for
“irrelevant” contrast cohorts, the
execu3on 3me improves dras3cally.
• Inspired from double dic3onary encoding,
we build an event set for each contrast
cohort regardless of their 3me of
occurrence. Event sets enable early
pruning of irrelevant cohorts.
41
14
Explore
contribution ratios
t0 t11 t12
A
(1.0)
C
(0.31)
C
(0.33)
Gender = female ⋀ location
= Grenoble ⋀ age = old gender 0.5
location 0.25
age 0
Makealternativequ
c1:
alive
0. Input cohort
cohort demographics:
⟨gender, female⟩ ⋀
⟨loca?on, Grenoble⟩ ⋀
⟨age, old⟩ ⋀
⟨life, alive⟩
cohort events:
{A,Y,M}
1. Contrast cohorts and their events
c1 demographics:
⟨gender, male⟩ ⋀ …
c2 demographics:
⟨loca?on, Lyon⟩ ⋀ …
c3 demographics:
⟨age, young⟩ ⋀ …
…
c1 events:
{A,Y,C}
c2 events:
{E,B,L}
c3 events:
{B,Y,M,D}
…
2. Similarity checking
and sor1ng
c3 events: {B,Y,M,D}
c1 events: {A,Y,C}
c2 events: {E,B,L}
3. Entropy
maximiza1on
…
Given ω = 2
Cω =
{ c1, c3 }
The entropy of
c1, c3 are
maximal.
sim(c,c3) = 0.9
sim(c,c1) = 0.7
sim(c,c2) = 0.0
θ = 0.5
discarded by “event sets”
Fig. 6 Running example for cohort exploration.
tropy. The algorithm terminates in two di↵erent cases,
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 42
1
Outline
2
3
4
0
Health-care data model
Cohort representa-on
Cohort explora-on
Experiments
Introduc-on
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Performance of cohort representa-on
43
18 Behrooz Omidvar-Tehrani et al.
0 0.2 0.4 0.6 0.8 1
·104
96
98
100
102
Cohort size
Executiontime(ms)
Agir
Rambam
0 0.5 1 1.5 2
·104
100
120
140
160
Cohort size
Executiontime(ms)
Agir
Rambam
0 0.5 1 1.5 2
·104
200
400
600
Cohort size
Executiontime(ms)
Agir
Rambam
0 0.5 1 1.5 2
·104
0
0.5
1
·104
Cohort size
Executiontime(ms)
Agir
Rambam
Fig. 8 Execution time of cohort representation with 10, 50, 100, and 200 trajectory families, respectively from left to right.
applied, respectively. For Rambam, the maximum ex-
ecution times are smaller and do not exceed 100ms,
150ms, and 610ms, respectively. The reason is that the
former has longer trajectories and requires more time
to aggregate events and build representations. In case
of 200 trajectory families, a cohort may end up with
all 200 medoids, which result in a large number of tra-
jectory comparisons. Although cohorts of size 5000 or
smaller can be executed in attention preserving latency,
larger cohorts may exceed this latency threshold.
ification attribute, as the same health situation may
happen for both genders.
While trajectory families and stratified sampling im-
prove execution time, we still need to verify how much
loss they entail (defined in Equations 9 and 10, respec-
tively). Figure 10-left shows the precision loss by vary-
ing the number of trajectory families from 5 to 200.
We observe that in both datasets, the loss decreases
when increasing the number of trajectory families. For
instance, while having only 5 trajectory families leads
to an 83% and 81% loss, 200 families lead to an 11%
Cohort Analytics: E ciency and Applicability 19
50 100 200
0
100
200
300
Cohort size
Executiontime(ms)
age
life
gender
random
50 100 200
0
500
1,000
1,500
Cohort size
Executiontime(ms)
age
life
gender
random
50 100 200
0
1,000
2,000
3,000
Cohort size
Executiontime(ms)
age
life
gender
random
50 100 200
0
2,000
4,000
6,000
8,000
Cohort size
Executiontime(ms)
age
life
gender
random
10
20
30
tiontime(ms)
age
life
gender
random
100
200
tiontime(ms)
age
life
gender
random
500
1,000
tiontime(ms)
age
life
gender
random
1,000
2,000
3,000
tiontime(ms)
age
life
gender
random
Execu-on -me of cohort representa-on with 10, 50, 100, and 200 trajectory families, respec-vely from lej to right.
Execu-on -me of cohort representa-on with stra-fied sampling on AGIR with sampling ra-os of 0.2, 0.4, 0.6,
and 0.8, respec-vely from lej to right.
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
• Fitness verifies if trajectories of cohort members have a footprint in the
representa3on.
• Replayability verifies if all ac3ons in the trajectories of cohort members are
observed in the representa3on.
• Specificity verifies how specific the representa3on is to the cohort
members.
Quality of representa-ons
44
Cohort Analytics: E ciency and Applicability
50 100 200
0.6
0.7
0.8
0.9
1
cohort size
fitness
= 0.2
= 0.5
= 0.8
50 100 200
0.94
0.96
0.98
cohort size
fitness
= 0.2
= 0.5
= 0.8
0.4
0.4
0.5
50 100 200
0.97
0.98
0.99
1
# trajectory families
fitness
Agir
Rambam
0.2
0.94
0.96
0.98
1
sam
fitness
Agir
Rambam
0.8 0.8
Cohort Analytics: E ciency and Applicability
50 100 200
0.6
0.7
0.8
0.9
1
cohort size
fitness
= 0.2
= 0.5
= 0.8
50 100 200
0.94
0.96
0.98
cohort size
fitness
= 0.2
= 0.5
= 0.8
50 100 200
0
0.1
0.2
0.3
0.4
cohort size
replayability
50 100 200
0.1
0.2
0.3
0.4
0.5
cohort size
replayability
0.95
0.9
1
50
0.97
0.98
0.99
1
#
fitness
Ra
50
0.4
0.5
0.6
0.7
0.8
#
replayability
1
50 100 200
0.6
0.7
0.8
cohort size
fitn
50
0.94
0.96
fitn
50 100 200
0
0.1
0.2
0.3
0.4
cohort size
replayability
50
0.1
0.2
0.3
0.4
0.5
replayability
50 100 200
0.8
0.85
0.9
0.95
cohort size
specificity
50
0.5
0.6
0.7
0.8
0.9
1
specificity
Fig. 14 Quality of cohort representation
Rambam (right).
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Performance of cohort explora-on
45
22 Behrooz
1 2 3
102
103
contrast cohort di↵erence µ
Executiontime(ms)
without event sets
with event sets
1 2 3
101
102
103
contrast cohort di↵erence µ
Executiontime(ms)
without event sets
with event sets
102
103
utiontime(ms)
102
103
utiontime(ms)
Another important observatio
di↵erence is that the execution tim
the same order of magnitude as th
contrast cohorts. For instance in
possible cohorts grows by 3 orde
tween µ = 1 and µ = 2, but the e
one order of magnitude worse. Thi
our materialization policy for cont
loading. The only piece of informa
for each contrast cohort c0
2 ¯C i
This retrieval can be instantaneou
index in the database on patient
1 2 3
102
103
contrast cohort di↵erence µ
Executiontime(ms)
without event sets
with event sets
1 2 3
101
102
103
contrast cohort di↵erence µ
Executiontime(ms)
without event sets
with event sets
3 5 10 20 50 100
101
102
103
# exploration options !
Executiontime(ms)
3 5 10 20 50 100
102
103
# exploration options !
Executiontime(ms)
102
utiontime(ms)
102
utiontime(ms)
di↵
the
con
pos
twe
one
our
load
for
Thi
ind
pon
bet
is v
(lin
Nu
infl
the
3 5 10 20 50 100
101
102
103
# exploration options !
Executiontime(ms)
10
10
Executiontime(ms)
0.2 0.5 0.8
101
102
similarity threshold ✓
Executiontime(ms)
10
10
Executiontime(ms)
Fig. 16 Execution time of cohort
the contrast cohort di↵erence µ (top
options ! (middle), and the similar
on Agir (left) and Rambam (right).
Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani
Conclusion
• A data-driven framework for medical cohort representa3on and explora3on.
• Representa3on builds a concise representa3on of a cohort by pruning
insignificant events.
• Explora3on relies on finding contrast cohorts as explora3on candidates.
• For an efficient computa3on of cohort representa3on, we employ
“trajectory families” and “stra3fied sampling”, and for cohort explora3on, we
employ “event sets”.
• We plan to deploy a distributed infrastructure where different components
of representa3on and explora3on can be performed in parallel.
46
Thank you!
Cohort Representa3on and Explora3on
May 29, 2019 @ SLIDE

More Related Content

Similar to Cohort Representation and Exploration

Lecture of epidemiology
Lecture of epidemiologyLecture of epidemiology
Lecture of epidemiologyAmany El-seoud
 
Analytical study designs.pptx
Analytical study designs.pptxAnalytical study designs.pptx
Analytical study designs.pptxAryasree L
 
Epidemiology, Assessment, And Presentation Of An Elderly...
Epidemiology, Assessment, And Presentation Of An Elderly...Epidemiology, Assessment, And Presentation Of An Elderly...
Epidemiology, Assessment, And Presentation Of An Elderly...Olga Bautista
 
1 introduction and basic concepts
1 introduction and basic  concepts1 introduction and basic  concepts
1 introduction and basic conceptsLama K Banna
 
The expected role of triage nurse in emergency reception of a university hosp...
The expected role of triage nurse in emergency reception of a university hosp...The expected role of triage nurse in emergency reception of a university hosp...
The expected role of triage nurse in emergency reception of a university hosp...Alexander Decker
 
Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104RSS6
 
1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ
1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ
1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุtaem
 
Logistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population ModelLogistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population ModelWaqas Tariq
 

Similar to Cohort Representation and Exploration (17)

3 cross sectional study
3 cross sectional study3 cross sectional study
3 cross sectional study
 
Lecture of epidemiology
Lecture of epidemiologyLecture of epidemiology
Lecture of epidemiology
 
Analytical study designs.pptx
Analytical study designs.pptxAnalytical study designs.pptx
Analytical study designs.pptx
 
Descriptive epidemiology
Descriptive epidemiologyDescriptive epidemiology
Descriptive epidemiology
 
Epidemiology, Assessment, And Presentation Of An Elderly...
Epidemiology, Assessment, And Presentation Of An Elderly...Epidemiology, Assessment, And Presentation Of An Elderly...
Epidemiology, Assessment, And Presentation Of An Elderly...
 
Role of 3-Dimensional Sonohysterography in Infertility
Role of 3-Dimensional Sonohysterography in InfertilityRole of 3-Dimensional Sonohysterography in Infertility
Role of 3-Dimensional Sonohysterography in Infertility
 
1 introduction and basic concepts
1 introduction and basic  concepts1 introduction and basic  concepts
1 introduction and basic concepts
 
Es33873878
Es33873878Es33873878
Es33873878
 
Es33873878
Es33873878Es33873878
Es33873878
 
Epidemiology of periodontal disease
Epidemiology of periodontal diseaseEpidemiology of periodontal disease
Epidemiology of periodontal disease
 
semo2037
semo2037semo2037
semo2037
 
The expected role of triage nurse in emergency reception of a university hosp...
The expected role of triage nurse in emergency reception of a university hosp...The expected role of triage nurse in emergency reception of a university hosp...
The expected role of triage nurse in emergency reception of a university hosp...
 
Meng2016
Meng2016Meng2016
Meng2016
 
Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104
 
Clinical study designs
Clinical study designsClinical study designs
Clinical study designs
 
1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ
1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ
1ฉุกเฉินไทยก้าวไกล อ.ศิริอร สินธุ
 
Logistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population ModelLogistic Loglogistic With Long Term Survivors For Split Population Model
Logistic Loglogistic With Long Term Survivors For Split Population Model
 

Recently uploaded

Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisBoston Institute of Analytics
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 

Recently uploaded (20)

Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 

Cohort Representation and Exploration

  • 1. Cohort Representa-on and Explora-on Behrooz Omidvar-Tehrani omidvar.info May 29, 2019 @ SLIDE
  • 2. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 2 Data Explora-on (enabling interac3ons with data) for customized analysis, originated from Exploratory Data Analysis (EDA) in sta3s3cs My research
  • 3. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Exploring Data Explora-on 1. PhD in mining, exploring, and visualizing user groups (with Sihem Amer-Yahia)
 [CIKM’15] [EDBT’15] [PKDD’16] [DEXA’17] [ICDE Demo’18] [VLDBJ’18] [EDBT’18] [CIKM Tutorial’18] [TKDE’19] [HILDA’19] [SIGMOD Tutorial’19] 2. Mini-Postdoc in exploring student progression in French medical educa3on system (with Marie-Chris3ne Rousset)
 [J. AI in Med ’19] 3. Postdoc in exploring spa3otemporal data (with Arnab Nandi at OSU)
 [ICDE Demo’17] [UrbanGIS’17] [SIGSPATIAL’17] 4. Postdoc in exploring user data and medical data (with Sihem Amer-Yahia)
 [DMAH’18] [DSAA’18] [VLDB Demo’19] 5. Research scien-st in exploring POI recommenda3ons (with Naver Labs Europe) 3
  • 4. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Medical data analysis 4 Public health analysis Precision medicine BEFORE NOW
  • 5. Public Health Analysis Overall health-related conclusions about the masses
  • 6. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 6 Precision Medicine A medical model for the customiza3on of healthcare, with medical decisions and treatments being tailored to the individual pa3ent. Na3onal Ins3tute of Health (NIH)
  • 7. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 7 Precision medicine for medical cohorts Precision medicine Public health analysis How does air pollu3on affect people? How does air pollu3on affect Julia’s health? Explore cohorts of pa3ents and their trajectories How does air pollu3on affect the health status of middle-age females in Paris suffering from sleep apnea? towards more customiza3on
  • 8. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 8 Julia who is suffering from epilepsy and is trying to understand what has caused an increase in the frequency of her absence seizures. Why pa-ent cohorts? Cohort of pa3ents suffering from epilepsy. Hospital External benefit Medical units prepare special care for the specific needs of the cohort members. For hospitals (a beLer cohort) Internal benefit Julia observes the profiles of look-alike (or like-minded) pa3ents and see where she falls rela3ve to the norm. For Julia (a beLer me)
  • 9. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Medical cohort analysis Medical cohort analysis exhibits how the health of a set of pa3ents is affected by treatments and diseases. 9 Representa-on: what has happened? Explora-on: what happened to similar cohorts? Predic-on: what will happen next? Representa-on: what has happened? Explora-on: what happened to similar cohorts? Cohort of middle-age females in Paris suffering from sleep apnea
  • 10. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Cohort representa-on Help medical experts understand which sequence of treatments and diseases lead to a final health status. 10 Cohort of middle-age females in Paris suffering from sleep apnea Which sequence of treatments is the most relevant to death? Which treatment is administered right aeer admission which kept cohort members alive for a longer period? What changes in Body Mass Index (abbr., BMI) lead to death?
  • 11. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Cohort explora-on Enable naviga3on in the space of cohorts to compare their representa3on and see how their health evolves. 11 Cohort of pa3ents inside Grenoble suffering from sleep apnea Cohort of pa3ents outside Grenoble suffering from sleep apnea Explore How does their sequence of treatments and disease evolve?
  • 12. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 12 1 Health-care data model Outline 2 Cohort representa-on 3 Cohort explora-on 4 Experiments 0 Introduc-on
  • 13. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 13 1 Outline 2 3 4 0 Health-care data model Cohort representa-on Cohort explora-on Experiments Introduc-on
  • 14. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Health-care datasets 14 Number of pa-ents 56,286 260,099 Number of ac-ons 1,845 428 Number of events 1,543,263 1,300,987 Time period January 2000 to December 2018 January 2004 to October 2007 Demographic aYributes gender, age, loca3on, life status gender, age, life status Types of ac-ons treatment, e3ology, BMI marker, sleepiness marker, hospitaliza3on visit outcome, visited hoposital dep, visited ward, emergency dstay, hospitaliza3on
  • 15. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Star schema of health-care datasets AGIR-à-Dom dataset for 56,284 pa3ents with respiratory diseases. 15 Health service Pa3ent ID Service dura3on Service ID 324,017 records E-ology Pa3ent ID E3ology date E3ology label 223,142 records Compliance Pa3ent ID Compliance date Compliance ID 659,361 records Pa-ent Pa3ent ID Gender Birth year Postal code Life status Death date Hospitaliza-on Pa3ent ID Hospital. dura3on Hospital name Service ID 28,712 records Insurance Pa3ent ID Insurance dura3on Insurer Insurance type Exemp3on 149,624 records Fa-gue marker Pa3ent ID Observa3on date Marker value 217,581 records BMI marker Pa3ent ID Observa3on date Height value Weight value Marker value 390,028 records Respira-on marker Pa3ent ID Observa3on date Marker value 151,810 records
  • 16. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Pa-ent events We want to express all health-care events in the following form. 16 pa)ent ac)on )mestamp event:
  • 17. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Pa-ent trajectories • A pa3ent trajectory is a list of temporally sorted events for the pa3ent. 17 … t0 t1 t2 t10 t14 Admission )me (first appearance of the pa3ent in our health-care database) BMI_obese, epworth_sleepy hospitaliza3on_begin oxygen_begin hospitaliza3on_end BMI_obese, epworth_normal, oxygen_end
  • 18. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Pa-ents’ cohort • A cohort is a set of pa3ents defined with a set of predicates. • For instance the cohort of middle-aged females in Paris suffering from sleep apnea has 3 following pa3ents. 18 … t0 t3 t8 t14 t19 BMI_obese oxygen_begin hospitaliza3on_beginoxygen_end … t0 t5 t11 t15 … t0 t3 t9 t15 t19 BMI_obese BMI_overweighthospitaliza3on_begin hospitaliza3on_begin hospitaliza3on_end hospitaliza3on_endepwoth_borderline Pa3ent 1’s trajectory Pa3ent 2’s trajectory Pa3ent 3’s trajectory
  • 19. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 19 1 Outline 2 3 4 0 Health-care data model Cohort representa-on Cohort explora-on Experiments Introduc-on
  • 20. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Cohort representa-on • Goal. Help medical experts understand the sequence of treatments and diseases for pa3ents over 3me. • Challenge. Cohorts oeen consist of hundreds of pa3ents with various types of events. 20 …Pa3ent 1 … … … … … … Pa3ent 2 Pa3ent 3 Pa3ent 4 Pa3ent 499 Pa3ent 500 a cohort with 500 pa)ents
  • 21. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Ideal representa-on of a cohort • An ideal representa3on for medical experts should describe a single end-to- end story for the cohort, be succinct and readable. 21 What has happened to the cohort of middle-aged females in Paris with Bronchi3s who stay alive? The cohort starts with oxygenotherapy, follows ven3la3on aber 11 months and takes another ven3la3on treatment one month later. Expert Cohort representa)on system
  • 22. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani {A,C} A→C→C A (0.7) → C (1.0) → C (0.6) Finding ideal cohort representa-ons • First off, how can we compare two pa3ent trajectories together? 22 …Pa3ent 1’s trajectory …Pa3ent 2’s trajectory A B C C A A C D A C Set-based Order-based Time-based Representa)on: E Time-based
  • 23. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Aggregated event 23 pa)ent ac)on )mestamp event: pa)ents ac)on )mestamps aggregated event:
  • 24. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani {A,C} A→C→C A (0.7) → C (1.0) → C (0.6) Finding ideal cohort representa-ons • First off, how can we compare two pa3ent trajectories together? 24 …Pa3ent 1’s trajectory …Pa3ent 2’s trajectory A B C C A A C D A C Set-based Order-based Time-based Representa)on: E Time-based t5 t7
  • 25. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Quality of aggregated events • Dispersion: an aggregated event is dispersed if its 3mestamps are spread in 3me. • Significance: an aggregated event is more significant if it contains several different 3mestamps. 25 Behrooz Omidvar-Tehrani et al. ely, an aggregated event is more significant if several di↵erent timestamps. For instance, aggregated events ¯e1 = hP, x, {t0, t6}i and , {t1, t1, t3, t4}i (P, P0 ✓ P), ¯e2 is more sig- an ¯e1, i.e., significance( ¯e3) = 2, and signi- = 1. As ¯e2 contains more events, it is more Note that while ¯e2 has higher significance e latter has more dispersion. end the definition of cohort representation Behrooz Omidvar (S) in the instance Z is minimized. Hence presentation problem is NP-complete. ⇤ t is assumed that the distance function satisfies triangular inequality. In case it problem becomes even harder. The SP func- -SP satisfies triangular inequality, which or any three letters x, y, and z, (x, z)  z). The proof is intuitive: assume SP does angular inequality, then (x, z) = 1, (x, y) Intuitively, an aggregated event is mor it contains several di↵erent timestamps. given two aggregated events ¯e1 = hP, x, ¯e2 = hP0 , x, {t1, t1, t3, t4}i (P, P0 ✓ P), ¯e nificant than ¯e1, i.e., significance( ¯e3) = ficance( ¯e2) = 1. As ¯e2 contains more even significant. Note that while ¯e2 has highe than ¯e1, the latter has more dispersion. We extend the definition of cohort r with the significance property, and propo dispersion(ē1) = 6 significance(ē1) = 2 dispersion(ē2) = 3 significance(ē2) = 4
  • 26. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Problem of cohort representa-on • Given a cohort c and a significance threshold σ, the problem of cohort representa3on is to find all aggregated events Ēc where for each ē ∈ Ēc, two following condi3ons are sa3sfied: 1. significance(ē) ≥ σ, 2. dispersion(ē) is minimized. • The problem is NP-Complete by a reduc3on from Mul3ple Sequence Alignment (MSA) problem. 26
  • 27. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Our cohort representa-on methodology 27 Ap1 Medical expert specifies a cohort gender loca:on age life p1 female Grenoble old alive p2 female Grenoble old alive p3 female Grenoble old alive p4 female Paris middle dead p5 male Grenoble old alive p6 female Paris old dead p7 male Grenoble old alive cohort demographics: ⟨gender, female⟩ ⋀ ⟨loca?on, Grenoble⟩ ⋀ ⟨age, old⟩ ⋀ ⟨life, alive⟩ t0 t6 t9… t4 t5 t7 t8 t10 Ap3 Ap2 cohort ac1ons: ∅ cohort members: p1, p2,p3 E B B B,E C D H C C admission <me ("me zero) t11 E aggregated events between T(p1) and T(p2) ⟨{p1,p2}, A, {t0,t0}⟩ ⟨{p1,p2}, B, {t5,t6}⟩ ⟨{p1,p2}, C, {t10,t10}⟩ between T(p2) and T(p3) ⟨{p2,p3}, A, {t0,t0}⟩ ⟨{p2,p3}, B, {t5,t6}⟩ between T(p1) and T(p3) ⟨{p1,p3}, A, {t0,t0}⟩ ⟨{p1,p3}, B, {t5,t5}⟩ aggregated event count final score ⟨{p1,p2,p3}, A, {t0,t0,t0,t0,t0,t0}}⟩ 3 1.00 ⟨{p2,p3}, B, {t5,t5}}⟩ 1 0.33 ⟨{p1,p2,p3},, B, {t5,t5,t5,t5,t5,t6}}⟩ 3 1.00 ⟨{p1,p2}, C, {t10,t10}}⟩ 1 0.33 0. EHR dataset (pa1ents) 1. Cohort specifica1on 2. Trajectories of the cohort’s members 3. Sequence matching 4. Compute significance score t0 t6 t9… t5 t7 t8 t10 A B t11 5. Cohort representa1on given 𝛔 = 0.8 6. Cohort representa1on given 𝛔 = 0.3 Trajectoriesof t0 t6 t9… t5 t7 t8 t10 A B t11 C
  • 28. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Genera-on of aggregated events using sequence matching • Ini3ally proposed to find matches between protein sequences and discover homologous protein pairs. • In Needleman-Wunsch algorithm, we first build the Manhavan graph using cost tables. Then we back-chain to obtain alignments. • In lack of cost tables, we use default cost templates, i.e., +1 for match, -1 for mismatch, and -2 for gap. 28 BEGIN A G C BEGIN 0 -2 -4 -6 A -2 1 -1 -3 A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 t0 t15 A Pa3ent 1’s trajectory t42 G C t0 t3 A Pa3ent 2’s trajectory t12 A CA t72 ManhaDan graph
  • 29. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Why is cohort representa-on useful? • Storytelling for the cohort: star3ng with event A, the cohort follows up with B aeer five months and then C aeer 6 months. 29 t0 t5 A Cohort representa3on t11 B C Decrease hospitaliza3on costs in the first 5- month period of treatment. Prepare the medical unit rela3ve to events A,B, and C and not other events. Organize care subscrip3ons in periods of 5-6 months.
  • 30. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Efficiency concerns of cohort representa-on • The worst case complexity of cohort representa3on is O(|c|2 ×n2). • To improve the efficiency of cohort representa3on, we need to reduce either the length of trajectories (i.e., n) or the number of trajectory comparisons (i.e., |c|). 30 Filter out unnecessary comparisons Trajectory families Stra-fied sampling Reduc,on of trajectory length Reduc,on of trajectory comparisons
  • 31. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Trajectory families • Generate non-overlapping clusters of trajectories in a 3me-agnos3c way. • Employ k-medoids as the clustering approach to obtain k families. • Abstract pa3ent trajectories with the medoid of their family 31 T(p1) T(p4) T(p7) T(p2) med(F2) = T(p3) T(p6) T(p8) T(p9) T(p10) 0.2 0.40.3 0.1 med(F1) = T(p5) 0.40.6 0.3 0.1 Trajectory family F1 Trajectory family F2 Fig. 4. Trajectory families. The distance between patient trajectories is shown on edges as inverse of their similarity. The upper bound of precision loss for F1 and F2 is loss(F1, F2) = max(0.4, 0.6) = 0.6. It originates from the Algorithm 2: E Input: Cohor 1 while all attri 2 ha, vi g 3 mark a as 4 c0 c h 5 if sim(T( 6 B.sort()
  • 32. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Stra-fied sampling • Generate a sampled cohort c’ ⊂ c by picking at random r×|c| members from each demographic group of a given stra3fica3on avribute a. • A demographic group is iden3fied with an avribute value pair, e.g., ⟨gender, female⟩ and ⟨gender, male⟩ in case the stra3fica3on avribute is “gender”. • Stra3fied sampling ensures that the sampled cohort members represent all demographic groups. • The sampling ra3o r is a value between 0 and 1 (exclusive), where values closer to 0 lead to more reduc3on. 32
  • 33. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 33 1 Outline 2 3 4 0 Health-care data model Cohort representa-on Cohort explora-on Experiments Introduc-on
  • 34. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Ideal cohort explora-on 34 Cohort of pa3ents inside Grenoble suffering from sleep apnea Candidate cohort for explora3on Explore ? • Experts may be interested in finding cohorts that are similar to a given cohort.
  • 35. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Comparison of cohorts: similarity • The similarity between two cohorts c1 and c2, is the similarity between their representa3ons Ec1 and Ec2. 35 equal to the largest possible dispersion, i.e., the largest size of the trajectories for patients in ¯e.P. For instance, given an aggregated event ¯e1 = h{p1, p2}, x, {t0, t2}i, and ⇧ = max(|p1.trajectory|, |p2.trajectory|) = 10, we obtain concentration( ¯e1) = 1.0 0.2 = 0.8, which means ¯e1 is highly concentrated. Given Equation 3, we now define similarity between two cohorts c1 and c2 as follows. similarity(c1, c2) = average[concentration(¯e) s.t., ¯e1 = hc1, x, 1i 2 T(c1)^ ¯e2 = hc2, x, 2i 2 T(c2)^ = 1 [ 2^ ¯e = hc1 [ c2, x, i 2 ¯E] (4) Intuitively, two cohorts are similar if their common follows. ⇡(ci, cj) = Note tha ⌦(ci, cj) = ✏ is a smal ity. Equatio est probabi their patien bility value C! = {c1, c2 and the am |c1 c2|= 2, these cohor exploration options should have maximal inf entropy. In the following, we first provide for nitions for “similarity” and “entropy”, and th the problem of cohort exploration. Similarity in cohort exploration. Two co similar if their common events are aligned, i.e persed. Hence we consider an inverse definiti persion, called “concentration”, as follows. concentration(¯e) = 1.0 dispersion(¯e) ⇧ In Equation 3, ⇧ is a normalization fact equal to the largest possible dispersion, i.e., t size of the trajectories for patients in ¯e.P. For given an aggregated event ¯e1 = h{p1, p2}, x, and ⇧ = max(|p1.trajectory|, |p2.trajectory|) obtain concentration( ¯e1) = 1.0 0.2 = 0.8, wh ¯e1 is highly concentrated. Given Equation 3, we now define similarity two cohorts c1 and c2 as follows. Ec Ec’
  • 36. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Comparison of cohorts: entropy • Cohort explora3on should return a limited number of well-separated similar cohorts to help experts explore different direc3ons in their health-care data. • Given a set of cohorts C where |C| ≤ ω, we measure the amount of informa3on that C coveys using its Shannon’s informa3on entropy. 36 patients. veyed by d also be er words, ormation rmal defi- hen define horts are , less dis- on of dis- alysts explore di↵erent directions in their health-care data. We consider an expert-defined parameter ! which defines the size of the exploration set. Given a set of cohorts ¯C! c ⇢ C where | ¯C! c | !, we measure the amount of information that ¯C! c coveys using its Shannon’s in- formation entropy, defined as follows. entropy( ¯C! c ) = X (ci,cj 2 ¯C! c ) ⇡(ci, cj)⇥log| ¯C! c |⇡(ci, cj) (5) Equation 5 is an adaptation of edge-weighted graph entropy introduced in [18]. The amount of information conveyed by the ! cohorts is high if they describe dis- C C C It captures the amount of pa0ent overlap between the cohorts .
  • 37. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Problem of cohort explora-on • Given a cohort c, a similarity threshold θ, and a size threshold ω, return at most ω cohorts C = {c1,c2 ...,cω} where similarity (c, ci) ≥ θ, and entropy(C) is maximized. • The problem is NP-Complete by a reduc3on from Maximum Sub-Edge Graph problem. 37
  • 38. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Mul--staged cohort explora-on 38 2. Compute the similarity of each iden3fied contrast cohort to the given cohort. 1. Limit the set of candidates to contrast cohorts and sort them in increasing order of peripherality. 3. Return ω-similar cohorts with maximized entropy. 14 Explore t0 t11 t12 A (1.0) C (0.31) C (0.33) Gender = female ⋀ location = Grenoble ⋀ age = old gender 0.5 location 0.25 age 0 Makealternativeq c1: Gender = c2: Gend c3: Gende alive 0. Input cohort cohort demographics: ⟨gender, female⟩ ⋀ ⟨loca?on, Grenoble⟩ ⋀ ⟨age, old⟩ ⋀ ⟨life, alive⟩ cohort events: {A,Y,M} 1. Contrast cohorts and their events c1 demographics: ⟨gender, male⟩ ⋀ … c2 demographics: ⟨loca?on, Lyon⟩ ⋀ … c3 demographics: ⟨age, young⟩ ⋀ … … c1 events: {A,Y,C} c2 events: {E,B,L} c3 events: {B,Y,M,D} … 2. Similarity checking and sor1ng c3 events: {B,Y,M,D} c1 events: {A,Y,C} c2 events: {E,B,L} 3. Entropy maximiza1on … Given ω = 2 Cω = { c1, c3 } The entropy of c1, c3 are maximal. sim(c,c3) = 0.9 sim(c,c1) = 0.7 sim(c,c2) = 0.0 θ = 0.5 discarded by “event sets” Fig. 6 Running example for cohort exploration. a given trieves the con tion (li other i order b Comp tween t capture betwee move a the sim process horts w are dis For tations gorithm T(c0 ). I a given
  • 39. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Contrast cohorts • A μ-contrast cohort differs in at most μ avribute values or ac3ons. • We consider μ-contrast cohorts as explora3on candidates, as they are similar to the input cohort but also different enough to provide addi3onal insights. 39 tropy. The algorithm terminates in two di↵erent cases, (i) ¯C! c is filled with ! cohorts (line 10), (ii) no better option exists to improve entropy (line 12). The flow of the algorithm is detailed as follows. Materialization of contrast cohorts. Given a co- hort c = hdemogs, actionsi, the function contrast(c, µ) returns the set of all µ-contrast cohorts of c, following Equation 11. constrast(c,µ) = {c0 2 C s.t., (|c.demogs c0 .demogs|= |c.demogs| µ1 ^ 9ha, vi 2 c, ha, v0 i 2 c0 , v 6= v0 )_ (|c.actions c0 .actions|= |c.actions| µ2)} µ1 + µ2 = µ (11) Intuitively, a µ-contrast cohort di↵ers in at most µ a given cohort c. a “virtual patient tual patients is c patients, i.e., usin Order of contra which cohorts sh an ordering based that removing an tion ratio from co change the memb itive to start scan with a small cont similar and di↵er While orderin itive, it does not can achieve an o rithm only if the ration candidates order should be m
  • 40. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Entropy maximiza-on • We can achieve an op3mal execu3on of our greedy algorithm (and ensure 1-1/e approxima3on guarantee) only if there is a total order between the explora3on candidates. • Peripherality func3on is a common centrality measure in social network analysis. • The intui3on behind this measure is that more peripheral cohorts (i.e., which are farther from other cohorts) contribute more to entropy. • The peripherality of a cohort from an input cohort is defined as the inverse of their closeness, i.e., their amount of overlap. 40 turns approximate results. ed sampling. Given a cohort c = hdemogs, acti- a stratification attribute a, and a sampling ratio tified sampling generates a sampled cohort ˆc ⇢ c king at random r⇥|c| |dom(a)| members from each de- phic group of a. A demographic group is identi- th an attribute value pair, e.g., hgender, femalei ender, malei in case the stratification attribute nder”. Stratified sampling ensures that the sam- ohort members represent di↵erent demographic . The sampling ratio r is a value between 0 and 1 sive), where values closer to 0 lead to more reduc- The value of r defines the tradeo↵ between e - and precision. For a cohort c, the upper-bound of cision loss with the sampling ratio of r is denoted 2 S( ¯C) events of ( ¯C) 3 for c0 2 ¯C do 4 if S(c) S(c0) 6= ; ^ sim 5 ¯C ¯C c0 6 end 7 end 8 ¯C sort( ¯C, ) 9 ¯C! c ; 10 while | ¯C! c |< ! do 11 c⇤ get next( ¯C) 12 if entropy( ¯C! c [ c⇤) e 13 return ( ¯C! c ) 14 end 15 ¯C! c ¯C! c [ c⇤ 16 ¯C ¯C c⇤ 17 end 18 return ¯C! c a stratification attribute a, and a sampling ratio tified sampling generates a sampled cohort ˆc ⇢ c king at random r⇥|c| |dom(a)| members from each de- phic group of a. A demographic group is identi- th an attribute value pair, e.g., hgender, femalei ender, malei in case the stratification attribute nder”. Stratified sampling ensures that the sam- ohort members represent di↵erent demographic . The sampling ratio r is a value between 0 and 1 sive), where values closer to 0 lead to more reduc- The value of r defines the tradeo↵ between e - and precision. For a cohort c, the upper-bound of ecision loss with the sampling ratio of r is denoted pling loss, and is computed using Equation 10. 5 ¯C ¯C c0 6 end 7 end 8 ¯C sort( ¯C, ) 9 ¯C! c ; 10 while | ¯C! c |< ! do 11 c⇤ get next( ¯C) 12 if entropy( ¯C! c [ c⇤) e 13 return ( ¯C! c ) 14 end 15 ¯C! c ¯C! c [ c⇤ 16 ¯C ¯C c⇤ 17 end 18 return ¯C! c challenging problem becau space, i.e., the huge numbe
  • 41. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Efficiency concerns of cohort explora-on • The bovleneck of cohort explora3on is similarity computa3on. • By avoiding similarity computa3on for “irrelevant” contrast cohorts, the execu3on 3me improves dras3cally. • Inspired from double dic3onary encoding, we build an event set for each contrast cohort regardless of their 3me of occurrence. Event sets enable early pruning of irrelevant cohorts. 41 14 Explore contribution ratios t0 t11 t12 A (1.0) C (0.31) C (0.33) Gender = female ⋀ location = Grenoble ⋀ age = old gender 0.5 location 0.25 age 0 Makealternativequ c1: alive 0. Input cohort cohort demographics: ⟨gender, female⟩ ⋀ ⟨loca?on, Grenoble⟩ ⋀ ⟨age, old⟩ ⋀ ⟨life, alive⟩ cohort events: {A,Y,M} 1. Contrast cohorts and their events c1 demographics: ⟨gender, male⟩ ⋀ … c2 demographics: ⟨loca?on, Lyon⟩ ⋀ … c3 demographics: ⟨age, young⟩ ⋀ … … c1 events: {A,Y,C} c2 events: {E,B,L} c3 events: {B,Y,M,D} … 2. Similarity checking and sor1ng c3 events: {B,Y,M,D} c1 events: {A,Y,C} c2 events: {E,B,L} 3. Entropy maximiza1on … Given ω = 2 Cω = { c1, c3 } The entropy of c1, c3 are maximal. sim(c,c3) = 0.9 sim(c,c1) = 0.7 sim(c,c2) = 0.0 θ = 0.5 discarded by “event sets” Fig. 6 Running example for cohort exploration. tropy. The algorithm terminates in two di↵erent cases,
  • 42. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani 42 1 Outline 2 3 4 0 Health-care data model Cohort representa-on Cohort explora-on Experiments Introduc-on
  • 43. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Performance of cohort representa-on 43 18 Behrooz Omidvar-Tehrani et al. 0 0.2 0.4 0.6 0.8 1 ·104 96 98 100 102 Cohort size Executiontime(ms) Agir Rambam 0 0.5 1 1.5 2 ·104 100 120 140 160 Cohort size Executiontime(ms) Agir Rambam 0 0.5 1 1.5 2 ·104 200 400 600 Cohort size Executiontime(ms) Agir Rambam 0 0.5 1 1.5 2 ·104 0 0.5 1 ·104 Cohort size Executiontime(ms) Agir Rambam Fig. 8 Execution time of cohort representation with 10, 50, 100, and 200 trajectory families, respectively from left to right. applied, respectively. For Rambam, the maximum ex- ecution times are smaller and do not exceed 100ms, 150ms, and 610ms, respectively. The reason is that the former has longer trajectories and requires more time to aggregate events and build representations. In case of 200 trajectory families, a cohort may end up with all 200 medoids, which result in a large number of tra- jectory comparisons. Although cohorts of size 5000 or smaller can be executed in attention preserving latency, larger cohorts may exceed this latency threshold. ification attribute, as the same health situation may happen for both genders. While trajectory families and stratified sampling im- prove execution time, we still need to verify how much loss they entail (defined in Equations 9 and 10, respec- tively). Figure 10-left shows the precision loss by vary- ing the number of trajectory families from 5 to 200. We observe that in both datasets, the loss decreases when increasing the number of trajectory families. For instance, while having only 5 trajectory families leads to an 83% and 81% loss, 200 families lead to an 11% Cohort Analytics: E ciency and Applicability 19 50 100 200 0 100 200 300 Cohort size Executiontime(ms) age life gender random 50 100 200 0 500 1,000 1,500 Cohort size Executiontime(ms) age life gender random 50 100 200 0 1,000 2,000 3,000 Cohort size Executiontime(ms) age life gender random 50 100 200 0 2,000 4,000 6,000 8,000 Cohort size Executiontime(ms) age life gender random 10 20 30 tiontime(ms) age life gender random 100 200 tiontime(ms) age life gender random 500 1,000 tiontime(ms) age life gender random 1,000 2,000 3,000 tiontime(ms) age life gender random Execu-on -me of cohort representa-on with 10, 50, 100, and 200 trajectory families, respec-vely from lej to right. Execu-on -me of cohort representa-on with stra-fied sampling on AGIR with sampling ra-os of 0.2, 0.4, 0.6, and 0.8, respec-vely from lej to right.
  • 44. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani • Fitness verifies if trajectories of cohort members have a footprint in the representa3on. • Replayability verifies if all ac3ons in the trajectories of cohort members are observed in the representa3on. • Specificity verifies how specific the representa3on is to the cohort members. Quality of representa-ons 44 Cohort Analytics: E ciency and Applicability 50 100 200 0.6 0.7 0.8 0.9 1 cohort size fitness = 0.2 = 0.5 = 0.8 50 100 200 0.94 0.96 0.98 cohort size fitness = 0.2 = 0.5 = 0.8 0.4 0.4 0.5 50 100 200 0.97 0.98 0.99 1 # trajectory families fitness Agir Rambam 0.2 0.94 0.96 0.98 1 sam fitness Agir Rambam 0.8 0.8 Cohort Analytics: E ciency and Applicability 50 100 200 0.6 0.7 0.8 0.9 1 cohort size fitness = 0.2 = 0.5 = 0.8 50 100 200 0.94 0.96 0.98 cohort size fitness = 0.2 = 0.5 = 0.8 50 100 200 0 0.1 0.2 0.3 0.4 cohort size replayability 50 100 200 0.1 0.2 0.3 0.4 0.5 cohort size replayability 0.95 0.9 1 50 0.97 0.98 0.99 1 # fitness Ra 50 0.4 0.5 0.6 0.7 0.8 # replayability 1 50 100 200 0.6 0.7 0.8 cohort size fitn 50 0.94 0.96 fitn 50 100 200 0 0.1 0.2 0.3 0.4 cohort size replayability 50 0.1 0.2 0.3 0.4 0.5 replayability 50 100 200 0.8 0.85 0.9 0.95 cohort size specificity 50 0.5 0.6 0.7 0.8 0.9 1 specificity Fig. 14 Quality of cohort representation Rambam (right).
  • 45. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Performance of cohort explora-on 45 22 Behrooz 1 2 3 102 103 contrast cohort di↵erence µ Executiontime(ms) without event sets with event sets 1 2 3 101 102 103 contrast cohort di↵erence µ Executiontime(ms) without event sets with event sets 102 103 utiontime(ms) 102 103 utiontime(ms) Another important observatio di↵erence is that the execution tim the same order of magnitude as th contrast cohorts. For instance in possible cohorts grows by 3 orde tween µ = 1 and µ = 2, but the e one order of magnitude worse. Thi our materialization policy for cont loading. The only piece of informa for each contrast cohort c0 2 ¯C i This retrieval can be instantaneou index in the database on patient 1 2 3 102 103 contrast cohort di↵erence µ Executiontime(ms) without event sets with event sets 1 2 3 101 102 103 contrast cohort di↵erence µ Executiontime(ms) without event sets with event sets 3 5 10 20 50 100 101 102 103 # exploration options ! Executiontime(ms) 3 5 10 20 50 100 102 103 # exploration options ! Executiontime(ms) 102 utiontime(ms) 102 utiontime(ms) di↵ the con pos twe one our load for Thi ind pon bet is v (lin Nu infl the 3 5 10 20 50 100 101 102 103 # exploration options ! Executiontime(ms) 10 10 Executiontime(ms) 0.2 0.5 0.8 101 102 similarity threshold ✓ Executiontime(ms) 10 10 Executiontime(ms) Fig. 16 Execution time of cohort the contrast cohort di↵erence µ (top options ! (middle), and the similar on Agir (left) and Rambam (right).
  • 46. Cohort Representa-on and Explora-on | Behrooz Omidvar-Tehrani Conclusion • A data-driven framework for medical cohort representa3on and explora3on. • Representa3on builds a concise representa3on of a cohort by pruning insignificant events. • Explora3on relies on finding contrast cohorts as explora3on candidates. • For an efficient computa3on of cohort representa3on, we employ “trajectory families” and “stra3fied sampling”, and for cohort explora3on, we employ “event sets”. • We plan to deploy a distributed infrastructure where different components of representa3on and explora3on can be performed in parallel. 46
  • 47. Thank you! Cohort Representa3on and Explora3on May 29, 2019 @ SLIDE