3. • Overview of IRIS from Ayasdi
• A tool for looking at large datasets and trying to find meaning
• Walking through an example of an Ayasdi analysis
Outline
3
4. • We are gathering more data all the time
What IRIS is for…
4
5. …and while data are often collected to address specific questions, the data
may also hold additional insights
5
CD
+Stim, Ab
Baseline
“There isn’t a single story happening in your complex data” – Anthony Bak, Ayasdi
6. • IRIS combines topological math with a highly flexible and intuitive interface to
analyze large datasets
• Creates different shapes that can be explored
• Ayasdi can be used on different kinds of high complexity datasets
• Transcriptome profiling
• Clinical data
• Flow cytometry data
• Financial data
• Text
• Etc.
That’s where we think IRIS from Ayasdi will help
6
7. • Concept is: data has shape based on how elements in the datasets are mathematically
related to each other
• For example, how are samples alike?
• IRIS takes the data, performs a mathematical transformation, and uses the output to
group samples together and draw a picture
• This is done iteratively with different mathematical transformations to give multiple
different views of the data’s shapes
• The shapes highlight possibly interesting parts of the dataset
• In our case, disease or patient subsets
How does IRIS work?
7
9. The problem of having a liberal arts education…
9
Platonic ideal
of chair
10. What an IRIS analysis looks like
10
3 different shapes
made from the
same data
11. Explaining the parts
11
Dots represent
groups of
samples that
are similar to
each other
Connecting lines
represent at
least one shared
member
between groups
Features like
this arm on the
shape can be
examined in
further detail
Coloring (red=high to blue=low) can be
based on initial math or annotations (ie,
gender, disease), gene expression, etc.
12. • Groups and shapes area analyzed and interpreted
• We try to understand what underlies the shapes and forms that arise
• Link back to biology, patients, effect
• Learn new insights
• Create hypotheses, test on the fly,
• Iterate
• Next several slides will be an example of an IRIS analysis and insights
How does an IRIS analysis proceed?
12
13. • Institute for Health Metrics and Evaluation (IHME)
• Performed survey of smoking prevalence worldwide, from 1980-2012
• 187 countries
• Dataset contains smoking frequency broken down by age, gender, year
• 518 columns, 187 rows
• Some reasons to look at this data:
• Practice—and IRIS workflow is pretty much the same for any dataset
• Using non-gene expression data
• Smoking is a risk factor for RA, diabetes, etc.
Example analysis: Smoking prevalence
13
14. These were derived from the IHME data
14
Thinking like an
analyst: what do
different parts of
shapes mean?
There’s a lot to
potentially explore
15. Start with this basic shape:
15
What are these
two groups?
Upper arm
Lower arm
Certain mathematical transformations often create this antibody shape in large
datasets
16. First step: define groups and do numerical and categorical comparison to
rest of shape
16
Lower arm categorical table
Column Name Value
Percent in
Group 1
Percent in Both
Group 1 and
Group 2
Count in Group
1
Count in Both
Group 1 and
Group 2 p-value
ISOsubregion 35 0.27 0.06 6 11 4.23E-04
Developing Yes 1.00 0.73 22 137 6.48E-04
ISOsubregion 14 0.27 0.09 6 17 0.006991494
Annualized Rate of Change
(%) Male and Female 1980
to 2012 -0.5 0.18 0.04 4 8 0.007475094
Annualized Rate of Change
(%) Male and Female 1980
to 2012 -0.7 0.18 0.05 4 10 0.019024382
ISOregion 2 0.45 0.27 10 50 0.035708684
Bangladesh
Burkina Faso
Burundi
Cambodia
Djibouti
Federated States of Micronesia
Ghana
Guinea-Bissau
Indonesia
Jamaica
Laos
Malawi
Maldives
Myanmar
Namibia
Paraguay
Philippines
Rwanda
Somalia
Sri Lanka
Thailand
Zimbabwe
Southeastern Asia
Eastern Africa
18. Now looking at numerical annotations
18
Column Name KS Statistic KS p-value T-test p-value Group 1 Mean - Group 2 Mean KS Sign
Smoking Prevalence (%) Age 80+ 1997 0.62 4.83578E-07 3.79979E-05 6.960909091 +
Smoking Prevalence (%) Age 80+ 2000 0.62 4.83578E-07 2.55956E-05 7.112424242 +
Smoking Prevalence (%) Age 80+ 1999 0.62 6.72238E-07 2.9015E-05 7.072121212 +
Smoking Prevalence (%) Age 80+ 2001 0.62 6.72238E-07 2.5208E-05 7.133030303 +
Smoking Prevalence (%) Age 80+ 2002 0.62 6.72238E-07 2.38392E-05 7.140909091 +
Smoking Prevalence (%) Age 80+ 1996 0.61 9.31143E-07 4.89306E-05 6.880909091 +
Smoking Prevalence (%) Age 80+ 1998 0.61 9.31143E-07 3.31192E-05 7.008787879 +
Smoking Prevalence (%) Age 80+ 2003 0.61 9.31143E-07 2.36669E-05 7.144242424 +
Smoking Prevalence (%) Age 80+ 1995 0.58 3.66511E-06 5.92711E-05 6.813030303 +
Smoking Prevalence (%) Age 80+ 2004 0.58 4.98014E-06 2.33953E-05 7.080606061 +
Smoking Prevalence (%) Age 75 2004 0.57 5.51162E-06 1.50199E-05 7.676363636 +
Smoking Prevalence (%) Age 75 2008 0.57 5.51162E-06 2.02097E-05 7.436666667 +
Smoking Prevalence (%) Age 75 2009 0.57 5.51162E-06 2.04579E-05 7.365151515 +
Smoking Prevalence (%) Age 75 2011 0.57 6.09737E-06 2.0317E-05 7.224545455 +
Smoking Prevalence (%) Age 75 2012 0.57 6.09737E-06 1.89945E-05 7.184242424 +
Smoking Prevalence (%) Age 80+ 2005 0.57 6.09737E-06 2.25215E-05 7.026363636 +
Smoking Prevalence (%) Age 75 2003 0.57 7.45331E-06 1.28236E-05 7.777878788 +
Smoking Prevalence (%) Age 75 2005 0.57 7.45331E-06 1.61689E-05 7.576666667 +
Smoking Prevalence (%) Age 75 2006 0.57 7.45331E-06 1.84185E-05 7.536969697 +
Smoking Prevalence (%) Age 75 2007 0.57 7.45331E-06 1.94395E-05 7.496666667 +
Smoking Prevalence (%) Age 75 2010 0.57 7.45331E-06 2.08264E-05 7.294848485 +
Smoking Prevalence (%) Age 80+ 2012 0.57 7.45331E-06 3.11246E-05 6.652121212 +
Smoking Prevalence (%) Age 80+ 1994 0.56 8.23553E-06 6.50367E-05 6.795151515 +
Smoking Prevalence (%) Age 80+ 2007 0.56 8.23553E-06 2.66895E-05 6.890909091 +
Smoking Prevalence (%) Age 75 2002 0.56 1.00428E-05 1.19239E-05 7.858484848 +
Smoking Prevalence (%) Age 80+ 2011 0.56 1.00428E-05 3.17879E-05 6.670606061 +
Smoking Prevalence (%) Age 80+ 2006 0.56 1.10835E-05 2.3874E-05 6.958181818 +
Smoking Prevalence (%) Age 80+ 2010 0.55 1.22271E-05 3.14422E-05 6.696666667 +
Ranking by one of
their built in
statistics, see
quickly that data
columns largely
reflect smoking
prevalence among
the elderly
19. Pick a few years for the 80+ smoking prevalence to graph boxplots
19
Okay, so confirming
insights: we’re looking
at a subset of countries
that have a high rate of
smoking in the elderly.
Note that Upper Arm
group has a
substantially lower rate
20. Other countries
have high rates in
the elderly; and
within the lower
arm group, some
have relatively
low rates
So we’ve found a
subpopulation
But that’s not the whole story
20
Country
Lower arm
group
Smoking Prevalence
(%) Age 80+ 2000 Country
Lower arm
group
Smoking Prevalence
(%) Age 80+ 2000
Pakistan no 34 Laos yes 29.4
Tonga no 25.2 Myanmar yes 26.4
Kiribati no 24.4 Namibia yes 23.3
Nepal no 23.8 Bangladesh yes 21.8
Lebanon no 22.2 Cambodia yes 20
Timor-Leste no 18.8 Indonesia yes 18.1
Denmark no 17.1 Federated States of Micronesia yes 17.6
Tunisia no 16.4 Philippines yes 15.8
Jordan no 16.2 Paraguay yes 14.5
Lesotho no 15.9 Malawi yes 14.4
South Korea no 15.9 Djibouti yes 14.3
Malaysia no 15.8 Zimbabwe yes 13.7
Dominican Republic no 15 Thailand yes 13
Vanuatu no 14.5 Maldives yes 12.5
Palestine no 14.2 Sri Lanka yes 11.2
Vietnam no 13.9 Burkina Faso yes 11
Cyprus no 13.7 Burundi yes 9.7
Samoa no 13.6 Rwanda yes 8.7
Albania no 13.4 Somalia yes 8.5
Mongolia no 13.1 Ghana yes 7.9
South Africa no 13.1 Jamaica yes 7.6
China no 13 Guinea-Bissau yes 7.5
21. • Many directions to go here
• In IRIS
• persistence of group
• Co-occurrence with other annotations beyond “developing”
• Outside of IRIS
• Once you know a subgroup exists, statistical analyses
• Visualization techniques such as heatmaps
What are the characteristics that define that subpopulation?
21
22. Persistence (or not) of subgroup integrity across shapes and analyses
22
From this we can go back to
the mathematical
transformations used to
make each set of shapes
and find clues to what is
driving this group to stay
together in some shapes
but not others
23. Overlay of different kinds of information
23
Comparison of developing
country status suggests
two groups we could
compare to look for
additional insights
Annualized rate of change
between 1980-1996 is
another annotation we
could look into more
Developing = no
Developing = yesPopulation
Ann rate of change 1980-96
24. Comparing the two developing world enriched groups
24
• Found differences between older age smoking prevalence—lower arm group
has higher rate
• We already knew that
• Also found differences in 10yr old smoking prevalence—lower arm group has
lower rate
• We didn’t know that…
25. 10 year old smoking prevalence
25
1980
20102000
1990 Smoking in kids
consistently low in the
lower arm group.
Suggests for public health
intervention for these
countries--need to confirm
pattern and, if it confirms,
look at transition from non-
smoking to smoking and
when that happens
26. Looking more closely at Annualized rate of change
26
Ann rate of change 1980-96 Ann rate of change 2006-2012
Ann rate of change 1980-2012 Ann rate of change 1996-2006 Suggestion that lower arm
group had relatively less
decrease in overall smoking
rates in the 80s and 90s,
but rate of decrease began
to pickup in the 2000s,
relative to other countries
From a Public Health
standpoint, now go back
and ask what kinds of
smoking cessation
interventions were put in
place in the 2000s
Editor's Notes
Flow and other immunological data, genomic and transcriptomic data, medical and clinical data, personal monitoring data
RNAseq experiment; primary goal to identify
This made me thing of the analogy of platonic ideals and real-world reflections. Big data is like the ideal, full of all kinds of meaning. Each different take comes from the same ideal and gives its own perspective on underlying structure
IHME is over on 5th Avenue
I’ll be using the very technical terms, “Lower arm” and “Upper arm.”
Here are some initial potential insights. Equatorial countries, cluster in SE Asia, some other in Africa, developing countries.
Blue arrows denote developed, as opposed to developing, nations. Make the point that ahead of time, is it likely someone would have selected this group as being different from other developing nations with high smoking rates in the elderly?
Analogous to finding disease subsets: find patterns that you might not have automatically assumed were there.