first: question for you . next slide: 4 datasets of x-y coordinates. Question is : are they drawn from the same distribution?
all summary statistics : the same => different datasets are drawn from the same distribution, right?
It's only when we draw the data that we see that the 4 datasets are in fact vastly different. We can see this in a fraction of a second, instantaneous . Known as Anscombe's quartet . Many probably acquainted with.
why do I show you this? represents my epiphany 2009 : genomics world abuzz: 1kG data available for analysis. 8 institutes at fore-front of genomics research set out to identify specific type of variation in the human genome (within 1kG project) 1yr later: results , but very little overlap (even though same input data) => overlap ranged from 60% down to 1% did the institutes make errors? no. but using different assumptions about the data and different parameters needed data visualization to find out what was going on ; automated algorithms couldn't
Wikipedia: analytical reasoning facilitated by interactive visual interfaces used in terrorism informatics, network security, ... integrating core human strengths in data analysis: - pattern detection - intuition - prediction - context next slide: example of pattern detection: will show you a flash of blue dots . Is there a red one?
pre-attentive vision => 50 milliseconds = enough initiation of eye-movement = 200 milliseconds already convert cognitive task into perceptive This talk: illustrate 2 strengths of visual analytics where visualization adds real value to automated analysis
heading towards data infarction: increasing distance between domain expert and output of automated analysis (e.g. bioinformatics) <= use different languages + algorithms are too opaque and/or advanced to directly relate output to input => expert needs to trust information, but this is blind trust => black box
example: data filtering if no golden standard available given input dataset: how to find the optimal combination of filters to get the maximum number of true positives but minimizing the false positives and false negatives this was the problem faced by the genomics community in 2009
different combinations of filters and their parameters => different elements that pass all thresholds
state of the art : run different combinations of filter settings => take the intersection
but this is what we should have found => visualization of data streams can shed light in that black box
second strength visual analytics Jim Gray (Microsoft) couple of years ago: article on different paradigms of doing scientific research through the ages
thousand of years ago: 1st paradigm concerned with describing natural phenomena
hundreds of years ago: second paradigm Kepler & Newton: theoretical approach : define laws, generalizations
last couple of decades: third paradigm modeling and simulation ("computational biology")
now: big data key difference: data first, hypothesis later
given 2 interaction networks (gene network vs network of functions in linux operating system) which is which?? how do these differ? can calculate connectivity, average vertex degree, global and local complexity, ... where should we start to look? what are the hypotheses to test?
Martin Krzywinski if we constrain nodes to 3 axes just based on the question if a node is a source and/or target of links => start to see patterns why is proportion of nodes on green axis much bigger in one network
if normalize these axes to 100%: some additional patterns clear => look for things that we can investigate left: small number of nodes on yellow axis linked to many nodes on green right: other way around => what is special about this small set of nodes?
have illustrated only 2 use cases of visual analytics, and hopefully spiced up your appetite call to action : put the human back in the loop It's by combining human and algorithm strengths => tackle onslaught of data - effectively - efficiently What we need to do, is detect the expected, discover the unexpected
Humanizing Data Analysis
Humanizing dataanalysisJan AertsBioinformatics, ESAT/SCD, University of LeuvenFuture Health Department, iMinds
I II III IV x y x y x y x y10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.588.0 6.95 8.0 8.14 8.0 6.77 8.0 5.7613.0 7.58 13.0 8.74 13.0 12.74 8.0 7.719.0 8.81 9.0 8.77 9.0 7.11 8.0 8.8411.0 8.33 11.0 9.26 11.0 7.81 8.0 8.4714.0 9.96 14.0 8.10 14.0 8.84 8.0 7.046.0 7.24 6.0 6.13 6.0 6.08 8.0 5.254.0 4.26 4.0 3.10 4.0 5.39 19.0 12.5012.0 10.84 12.0 9.13 12.0 8.15 8.0 5.567.0 4.82 7.0 7.26 7.0 6.42 8.0 7.915.0 5.68 5.0 4.74 5.0 5.73 8.0 6.80 correlation x & y = 0.816 mean x = 9.0 variance x = 11.0 regression line: y =n = 11 mean y = 7.5 variance y = 4.12 3+0.5x