Humanizing dataanalysisJan AertsBioinformatics, ESAT/SCD, University of LeuvenFuture Health Department, iMinds
I                     II                    III                       IV x          y          x           y         x    ...
visualanalytics
cognitive task => perceptive            task
Opening theblack box
inputfilter 1 filter    2 filter    3           output   output   output             A        B        C
A       B    C
A       B    C
A       B    C
Generatinghypotheses
wallpaperweb.org
put the human back in the loop!
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Humanizing Data Analysis
Upcoming SlideShare
Loading in...5
×

Humanizing Data Analysis

426

Published on

My SuperMinds talk at the iMinds 2013 event in Mechelen

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
426
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • first: question for you . next slide: 4 datasets of x-y coordinates. Question is : are they drawn from the same distribution?
  • all summary statistics : the same => different datasets are drawn from the same distribution, right?
  • It's only when we draw the data that we see that the 4 datasets are in fact vastly different. We can see this in a fraction of a second, instantaneous . Known as Anscombe's quartet . Many probably acquainted with.
  • why do I show you this? represents my epiphany 2009 : genomics world abuzz: 1kG data available for analysis. 8 institutes at fore-front of genomics research set out to identify specific type of variation in the human genome (within 1kG project) 1yr later: results , but very little overlap (even though same input data) => overlap ranged from 60% down to 1% did the institutes make errors? no. but using different assumptions about the data and different parameters needed data visualization to find out what was going on ; automated algorithms couldn't
  • Wikipedia: analytical reasoning facilitated by interactive visual interfaces used in terrorism informatics, network security, ... integrating core human strengths in data analysis: - pattern detection - intuition - prediction - context next slide: example of pattern detection: will show you a flash of blue dots . Is there a red one?
  • pre-attentive vision => 50 milliseconds = enough initiation of eye-movement = 200 milliseconds already convert cognitive task into perceptive This talk: illustrate 2 strengths of visual analytics where visualization adds real value to automated analysis
  • heading towards data infarction: increasing distance between domain expert and output of automated analysis (e.g. bioinformatics) <= use different languages + algorithms are too opaque and/or advanced to directly relate output to input => expert needs to trust information, but this is blind trust => black box
  • example: data filtering if no golden standard available given input dataset: how to find the optimal combination of filters to get the maximum number of true positives but minimizing the false positives and false negatives this was the problem faced by the genomics community in 2009
  • different combinations of filters and their parameters => different elements that pass all thresholds
  • state of the art : run different combinations of filter settings => take the intersection
  • but this is what we should have found => visualization of data streams can shed light in that black box
  • second strength visual analytics Jim Gray (Microsoft) couple of years ago: article on different paradigms of doing scientific research through the ages
  • thousand of years ago: 1st paradigm concerned with describing natural phenomena
  • hundreds of years ago: second paradigm Kepler & Newton: theoretical approach : define laws, generalizations
  • last couple of decades: third paradigm modeling and simulation ("computational biology")
  • now: big data key difference: data first, hypothesis later
  • given 2 interaction networks (gene network vs network of functions in linux operating system) which is which?? how do these differ? can calculate connectivity, average vertex degree, global and local complexity, ... where should we start to look? what are the hypotheses to test?
  • Martin Krzywinski if we constrain nodes to 3 axes just based on the question if a node is a source and/or target of links => start to see patterns why is proportion of nodes on green axis much bigger in one network
  • if normalize these axes to 100%: some additional patterns clear => look for things that we can investigate left: small number of nodes on yellow axis linked to many nodes on green right: other way around => what is special about this small set of nodes?
  • have illustrated only 2 use cases of visual analytics, and hopefully spiced up your appetite call to action : put the human back in the loop It's by combining human and algorithm strengths => tackle onslaught of data - effectively - efficiently What we need to do, is detect the expected, discover the unexpected
  • Humanizing Data Analysis

    1. 1. Humanizing dataanalysisJan AertsBioinformatics, ESAT/SCD, University of LeuvenFuture Health Department, iMinds
    2. 2. I II III IV x y x y x y x y10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.588.0 6.95 8.0 8.14 8.0 6.77 8.0 5.7613.0 7.58 13.0 8.74 13.0 12.74 8.0 7.719.0 8.81 9.0 8.77 9.0 7.11 8.0 8.8411.0 8.33 11.0 9.26 11.0 7.81 8.0 8.4714.0 9.96 14.0 8.10 14.0 8.84 8.0 7.046.0 7.24 6.0 6.13 6.0 6.08 8.0 5.254.0 4.26 4.0 3.10 4.0 5.39 19.0 12.5012.0 10.84 12.0 9.13 12.0 8.15 8.0 5.567.0 4.82 7.0 7.26 7.0 6.42 8.0 7.915.0 5.68 5.0 4.74 5.0 5.73 8.0 6.80 correlation x & y = 0.816 mean x = 9.0 variance x = 11.0 regression line: y =n = 11 mean y = 7.5 variance y = 4.12 3+0.5x
    3. 3. visualanalytics
    4. 4. cognitive task => perceptive task
    5. 5. Opening theblack box
    6. 6. inputfilter 1 filter 2 filter 3 output output output A B C
    7. 7. A B C
    8. 8. A B C
    9. 9. A B C
    10. 10. Generatinghypotheses
    11. 11. wallpaperweb.org
    12. 12. put the human back in the loop!

    ×