CS-GN-TEAM: internal presentation detecting novel associations                                                    in large...
presentation my research taster projectwhere we are                         05/03/2012, Michele Filannino   2 / 36
Introduction
presentation my research taster projectnovel association ■ two variables, X and Y, are associated if there is a   relation...
presentation my research taster projectexample                 f0      f1      f2       f3              f4                ...
presentation my research taster projectexample                 f0      f1      f2       f3              f4                ...
presentation my research taster projectscatter plot: f0 vs. f2f2(x) = f0(x) + 1              05/03/2012, Michele Filannino...
presentation my research taster projectexample                 f0      f1      f2       f3              f4                ...
presentation my research taster projectscatter plot: f0 vs. f1no relation                     05/03/2012, Michele Filannin...
presentation my research taster projectcorrelation coefficients           Pearson   Mutual Infor.                MI norm.   ...
presentation my research taster projectpros. & cons. ■ Pearson’s coeff.              ■ Mutual Information   ✔   closed inte...
the new measure
presentation my research taster projectmotivations■ generality:   ●   capture a wide range of interesting associations, no...
presentation my research taster projectdefinition of MIC ■ Given a finite set D of ordered pairs, we can    partition the X-...
presentation my research taster projectx-by-y grid2-by-4 grid             05/03/2012, Michele Filannino   15 / 36
presentation my research taster projectdefinition of MIC ■ given the grid we could calculate D|G, the frequency   distribut...
presentation my research taster projectmaximal MI over all grids  number of columns   number of rows                      ...
presentation my research taster projectcharacteristic matrix  Infinite matrix!                           normalisation fact...
presentation my research taster projectMaximal Information Coeff.       max grid size                                 05/03...
presentation my research taster projectmatrix computation■ space of grids grows exponentially   ●   B(n) ≤ O(n1-ε) for 0 <...
presentation my research taster projectMIC summary✔   closed interval result✔   non linear relations✔   all types of data✖...
presentation my research taster projectB(n) behaviour                           05/03/2012, Michele Filannino   22 / 36
presentation my research taster projectB(n) behaviour                           05/03/2012, Michele Filannino   23 / 36
how to use it
presentation my research taster projectpython     import xstats.MINE as MINE     x = [40,50,None,70,80,90,100,110,120,130,...
presentation my research taster projectpython: result {MCN: 2.5849625999999999,  MAS: 0.040419996,  pearson: 0.31553724,  ...
presentation my research taster projectcorrelation coefficients                     Mutual          Pearson               MI...
presentation my research taster projectMIC summary✔   closed interval result✔   non linear relations✔   all types of data✖...
presentation my research taster projectpython import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y...
conclusion
presentation my research taster projectrelationship typesSource: paper                  05/03/2012, Michele Filannino   31...
presentation my research taster projectrelationship typesSource: paper                  05/03/2012, Michele Filannino   32...
presentation my research taster projectreal applicationSource: paper                05/03/2012, Michele Filannino   33 / 36
presentation my research taster projectsuggestions■ use MIC only when you have lots of samples   ●   samples > 2000■ use B...
Thank you.
presentation my research taster projectreferences■ D. N. Reshef et al., “Detecting Novel Associations in  Large Data Sets,...
Upcoming SlideShare
Loading in …5
×

Detecting novel associations in large data sets

1,323 views

Published on

Paper presentation:
D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,323
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Detecting novel associations in large data sets

  1. 1. CS-GN-TEAM: internal presentation detecting novel associations in large data sets Michele Filannino + You Presented paper:D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011. Manchester, 05/03/2012
  2. 2. presentation my research taster projectwhere we are 05/03/2012, Michele Filannino 2 / 36
  3. 3. Introduction
  4. 4. presentation my research taster projectnovel association ■ two variables, X and Y, are associated if there is a relationship between them ● functional ▶ ● non functional ▶ ■ novel: unknown 05/03/2012, Michele Filannino 4 / 36
  5. 5. presentation my research taster projectexample f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71Data set 10x6 05/03/2012, Michele Filannino 5 / 36
  6. 6. presentation my research taster projectexample f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71Data set 10x6 05/03/2012, Michele Filannino 6 / 36
  7. 7. presentation my research taster projectscatter plot: f0 vs. f2f2(x) = f0(x) + 1 05/03/2012, Michele Filannino 7 / 36
  8. 8. presentation my research taster projectexample f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71Data set 10x6 05/03/2012, Michele Filannino 8 / 36
  9. 9. presentation my research taster projectscatter plot: f0 vs. f1no relation 05/03/2012, Michele Filannino 9 / 36
  10. 10. presentation my research taster projectcorrelation coefficients Pearson Mutual Infor. MI norm. f0-f5 0.63 2.45 0.74 f0-f1 -0.17 1.57 0.47 f0-f2 1.00 3.32 1.00 f2-f3 -0.08 3.12 0.94 f0-f3 -0.08 3.12 0.94 05/03/2012, Michele Filannino 10 / 36
  11. 11. presentation my research taster projectpros. & cons. ■ Pearson’s coeff. ■ Mutual Information ✔ closed interval result ✔ non linear relations ✖ only linear relations ✖ only categorical data ✖ feature independency ✖ biased towards higher arity features 05/03/2012, Michele Filannino 11 / 36
  12. 12. the new measure
  13. 13. presentation my research taster projectmotivations■ generality: ● capture a wide range of interesting associations, not limited to specific function types■ equitability: ● give similar scores to equally noisy relationships of different types 05/03/2012, Michele Filannino 13 / 36
  14. 14. presentation my research taster projectdefinition of MIC ■ Given a finite set D of ordered pairs, we can partition the X-values of D into x bins and the Y- values of D into y bins ■ We obtain a pair of partitions called x-by-y grid D = (F0, F1) F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00) F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99) 05/03/2012, Michele Filannino 14 / 36
  15. 15. presentation my research taster projectx-by-y grid2-by-4 grid 05/03/2012, Michele Filannino 15 / 36
  16. 16. presentation my research taster projectdefinition of MIC ■ given the grid we could calculate D|G, the frequency distribution induced by the points in D on the cells of G ● different grids G result in different distributions D|G 05/03/2012, Michele Filannino 16 / 36
  17. 17. presentation my research taster projectmaximal MI over all grids number of columns number of rows 05/03/2012, Michele Filannino 17 / 36
  18. 18. presentation my research taster projectcharacteristic matrix Infinite matrix! normalisation factor (derived by MI definition) 05/03/2012, Michele Filannino 18 / 36
  19. 19. presentation my research taster projectMaximal Information Coeff. max grid size 05/03/2012, Michele Filannino 19 / 36
  20. 20. presentation my research taster projectmatrix computation■ space of grids grows exponentially ● B(n) ≤ O(n1-ε) for 0 < ε < 1■ approximation of MIC ● heuristic dynamic programming 05/03/2012, Michele Filannino 20 / 36
  21. 21. presentation my research taster projectMIC summary✔ closed interval result✔ non linear relations✔ all types of data✖ B(n) is crucial ✖ too high: non-zero scores even for random data ✖ too low: we are searching only for simple pattern✖ still univariate 05/03/2012, Michele Filannino 21 / 36
  22. 22. presentation my research taster projectB(n) behaviour 05/03/2012, Michele Filannino 22 / 36
  23. 23. presentation my research taster projectB(n) behaviour 05/03/2012, Michele Filannino 23 / 36
  24. 24. how to use it
  25. 25. presentation my research taster projectpython import xstats.MINE as MINE x = [40,50,None,70,80,90,100,110,120,130,140,150, 160,170,180,190,200,210,220,230,240,250,260] y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44, -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09, -0.44,0.31,0.03,0.57,0,0.01] print "x y", MINE.analyze_pair(x, y)https://github.com/ajmazurie/xstats.MINE 05/03/2012, Michele Filannino 25 / 36
  26. 26. presentation my research taster projectpython: result {MCN: 2.5849625999999999, MAS: 0.040419996, pearson: 0.31553724, MIC: 0.38196000000000002, MEV: 0.27117000000000002, non_linearity: 0.28239626000000001} 05/03/2012, Michele Filannino 26 / 36
  27. 27. presentation my research taster projectcorrelation coefficients Mutual Pearson MI norm. MIC graph Informat. f0-f5 0.63 2.45 0.74 0.24 f0-f1 -0.17 1.57 0.47 0.24 f0-f2 1.00 3.32 1.00 1.00 f2-f3 -0.08 3.12 0.94 0.24 f0-f3 -0.08 3.12 0.94 0.24 05/03/2012, Michele Filannino 27 / 36
  28. 28. presentation my research taster projectMIC summary✔ closed interval result✔ non linear relations✔ all types of data✖ B(n) is crucial ✖ n is too low!✖ still univariate 05/03/2012, Michele Filannino 28 / 36
  29. 29. presentation my research taster projectpython import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y = [math.sin(n) for n in x] result = MINE.analyze_pair(x, y) print "MIC:", result[‘MIC’] print "Pearson:", result[‘pearson’] >>> MIC: 0.99999 >>> Pearson: -0.16366038 05/03/2012, Michele Filannino 29 / 36
  30. 30. conclusion
  31. 31. presentation my research taster projectrelationship typesSource: paper 05/03/2012, Michele Filannino 31 / 36
  32. 32. presentation my research taster projectrelationship typesSource: paper 05/03/2012, Michele Filannino 32 / 36
  33. 33. presentation my research taster projectreal applicationSource: paper 05/03/2012, Michele Filannino 33 / 36
  34. 34. presentation my research taster projectsuggestions■ use MIC only when you have lots of samples ● samples > 2000■ use B(n) = n0.6■ don’t use it for all the possible pairs of features ● it is slower than Pearson’s correlation coefficient or Mutual Information 05/03/2012, Michele Filannino 34 / 36
  35. 35. Thank you.
  36. 36. presentation my research taster projectreferences■ D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.■ D. N. Reshef et al., “Supporting Online Material for Detecting Novel Associations in Large Data Sets” 05/03/2012, Michele Filannino 36 / 36

×