Successfully reported this slideshow.
Your SlideShare is downloading. ×

Detecting novel associations in large data sets

Ad

CS-GN-TEAM: internal presentation




 detecting novel associations
                                                    in...

Ad

presentation my research taster project




where we are




                         05/03/2012, Michele Filannino   2 / ...

Ad

Introduction

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Simulación de Proyectos
Simulación de Proyectos
Loading in …3
×

Check these out next

1 of 36 Ad
1 of 36 Ad

Detecting novel associations in large data sets

Download to read offline

Paper presentation:
D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.

Paper presentation:
D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.

Advertisement
Advertisement

More Related Content

More from Michele Filannino (17)

Advertisement

Detecting novel associations in large data sets

  1. 1. CS-GN-TEAM: internal presentation detecting novel associations in large data sets Michele Filannino + You Presented paper: D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011. Manchester, 05/03/2012
  2. 2. presentation my research taster project where we are 05/03/2012, Michele Filannino 2 / 36
  3. 3. Introduction
  4. 4. presentation my research taster project novel association ■ two variables, X and Y, are associated if there is a relationship between them ● functional ▶ ● non functional ▶ ■ novel: unknown 05/03/2012, Michele Filannino 4 / 36
  5. 5. presentation my research taster project example f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71 Data set 10x6 05/03/2012, Michele Filannino 5 / 36
  6. 6. presentation my research taster project example f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71 Data set 10x6 05/03/2012, Michele Filannino 6 / 36
  7. 7. presentation my research taster project scatter plot: f0 vs. f2 f2(x) = f0(x) + 1 05/03/2012, Michele Filannino 7 / 36
  8. 8. presentation my research taster project example f0 f1 f2 f3 f4 f5 s0 4.00 -0.76 5.00 12.00 8.22 1.83 s1 9.00 0.41 10.00 23.00 27.12 4.30 s2 3.00 0.14 4.00 0.00 0.56 -0.43 s3 10.00 -0.54 11.00 100.00 94.02 6.24 s4 5.00 -0.96 6.00 45.00 39.25 3.56 s5 2.00 0.91 3.00 123.00 125.73 2.97 s6 7.00 0.66 8.00 4.00 9.26 2.56 s7 8.00 0.99 9.00 -2.00 6.90 2.37 s8 1.00 0.84 2.00 36.00 37.68 1.58 s9 6.00 -0.28 7.00 0.00 -1.96 0.71 Data set 10x6 05/03/2012, Michele Filannino 8 / 36
  9. 9. presentation my research taster project scatter plot: f0 vs. f1 no relation 05/03/2012, Michele Filannino 9 / 36
  10. 10. presentation my research taster project correlation coefficients Pearson Mutual Infor. MI norm. f0-f5 0.63 2.45 0.74 f0-f1 -0.17 1.57 0.47 f0-f2 1.00 3.32 1.00 f2-f3 -0.08 3.12 0.94 f0-f3 -0.08 3.12 0.94 05/03/2012, Michele Filannino 10 / 36
  11. 11. presentation my research taster project pros. & cons. ■ Pearson’s coeff. ■ Mutual Information ✔ closed interval result ✔ non linear relations ✖ only linear relations ✖ only categorical data ✖ feature independency ✖ biased towards higher arity features 05/03/2012, Michele Filannino 11 / 36
  12. 12. the new measure
  13. 13. presentation my research taster project motivations ■ generality: ● capture a wide range of interesting associations, not limited to specific function types ■ equitability: ● give similar scores to equally noisy relationships of different types 05/03/2012, Michele Filannino 13 / 36
  14. 14. presentation my research taster project definition of MIC ■ Given a finite set D of ordered pairs, we can partition the X-values of D into x bins and the Y- values of D into y bins ■ We obtain a pair of partitions called x-by-y grid D = (F0, F1) F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00) F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99) 05/03/2012, Michele Filannino 14 / 36
  15. 15. presentation my research taster project x-by-y grid 2-by-4 grid 05/03/2012, Michele Filannino 15 / 36
  16. 16. presentation my research taster project definition of MIC ■ given the grid we could calculate D|G, the frequency distribution induced by the points in D on the cells of G ● different grids G result in different distributions D|G 05/03/2012, Michele Filannino 16 / 36
  17. 17. presentation my research taster project maximal MI over all grids number of columns number of rows 05/03/2012, Michele Filannino 17 / 36
  18. 18. presentation my research taster project characteristic matrix Infinite matrix! normalisation factor (derived by MI definition) 05/03/2012, Michele Filannino 18 / 36
  19. 19. presentation my research taster project Maximal Information Coeff. max grid size 05/03/2012, Michele Filannino 19 / 36
  20. 20. presentation my research taster project matrix computation ■ space of grids grows exponentially ● B(n) ≤ O(n1-ε) for 0 < ε < 1 ■ approximation of MIC ● heuristic dynamic programming 05/03/2012, Michele Filannino 20 / 36
  21. 21. presentation my research taster project MIC summary ✔ closed interval result ✔ non linear relations ✔ all types of data ✖ B(n) is crucial ✖ too high: non-zero scores even for random data ✖ too low: we are searching only for simple pattern ✖ still univariate 05/03/2012, Michele Filannino 21 / 36
  22. 22. presentation my research taster project B(n) behaviour 05/03/2012, Michele Filannino 22 / 36
  23. 23. presentation my research taster project B(n) behaviour 05/03/2012, Michele Filannino 23 / 36
  24. 24. how to use it
  25. 25. presentation my research taster project python import xstats.MINE as MINE x = [40,50,None,70,80,90,100,110,120,130,140,150, 160,170,180,190,200,210,220,230,240,250,260] y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44, -0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09, -0.44,0.31,0.03,0.57,0,0.01] print "x y", MINE.analyze_pair(x, y) https://github.com/ajmazurie/xstats.MINE 05/03/2012, Michele Filannino 25 / 36
  26. 26. presentation my research taster project python: result {'MCN': 2.5849625999999999, 'MAS': 0.040419996, 'pearson': 0.31553724, 'MIC': 0.38196000000000002, 'MEV': 0.27117000000000002, 'non_linearity': 0.28239626000000001} 05/03/2012, Michele Filannino 26 / 36
  27. 27. presentation my research taster project correlation coefficients Mutual Pearson MI norm. MIC graph Informat. f0-f5 0.63 2.45 0.74 0.24 f0-f1 -0.17 1.57 0.47 0.24 f0-f2 1.00 3.32 1.00 1.00 f2-f3 -0.08 3.12 0.94 0.24 f0-f3 -0.08 3.12 0.94 0.24 05/03/2012, Michele Filannino 27 / 36
  28. 28. presentation my research taster project MIC summary ✔ closed interval result ✔ non linear relations ✔ all types of data ✖ B(n) is crucial ✖ n is too low! ✖ still univariate 05/03/2012, Michele Filannino 28 / 36
  29. 29. presentation my research taster project python import xstats.MINE as MINE import math x = [n*0.01 for n in range(1,2000)] y = [math.sin(n) for n in x] result = MINE.analyze_pair(x, y) print "MIC:", result[‘MIC’] print "Pearson:", result[‘pearson’] >>> MIC: 0.99999 >>> Pearson: -0.16366038 05/03/2012, Michele Filannino 29 / 36
  30. 30. conclusion
  31. 31. presentation my research taster project relationship types Source: paper 05/03/2012, Michele Filannino 31 / 36
  32. 32. presentation my research taster project relationship types Source: paper 05/03/2012, Michele Filannino 32 / 36
  33. 33. presentation my research taster project real application Source: paper 05/03/2012, Michele Filannino 33 / 36
  34. 34. presentation my research taster project suggestions ■ use MIC only when you have lots of samples ● samples > 2000 ■ use B(n) = n0.6 ■ don’t use it for all the possible pairs of features ● it is slower than Pearson’s correlation coefficient or Mutual Information 05/03/2012, Michele Filannino 34 / 36
  35. 35. Thank you.
  36. 36. presentation my research taster project references ■ D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011. ■ D. N. Reshef et al., “Supporting Online Material for Detecting Novel Associations in Large Data Sets” 05/03/2012, Michele Filannino 36 / 36

×