The Artful Business                        of Data Mining                            Computational Statistics             ...
David Coallier                         @davidcoallierWednesday 20 March 13
Data Scientist                         At Engine Yard (.com)Wednesday 20 March 13
Find DataWednesday 20 March 13
Clean DataWednesday 20 March 13
Analyse Data?Wednesday 20 March 13
Analyse DataWednesday 20 March 13
Question DataWednesday 20 March 13
Report FindingsWednesday 20 March 13
Data ScientistWednesday 20 March 13
Data JanitorWednesday 20 March 13
Actual                        TasksWednesday 20 March 13
“If your model           is elegant, it’s           probably wrong”Wednesday 20 March 13
“The Times                        they are                        a-Changing”                              — Bob DylanWedn...
Python & RWednesday 20 March 13
SciPy                        http://www.scipy.orgWednesday 20 March 13
scipy.statsWednesday 20 March 13
scipy.stats                         Descriptive StatisticsWednesday 20 March 13
from scipy.stats                        import describe                        s = [1,2,1,3,4,5]                        pr...
scipy.stats                        Probability DistributionsWednesday 20 March 13
Example                           Poisson DistributionWednesday 20 March 13
λ e                                    k −k                        f (k; λ ) =                                     k!     ...
import scipy.stats.poisson    p = poisson.pmf([1,2,3,4,1,2,3], 2)Wednesday 20 March 13
print p.mean()                        print p.sum()                        ...Wednesday 20 March 13
NumPy                        http://www.numpy.org/Wednesday 20 March 13
NumPy                          Linear AlgebraWednesday 20 March 13
⎛ 1 0 ⎞                        ⎜ 0 1 ⎟                        ⎝     ⎠Wednesday 20 March 13
import numpy as np      x = np.array([ [1, 0], [0, 1] ])      vec, val = np.linalg.eig(x)      np.linalg.eigvals(x)Wednesd...
>>> np.linalg.eig(x)                           (                             array([ 1., 1.]),                            ...
Matplotlib                         Python PlottingWednesday 20 March 13
statsmodels                        Advanced Statistics ModelingWednesday 20 March 13
NLTK                        Natural Language Tool KitWednesday 20 March 13
scikit-learn                        Machine LearningWednesday 20 March 13
from sklearn import tree                   X = [[0, 0], [1, 1]]                   Y = [0, 1]                   clf = tree....
PyBrain                          ... Machine LearningWednesday 20 March 13
PyMC                        Bayesian InferenceWednesday 20 March 13
Pattern                         Web Mining for PythonWednesday 20 March 13
NetworkX                            Study NetworksWednesday 20 March 13
MILK                        MOAR machine LEARNING!Wednesday 20 March 13
Pandas                           easy-to-use                          data structuresWednesday 20 March 13
from pandas import *        x = DataFrame([            {"age": 26},            {"age": 19},            {"age": 21},       ...
RWednesday 20 March 13
RStudio                             The IDEWednesday 20 March 13
lubridate                        and zoo                            Dealing with Dates...Wednesday 20 March 13
yy/mm/dd mm/dd/yy          YYYY-mm-dd HH:MM:ss TZ          yy-mm-dd 1363784094.513425          yy/mm different timezoneWed...
reshape2                           Reshape your DataWednesday 20 March 13
ggplot2                          Visualise your DataWednesday 20 March 13
RCurl, RJSONIO                        Find more DataWednesday 20 March 13
HMisc                        Miscellaneous useful functionsWednesday 20 March 13
forecast                            Can you guess?Wednesday 20 March 13
garch                          And ruGarchWednesday 20 March 13
quantmod                        Statistical Financial TradingWednesday 20 March 13
xts                        Extensible Time SeriesWednesday 20 March 13
igraph                          Study NetworksWednesday 20 March 13
maptools                           Read & View MapsWednesday 20 March 13
map(state, region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)Wedn...
StorageWednesday 20 March 13
Oppose                        “big” DataWednesday 20 March 13
“Learn how           to sample”Wednesday 20 March 13
ExperimentsWednesday 20 March 13
What Do     You Want to Answer?Wednesday 20 March 13
Understand     Your AudienceWednesday 20 March 13
Scientific     ReportingWednesday 20 March 13
Busy-ness                            Time is moneyWednesday 20 March 13
Public     VisualisationWednesday 20 March 13
Best                   Visualisation,                   Bad                   DataWednesday 20 March 13
Best                   Forecasting                   models...                   Bad                   VisualisationWednes...
Wednesday 20 March 13
Wednesday 20 March 13
SeanchaíWednesday 20 March 13
Wednesday 20 March 13
FeelitWednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
Wednesday 20 March 13
“Don’t be scared of           bar charts.”Wednesday 20 March 13
Mathematical     Statistics     Engineering     Business     Economics     CuriosityWednesday 20 March 13
davidcoallier.github.com            @davidcoallier on TwitterWednesday 20 March 13
Upcoming SlideShare
Loading in...5
×

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

581

Published on

This talk goes over a concepts of data mining and data analysis using open source tools, mainly Python and R with interesting libraries and the tools I have used and currently use at Engine Yard.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
581
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

  1. 1. The Artful Business of Data Mining Computational Statistics with Open Source ToolWednesday 20 March 13
  2. 2. David Coallier @davidcoallierWednesday 20 March 13
  3. 3. Data Scientist At Engine Yard (.com)Wednesday 20 March 13
  4. 4. Find DataWednesday 20 March 13
  5. 5. Clean DataWednesday 20 March 13
  6. 6. Analyse Data?Wednesday 20 March 13
  7. 7. Analyse DataWednesday 20 March 13
  8. 8. Question DataWednesday 20 March 13
  9. 9. Report FindingsWednesday 20 March 13
  10. 10. Data ScientistWednesday 20 March 13
  11. 11. Data JanitorWednesday 20 March 13
  12. 12. Actual TasksWednesday 20 March 13
  13. 13. “If your model is elegant, it’s probably wrong”Wednesday 20 March 13
  14. 14. “The Times they are a-Changing” — Bob DylanWednesday 20 March 13
  15. 15. Python & RWednesday 20 March 13
  16. 16. SciPy http://www.scipy.orgWednesday 20 March 13
  17. 17. scipy.statsWednesday 20 March 13
  18. 18. scipy.stats Descriptive StatisticsWednesday 20 March 13
  19. 19. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s)Wednesday 20 March 13
  20. 20. scipy.stats Probability DistributionsWednesday 20 March 13
  21. 21. Example Poisson DistributionWednesday 20 March 13
  22. 22. λ e k −k f (k; λ ) = k! for k >= 0Wednesday 20 March 13
  23. 23. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2)Wednesday 20 March 13
  24. 24. print p.mean() print p.sum() ...Wednesday 20 March 13
  25. 25. NumPy http://www.numpy.org/Wednesday 20 March 13
  26. 26. NumPy Linear AlgebraWednesday 20 March 13
  27. 27. ⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎝ ⎠Wednesday 20 March 13
  28. 28. import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x)Wednesday 20 March 13
  29. 29. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )Wednesday 20 March 13
  30. 30. Matplotlib Python PlottingWednesday 20 March 13
  31. 31. statsmodels Advanced Statistics ModelingWednesday 20 March 13
  32. 32. NLTK Natural Language Tool KitWednesday 20 March 13
  33. 33. scikit-learn Machine LearningWednesday 20 March 13
  34. 34. from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1])Wednesday 20 March 13
  35. 35. PyBrain ... Machine LearningWednesday 20 March 13
  36. 36. PyMC Bayesian InferenceWednesday 20 March 13
  37. 37. Pattern Web Mining for PythonWednesday 20 March 13
  38. 38. NetworkX Study NetworksWednesday 20 March 13
  39. 39. MILK MOAR machine LEARNING!Wednesday 20 March 13
  40. 40. Pandas easy-to-use data structuresWednesday 20 March 13
  41. 41. from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x[age] > 20].count() print x[x[age] > 20].mean()Wednesday 20 March 13
  42. 42. RWednesday 20 March 13
  43. 43. RStudio The IDEWednesday 20 March 13
  44. 44. lubridate and zoo Dealing with Dates...Wednesday 20 March 13
  45. 45. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezoneWednesday 20 March 13
  46. 46. reshape2 Reshape your DataWednesday 20 March 13
  47. 47. ggplot2 Visualise your DataWednesday 20 March 13
  48. 48. RCurl, RJSONIO Find more DataWednesday 20 March 13
  49. 49. HMisc Miscellaneous useful functionsWednesday 20 March 13
  50. 50. forecast Can you guess?Wednesday 20 March 13
  51. 51. garch And ruGarchWednesday 20 March 13
  52. 52. quantmod Statistical Financial TradingWednesday 20 March 13
  53. 53. xts Extensible Time SeriesWednesday 20 March 13
  54. 54. igraph Study NetworksWednesday 20 March 13
  55. 55. maptools Read & View MapsWednesday 20 March 13
  56. 56. map(state, region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)Wednesday 20 March 13
  57. 57. StorageWednesday 20 March 13
  58. 58. Oppose “big” DataWednesday 20 March 13
  59. 59. “Learn how to sample”Wednesday 20 March 13
  60. 60. ExperimentsWednesday 20 March 13
  61. 61. What Do You Want to Answer?Wednesday 20 March 13
  62. 62. Understand Your AudienceWednesday 20 March 13
  63. 63. Scientific ReportingWednesday 20 March 13
  64. 64. Busy-ness Time is moneyWednesday 20 March 13
  65. 65. Public VisualisationWednesday 20 March 13
  66. 66. Best Visualisation, Bad DataWednesday 20 March 13
  67. 67. Best Forecasting models... Bad VisualisationWednesday 20 March 13
  68. 68. Wednesday 20 March 13
  69. 69. Wednesday 20 March 13
  70. 70. SeanchaíWednesday 20 March 13
  71. 71. Wednesday 20 March 13
  72. 72. FeelitWednesday 20 March 13
  73. 73. Wednesday 20 March 13
  74. 74. Wednesday 20 March 13
  75. 75. Wednesday 20 March 13
  76. 76. “Don’t be scared of bar charts.”Wednesday 20 March 13
  77. 77. Mathematical Statistics Engineering Business Economics CuriosityWednesday 20 March 13
  78. 78. davidcoallier.github.com @davidcoallier on TwitterWednesday 20 March 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×