Your SlideShare is downloading. ×

Data Science, what even?!

1,149

Published on

Presented an abridged version of my "What is data science" talk at #websummit 2013. …

Presented an abridged version of my "What is data science" talk at #websummit 2013.

This talk goes over the required skillset as defined by Drew Conway and his famous venn diagram, and also outlines the Data Scientific Method brought by Dr. Patil. The talk is mainly two parts and the second part goes over some of the packages and technologies we use — minus the storage part.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,149
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Science?! what even...
  • 2. David Coallier @davidcoallier
  • 3. Data Scientist Engine Yard
  • 4. And I cook.. A lot.
  • 5. (n-1) items
  • 6. Adapting.
  • 7. Feedback.
  • 8. Indifference.
  • 9. Young mathematically inclined minds
  • 10. Young mathematically inclined minds We knew everything.
  • 11. First Bad Assumption.
  • 12. So we asked “experts”.
  • 13. Wrong Ingredients
  • 14. Bad Data
  • 15. Tasted like sh*t
  • 16. From Our Results We had questions.
  • 17. Found Expertise Not Online.
  • 18. Data Scientific Method
  • 19. Find a Question Your Hypothesis
  • 20. Current Data What do you have?
  • 21. Features & Tests Try it.
  • 22. Analyse Results Won’t be pretty.
  • 23. Conversation Framed. By. Data.
  • 24. But....
  • 25. Good Discussions Imply good data scientists
  • 26. Hacking Skills
  • 27. Hacking Skills Maths & Stats
  • 28. Hacking Skills Expertise Maths & Stats
  • 29. Hacking Skills Machine Learning Danger Zone!!! Expertise Research Maths & Stats
  • 30. Hacking Skills Data Science Expertise Maths & Stats
  • 31. Hacking Skills Danger Zone!!! Machine Learning Data Science Maths & Stats Expertise Research
  • 32. Business Don’t need an MBA
  • 33. In other words.
  • 34. 1. Hacking 2. Maths & Stats 3. Expertise
  • 35. Apply Method Data Scientific
  • 36. 1. Question 2. Current Data 3. Features/Tests 4. Analyse 5. Converse
  • 37. Find a Question Let’s imagine Github
  • 38. Upgrade Repos Affect users as little as possible
  • 39. import csv content = csv.read('repo1.csv')
  • 40. λ e f (k; λ ) = k! k −k for k >= 0
  • 41. Converse Present Findings
  • 42. Iterate Commits aren’t key.
  • 43. KPIs are key Indicators from experience
  • 44. Questions Super Important.
  • 45. Just test it..
  • 46. We are Human. Emotional Connection
  • 47. What next? Second Hypothesis.
  • 48. Focus on Data Relevant to your KPIs.
  • 49. Data gives you the what Humans give you the why
  • 50. Turn Information
  • 51. Into Actionable Insight
  • 52. Create Discussions Introspection Engines
  • 53. Seeing, Feeling it The brain sees.
  • 54. Not regressions
  • 55. Not p-values
  • 56. Not slopes
  • 57. Not F-statistics
  • 58. Not coefficients
  • 59. Question Data Not Visualisations.
  • 60. Toolbox What do we use?
  • 61. R Modeling, Testing, Prototyping
  • 62. RStudio The IDE
  • 63. lubridate and zoo Dealing with Dates...
  • 64. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone
  • 65. reshape2 Reshape your Data
  • 66. ggplot2 Visualise your Data
  • 67. RCurl, RJSONIO Find more Data
  • 68. HMisc Miscellaneous useful functions
  • 69. forecast Can you guess?
  • 70. garch Generalized Autoregressive Conditional Heteroskedasticity
  • 71. quantmod Statistical Financial Trading
  • 72. getSymbols('AAPL') barChart(AAPL) addMACD()
  • 73. xts Extensible Time Series
  • 74. igraph Study Networks
  • 75. maptools Read & View Maps
  • 76. map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)
  • 77. Python Scientific Computing
  • 78. SciPy http://www.scipy.org
  • 79. scipy.stats
  • 80. scipy.stats Descriptive Statistics
  • 81. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s)
  • 82. scipy.stats Probability Distributions
  • 83. Example Poisson Distribution
  • 84. λ e f (k; λ ) = k! k −k for k >= 0
  • 85. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2)
  • 86. print p.mean() print p.sum() ...
  • 87. NumPy http://www.numpy.org/
  • 88. NumPy Linear Algebra
  • 89. ⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎝ ⎠
  • 90. import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x)
  • 91. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )
  • 92. Matplotlib Python Plotting
  • 93. statsmodels Advanced Statistics Modeling
  • 94. NLTK Natural Language Tool Kit
  • 95. scikit-learn Machine Learning
  • 96. from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1])
  • 97. PyBrain ... Machine Learning
  • 98. PyMC Bayesian Inference
  • 99. Pattern Web Mining for Python
  • 100. NetworkX Study Networks
  • 101. MILK: Machine Learning
  • 102. Pandas easy-to-use data structures
  • 103. from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean()
  • 104. Python vs R? Different Purposes
  • 105. Dogfooding Data Scientific Method
  • 106. Original Question What is Data Science?
  • 107. Back to you For questioning

×