Science
Data Acquisition
Machines
ToolBox
Conclusion
The Hitch-Hackers Guide to Data Science
... or what I wish I’d known ...
Science
Data Acquisition
Machines
ToolBox
Conclusion
1 Science
2 Data Acquisition
3 Machines
4 ToolBox
5 Conclusion
Jarosl...
Science
Data Acquisition
Machines
ToolBox
Conclusion
What is Science?
The whole of science is nothing more than a refinemen...
Science
Data Acquisition
Machines
ToolBox
Conclusion
More than Science
Mistakes/Feedback
No pain no gain
Pain == gain?
Eve...
Science
Data Acquisition
Machines
ToolBox
Conclusion
MOOC == new era?
https://www.khanacademy.org/
https://www.coursera.or...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Reproducibility
http://jakevdp.github.io/blog/2013/10/26/
big-data-br...
Science
Data Acquisition
Machines
ToolBox
Conclusion
We are all humans
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
We are all humans/animals
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
We are all humans/animals/idiots
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
Probability
Test your intuition!
Roll dice. 5 times you got 6. What i...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Bayes’s theorem
Suppose the probability (for anyone) to have AIDS is:...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Bayes’s theorem solution
P(AIDS|+) =
P(+|AIDS)P(AIDS)
P(+|AIDS)P(AIDS...
Science
Data Acquisition
Machines
ToolBox
Conclusion
We are all humans/animals/idiots/liars
Jaroslav Vážný Practical appro...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Data Avalanche?
Large Synoptic Survey Telescope
20 TB per night
60 PB...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Sloan Digital Sky Survey
Why is it important?
Lots of data (>106
obje...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Virtual Observatory
Why is it important?
Uniform access to astronomy ...
Science
Data Acquisition
Machines
ToolBox
Conclusion
What is
Machine Learning (Data astrology)
Data Mining
Artificial Intel...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Supervised Machine Learning
Training
Text,
Documents,
Images,
etc.
Fe...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Overfit/underfit
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
Unsupervised Machine Learning
Training
Text,
Documents,
Images,
etc.
...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Star spectrum
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
Example of feature extraction
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
Example: Decison Tree
1 ug <= 0.663668
2 | gr <= -0.191208: 1 (7.0)
3...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Example: Suport Vector Machine
Jaroslav Vážný Practical approach
Science
Data Acquisition
Machines
ToolBox
Conclusion
Data exploration
http://ipython.org/
http://scikit-learn.org/stable/
...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Developement
https://github.com/
Tests
Funny hat
https://www.python.o...
Science
Data Acquisition
Machines
ToolBox
Conclusion
References
http://ipython.org/
http://www.greenteapress.com/thinkstat...
Science
Data Acquisition
Machines
ToolBox
Conclusion
Discussion
Jaroslav Vážný Practical approach
Upcoming SlideShare
Loading in …5
×

Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

529 views
396 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
529
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

  1. 1. Science Data Acquisition Machines ToolBox Conclusion The Hitch-Hackers Guide to Data Science ... or what I wish I’d known when I was younger Jaroslav Vážný Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz 3. dubna 2014 Jaroslav Vážný Practical approach
  2. 2. Science Data Acquisition Machines ToolBox Conclusion 1 Science 2 Data Acquisition 3 Machines 4 ToolBox 5 Conclusion Jaroslav Vážný Practical approach
  3. 3. Science Data Acquisition Machines ToolBox Conclusion What is Science? The whole of science is nothing more than a refinement of everyday thinking. Albert Einstein Jaroslav Vážný Practical approach
  4. 4. Science Data Acquisition Machines ToolBox Conclusion More than Science Mistakes/Feedback No pain no gain Pain == gain? Everything is hard until someone makes it easy Jaroslav Vážný Practical approach
  5. 5. Science Data Acquisition Machines ToolBox Conclusion MOOC == new era? https://www.khanacademy.org/ https://www.coursera.org/ https://www.udacity.com/ https://www.edx.org/ Jaroslav Vážný Practical approach
  6. 6. Science Data Acquisition Machines ToolBox Conclusion Reproducibility http://jakevdp.github.io/blog/2013/10/26/ big-data-brain-drain/ http://nbviewer.ipython.org/ http://pdos.csail.mit.edu/scigen/ ;-) Jaroslav Vážný Practical approach
  7. 7. Science Data Acquisition Machines ToolBox Conclusion We are all humans Jaroslav Vážný Practical approach
  8. 8. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals Jaroslav Vážný Practical approach
  9. 9. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals/idiots Jaroslav Vážný Practical approach
  10. 10. Science Data Acquisition Machines ToolBox Conclusion Probability Test your intuition! Roll dice. 5 times you got 6. What is P(6)=? Monty Hall problem Show examples in IPython! 1 2 ? ? Jaroslav Vážný Practical approach
  11. 11. Science Data Acquisition Machines ToolBox Conclusion Bayes’s theorem Suppose the probability (for anyone) to have AIDS is: P(AIDS) = 0.001 P(no AIDS) = 0.999 Consider an AIDS test: result is + or - P(+|AIDS) = 0.98 P(-|AIDS) = 0.02 P(+|no AIDS) = 0.03 P(-|no AIDS) = 0.97 Jaroslav Vážný Practical approach
  12. 12. Science Data Acquisition Machines ToolBox Conclusion Bayes’s theorem solution P(AIDS|+) = P(+|AIDS)P(AIDS) P(+|AIDS)P(AIDS) + P(+|noAIDS)P(noAIDS) = 0.98 × 0.001 0.98 × 0.001 + 0.03 × 0.999 = 0.032 Your viewpoint: my degree of belief that I have AIDS is 3.2% Your doctor’s viewpoint: 3.2% of people like this will have AIDS Jaroslav Vážný Practical approach
  13. 13. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals/idiots/liars Jaroslav Vážný Practical approach
  14. 14. Science Data Acquisition Machines ToolBox Conclusion Data Avalanche? Large Synoptic Survey Telescope 20 TB per night 60 PB for the raw data (after 10 years) 15 PB for the catalog database The total data volume after processing will be several hundred PB CERN 1 PB per day Jaroslav Vážný Practical approach
  15. 15. Science Data Acquisition Machines ToolBox Conclusion Sloan Digital Sky Survey Why is it important? Lots of data (>106 objects) Perfect documentation Tools to access the data Where I can learn it? http://www.sdss3.org/ Jaroslav Vážný Practical approach
  16. 16. Science Data Acquisition Machines ToolBox Conclusion Virtual Observatory Why is it important? Uniform access to astronomy data Based on Web standards Many tools with vo support (Topcat, Aladin, Tapsh) Where I can learn it? http://physics.muni.cz/~vazny/wiki/index.php/ Diploma_work Jaroslav Vážný Practical approach
  17. 17. Science Data Acquisition Machines ToolBox Conclusion What is Machine Learning (Data astrology) Data Mining Artificial Inteligence Jaroslav Vážný Practical approach
  18. 18. Science Data Acquisition Machines ToolBox Conclusion Supervised Machine Learning Training Text, Documents, Images, etc. Feature Vectors Machine Learning Algorithm New Text, Document, Image, etc. Feature Vector Predictive Model Labels Expected Label Supervised Learning Model Jaroslav Vážný Practical approach
  19. 19. Science Data Acquisition Machines ToolBox Conclusion Overfit/underfit Jaroslav Vážný Practical approach
  20. 20. Science Data Acquisition Machines ToolBox Conclusion Unsupervised Machine Learning Training Text, Documents, Images, etc. Feature Vectors Machine Learning Algorithm New Text, Document, Image, etc. Feature Vector Predictive Model Likelihood or Cluster ID or Better Representation Unsupervised Learning Model Jaroslav Vážný Practical approach
  21. 21. Science Data Acquisition Machines ToolBox Conclusion Star spectrum Jaroslav Vážný Practical approach
  22. 22. Science Data Acquisition Machines ToolBox Conclusion Example of feature extraction Jaroslav Vážný Practical approach
  23. 23. Science Data Acquisition Machines ToolBox Conclusion Example: Decison Tree 1 ug <= 0.663668 2 | gr <= -0.191208: 1 (7.0) 3 | gr > -0.191208: 3 (104.0/5.0) 4 ug > 0.663668 5 | ri <= 0.285854: 1 (88.0/5.0) 6 | ri > 0.285854 7 | | ri <= 0.314657 8 | | | gr <= 0.692108: 2 (6.0) 9 | | | gr > 0.692108: 1 (3.0) 10 | | ri > 0.314657: 2 (90.0/2.0) Jaroslav Vážný Practical approach
  24. 24. Science Data Acquisition Machines ToolBox Conclusion Example: Suport Vector Machine Jaroslav Vážný Practical approach
  25. 25. Science Data Acquisition Machines ToolBox Conclusion Data exploration http://ipython.org/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ Jaroslav Vážný Practical approach
  26. 26. Science Data Acquisition Machines ToolBox Conclusion Developement https://github.com/ Tests Funny hat https://www.python.org/ Jaroslav Vážný Practical approach
  27. 27. Science Data Acquisition Machines ToolBox Conclusion References http://ipython.org/ http://www.greenteapress.com/thinkstats/ http://www.greenteapress.com/thinkpython/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ http://jakevdp.github.io/ blog/2013/10/26/big-data-brain-drain/ http://www.galaxyzoo.org/ http://www.planethunters.org/ http://www.sdss3.org/ Jaroslav Vážný Practical approach
  28. 28. Science Data Acquisition Machines ToolBox Conclusion Discussion Jaroslav Vážný Practical approach

×