Your SlideShare is downloading. ×
0
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

282

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
282
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Science Data Acquisition Machines ToolBox Conclusion The Hitch-Hackers Guide to Data Science ... or what I wish I’d known when I was younger Jaroslav Vážný Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz 3. dubna 2014 Jaroslav Vážný Practical approach
  • 2. Science Data Acquisition Machines ToolBox Conclusion 1 Science 2 Data Acquisition 3 Machines 4 ToolBox 5 Conclusion Jaroslav Vážný Practical approach
  • 3. Science Data Acquisition Machines ToolBox Conclusion What is Science? The whole of science is nothing more than a refinement of everyday thinking. Albert Einstein Jaroslav Vážný Practical approach
  • 4. Science Data Acquisition Machines ToolBox Conclusion More than Science Mistakes/Feedback No pain no gain Pain == gain? Everything is hard until someone makes it easy Jaroslav Vážný Practical approach
  • 5. Science Data Acquisition Machines ToolBox Conclusion MOOC == new era? https://www.khanacademy.org/ https://www.coursera.org/ https://www.udacity.com/ https://www.edx.org/ Jaroslav Vážný Practical approach
  • 6. Science Data Acquisition Machines ToolBox Conclusion Reproducibility http://jakevdp.github.io/blog/2013/10/26/ big-data-brain-drain/ http://nbviewer.ipython.org/ http://pdos.csail.mit.edu/scigen/ ;-) Jaroslav Vážný Practical approach
  • 7. Science Data Acquisition Machines ToolBox Conclusion We are all humans Jaroslav Vážný Practical approach
  • 8. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals Jaroslav Vážný Practical approach
  • 9. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals/idiots Jaroslav Vážný Practical approach
  • 10. Science Data Acquisition Machines ToolBox Conclusion Probability Test your intuition! Roll dice. 5 times you got 6. What is P(6)=? Monty Hall problem Show examples in IPython! 1 2 ? ? Jaroslav Vážný Practical approach
  • 11. Science Data Acquisition Machines ToolBox Conclusion Bayes’s theorem Suppose the probability (for anyone) to have AIDS is: P(AIDS) = 0.001 P(no AIDS) = 0.999 Consider an AIDS test: result is + or - P(+|AIDS) = 0.98 P(-|AIDS) = 0.02 P(+|no AIDS) = 0.03 P(-|no AIDS) = 0.97 Jaroslav Vážný Practical approach
  • 12. Science Data Acquisition Machines ToolBox Conclusion Bayes’s theorem solution P(AIDS|+) = P(+|AIDS)P(AIDS) P(+|AIDS)P(AIDS) + P(+|noAIDS)P(noAIDS) = 0.98 × 0.001 0.98 × 0.001 + 0.03 × 0.999 = 0.032 Your viewpoint: my degree of belief that I have AIDS is 3.2% Your doctor’s viewpoint: 3.2% of people like this will have AIDS Jaroslav Vážný Practical approach
  • 13. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals/idiots/liars Jaroslav Vážný Practical approach
  • 14. Science Data Acquisition Machines ToolBox Conclusion Data Avalanche? Large Synoptic Survey Telescope 20 TB per night 60 PB for the raw data (after 10 years) 15 PB for the catalog database The total data volume after processing will be several hundred PB CERN 1 PB per day Jaroslav Vážný Practical approach
  • 15. Science Data Acquisition Machines ToolBox Conclusion Sloan Digital Sky Survey Why is it important? Lots of data (>106 objects) Perfect documentation Tools to access the data Where I can learn it? http://www.sdss3.org/ Jaroslav Vážný Practical approach
  • 16. Science Data Acquisition Machines ToolBox Conclusion Virtual Observatory Why is it important? Uniform access to astronomy data Based on Web standards Many tools with vo support (Topcat, Aladin, Tapsh) Where I can learn it? http://physics.muni.cz/~vazny/wiki/index.php/ Diploma_work Jaroslav Vážný Practical approach
  • 17. Science Data Acquisition Machines ToolBox Conclusion What is Machine Learning (Data astrology) Data Mining Artificial Inteligence Jaroslav Vážný Practical approach
  • 18. Science Data Acquisition Machines ToolBox Conclusion Supervised Machine Learning Training Text, Documents, Images, etc. Feature Vectors Machine Learning Algorithm New Text, Document, Image, etc. Feature Vector Predictive Model Labels Expected Label Supervised Learning Model Jaroslav Vážný Practical approach
  • 19. Science Data Acquisition Machines ToolBox Conclusion Overfit/underfit Jaroslav Vážný Practical approach
  • 20. Science Data Acquisition Machines ToolBox Conclusion Unsupervised Machine Learning Training Text, Documents, Images, etc. Feature Vectors Machine Learning Algorithm New Text, Document, Image, etc. Feature Vector Predictive Model Likelihood or Cluster ID or Better Representation Unsupervised Learning Model Jaroslav Vážný Practical approach
  • 21. Science Data Acquisition Machines ToolBox Conclusion Star spectrum Jaroslav Vážný Practical approach
  • 22. Science Data Acquisition Machines ToolBox Conclusion Example of feature extraction Jaroslav Vážný Practical approach
  • 23. Science Data Acquisition Machines ToolBox Conclusion Example: Decison Tree 1 ug <= 0.663668 2 | gr <= -0.191208: 1 (7.0) 3 | gr > -0.191208: 3 (104.0/5.0) 4 ug > 0.663668 5 | ri <= 0.285854: 1 (88.0/5.0) 6 | ri > 0.285854 7 | | ri <= 0.314657 8 | | | gr <= 0.692108: 2 (6.0) 9 | | | gr > 0.692108: 1 (3.0) 10 | | ri > 0.314657: 2 (90.0/2.0) Jaroslav Vážný Practical approach
  • 24. Science Data Acquisition Machines ToolBox Conclusion Example: Suport Vector Machine Jaroslav Vážný Practical approach
  • 25. Science Data Acquisition Machines ToolBox Conclusion Data exploration http://ipython.org/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ Jaroslav Vážný Practical approach
  • 26. Science Data Acquisition Machines ToolBox Conclusion Developement https://github.com/ Tests Funny hat https://www.python.org/ Jaroslav Vážný Practical approach
  • 27. Science Data Acquisition Machines ToolBox Conclusion References http://ipython.org/ http://www.greenteapress.com/thinkstats/ http://www.greenteapress.com/thinkpython/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ http://jakevdp.github.io/ blog/2013/10/26/big-data-brain-drain/ http://www.galaxyzoo.org/ http://www.planethunters.org/ http://www.sdss3.org/ Jaroslav Vážný Practical approach
  • 28. Science Data Acquisition Machines ToolBox Conclusion Discussion Jaroslav Vážný Practical approach

×