Data Scientist 101 BI Dutch

3,774 views
3,574 views

Published on

Slides for my 30 minute 'keynote' during the June 2013 BI Dutch session

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,774
On SlideShare
0
From Embeds
0
Number of Embeds
1,690
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Data Scientist 101 BI Dutch

  1. 1. Data Scientist 101: How to become a Super Cruncher
  2. 2. “All truths are easy to understand once they are discovered; the point is to discover them.”
  3. 3. The 4 “soft” C's of a Data Scientist
  4. 4. ...and the 5 R's of 21st Century Literacy ⇨Reading ⇨wRiting ⇨aRithmetic ⇨pRobability ⇨R Source: Joe BlitzStein, Harvard
  5. 5. "data scientists should take a page from social scientists, who have a long history of asking where the data they're working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation." Kate Crawford, Microsoft Research/MIT
  6. 6. Wrong prediction due to extensive media attention & coverage
  7. 7. Data Science: wetting your appetite
  8. 8. The Data Science Venn Diagram Source: Drew Conway, NYU http://drewconway.com/zia/2013/3/ 26/the-data-science-venn-diagram
  9. 9. Another way to look at things...
  10. 10. The nerdy approach... Source: Hillary Mason, bit.ly
  11. 11. Data Scientists have more fun Source: How to Engage and Retain Analytical Talent By Elizabeth Craig, Jeanne G. Harris and Henry Egan January 2010
  12. 12. How Do I Become A Data Scientist? ⇨ Learn about matrix factorizations ⇨ Learn about distributed computing ⇨ Learn about statistical analysis ⇨ Learn about optimization ⇨ Learn about machine learning ⇨ Learn about information retrieval ⇨ Learn about signal detection and estimation ⇨ Master algorithms and data structures ⇨ Practice ⇨ Study Engineering Source: http://www.quora.com/Career-Advice/How-do-I-become-a-data-scientist
  13. 13. 6 levels of expertise needed Data wranglingStatistics Data mining Visualization Communication Data Science* Domain & Business Expertise * a bit of programming skills doesn't hurt either
  14. 14. Programming Skills? C C++ PAL Smalltalk VB.Net C# SQL LotusScript VBScript JavaScript HTML Delphi (Java) Python R Perl Me “Them” Prolog Octave Ruby SQL Pascal
  15. 15. SQL Still Matters! ⇨ Big Data SQL ⇨ Hbase & Hive ⇨ Amazon Redshift ⇨ Cloudera Impala ⇨ HortonWorks Stinger ⇨ ... Source: KDNuggets.com
  16. 16. How about Technology?
  17. 17. New analytics->new infrastructure
  18. 18. The Analytics Landscape
  19. 19. Why you need (some) Statistics
  20. 20. Correlation != Causation
  21. 21. Learning Statistics ⇨ Coursera.org ⇨ Statistics One ⇨ Passion Driven Statistics ⇨ Statistics: Making sense of Data
  22. 22. Essentially, all models are wrong... ...but some are useful George E.P. Box
  23. 23. Learning Data Mining ⇨ Coursera.org ⇨ Machine Learning ⇨ Neural Networks for Machine Learning ⇨ Kaggle.com ⇨ Kaggle In Class
  24. 24. VisualizationVisualization
  25. 25. Visualization is... Theconversionofanyabstractdataintoagraphicalformatsothecharacteristicsand relationshipsofthedatacanbeexploredandanalyzed. ⇨ Humans have the ability to analyze large amounts of information that is presented visually ⇨ This is good for certain types of pattern and trend analysis ⇨ It’s often easy to detect outliers and unusual patterns Usefulforexploration,explanation,discovery,but not forautomatedsystemactions.
  26. 26. How many 5's? 3435261241134352612203498723566 9623466620398652034095823450238 4560289567109238401645089630489 5769782364196873484
  27. 27. Again: how many 5's? 3435261241134352612203498723566 9623466620398652034095823450238 4560289567109238401645089630489 5769782364196873484
  28. 28. Learning Visualization ⇨ Stephen Few classes ($$) ⇨ Alberto Cairo ⇨ Introduction to Data Journalism
  29. 29. Want to get your feet wet? Tableau Public http://www.tableausoftware.com/public/ SAS Visual Analytics http://www.sas.com/software/visual-analytics
  30. 30. Where to go from here? ⇨ Read 'Competing on Analytics' ⇨ Move on to 'Data Analysis Using SQL and Excel' ⇨ Then buy 'Handbook of Statistical Analysis & Data Mining Applications' ⇨ Statistics for business: ⇨ http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm ⇨ Data Mining: ⇨ www.rapid-i.com (RapidMiner) ⇨ http://www.thearling.com ⇨ http://www.autonlab.org/tutorials/ ⇨ For free text books, search www.scribd.com ⇨ Enter http://www.coursera.org
  31. 31. More Resources to Get You Started Books: ⇨ DataMiningTechniques:ForMarketing,SalesandCustomerSupport,MichaelJ.BarryandGordonLinoff ⇨ DataPreparationforDataMining,DorianPyle ⇨ DataMiningAlgorithms,ElbeFrank,IanWitten,JimGray ⇨ AnIntroductiontoInformationRetrieval,ChristopherD.Manning,PrabhakarRaghavan,HinrichSchütze ⇨ InformationRetrieval,C.J.vanRijsbergen ⇨ TheVisualDisplayofQuantitativeInformation,EdwardR.Tufte Journals,Newsletters,WebSites: ⇨ SIGKDDExplorations,NewsletteroftheACMSIGonKnowledgeDiscoveryandDataMining ⇨ IEEETransactionsonPatternAnalysisandMachineIntelligence ⇨ SASKnowledgeExchange: www.sas.com/knowledge-exchange/business-analytics ⇨ KDNuggetsdataminingresources: www.kdnuggets.com ⇨ FlowingData,visualizationresources: http://flowingdata.com/ ⇨ Infoaesthetics,visualdesignresources: http://infosthetics.com/ ⇨ VisualComplexity,visualizationresources: www.visualcomplexity.com/vc/index.cfm ⇨ Recommendationsystemsresources: http://www.deitel.com/ResourceCenters/Web20/RecommenderSystems/tabid/1229/Default.aspx ⇨ TheImpoverishedSocialScientist'sGuidetoFreeStatisticalSoftwareandResources: http://maltman.hmdc.harvard.edu/socsci.shtml
  32. 32. Free Stuff So You Can Work Cheaply ⇨ WEKA http://www.cs.waikato.ac.nz/ml/weka/ ⇨ IND decision tree software http://opensource.arc.nasa.gov/software/ind/ ⇨ Clustering http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/ ⇨ Parallel Sets http://eagereyes.org/parallel-sets#download ⇨ RapidMiner http://rapid-i.com/content/blogcategory/38/69/ ⇨ Knime http://www.knime.org/ ⇨ Orange http://www.ailab.si/Orange/ ⇨ R statistics software http://www.r-project.org/ ⇨ ARC statistics software http://www.stat.umn.edu/arc/software.html ⇨ Octave numerical and matrix computation http://www.gnu.org/software/octave/ ⇨ Processing http://www.processing.org/ ⇨ Circos http://mkweb.bcgsc.ca/circos/ ⇨ Treemap http://www.cs.umd.edu/hcil/treemap/ ⇨ Many Eyes http://manyeyes.alphaworks.ibm.com/manyeyes/ ⇨ Dutch Students: SAS & SPSS Academic Licenses (e.g. SurfSpot.nl)
  33. 33. Web: www.sas.com Email: jos.vandongen<at>sas.com Phone: +31-(0)6-10172008 Skype: tholis.jos LinkedIn: jvdongen Twitter: josvandongen Delicious: jvdongen Jos van Dongen In BI since 1991 Principal Consultant @ SAS Author/Speaker/Analyst

×