Data Scientist 101 BI Dutch

Data Scientist 101:
How to become a Super Cruncher

“All truths are easy to understand once they are
discovered; the point is to discover them.”

The 4 “soft” C's of a Data Scientist

...and the 5 R's of 21st Century Literacy
⇨Reading
⇨wRiting
⇨aRithmetic
⇨pRobability
⇨R
Source: Joe BlitzStein, Harvard

"data scientists should take a page
from social scientists, who have a
long history of asking where the
data they're working with comes
from, what methods were used to
gather and analyze it, and what
cognitive biases they might bring to
its interpretation."
Kate Crawford, Microsoft Research/MIT

Wrong prediction
due to extensive
media attention &
coverage

Data Science: wetting your appetite

The Data Science Venn Diagram
Source: Drew Conway, NYU
http://drewconway.com/zia/2013/3/
26/the-data-science-venn-diagram

Another way to look at things...

The nerdy approach...
Source: Hillary Mason, bit.ly

Data Scientists have more fun
Source: How to Engage and Retain Analytical Talent
By Elizabeth Craig, Jeanne G. Harris and Henry Egan
January 2010

How Do I Become A Data Scientist?
⇨ Learn about matrix factorizations
⇨ Learn about distributed computing
⇨ Learn about statistical analysis
⇨ Learn about optimization
⇨ Learn about machine learning
⇨ Learn about information retrieval
⇨ Learn about signal detection and estimation
⇨ Master algorithms and data structures
⇨ Practice
⇨ Study Engineering
Source: http://www.quora.com/Career-Advice/How-do-I-become-a-data-scientist

6 levels of expertise needed
Data wranglingStatistics
Data mining Visualization
Communication
Data
Science*
Domain & Business Expertise
* a bit of programming
skills doesn't hurt either

Programming Skills?
C
C++
PAL
Smalltalk
VB.Net
C#
SQL
LotusScript
VBScript
JavaScript
HTML
Delphi
(Java)
Python
R
Perl
Me “Them”
Prolog Octave
Ruby
SQL
Pascal

SQL Still Matters!
⇨ Big Data SQL
⇨ Hbase & Hive
⇨ Amazon Redshift
⇨ Cloudera Impala
⇨ HortonWorks Stinger
⇨ ...
Source: KDNuggets.com

New analytics->new infrastructure

Why you need (some) Statistics

Learning Statistics
⇨ Coursera.org
⇨ Statistics One
⇨ Passion Driven Statistics
⇨ Statistics: Making sense of Data

Essentially,
all models are wrong...
...but some are useful
George E.P. Box

Learning Data Mining
⇨ Coursera.org
⇨ Machine Learning
⇨ Neural Networks for
Machine Learning
⇨ Kaggle.com
⇨ Kaggle In Class

Visualization is...
Theconversionofanyabstractdataintoagraphicalformatsothecharacteristicsand
relationshipsofthedatacanbeexploredandanalyzed.
⇨ Humans have the ability to analyze large amounts of information that is
presented visually
⇨ This is good for certain types of pattern and trend analysis
⇨ It’s often easy to detect outliers and unusual patterns
Usefulforexploration,explanation,discovery,but not forautomatedsystemactions.

How many 5's?
3435261241134352612203498723566
9623466620398652034095823450238
4560289567109238401645089630489
5769782364196873484

Again: how many 5's?
3435261241134352612203498723566
9623466620398652034095823450238
4560289567109238401645089630489
5769782364196873484

Learning Visualization
⇨ Stephen Few classes ($$)
⇨ Alberto Cairo
⇨ Introduction to Data Journalism

Want to get your feet wet?
Tableau Public
http://www.tableausoftware.com/public/
SAS Visual Analytics
http://www.sas.com/software/visual-analytics

Where to go from here?
⇨ Read 'Competing on Analytics'
⇨ Move on to 'Data Analysis Using SQL and Excel'
⇨ Then buy 'Handbook of Statistical Analysis & Data Mining
Applications'
⇨ Statistics for business:
⇨
http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm
⇨ Data Mining:
⇨ www.rapid-i.com (RapidMiner)
⇨
http://www.thearling.com
⇨ http://www.autonlab.org/tutorials/
⇨ For free text books, search www.scribd.com
⇨ Enter http://www.coursera.org

More Resources to Get You Started
Books:
⇨ DataMiningTechniques:ForMarketing,SalesandCustomerSupport,MichaelJ.BarryandGordonLinoff
⇨
DataPreparationforDataMining,DorianPyle
⇨ DataMiningAlgorithms,ElbeFrank,IanWitten,JimGray
⇨
AnIntroductiontoInformationRetrieval,ChristopherD.Manning,PrabhakarRaghavan,HinrichSchütze
⇨ InformationRetrieval,C.J.vanRijsbergen
⇨
TheVisualDisplayofQuantitativeInformation,EdwardR.Tufte
Journals,Newsletters,WebSites:
⇨
SIGKDDExplorations,NewsletteroftheACMSIGonKnowledgeDiscoveryandDataMining
⇨ IEEETransactionsonPatternAnalysisandMachineIntelligence
⇨
SASKnowledgeExchange: www.sas.com/knowledge-exchange/business-analytics
⇨ KDNuggetsdataminingresources: www.kdnuggets.com
⇨
FlowingData,visualizationresources: http://flowingdata.com/
⇨ Infoaesthetics,visualdesignresources: http://infosthetics.com/
⇨
VisualComplexity,visualizationresources: www.visualcomplexity.com/vc/index.cfm
⇨ Recommendationsystemsresources:
http://www.deitel.com/ResourceCenters/Web20/RecommenderSystems/tabid/1229/Default.aspx
⇨
TheImpoverishedSocialScientist'sGuidetoFreeStatisticalSoftwareandResources: http://maltman.hmdc.harvard.edu/socsci.shtml

Free Stuff So You Can Work Cheaply
⇨
WEKA http://www.cs.waikato.ac.nz/ml/weka/
⇨ IND decision tree software http://opensource.arc.nasa.gov/software/ind/
⇨
Clustering http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
⇨ Parallel Sets http://eagereyes.org/parallel-sets#download
⇨
RapidMiner http://rapid-i.com/content/blogcategory/38/69/
⇨ Knime http://www.knime.org/
⇨ Orange http://www.ailab.si/Orange/
⇨
R statistics software http://www.r-project.org/
⇨ ARC statistics software http://www.stat.umn.edu/arc/software.html
⇨
Octave numerical and matrix computation http://www.gnu.org/software/octave/
⇨ Processing http://www.processing.org/
⇨
Circos http://mkweb.bcgsc.ca/circos/
⇨
Treemap http://www.cs.umd.edu/hcil/treemap/
⇨ Many Eyes http://manyeyes.alphaworks.ibm.com/manyeyes/
⇨ Dutch Students: SAS & SPSS Academic Licenses (e.g. SurfSpot.nl)

Web: www.sas.com
Email: jos.vandongen<at>sas.com
Phone: +31-(0)6-10172008
Skype: tholis.jos
LinkedIn: jvdongen
Twitter: josvandongen
Delicious: jvdongen
Jos van Dongen
In BI since 1991
Principal Consultant @ SAS
Author/Speaker/Analyst

Data Scientist 101 BI Dutch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Scientist 101 BI Dutch

Similar to Data Scientist 101 BI Dutch (20)

Recently uploaded

Recently uploaded (20)

Data Scientist 101 BI Dutch