DECODING DATA SCIENCE
Matt Fornito
Director of Analytics
OpsVision Solutions: Big Data/Cloud Consulting Firm
@MattFornito
BIG DATA
“Big Data is the simple yet seemingly
revolutionary belief that data are
valuable…I believe that ‘big’ actually
means important.
-Sean Patrick Murphy
BIG DATA
➤ There is a continuous assumption that organizations all have ‘big data’
and they need solutions to manage big data
➤ From a more realistic perspective, big data operates on the premise of
both (1) storage and (2) memory.
➤ Big Data is not easily stored on a single hard drive (or a single computer with
multiple hard drives).
➤ Big Data requires meaningful memory processing e.g. if we had 100,000,000
rows and 100 variables, we’d likely have a big data need because that cannot
be processed via data science analytics with 4-32 GB of memory (for the most
part).
BIG DATA
➤ Hadoop & AWS
➤ Small/Mid-size Organizations: AWS cheaper than any dedicated
infrastructure
➤ Large Organizations: Can afford dedicated servers or choose cloud
computing for highly-scalable solutions
“As big data and statistics engage with
one another, it is critical to remember
that the two fields are united by one
common goal: to draw reliable
conclusions from available data.
-Kaiser Fung
DIVE INTO DATA SCIENCE
“A data scientist is a person who is
better at blah blah blah
-Josh Willis
WHAT IS DATA SCIENCE
➤ DATA SCIENCE is the utilization of data to solve problems
➤ Bonus points for novel, interesting, necessary, and complex
problems
➤ A DATA SCIENTIST is a professional who uses the scientific method
to liberate and create meaning from raw data
DATA SCIENCE MARKET
Data Scientist is
the #1 job of
2016 according to
both Forbes and
Glassdoor
DATA SCIENCE IS EASY!
FALSE
T-MODEL TO SUCCESS
Breadth of Knowledge
DepthofExpertise
DATA SCIENCE SKILLS SUMMARY
Programming
Data Cleaning
Feature Engineering
Statistics
Machine Learning
Optimization
Visualizations
Communication
Creativity, Curiosity, & Problem Solving
PROGRAMMING
TOO MANY OPTIONS?
R VS. PYTHON
lm(y ~ x1 + x2 + x3,
data=mydata)
linear_model.LinearRegression()
DATA CLEANING
DATA CLEANING/WRANGLING
Approximately 80% of
time and costs are related
to cleaning up data and
other quality issues
➤ Invalid
➤ Missing
➤ Duplicated
➤ Corrupted
➤ Inconsistent
Data	Frame
‘CO’ ‘Colorado’
“If I had only one hour to save the
world, I would spend fifty-five
minutes defining the problem, and
only five minutes finding the solution.
-Albert Einstein (attributed)
FEATURE ENGINEERING
FEATURE ENGINEERING: TRANSFORMATIONS
FEATURE ENGINEERING: PARSING & NEW FEATURES
Date of Sale
03/25/2014
09/22/2015
04/05/2016
05/12/2016
Day Month Year
Day of
Week
Days
Since
Sale
25 3 2014 Tuesday 782
22 9 2015 Tuesday 236
5 4 2016 Wednesday 40
12 5 2016 Thursday 3
STATISTICS
STATISTICS
➤ Summary Statistics
➤ Probability/Combinatorics
➤ Distributions (e.g. Binomial, Uniform, Poisson, etc.)
➤ Linear Algebra
➤ Hypothesis Testing
➤ Calculus
➤ Graph Theory
➤ Bayesian Analysis
MACHINE LEARNING
MACHINE LEARNING
➤ Machine Learning is the process of letting ‘machines’ do the heavy
lifting
➤ More Formally: it’s defined as the field of study that gives computers
the ability to learn without being explicitly programmed.
➤ Two Paths:
Supervised Learning Unsupervised Learning
DEEP LEARNING
➤ Deep Learning is a branch of Machine Learning, usually more advanced
that uses multiple processing layers composed of multiple data
transformations.
➤ It is often constructed on pictures, audio, videos, and text data.
STATISTICS & MACHINE LEARNING
Parsimony
Linear
R
egression
R
ecurrent
N
euralN
etw
ork
Predictive Power
vs.
Interpretability
OPTIMIZATION
VISUALIZATIONS
VISUALIZATIONS
Tell a story…
COMMUNICATION &
STORY TELLING
CREATIVITY,
CURIOSITY,
&
PROBLEM SOLVING
“How do I do X in R/Python?
-Everyone
TOP DOWN COGNITIVE FRAMEWORK
➤ Problem solving
holistic approach
➤ Parse all into
meaningful chunks
➤ Solve piece-by-piece
➤ Roll back up
BREAK INTO THE FIELD
REQUIRED SKILLS
➤ Strong statistics/probability/distributions/etc. background
➤ [Ideal] experience with Machine Learning
➤ Python and/or R
➤ SQL
➤ [Ideal] AWS and/or Hadoop
➤ Problem Solving skills & Asking the right questions
➤ Capable of explaining what was done and why at all levels
PROGRAMMING SCHOOLS
ONLINE COURSES
DATA SCIENCE BOOTCAMPS
OPEN SOURCE DATA SCIENCE MASTERS
METACADEMY
MEASURING SUCCESS
KEEPING AN EYE ON RECRUITER BEHAVIOR
➤ Using eye-tracking software,
researchers found recruiters
spend only 6 seconds
reviewing a resume.
➤ 80% of time is spent looking
at Education, Current/
Previous Company &
Current/Previous Title
➤ Take Away: Getting a job from
a renowned company OR
with a data scientist title
opens up a lot of doors.
JOB ROLES
DATA ARCHITECT
DATA ENGINEER
DATA SCIENTIST
ARCHITECTING & FLOW
MODELS
FLOW MODEL
DATA
ARCHITECTURE
DATA
ENGINEERING
DATA SCIENCE AUTOMATION
ARCHITECTING & ENGINEERING
Ingestion Warehousing/Storage Cleaning & Optimization
DATA SCIENCE IT
Stat Software Exploration

Visualizations

Cleaning
Modeling Automation

Visualizations

Communication
BUILDING A TEAM
CROSS-INDUSTRY STANDARD PROCESS FOR DATA MINING (CRISP-DM)
KEY FEATURES WHEN HIRING
➤ Cultural Fit
➤ Math/Statistics/Machine Learning knowledge
➤ Programming skills (hackerrank challenges/take home
assessments)
➤ One-day on site/Day-in-life
➤ Continuous Learning Assessment (i.e. What do you enjoy
about Data Science?)
➤ Problem Solving (situational interview questions or past
performance assessment)
“The impact of a data science team is
dependent upon its ability to
influence the adoption of its
recommendations.
Elena Grewal & Riley Newman
FINDING THE UNICORN
ALTERNATIVE APPROACH
THANK YOU
Matt Fornito
Director of Analytics
OpsVision Solutions: Big Data/Cloud Consulting Firm
@MattFornito
BigDataUnicorn.com

Decoding Data Science

  • 1.
    DECODING DATA SCIENCE MattFornito Director of Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm @MattFornito
  • 2.
  • 3.
    “Big Data isthe simple yet seemingly revolutionary belief that data are valuable…I believe that ‘big’ actually means important. -Sean Patrick Murphy
  • 4.
    BIG DATA ➤ Thereis a continuous assumption that organizations all have ‘big data’ and they need solutions to manage big data ➤ From a more realistic perspective, big data operates on the premise of both (1) storage and (2) memory. ➤ Big Data is not easily stored on a single hard drive (or a single computer with multiple hard drives). ➤ Big Data requires meaningful memory processing e.g. if we had 100,000,000 rows and 100 variables, we’d likely have a big data need because that cannot be processed via data science analytics with 4-32 GB of memory (for the most part).
  • 5.
    BIG DATA ➤ Hadoop& AWS ➤ Small/Mid-size Organizations: AWS cheaper than any dedicated infrastructure ➤ Large Organizations: Can afford dedicated servers or choose cloud computing for highly-scalable solutions
  • 6.
    “As big dataand statistics engage with one another, it is critical to remember that the two fields are united by one common goal: to draw reliable conclusions from available data. -Kaiser Fung
  • 7.
  • 8.
    “A data scientistis a person who is better at blah blah blah -Josh Willis
  • 9.
    WHAT IS DATASCIENCE ➤ DATA SCIENCE is the utilization of data to solve problems ➤ Bonus points for novel, interesting, necessary, and complex problems ➤ A DATA SCIENTIST is a professional who uses the scientific method to liberate and create meaning from raw data
  • 10.
    DATA SCIENCE MARKET DataScientist is the #1 job of 2016 according to both Forbes and Glassdoor
  • 11.
    DATA SCIENCE ISEASY! FALSE
  • 13.
    T-MODEL TO SUCCESS Breadthof Knowledge DepthofExpertise
  • 14.
    DATA SCIENCE SKILLSSUMMARY Programming Data Cleaning Feature Engineering Statistics Machine Learning Optimization Visualizations Communication Creativity, Curiosity, & Problem Solving
  • 15.
  • 16.
  • 17.
    R VS. PYTHON lm(y~ x1 + x2 + x3, data=mydata) linear_model.LinearRegression()
  • 18.
  • 19.
    DATA CLEANING/WRANGLING Approximately 80%of time and costs are related to cleaning up data and other quality issues ➤ Invalid ➤ Missing ➤ Duplicated ➤ Corrupted ➤ Inconsistent Data Frame ‘CO’ ‘Colorado’
  • 20.
    “If I hadonly one hour to save the world, I would spend fifty-five minutes defining the problem, and only five minutes finding the solution. -Albert Einstein (attributed)
  • 21.
  • 22.
  • 23.
    FEATURE ENGINEERING: PARSING& NEW FEATURES Date of Sale 03/25/2014 09/22/2015 04/05/2016 05/12/2016 Day Month Year Day of Week Days Since Sale 25 3 2014 Tuesday 782 22 9 2015 Tuesday 236 5 4 2016 Wednesday 40 12 5 2016 Thursday 3
  • 24.
  • 25.
    STATISTICS ➤ Summary Statistics ➤Probability/Combinatorics ➤ Distributions (e.g. Binomial, Uniform, Poisson, etc.) ➤ Linear Algebra ➤ Hypothesis Testing ➤ Calculus ➤ Graph Theory ➤ Bayesian Analysis
  • 26.
  • 27.
    MACHINE LEARNING ➤ MachineLearning is the process of letting ‘machines’ do the heavy lifting ➤ More Formally: it’s defined as the field of study that gives computers the ability to learn without being explicitly programmed. ➤ Two Paths: Supervised Learning Unsupervised Learning
  • 28.
    DEEP LEARNING ➤ DeepLearning is a branch of Machine Learning, usually more advanced that uses multiple processing layers composed of multiple data transformations. ➤ It is often constructed on pictures, audio, videos, and text data.
  • 29.
    STATISTICS & MACHINELEARNING Parsimony Linear R egression R ecurrent N euralN etw ork Predictive Power vs. Interpretability
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    “How do Ido X in R/Python? -Everyone
  • 36.
    TOP DOWN COGNITIVEFRAMEWORK ➤ Problem solving holistic approach ➤ Parse all into meaningful chunks ➤ Solve piece-by-piece ➤ Roll back up
  • 37.
  • 38.
    REQUIRED SKILLS ➤ Strongstatistics/probability/distributions/etc. background ➤ [Ideal] experience with Machine Learning ➤ Python and/or R ➤ SQL ➤ [Ideal] AWS and/or Hadoop ➤ Problem Solving skills & Asking the right questions ➤ Capable of explaining what was done and why at all levels
  • 39.
  • 40.
    OPEN SOURCE DATASCIENCE MASTERS
  • 41.
  • 42.
  • 43.
    KEEPING AN EYEON RECRUITER BEHAVIOR ➤ Using eye-tracking software, researchers found recruiters spend only 6 seconds reviewing a resume. ➤ 80% of time is spent looking at Education, Current/ Previous Company & Current/Previous Title ➤ Take Away: Getting a job from a renowned company OR with a data scientist title opens up a lot of doors.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
    ARCHITECTING & ENGINEERING IngestionWarehousing/Storage Cleaning & Optimization
  • 51.
    DATA SCIENCE IT StatSoftware Exploration
 Visualizations
 Cleaning Modeling Automation
 Visualizations
 Communication
  • 52.
  • 53.
    CROSS-INDUSTRY STANDARD PROCESSFOR DATA MINING (CRISP-DM)
  • 54.
    KEY FEATURES WHENHIRING ➤ Cultural Fit ➤ Math/Statistics/Machine Learning knowledge ➤ Programming skills (hackerrank challenges/take home assessments) ➤ One-day on site/Day-in-life ➤ Continuous Learning Assessment (i.e. What do you enjoy about Data Science?) ➤ Problem Solving (situational interview questions or past performance assessment)
  • 55.
    “The impact ofa data science team is dependent upon its ability to influence the adoption of its recommendations. Elena Grewal & Riley Newman
  • 56.
  • 57.
  • 58.
    THANK YOU Matt Fornito Directorof Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm @MattFornito BigDataUnicorn.com