You Don't Have to Be a Data Scientist to Do Data Science
You don’t have to be a Data
Scientist to do Data Science
@carmenmardiros (not a data scientist)
“Sexiest job of the 21st
Why do I, a mere analyst, care?
The appeal of Data Science (for me as an analyst)
My own and others’ in my analyses as the complexity
of data and business ecosystem increases.
Speed up the analysis cycle from exploration to
hypothesis to experimentation.
Add value in
As the business and technology landscape changes.
Operationalise analysis outcomes as data products.
“It’s just not for me...”
“I don’t have a degree in statistics or programming.”
No confidence to attend the
Worried I would not understand
Worried I’d be spotted as a fraud.
(3m into my data science foray)
Understood much of the content
Mentally thought questions
I knew more than I thought I did.
Predictive Analytics Summit 2013 Predictive Analytics Summit 2016
Doing data science requires a
PhD/going back to school.
Can’t do data science until you
can write an algorithm.
Bottom-up is the only way.
Doing data science requires
enthusiasm and confidence in
Can and should do data science
once we’ve conceptually
understood how and why the
Provide value, learn as you go.
Digital Analytics is changing fast
Essential as we move towards prescriptive analytics
We will be key to bridging the gap between PhDs,
machines and management.
May even use it ourselves for our day-to-day work.
MS Office for Machine Learning coming soon at a
cloud near you.
Number of observations: 100
Sample is representative (to the best of
Observed mean: 17.54 months
Draw 100 random samples with
Calculate for each one the mean:
[17.61, 16.21, 17.13, 14.08, 19.58 … ] # 100
Plot all means, the 2.5 and 97.5
percentiles and original observed mean.
Bootstrap is extremely versatile:
● Fewer assumptions than parametric
● Can be used on any statistic.
Simulations & Sensitivity Analysis
Given existing distribution of order values and a
given range of possible conversion rates , how
much £££ would we make if we doubled the
traffic to our website?
(or how to open up black boxes):
Given a predictive model, randomly generate
new data points for each input based on
observed distributions, create predictions using
the model and interpret distribution of
1 Train fold Train fold Train fold Train fold Test fold
2 Train fold Train fold Train fold Test fold Train fold
3 Train fold Train fold Test fold Train fold Train fold
4 Train fold Test fold Train fold Train fold Train fold
5 Test fold Train fold Train fold Train fold Train fold
Assesses how well a predictive model generalises to unseen data.
Acknowledges and mitigates effects of variance and
noise in the data.
You already do this when you use confidence
intervals. Quantify uncertainty more often.
Leverages randomness and probability to give you
glimpses into possible future outcomes.
Embrace randomness. It's your ally into prescriptive
#3 Feature Engineering
#3 Calculated Metrics or
Back on familiar territory.
Feature Engineering Examples
views per user
by content type
# politics content views, # business content views
# short/long-form content views
% politics content views in total content viewed
adjusted for uncertainty of small samples
Result: fat user-level table of attributes and
behaviour for analysis and modelling.
Feature Engineering Examples
(for time series
# new marketing campaigns (first date with sessions)
# new brands launched (first date with pageviews)
# voucher codes at peak redeem-rate (date with
# AB tests started (date with first events tracked)
# VIPs active on each date, etc
Result: fat date-level table of leading KPIs and
activities (model the ecosystem).
New ways of
Seasoned data scientists: Feature engineering often
yields higher rewards than pushing the latest
You likely already do this, likely in Excel.
It’s painful and limiting.
Your analytical creativity needs better tools.
SQL: The single most valuable tool in our toolkit.
We become self-sufficient analysts.
Learn Python https://try.jupyter.org/ -- start learning python for
data science right now (no setup!).
Understand how algorithms using spreadsheets.
Top-down approach. No programming required.
Learn SQL https://learncodethehardway.org/sql/