14. Take Away
There is no set path to becoming a data scientist
Focus on:
Developing a scientific mindset
Strengthening your “metaskills”
Exploring many disciplines
15. Should you listen to me?
I am not speaking as an authority
I am here to share what I have learned and to help move
people forward in data science
So:
- Don’t take what I say at face value
- Test for yourself
- Challenge what you hear
- Come up with new and better ideas
19. What is Data Science?
Wikipedia that:
“...interdisciplinary field about processes and systems to extract knowledge or
insights from data in various forms, either structured or unstructured, which is a
continuation of some of the data analysis fields such as statistics, data mining, and
predictive analytics…”
“...Data science employs techniques and theories drawn from many fields within
the broad areas of mathematics, statistics, information science, and computer
science, including signal processing, probability models, machine learning,
statistical learning, data mining, database, data engineering, pattern recognition
and learning, visualization, predictive analytics, uncertainty modeling, data
warehousing, data compression, computer programming, artificial intelligence, and
high performance computing.”
20. What is a Data Scientist?
“Data scientists use their data and analytical ability to
find and interpret rich data sources;
manage large amounts of data despite hardware, software, and bandwidth
constraints;
merge data sources;
ensure consistency of datasets;
create visualizations to aid in understanding data;
build mathematical models using the data; and
present and communicate the data insights/findings.”
23. The purpose of a scientific discipline
Do the following descriptions make sense?
- Astronomy is the field of science that uses telescopes
- Chemistry is about mixing chemicals and torturing undergrads
- Statistics uses maths
Nope.
- Astronomy is the study of celestial objects and processes that allows
us to understand the universe
- Chemistry examines the composition, structure, properties and
change of matter to help us understand the physical world
- Statistics allows us to use data more effectively by studying the
collection, analysis, interpretation, and organization of data….
Methods are invented to serve the field,
not as a purpose in themselves.
24. Is data science just statistics “rebranded”?
"Data scientist is just a sexed up word for statistician."
- Nate Silver
“Statistical modelling - two cultures” - Leo Breiman
“50 Years of Data Science” - David Donoho
Summary, data science is just an expanded form of statistics
But see:
“What ‘50 years of data science’ leaves out” - Sean Owen, Cloudera
What is the purpose of data science?
25. Data Science is about decisions
We democratize data access to empower all employees to make data-informed
decisions, give everybody the ability to use experiments to correctly measure the
impact of their decisions, and turn insights on user preferences into data
products that improve the experience of using Airbnb
- Scaling Knowledge at Airbnb
That is more than statistics:
- Need to understand business processes
- Requires data engineering approaches to provide the
environment
- Requires software engineering to create platforms to measure
the impact and develop the data products
26. Data science is the scientific discipline focused on determining
how data can drive better decisions across a wide set of
domains
Scientific discipline - not just data analysis, but science
“...determining how data...”
- methodologies, statistics, computer science
“...can drive better decisions…”
- domain knowledge, science, engineering, social sciences...
27. How does a focus on decisions change our approach?
1) Takes the focus away from specific methodologies (we do deep learning
too!) to using the appropriate methodologies to achieve a larger
overarching goal - better decisions
a) Side effect is we get to use a larger array of disciplines
i) Systems theory
ii) Psychology
2) Focus on making good data products that change decisions
a) Focusing on data products takes us away from “scripts” and towards an
engineered approach to data product manufacturing
28. Data science is not rebranded statistics.
Data science is a multidisciplinary
discipline that seeks to understand how
data can be used to improve decision
making.
Statistics is just a part of the approach.
30. What is a data product?
Desired OutcomeDecisionExperienceWorld
learning
data information knowledge wisdom
data product
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Other Outcome
31. Data products are the mechanism
by which data science creates impact
32. Scientific Method
Framework for finding
value in data
Data is a raw resource.
Converting data to a data
product requires
experimentation,
exploration and learning.
This is the domain of
science.
Agile Development
Process for creation in
the face of uncertainty
Agile processes allow
software teams to meet
changing requirements,
but stay on track and
create effective products.
Engineered Products
Practices for ensuring
high quality products
It is one thing to make an
R script to analyse a
dataset. It is another to
have a resilient,
auditable, scalable data
product.
Desid Labs Approach
“Data science - more than just R scripts”
- unofficial Desid Labs motto
33. Levels of data products
Reporting
Dashboards
Prediction
AI (Autonomous)
Intelligent Decision-making Support Systems
34. Other dimensions
Complexity of UI
Complexity, size, and speed
of data, information, and knowledge (3V’s)
This branches into the field of AI and decision making
Start with Herbert Simon
35. Learning from the other doctors (MD, not PhD)
Clinical Decisions Rules (Dr. Ian Stiell)
1) Derivation
2) Validation
a) Cross-validation (should be standard practice!)
b) Prospective validation - this is the real experiment
3) Implementation
4) Studying barriers to adoption
These steps help determine the validity of
your data product
36. More than just R scripts
“It’s one thing to create an excellent fraud detection model in R, and quite another
to build:
● Fault-tolerant ingest of live data at scale that could represent fraudulent
actions
● Real-time computation of features based on the data stream
● Serialization, versioning and management of a fraud detection model
● Real-time prediction of fraud based on computed features at scale
● Learning over all historical data
● Incremental update of the production model in near-real-time
● Monitoring, testing, productionization of all of the above”
- Sean Owen, Cloudera
These are the sorts of things to think about when it
comes to implementing your data product
40. Learn by doing
1) Figure out where you are in the spectrum
2) Determine what experience you need to expand in either
direction
3) Find projects that will give you that experience
a) Online competitions
b) Hackathons
c) Freelance work
d) Your own projects
e) Data journalism
f) Data for Good (!)
41. Post production
Treat your data product as an hypothesis about the world
● Collect prospective data on its use
● Perform cohort analyses on people who make decisions based
on the data
● Consider A/B testing
● Consider canary testing
● Set a point where you will analyze the data (X people, X
amount of time)
● Answer the question - did it make a difference?
● Did it make the right difference?
43. “...science and technology have been unable to
keep pace with the second-order effects caused
by their first-order victories.”
- Gerald Weinberg
44. How do we know that our data products are having
the desired effect?
Data is cleaned, features determined, model created (AUC: 0.88!), implementation
tested, UI designed, UX tested, integrated into production system, monitored.
Everything is done
Pat on the back - walk away
Next month’s headline:
45. What happened?
- An algorithm is only as good as its data
- An algorithm learns from the data - data is an
representation of the real world including its flaws
- The real world is complex and there can be non-linear
effects
46. Obviously Data for Evil (Commission)
Predatory advertising
Surveillance of dissidents, activists
Identity theft
Social Engineering
47. Gray areas
Web lining
Databases in elections to determine wedge issues
Surveillance for security reasons
Targeted advertising
48. Data for Good … right? (Omission)
Model to determine who will respond best to social assistance
What if the data is from an area with strong historical racism?
(Don’t use variables/features that could be impart racial bias)
Automatic tagging of photos
What are the consequences of the algorithm being wrong?
(Need to balance sensitivity and specificity)
Apps to help first-responder (geolocation)
Will providing a service to some people limit access based on
arbitrary technology choices?
49. How Big Data Enables Economic Harm to
Consumers, Especially to Low-Income and
Other Vulnerable Sectors of the Population
50. Algorithms aren’t biased - but data is
Historical data encompasses our societal biases
Algorithms learn from that data and inherit these biases
https://www.fordfoundation.org/ideas/equals-change-blog/posts/can-computers-
be-racist-big-data-inequality-and-discrimination/
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899
https://www.propublica.org/article/when-big-data-becomes-bad-data
https://theconversation.com/big-data-algorithms-can-discriminate-and-its-not-
clear-what-to-do-about-it-45849
51. So what do we do?
Possibilities:
● Strengthen User Control of Personal Data
● Enforce Structural Changes in Market to Increase Competition
● Directly Regulate Big Data Platforms to Prohibit Harmful Practices
● Investing in the technical capacity of public interest lawyers, and developing a
greater cohort of public interest technologists
● Pressing for “algorithmic transparency.”
● Exploring effective regulation of personal data
● Ethical code of conduct for data science
These are strategic suggestions -
they suggest the what, but not the how
52. We need a solution that keeps pace with the tech
1) Systematic scientific process should be applied
Equivalent of peer review
2) Agile development and testing
Ensure models are implemented correctly
3) Systems modeling
Understand the second-order effects of the system
4) Monitoring
Validation of our model in the world
53. Conclusions
Data science is about decisions.
The creation of data products involves many
disciplines
Determine where you are at, then expand your
skills
Approach data science with care and thought - it
is as easier to hurt than help
54. If you are interested in specifics about
methodologies, sign up for the Desid Labs
newsletter:
desidlabs.com