SlideShare a Scribd company logo
1 of 55
Download to read offline
Making an impact
with data science
Jordan Engbers, PhD
Chief Scientist, Desid Labs Inc.
CTO, Systolik Inc.
Outline
Who am I?
What is data science?
Making data products
Where do you go from X?
Are you doing good?
The Goal
To have a discussion around how to
create meaningful impact with data
science
Who am I?
How did I get here?
2004
Bioinformatics
Multidisciplinary Program
- Computer Science
- Biomedical Science
Bioinformatics
2004 2008
Neuroscience
Just starting … no
bitterness yet
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data
Science
Big Data
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data
Science
Data Science
Data Analytics
Predictive Analytics
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data
Science
- Data management for clinical
researchers
- International clinical trials
- Software development
- Data science with clinical
registries and administrative
health data (THIN)
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data
Science
2015
Desid Labs Inc.
Data Science consulting company offering end-to-end data science services
Science-as-a-Service
desidlabs.com
Bioinformatics
2004 2008
Neuroscience
2013
Clinical Data
Science
2015/16
Desid Labs Inc.
Systolik Inc.
Taking Apps to Heart
Cardiovascular Information Systems
Focus on Analytics within Cardiovascular Care
systolik.com
my random walk
music
bioinformatics
neuroscience
clinical data science
entrepreneur
web programming
Take Away
There is no set path to becoming a data scientist
Focus on:
Developing a scientific mindset
Strengthening your “metaskills”
Exploring many disciplines
Should you listen to me?
I am not speaking as an authority
I am here to share what I have learned and to help move
people forward in data science
So:
- Don’t take what I say at face value
- Test for yourself
- Challenge what you hear
- Come up with new and better ideas
What is
Data Science?
http://higheredublog.com/data-science-as-a-masters-a-brief-overview/
science
http://www.kdnuggets.com/2015/02/history-data-science-infographic.html
What is Data Science?
Wikipedia that:
“...interdisciplinary field about processes and systems to extract knowledge or
insights from data in various forms, either structured or unstructured, which is a
continuation of some of the data analysis fields such as statistics, data mining, and
predictive analytics…”
“...Data science employs techniques and theories drawn from many fields within
the broad areas of mathematics, statistics, information science, and computer
science, including signal processing, probability models, machine learning,
statistical learning, data mining, database, data engineering, pattern recognition
and learning, visualization, predictive analytics, uncertainty modeling, data
warehousing, data compression, computer programming, artificial intelligence, and
high performance computing.”
What is a Data Scientist?
“Data scientists use their data and analytical ability to
find and interpret rich data sources;
manage large amounts of data despite hardware, software, and bandwidth
constraints;
merge data sources;
ensure consistency of datasets;
create visualizations to aid in understanding data;
build mathematical models using the data; and
present and communicate the data insights/findings.”
Is data science just a set of
methodologies?
The purpose of a scientific discipline
Do the following descriptions make sense?
- Astronomy is the field of science that uses telescopes
- Chemistry is about mixing chemicals and torturing undergrads
- Statistics uses maths
Nope.
- Astronomy is the study of celestial objects and processes that allows
us to understand the universe
- Chemistry examines the composition, structure, properties and
change of matter to help us understand the physical world
- Statistics allows us to use data more effectively by studying the
collection, analysis, interpretation, and organization of data….
Methods are invented to serve the field,
not as a purpose in themselves.
Is data science just statistics “rebranded”?
"Data scientist is just a sexed up word for statistician."
- Nate Silver
“Statistical modelling - two cultures” - Leo Breiman
“50 Years of Data Science” - David Donoho
Summary, data science is just an expanded form of statistics
But see:
“What ‘50 years of data science’ leaves out” - Sean Owen, Cloudera
What is the purpose of data science?
Data Science is about decisions
We democratize data access to empower all employees to make data-informed
decisions, give everybody the ability to use experiments to correctly measure the
impact of their decisions, and turn insights on user preferences into data
products that improve the experience of using Airbnb
- Scaling Knowledge at Airbnb
That is more than statistics:
- Need to understand business processes
- Requires data engineering approaches to provide the
environment
- Requires software engineering to create platforms to measure
the impact and develop the data products
Data science is the scientific discipline focused on determining
how data can drive better decisions across a wide set of
domains
Scientific discipline - not just data analysis, but science
“...determining how data...”
- methodologies, statistics, computer science
“...can drive better decisions…”
- domain knowledge, science, engineering, social sciences...
How does a focus on decisions change our approach?
1) Takes the focus away from specific methodologies (we do deep learning
too!) to using the appropriate methodologies to achieve a larger
overarching goal - better decisions
a) Side effect is we get to use a larger array of disciplines
i) Systems theory
ii) Psychology
2) Focus on making good data products that change decisions
a) Focusing on data products takes us away from “scripts” and towards an
engineered approach to data product manufacturing
Data science is not rebranded statistics.
Data science is a multidisciplinary
discipline that seeks to understand how
data can be used to improve decision
making.
Statistics is just a part of the approach.
Making Data Products
What is a data product?
Desired OutcomeDecisionExperienceWorld
learning
data information knowledge wisdom
data product
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Other Outcome
Data products are the mechanism
by which data science creates impact
Scientific Method
Framework for finding
value in data
Data is a raw resource.
Converting data to a data
product requires
experimentation,
exploration and learning.
This is the domain of
science.
Agile Development
Process for creation in
the face of uncertainty
Agile processes allow
software teams to meet
changing requirements,
but stay on track and
create effective products.
Engineered Products
Practices for ensuring
high quality products
It is one thing to make an
R script to analyse a
dataset. It is another to
have a resilient,
auditable, scalable data
product.
Desid Labs Approach
“Data science - more than just R scripts”
- unofficial Desid Labs motto
Levels of data products
Reporting
Dashboards
Prediction
AI (Autonomous)
Intelligent Decision-making Support Systems
Other dimensions
Complexity of UI
Complexity, size, and speed
of data, information, and knowledge (3V’s)
This branches into the field of AI and decision making
Start with Herbert Simon
Learning from the other doctors (MD, not PhD)
Clinical Decisions Rules (Dr. Ian Stiell)
1) Derivation
2) Validation
a) Cross-validation (should be standard practice!)
b) Prospective validation - this is the real experiment
3) Implementation
4) Studying barriers to adoption
These steps help determine the validity of
your data product
More than just R scripts
“It’s one thing to create an excellent fraud detection model in R, and quite another
to build:
● Fault-tolerant ingest of live data at scale that could represent fraudulent
actions
● Real-time computation of features based on the data stream
● Serialization, versioning and management of a fraud detection model
● Real-time prediction of fraud based on computed features at scale
● Learning over all historical data
● Incremental update of the production model in near-real-time
● Monitoring, testing, productionization of all of the above”
- Sean Owen, Cloudera
These are the sorts of things to think about when it
comes to implementing your data product
Where do you go from X?
Coursera
Activities
Data Preparation
Intelligence
Gathering
W
hatis
the
question?
W
here
is
the
data?
W
hatis
the
data?
G
etthe
data
Store
the
data
Transform
the
data
Load
the
data
Modeling
Feature
engineering
Preprocessing
M
achine
learning
algorithm
s
Validation
(Phase
I-C
ross
Validation)
Design Production
Visualization
R
educing
feature
set
C
reating
a
plan
forintegrating
M
ovem
entto
production
stack
Versioning
and
m
anagem
ent
M
onitoring,testing,deploym
ent
Kaggle
Hackathon
Research & Open Data
Data Science Job
Activities
Skills&
Knowledge
Data Preparation
Intelligence
Gathering
W
hatis
the
question?
W
here
is
the
data?
W
hatis
the
data?
G
etthe
data
Store
the
data
Transform
the
data
Load
the
data
Modeling
Feature
engineering
Preprocessing
M
achine
learning
algorithm
s
Validation
(Phase
I-C
ross
Validation)
Design Production
Visualization
R
educing
feature
set
C
reating
a
plan
forintegrating
M
ovem
entto
production
stack
Versioning
and
m
anagem
ent
M
onitoring,testing,deploym
ent
Domain
Knowledge
Data munging
Distributed computing
Storage
Sampling
Digital signal processing
Handling missing data
Filtering
Databases
Machine Learning
Algorithmic Complexity
GPU optimization
Programming
Statistics
Probabilities
Web development
Psychology
UI/UX
Software engineering
Devops
Testing
Debugging
Enterprise languages
Cloud computing
Learn by doing
1) Figure out where you are in the spectrum
2) Determine what experience you need to expand in either
direction
3) Find projects that will give you that experience
a) Online competitions
b) Hackathons
c) Freelance work
d) Your own projects
e) Data journalism
f) Data for Good (!)
Post production
Treat your data product as an hypothesis about the world
● Collect prospective data on its use
● Perform cohort analyses on people who make decisions based
on the data
● Consider A/B testing
● Consider canary testing
● Set a point where you will analyze the data (X people, X
amount of time)
● Answer the question - did it make a difference?
● Did it make the right difference?
Are you doing good?
“...science and technology have been unable to
keep pace with the second-order effects caused
by their first-order victories.”
- Gerald Weinberg
How do we know that our data products are having
the desired effect?
Data is cleaned, features determined, model created (AUC: 0.88!), implementation
tested, UI designed, UX tested, integrated into production system, monitored.
Everything is done
Pat on the back - walk away
Next month’s headline:
What happened?
- An algorithm is only as good as its data
- An algorithm learns from the data - data is an
representation of the real world including its flaws
- The real world is complex and there can be non-linear
effects
Obviously Data for Evil (Commission)
Predatory advertising
Surveillance of dissidents, activists
Identity theft
Social Engineering
Gray areas
Web lining
Databases in elections to determine wedge issues
Surveillance for security reasons
Targeted advertising
Data for Good … right? (Omission)
Model to determine who will respond best to social assistance
What if the data is from an area with strong historical racism?
(Don’t use variables/features that could be impart racial bias)
Automatic tagging of photos
What are the consequences of the algorithm being wrong?
(Need to balance sensitivity and specificity)
Apps to help first-responder (geolocation)
Will providing a service to some people limit access based on
arbitrary technology choices?
How Big Data Enables Economic Harm to
Consumers, Especially to Low-Income and
Other Vulnerable Sectors of the Population
Algorithms aren’t biased - but data is
Historical data encompasses our societal biases
Algorithms learn from that data and inherit these biases
https://www.fordfoundation.org/ideas/equals-change-blog/posts/can-computers-
be-racist-big-data-inequality-and-discrimination/
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899
https://www.propublica.org/article/when-big-data-becomes-bad-data
https://theconversation.com/big-data-algorithms-can-discriminate-and-its-not-
clear-what-to-do-about-it-45849
So what do we do?
Possibilities:
● Strengthen User Control of Personal Data
● Enforce Structural Changes in Market to Increase Competition
● Directly Regulate Big Data Platforms to Prohibit Harmful Practices
● Investing in the technical capacity of public interest lawyers, and developing a
greater cohort of public interest technologists
● Pressing for “algorithmic transparency.”
● Exploring effective regulation of personal data
● Ethical code of conduct for data science
These are strategic suggestions -
they suggest the what, but not the how
We need a solution that keeps pace with the tech
1) Systematic scientific process should be applied
Equivalent of peer review
2) Agile development and testing
Ensure models are implemented correctly
3) Systems modeling
Understand the second-order effects of the system
4) Monitoring
Validation of our model in the world
Conclusions
Data science is about decisions.
The creation of data products involves many
disciplines
Determine where you are at, then expand your
skills
Approach data science with care and thought - it
is as easier to hurt than help
If you are interested in specifics about
methodologies, sign up for the Desid Labs
newsletter:
desidlabs.com
Questions?
@jengbers
@desidlabs
jordan@desidlabs.com

More Related Content

What's hot

What's hot (20)

Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Machine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's NextMachine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's Next
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical Devices
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?
 
Data mining 2012 generalwithmethods
Data mining  2012 generalwithmethodsData mining  2012 generalwithmethods
Data mining 2012 generalwithmethods
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdf
 
Data+Science : A First Course
Data+Science : A First CourseData+Science : A First Course
Data+Science : A First Course
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data ScienceIntroduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to Data Science - Week 3 - Steps involved in Data Science
 
NLP & ML Webinar
NLP & ML WebinarNLP & ML Webinar
NLP & ML Webinar
 
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI  Webina...
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
 
Understand the Demand of Analyst Opportunity in U.S
Understand the Demand of Analyst Opportunity in U.SUnderstand the Demand of Analyst Opportunity in U.S
Understand the Demand of Analyst Opportunity in U.S
 
Aa proj assited-living_iot
Aa proj assited-living_iotAa proj assited-living_iot
Aa proj assited-living_iot
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 

Similar to Making an impact with data science

Untitled document.pdf
Untitled document.pdfUntitled document.pdf
Untitled document.pdf
MuhammadTahiriqbal13
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the Marketspace
Bala Iyer
 

Similar to Making an impact with data science (20)

Untitled document.pdf
Untitled document.pdfUntitled document.pdf
Untitled document.pdf
 
What is data science artical
What is data science articalWhat is data science artical
What is data science artical
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Data Science Demystified_ Journeying Through Insights and Innovations
Data Science Demystified_ Journeying Through Insights and InnovationsData Science Demystified_ Journeying Through Insights and Innovations
Data Science Demystified_ Journeying Through Insights and Innovations
 
Introduction to Data Science: Unveiling Insights Hidden in Data
Introduction to Data Science: Unveiling Insights Hidden in DataIntroduction to Data Science: Unveiling Insights Hidden in Data
Introduction to Data Science: Unveiling Insights Hidden in Data
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the Marketspace
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
 
The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Regression and correlation
Regression and correlationRegression and correlation
Regression and correlation
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
Data Science- Basics.pptx
Data Science- Basics.pptxData Science- Basics.pptx
Data Science- Basics.pptx
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 

Making an impact with data science

  • 1. Making an impact with data science Jordan Engbers, PhD Chief Scientist, Desid Labs Inc. CTO, Systolik Inc.
  • 2. Outline Who am I? What is data science? Making data products Where do you go from X? Are you doing good?
  • 3. The Goal To have a discussion around how to create meaningful impact with data science
  • 5. How did I get here?
  • 9. Bioinformatics 2004 2008 Neuroscience 2013 Clinical Data Science Data Science Data Analytics Predictive Analytics
  • 10. Bioinformatics 2004 2008 Neuroscience 2013 Clinical Data Science - Data management for clinical researchers - International clinical trials - Software development - Data science with clinical registries and administrative health data (THIN)
  • 11. Bioinformatics 2004 2008 Neuroscience 2013 Clinical Data Science 2015 Desid Labs Inc. Data Science consulting company offering end-to-end data science services Science-as-a-Service desidlabs.com
  • 12. Bioinformatics 2004 2008 Neuroscience 2013 Clinical Data Science 2015/16 Desid Labs Inc. Systolik Inc. Taking Apps to Heart Cardiovascular Information Systems Focus on Analytics within Cardiovascular Care systolik.com
  • 13. my random walk music bioinformatics neuroscience clinical data science entrepreneur web programming
  • 14. Take Away There is no set path to becoming a data scientist Focus on: Developing a scientific mindset Strengthening your “metaskills” Exploring many disciplines
  • 15. Should you listen to me? I am not speaking as an authority I am here to share what I have learned and to help move people forward in data science So: - Don’t take what I say at face value - Test for yourself - Challenge what you hear - Come up with new and better ideas
  • 19. What is Data Science? Wikipedia that: “...interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics…” “...Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing.”
  • 20. What is a Data Scientist? “Data scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings.”
  • 21.
  • 22. Is data science just a set of methodologies?
  • 23. The purpose of a scientific discipline Do the following descriptions make sense? - Astronomy is the field of science that uses telescopes - Chemistry is about mixing chemicals and torturing undergrads - Statistics uses maths Nope. - Astronomy is the study of celestial objects and processes that allows us to understand the universe - Chemistry examines the composition, structure, properties and change of matter to help us understand the physical world - Statistics allows us to use data more effectively by studying the collection, analysis, interpretation, and organization of data…. Methods are invented to serve the field, not as a purpose in themselves.
  • 24. Is data science just statistics “rebranded”? "Data scientist is just a sexed up word for statistician." - Nate Silver “Statistical modelling - two cultures” - Leo Breiman “50 Years of Data Science” - David Donoho Summary, data science is just an expanded form of statistics But see: “What ‘50 years of data science’ leaves out” - Sean Owen, Cloudera What is the purpose of data science?
  • 25. Data Science is about decisions We democratize data access to empower all employees to make data-informed decisions, give everybody the ability to use experiments to correctly measure the impact of their decisions, and turn insights on user preferences into data products that improve the experience of using Airbnb - Scaling Knowledge at Airbnb That is more than statistics: - Need to understand business processes - Requires data engineering approaches to provide the environment - Requires software engineering to create platforms to measure the impact and develop the data products
  • 26. Data science is the scientific discipline focused on determining how data can drive better decisions across a wide set of domains Scientific discipline - not just data analysis, but science “...determining how data...” - methodologies, statistics, computer science “...can drive better decisions…” - domain knowledge, science, engineering, social sciences...
  • 27. How does a focus on decisions change our approach? 1) Takes the focus away from specific methodologies (we do deep learning too!) to using the appropriate methodologies to achieve a larger overarching goal - better decisions a) Side effect is we get to use a larger array of disciplines i) Systems theory ii) Psychology 2) Focus on making good data products that change decisions a) Focusing on data products takes us away from “scripts” and towards an engineered approach to data product manufacturing
  • 28. Data science is not rebranded statistics. Data science is a multidisciplinary discipline that seeks to understand how data can be used to improve decision making. Statistics is just a part of the approach.
  • 30. What is a data product? Desired OutcomeDecisionExperienceWorld learning data information knowledge wisdom data product Other Outcome Other Outcome Other Outcome Other Outcome Other Outcome Other Outcome
  • 31. Data products are the mechanism by which data science creates impact
  • 32. Scientific Method Framework for finding value in data Data is a raw resource. Converting data to a data product requires experimentation, exploration and learning. This is the domain of science. Agile Development Process for creation in the face of uncertainty Agile processes allow software teams to meet changing requirements, but stay on track and create effective products. Engineered Products Practices for ensuring high quality products It is one thing to make an R script to analyse a dataset. It is another to have a resilient, auditable, scalable data product. Desid Labs Approach “Data science - more than just R scripts” - unofficial Desid Labs motto
  • 33. Levels of data products Reporting Dashboards Prediction AI (Autonomous) Intelligent Decision-making Support Systems
  • 34. Other dimensions Complexity of UI Complexity, size, and speed of data, information, and knowledge (3V’s) This branches into the field of AI and decision making Start with Herbert Simon
  • 35. Learning from the other doctors (MD, not PhD) Clinical Decisions Rules (Dr. Ian Stiell) 1) Derivation 2) Validation a) Cross-validation (should be standard practice!) b) Prospective validation - this is the real experiment 3) Implementation 4) Studying barriers to adoption These steps help determine the validity of your data product
  • 36. More than just R scripts “It’s one thing to create an excellent fraud detection model in R, and quite another to build: ● Fault-tolerant ingest of live data at scale that could represent fraudulent actions ● Real-time computation of features based on the data stream ● Serialization, versioning and management of a fraud detection model ● Real-time prediction of fraud based on computed features at scale ● Learning over all historical data ● Incremental update of the production model in near-real-time ● Monitoring, testing, productionization of all of the above” - Sean Owen, Cloudera These are the sorts of things to think about when it comes to implementing your data product
  • 37. Where do you go from X?
  • 39. Activities Skills& Knowledge Data Preparation Intelligence Gathering W hatis the question? W here is the data? W hatis the data? G etthe data Store the data Transform the data Load the data Modeling Feature engineering Preprocessing M achine learning algorithm s Validation (Phase I-C ross Validation) Design Production Visualization R educing feature set C reating a plan forintegrating M ovem entto production stack Versioning and m anagem ent M onitoring,testing,deploym ent Domain Knowledge Data munging Distributed computing Storage Sampling Digital signal processing Handling missing data Filtering Databases Machine Learning Algorithmic Complexity GPU optimization Programming Statistics Probabilities Web development Psychology UI/UX Software engineering Devops Testing Debugging Enterprise languages Cloud computing
  • 40. Learn by doing 1) Figure out where you are in the spectrum 2) Determine what experience you need to expand in either direction 3) Find projects that will give you that experience a) Online competitions b) Hackathons c) Freelance work d) Your own projects e) Data journalism f) Data for Good (!)
  • 41. Post production Treat your data product as an hypothesis about the world ● Collect prospective data on its use ● Perform cohort analyses on people who make decisions based on the data ● Consider A/B testing ● Consider canary testing ● Set a point where you will analyze the data (X people, X amount of time) ● Answer the question - did it make a difference? ● Did it make the right difference?
  • 42. Are you doing good?
  • 43. “...science and technology have been unable to keep pace with the second-order effects caused by their first-order victories.” - Gerald Weinberg
  • 44. How do we know that our data products are having the desired effect? Data is cleaned, features determined, model created (AUC: 0.88!), implementation tested, UI designed, UX tested, integrated into production system, monitored. Everything is done Pat on the back - walk away Next month’s headline:
  • 45. What happened? - An algorithm is only as good as its data - An algorithm learns from the data - data is an representation of the real world including its flaws - The real world is complex and there can be non-linear effects
  • 46. Obviously Data for Evil (Commission) Predatory advertising Surveillance of dissidents, activists Identity theft Social Engineering
  • 47. Gray areas Web lining Databases in elections to determine wedge issues Surveillance for security reasons Targeted advertising
  • 48. Data for Good … right? (Omission) Model to determine who will respond best to social assistance What if the data is from an area with strong historical racism? (Don’t use variables/features that could be impart racial bias) Automatic tagging of photos What are the consequences of the algorithm being wrong? (Need to balance sensitivity and specificity) Apps to help first-responder (geolocation) Will providing a service to some people limit access based on arbitrary technology choices?
  • 49. How Big Data Enables Economic Harm to Consumers, Especially to Low-Income and Other Vulnerable Sectors of the Population
  • 50. Algorithms aren’t biased - but data is Historical data encompasses our societal biases Algorithms learn from that data and inherit these biases https://www.fordfoundation.org/ideas/equals-change-blog/posts/can-computers- be-racist-big-data-inequality-and-discrimination/ http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899 https://www.propublica.org/article/when-big-data-becomes-bad-data https://theconversation.com/big-data-algorithms-can-discriminate-and-its-not- clear-what-to-do-about-it-45849
  • 51. So what do we do? Possibilities: ● Strengthen User Control of Personal Data ● Enforce Structural Changes in Market to Increase Competition ● Directly Regulate Big Data Platforms to Prohibit Harmful Practices ● Investing in the technical capacity of public interest lawyers, and developing a greater cohort of public interest technologists ● Pressing for “algorithmic transparency.” ● Exploring effective regulation of personal data ● Ethical code of conduct for data science These are strategic suggestions - they suggest the what, but not the how
  • 52. We need a solution that keeps pace with the tech 1) Systematic scientific process should be applied Equivalent of peer review 2) Agile development and testing Ensure models are implemented correctly 3) Systems modeling Understand the second-order effects of the system 4) Monitoring Validation of our model in the world
  • 53. Conclusions Data science is about decisions. The creation of data products involves many disciplines Determine where you are at, then expand your skills Approach data science with care and thought - it is as easier to hurt than help
  • 54. If you are interested in specifics about methodologies, sign up for the Desid Labs newsletter: desidlabs.com