SlideShare a Scribd company logo
Designing and Scoping a
Data Science Project
Data Science for Beginners, Session 1
About these Sessions
Session Format
Session:
• One topic
• Learn 4-6 concepts related to that topic
• Try apps or code related to that topic
Before each session:
• Install required tools (see the ‘tool installs’ instructions sheet)
• Do background reading
Session Topics
People
• Designing a data science project
• Communicating results
Tools
• Python basics
• Enterprise data tools
Getting Data
• Acquiring data
• Cleaning and exploring data
Special data types
• Handling text data
• Handling geospatial data
• Handling big data
Learning from data
• Predicting values from data
• Learning relationships from data
• Learning classes from data
Sessions Timeline
1. Scoping a data science project
2. Python basics
3. Acquiring data
4. Communicating results
5. Cleaning and exploring data
6. Predicting values from data
7. Handling text data
8. Handling geospatial data
9. Learning relationships from data
10. Enterprise data tools
11. Learning classes from data
12. Handling big data
Session 1: your 5-7 things
• What is data science?
• Data science is a process
• What’s a data scientist?
• Data science competitions
• Writing a problem statement
What is Data Science?
Defining Data Science
“A data scientist… excels at analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.”
“The analysis of data using the scientific method”
“A data scientist is an individual, organization or application that performs statistical
analysis, data mining and retrieval processes on a large amount of data to identify
trends, figures and other relevant information.”
Understanding through Data
Data Science is a Process
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize your results
Ask an interesting question
Write hypotheses that can be explored
● Do people have more phones than toilets?
● How is Ebola spreading?
● Is using wood fires sustainable in rural Tanzania?
● Can we feed 9 billion people?
Make them simple, actionable, incremental
Get the data
Data files (CSV, Excel, Json, Xml...)
● Databases (sqlite, mysql, oracle, postgresql...)
● APIs
● Report tables (tables on websites, in pdf reports...)
● Text (reports and other documents…)
● Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
● Images (satellite images, drone footage, pictures, videos…)
Most data is small, but…
Reformat the data
Explore the data
Model the Data
Communicate results
What’s a Data
Scientist?
The Data Science Venn Diagram
How do you become a data scientist?
Learning and Practice
● Kaggle - online datascience competitions
● Driven Data - social good datascience competitions
● Innocentive - some datascience challenges
● CrowdAnalytix - business datascience competitions
Should you become a data
scientist?
● Not necessarily. There are lots of data science
students desperate for good problems to work on.
● You might want to become someone who can
work with data scientists
● Which means learning how to specify data
problems well
Problem examples:
Data Science
Competitions
Who Does What
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize
your results
Problem Owner
Competitor
?
DrivenData
Kaggle
DataKind
Example project: Pump It Up
Tanzania wells:
“Your goal is to predict the
operating condition of a
waterpoint for each record in the
dataset”
Example project: Cervical cancer
DrivenData competition guidelines
Impact: “… clear win for the organisation in terms of effective planning, resources
saved or people served… good story around how they generate social impact…”
Challenge: “… challenging enough for a rich competition…”
Feasibility: “….the right kind of data to answer the question at hand… does it
have enough signal to be useful?...”
Privacy: “… can answer this question while protecting the privacy of individuals in
the dataset and the operational privacy of an organisation…”
Writing a Problem
Statement
Design your project
Context: who needs this work, and what are they doing it for?
Needs: what are you trying to fix
Vision: what do you expect your final result to look like?
Outcome: how do you get your results to the people who need them? What
happens next?
Design your questions
Is the question concrete enough?
Can you translate the question into an experiment?
Is it actionable?
What actions will be taken given the answer?
What data is needed to do the analysis?
Data Science Ethics
Data Risk and Ethics
You’re responsible for your data outputs
Could your outputs increase risk to anyone?
How will you respect privacy and security?
Data Risk
Risk: “The probability of something happening multiplied by the resulting cost or
benefit if it does”
Risk of: physical, legal, reputational, privacy harm
Likelihood (e.g. low, medium, high)
Risk to: data subjects, collectors, processors, releasers, users
PII: Personally Identifiable Information
“Personally identifiable information (PII) is any data that could potentially
identify a specific individual. Any information that can be used to distinguish one
person from another and can be used for de-anonymizing anonymous data can be
considered PII.”
PII Red Flags
Names, addresses, phone numbers
Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier)
Members of small populations
Untranslated text
Codes (e.g. “41”)
Slang terms
Exercises
3-minute exercise: Ask interesting questions
Either your own questions:
Questions that data might help with
Stories you want to tell with data
Datasets you’d like to explore
Or pick an existing question:
● Competition questions: Kaggle, DrivenData
● A data science project that interested you
3-minute exercise: Get the data
Pick one of your questions
List the ideal data you need to answer it
List the data that’s (probably) available
Think about what you’ll do if the data you need isn’t available
What compromises could you make
Where would you look for more data
Are there proxies (other datasets that tell you something about your question)
3-min exercise: design your communications
List the types of people you’d want to show your results to
How do you want them to change the world? Can they take actions, can they
change opinions etc
Describe the types of outputs that might be persuasive to them - visuals, text,
numbers, stories, art… be as wild with this as you want
Things to do before next week
See file Tool Install Instructions
• Make friends with the terminal window
• Install iPython
• Install Git

More Related Content

What's hot

Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
Sanura Hettiarachchi
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
TJ Stalcup
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
N.Jagadish Kumar
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
Ankur Tripathi
 
Big data and data science
Big data and data scienceBig data and data science
Big data and data science
Song Xue
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
Gabriel Moreira
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
Srinath Perera
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
Conor B. Murphy
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Edureka!
 
Startup Data Science
Startup Data ScienceStartup Data Science
Startup Data Science
Misha Lisovich
 

What's hot (19)

Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Big data and data science
Big data and data scienceBig data and data science
Big data and data science
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Startup Data Science
Startup Data ScienceStartup Data Science
Startup Data Science
 

Viewers also liked

Global team
Global teamGlobal team
Global team
Ayush Malhotra
 
CrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing DealsCrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing Deals
Sawinder Pal Kaur
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
Jeffrey Funk Business Models
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
DKALab
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
jayroy
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALSolarWinds MSP
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
MachinePulse
 
India Startup Report
India Startup ReportIndia Startup Report
India Startup Report
World Startup Report
 
This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.
Peter Orszag
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data Analytics
InterQuest Group
 
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
J. Skyler Fernandes
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and Validation
Yevgeniy Brikman
 
List of Software Development Model and Methods
List of Software Development Model and MethodsList of Software Development Model and Methods
List of Software Development Model and Methods
Riant Soft
 

Viewers also liked (14)

Global team
Global teamGlobal team
Global team
 
CrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing DealsCrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing Deals
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINAL
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
India Startup Report
India Startup ReportIndia Startup Report
India Startup Report
 
This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data Analytics
 
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and Validation
 
List of Software Development Model and Methods
List of Software Development Model and MethodsList of Software Development Model and Methods
List of Software Development Model and Methods
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Session 01 designing and scoping a data science project

Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Thinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
Thinkful
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
Thinkful
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
jybufgofasfbkpoovh
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
C. Tobin Magle
 
Getting started in ds (july 17) atlanta
Getting started in ds (july 17)   atlantaGetting started in ds (july 17)   atlanta
Getting started in ds (july 17) atlanta
Thinkful
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
DrSudheerHanumanthak
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
Thinkful
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
Aseel Addawood
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
MartinFrigaard
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
TJ Stalcup
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
Albert Anthony Gavino, MBA
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
D92-198gstindspdx
D92-198gstindspdxD92-198gstindspdx
D92-198gstindspdx
Thinkful
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 

Similar to Session 01 designing and scoping a data science project (20)

Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Getting started in ds (july 17) atlanta
Getting started in ds (july 17)   atlantaGetting started in ds (july 17)   atlanta
Getting started in ds (july 17) atlanta
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
D92-198gstindspdx
D92-198gstindspdxD92-198gstindspdx
D92-198gstindspdx
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 

More from bodaceacat

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
bodaceacat
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
bodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
bodaceacat
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
bodaceacat
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
bodaceacat
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
bodaceacat
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
bodaceacat
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
bodaceacat
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
bodaceacat
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
bodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
bodaceacat
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
bodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
bodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
bodaceacat
 
Ardrone represent
Ardrone representArdrone represent
Ardrone representbodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection managerbodaceacat
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
bodaceacat
 
Blue light services
Blue light servicesBlue light services
Blue light services
bodaceacat
 

More from bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 

Recently uploaded

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 

Recently uploaded (20)

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 

Session 01 designing and scoping a data science project

  • 1. Designing and Scoping a Data Science Project Data Science for Beginners, Session 1
  • 3. Session Format Session: • One topic • Learn 4-6 concepts related to that topic • Try apps or code related to that topic Before each session: • Install required tools (see the ‘tool installs’ instructions sheet) • Do background reading
  • 4. Session Topics People • Designing a data science project • Communicating results Tools • Python basics • Enterprise data tools Getting Data • Acquiring data • Cleaning and exploring data Special data types • Handling text data • Handling geospatial data • Handling big data Learning from data • Predicting values from data • Learning relationships from data • Learning classes from data
  • 5. Sessions Timeline 1. Scoping a data science project 2. Python basics 3. Acquiring data 4. Communicating results 5. Cleaning and exploring data 6. Predicting values from data 7. Handling text data 8. Handling geospatial data 9. Learning relationships from data 10. Enterprise data tools 11. Learning classes from data 12. Handling big data
  • 6. Session 1: your 5-7 things • What is data science? • Data science is a process • What’s a data scientist? • Data science competitions • Writing a problem statement
  • 7. What is Data Science?
  • 8. Defining Data Science “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.” “The analysis of data using the scientific method” “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.”
  • 10. Data Science is a Process • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results
  • 11. Ask an interesting question Write hypotheses that can be explored ● Do people have more phones than toilets? ● How is Ebola spreading? ● Is using wood fires sustainable in rural Tanzania? ● Can we feed 9 billion people? Make them simple, actionable, incremental
  • 12. Get the data Data files (CSV, Excel, Json, Xml...) ● Databases (sqlite, mysql, oracle, postgresql...) ● APIs ● Report tables (tables on websites, in pdf reports...) ● Text (reports and other documents…) ● Maps and GIS data (openstreetmap, shapefiles, NASA earth images...) ● Images (satellite images, drone footage, pictures, videos…)
  • 13. Most data is small, but…
  • 19. The Data Science Venn Diagram
  • 20. How do you become a data scientist? Learning and Practice ● Kaggle - online datascience competitions ● Driven Data - social good datascience competitions ● Innocentive - some datascience challenges ● CrowdAnalytix - business datascience competitions
  • 21. Should you become a data scientist? ● Not necessarily. There are lots of data science students desperate for good problems to work on. ● You might want to become someone who can work with data scientists ● Which means learning how to specify data problems well
  • 23. Who Does What • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results Problem Owner Competitor ?
  • 27. Example project: Pump It Up Tanzania wells: “Your goal is to predict the operating condition of a waterpoint for each record in the dataset”
  • 29. DrivenData competition guidelines Impact: “… clear win for the organisation in terms of effective planning, resources saved or people served… good story around how they generate social impact…” Challenge: “… challenging enough for a rich competition…” Feasibility: “….the right kind of data to answer the question at hand… does it have enough signal to be useful?...” Privacy: “… can answer this question while protecting the privacy of individuals in the dataset and the operational privacy of an organisation…”
  • 31. Design your project Context: who needs this work, and what are they doing it for? Needs: what are you trying to fix Vision: what do you expect your final result to look like? Outcome: how do you get your results to the people who need them? What happens next?
  • 32. Design your questions Is the question concrete enough? Can you translate the question into an experiment? Is it actionable? What actions will be taken given the answer? What data is needed to do the analysis?
  • 34. Data Risk and Ethics You’re responsible for your data outputs Could your outputs increase risk to anyone? How will you respect privacy and security?
  • 35. Data Risk Risk: “The probability of something happening multiplied by the resulting cost or benefit if it does” Risk of: physical, legal, reputational, privacy harm Likelihood (e.g. low, medium, high) Risk to: data subjects, collectors, processors, releasers, users
  • 36. PII: Personally Identifiable Information “Personally identifiable information (PII) is any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII.”
  • 37. PII Red Flags Names, addresses, phone numbers Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier) Members of small populations Untranslated text Codes (e.g. “41”) Slang terms
  • 39. 3-minute exercise: Ask interesting questions Either your own questions: Questions that data might help with Stories you want to tell with data Datasets you’d like to explore Or pick an existing question: ● Competition questions: Kaggle, DrivenData ● A data science project that interested you
  • 40. 3-minute exercise: Get the data Pick one of your questions List the ideal data you need to answer it List the data that’s (probably) available Think about what you’ll do if the data you need isn’t available What compromises could you make Where would you look for more data Are there proxies (other datasets that tell you something about your question)
  • 41. 3-min exercise: design your communications List the types of people you’d want to show your results to How do you want them to change the world? Can they take actions, can they change opinions etc Describe the types of outputs that might be persuasive to them - visuals, text, numbers, stories, art… be as wild with this as you want
  • 42. Things to do before next week See file Tool Install Instructions • Make friends with the terminal window • Install iPython • Install Git