SlideShare a Scribd company logo
1 of 42
Designing and Scoping a
Data Science Project
Data Science for Beginners, Session 1
About these Sessions
Session Format
Session:
• One topic
• Learn 4-6 concepts related to that topic
• Try apps or code related to that topic
Before each session:
• Install required tools (see the ‘tool installs’ instructions sheet)
• Do background reading
Session Topics
People
• Designing a data science project
• Communicating results
Tools
• Python basics
• Enterprise data tools
Getting Data
• Acquiring data
• Cleaning and exploring data
Special data types
• Handling text data
• Handling geospatial data
• Handling big data
Learning from data
• Predicting values from data
• Learning relationships from data
• Learning classes from data
Sessions Timeline
1. Scoping a data science project
2. Python basics
3. Acquiring data
4. Communicating results
5. Cleaning and exploring data
6. Predicting values from data
7. Handling text data
8. Handling geospatial data
9. Learning relationships from data
10. Enterprise data tools
11. Learning classes from data
12. Handling big data
Session 1: your 5-7 things
• What is data science?
• Data science is a process
• What’s a data scientist?
• Data science competitions
• Writing a problem statement
What is Data Science?
Defining Data Science
“A data scientist… excels at analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.”
“The analysis of data using the scientific method”
“A data scientist is an individual, organization or application that performs statistical
analysis, data mining and retrieval processes on a large amount of data to identify
trends, figures and other relevant information.”
Understanding through Data
Data Science is a Process
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize your results
Ask an interesting question
Write hypotheses that can be explored
● Do people have more phones than toilets?
● How is Ebola spreading?
● Is using wood fires sustainable in rural Tanzania?
● Can we feed 9 billion people?
Make them simple, actionable, incremental
Get the data
Data files (CSV, Excel, Json, Xml...)
● Databases (sqlite, mysql, oracle, postgresql...)
● APIs
● Report tables (tables on websites, in pdf reports...)
● Text (reports and other documents…)
● Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
● Images (satellite images, drone footage, pictures, videos…)
Most data is small, but…
Reformat the data
Explore the data
Model the Data
Communicate results
What’s a Data
Scientist?
The Data Science Venn Diagram
How do you become a data scientist?
Learning and Practice
● Kaggle - online datascience competitions
● Driven Data - social good datascience competitions
● Innocentive - some datascience challenges
● CrowdAnalytix - business datascience competitions
Should you become a data
scientist?
● Not necessarily. There are lots of data science
students desperate for good problems to work on.
● You might want to become someone who can
work with data scientists
● Which means learning how to specify data
problems well
Problem examples:
Data Science
Competitions
Who Does What
• Ask an interesting question
• Get the data
• Explore the data
• Model the data
• Communicate and visualize
your results
Problem Owner
Competitor
?
DrivenData
Kaggle
DataKind
Example project: Pump It Up
Tanzania wells:
“Your goal is to predict the
operating condition of a
waterpoint for each record in the
dataset”
Example project: Cervical cancer
DrivenData competition guidelines
Impact: “… clear win for the organisation in terms of effective planning, resources
saved or people served… good story around how they generate social impact…”
Challenge: “… challenging enough for a rich competition…”
Feasibility: “….the right kind of data to answer the question at hand… does it
have enough signal to be useful?...”
Privacy: “… can answer this question while protecting the privacy of individuals in
the dataset and the operational privacy of an organisation…”
Writing a Problem
Statement
Design your project
Context: who needs this work, and what are they doing it for?
Needs: what are you trying to fix
Vision: what do you expect your final result to look like?
Outcome: how do you get your results to the people who need them? What
happens next?
Design your questions
Is the question concrete enough?
Can you translate the question into an experiment?
Is it actionable?
What actions will be taken given the answer?
What data is needed to do the analysis?
Data Science Ethics
Data Risk and Ethics
You’re responsible for your data outputs
Could your outputs increase risk to anyone?
How will you respect privacy and security?
Data Risk
Risk: “The probability of something happening multiplied by the resulting cost or
benefit if it does”
Risk of: physical, legal, reputational, privacy harm
Likelihood (e.g. low, medium, high)
Risk to: data subjects, collectors, processors, releasers, users
PII: Personally Identifiable Information
“Personally identifiable information (PII) is any data that could potentially
identify a specific individual. Any information that can be used to distinguish one
person from another and can be used for de-anonymizing anonymous data can be
considered PII.”
PII Red Flags
Names, addresses, phone numbers
Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier)
Members of small populations
Untranslated text
Codes (e.g. “41”)
Slang terms
Exercises
3-minute exercise: Ask interesting questions
Either your own questions:
Questions that data might help with
Stories you want to tell with data
Datasets you’d like to explore
Or pick an existing question:
● Competition questions: Kaggle, DrivenData
● A data science project that interested you
3-minute exercise: Get the data
Pick one of your questions
List the ideal data you need to answer it
List the data that’s (probably) available
Think about what you’ll do if the data you need isn’t available
What compromises could you make
Where would you look for more data
Are there proxies (other datasets that tell you something about your question)
3-min exercise: design your communications
List the types of people you’d want to show your results to
How do you want them to change the world? Can they take actions, can they
change opinions etc
Describe the types of outputs that might be persuasive to them - visuals, text,
numbers, stories, art… be as wild with this as you want
Things to do before next week
See file Tool Install Instructions
• Make friends with the terminal window
• Install iPython
• Install Git

More Related Content

What's hot

Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoĂŻc Lejoly
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And HadoopAnkur Tripathi
 
Big data and data science
Big data and data scienceBig data and data science
Big data and data scienceSong Xue
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data ScienceGabriel Moreira
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as ScaleConor B. Murphy
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
Startup Data Science
Startup Data ScienceStartup Data Science
Startup Data ScienceMisha Lisovich
 

What's hot (19)

Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Big data and data science
Big data and data scienceBig data and data science
Big data and data science
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Startup Data Science
Startup Data ScienceStartup Data Science
Startup Data Science
 

Viewers also liked

CrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing DealsCrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing DealsSawinder Pal Kaur
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive ModelDKALab
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part Ijayroy
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALSolarWinds MSP
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.Peter Orszag
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsInterQuest Group
 
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...J. Skyler Fernandes
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and ValidationYevgeniy Brikman
 
List of Software Development Model and Methods
List of Software Development Model and MethodsList of Software Development Model and Methods
List of Software Development Model and MethodsRiant Soft
 

Viewers also liked (14)

Global team
Global teamGlobal team
Global team
 
CrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing DealsCrowdANALTIX Data Competition Visualizing Deals
CrowdANALTIX Data Competition Visualizing Deals
 
Kaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data AnalyticsKaggle: Crowd Sourcing for Data Analytics
Kaggle: Crowd Sourcing for Data Analytics
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINAL
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
India Startup Report
India Startup ReportIndia Startup Report
India Startup Report
 
This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.This Isn't 'Big Data.' It's Just Bad Data.
This Isn't 'Big Data.' It's Just Bad Data.
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data Analytics
 
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
The Best Startup Investor Pitch Deck & How to Present to Angels & Venture Cap...
 
Startup Ideas and Validation
Startup Ideas and ValidationStartup Ideas and Validation
Startup Ideas and Validation
 
List of Software Development Model and Methods
List of Software Development Model and MethodsList of Software Development Model and Methods
List of Software Development Model and Methods
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Session 01 designing and scoping a data science project

Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sdThinkful
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...jybufgofasfbkpoovh
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data ManagementC. Tobin Magle
 
Getting started in ds (july 17) atlanta
Getting started in ds (july 17)   atlantaGetting started in ds (july 17)   atlanta
Getting started in ds (july 17) atlantaThinkful
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxDrSudheerHanumanthak
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17Thinkful
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1Aseel Addawood
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st centuryMartinFrigaard
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCTJ Stalcup
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
D92-198gstindspdx
D92-198gstindspdxD92-198gstindspdx
D92-198gstindspdxThinkful
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 

Similar to Session 01 designing and scoping a data science project (20)

Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Getting started in ds (july 17) atlanta
Getting started in ds (july 17)   atlantaGetting started in ds (july 17)   atlanta
Getting started in ds (july 17) atlanta
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Data fluency for the 21st century
Data fluency for the 21st centuryData fluency for the 21st century
Data fluency for the 21st century
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
D92-198gstindspdx
D92-198gstindspdxD92-198gstindspdx
D92-198gstindspdx
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 

More from bodaceacat

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformationbodaceacat
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_masterbodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019bodaceacat
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018bodaceacat
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptxbodaceacat
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial databodaceacat
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptxbodaceacat
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptxbodaceacat
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploringbodaceacat
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating resultsbodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basicsbodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Ardrone represent
Ardrone representArdrone represent
Ardrone representbodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection managerbodaceacat
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovationbodaceacat
 
Blue light services
Blue light servicesBlue light services
Blue light servicesbodaceacat
 

More from bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 

Recently uploaded

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

Session 01 designing and scoping a data science project

  • 1. Designing and Scoping a Data Science Project Data Science for Beginners, Session 1
  • 3. Session Format Session: • One topic • Learn 4-6 concepts related to that topic • Try apps or code related to that topic Before each session: • Install required tools (see the ‘tool installs’ instructions sheet) • Do background reading
  • 4. Session Topics People • Designing a data science project • Communicating results Tools • Python basics • Enterprise data tools Getting Data • Acquiring data • Cleaning and exploring data Special data types • Handling text data • Handling geospatial data • Handling big data Learning from data • Predicting values from data • Learning relationships from data • Learning classes from data
  • 5. Sessions Timeline 1. Scoping a data science project 2. Python basics 3. Acquiring data 4. Communicating results 5. Cleaning and exploring data 6. Predicting values from data 7. Handling text data 8. Handling geospatial data 9. Learning relationships from data 10. Enterprise data tools 11. Learning classes from data 12. Handling big data
  • 6. Session 1: your 5-7 things • What is data science? • Data science is a process • What’s a data scientist? • Data science competitions • Writing a problem statement
  • 7. What is Data Science?
  • 8. Defining Data Science “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.” “The analysis of data using the scientific method” “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.”
  • 10. Data Science is a Process • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results
  • 11. Ask an interesting question Write hypotheses that can be explored ● Do people have more phones than toilets? ● How is Ebola spreading? ● Is using wood fires sustainable in rural Tanzania? ● Can we feed 9 billion people? Make them simple, actionable, incremental
  • 12. Get the data Data files (CSV, Excel, Json, Xml...) ● Databases (sqlite, mysql, oracle, postgresql...) ● APIs ● Report tables (tables on websites, in pdf reports...) ● Text (reports and other documents…) ● Maps and GIS data (openstreetmap, shapefiles, NASA earth images...) ● Images (satellite images, drone footage, pictures, videos…)
  • 13. Most data is small, but…
  • 19. The Data Science Venn Diagram
  • 20. How do you become a data scientist? Learning and Practice ● Kaggle - online datascience competitions ● Driven Data - social good datascience competitions ● Innocentive - some datascience challenges ● CrowdAnalytix - business datascience competitions
  • 21. Should you become a data scientist? ● Not necessarily. There are lots of data science students desperate for good problems to work on. ● You might want to become someone who can work with data scientists ● Which means learning how to specify data problems well
  • 23. Who Does What • Ask an interesting question • Get the data • Explore the data • Model the data • Communicate and visualize your results Problem Owner Competitor ?
  • 27. Example project: Pump It Up Tanzania wells: “Your goal is to predict the operating condition of a waterpoint for each record in the dataset”
  • 29. DrivenData competition guidelines Impact: “… clear win for the organisation in terms of effective planning, resources saved or people served… good story around how they generate social impact…” Challenge: “… challenging enough for a rich competition…” Feasibility: “….the right kind of data to answer the question at hand… does it have enough signal to be useful?...” Privacy: “… can answer this question while protecting the privacy of individuals in the dataset and the operational privacy of an organisation…”
  • 31. Design your project Context: who needs this work, and what are they doing it for? Needs: what are you trying to fix Vision: what do you expect your final result to look like? Outcome: how do you get your results to the people who need them? What happens next?
  • 32. Design your questions Is the question concrete enough? Can you translate the question into an experiment? Is it actionable? What actions will be taken given the answer? What data is needed to do the analysis?
  • 34. Data Risk and Ethics You’re responsible for your data outputs Could your outputs increase risk to anyone? How will you respect privacy and security?
  • 35. Data Risk Risk: “The probability of something happening multiplied by the resulting cost or benefit if it does” Risk of: physical, legal, reputational, privacy harm Likelihood (e.g. low, medium, high) Risk to: data subjects, collectors, processors, releasers, users
  • 36. PII: Personally Identifiable Information “Personally identifiable information (PII) is any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII.”
  • 37. PII Red Flags Names, addresses, phone numbers Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier) Members of small populations Untranslated text Codes (e.g. “41”) Slang terms
  • 39. 3-minute exercise: Ask interesting questions Either your own questions: Questions that data might help with Stories you want to tell with data Datasets you’d like to explore Or pick an existing question: ● Competition questions: Kaggle, DrivenData ● A data science project that interested you
  • 40. 3-minute exercise: Get the data Pick one of your questions List the ideal data you need to answer it List the data that’s (probably) available Think about what you’ll do if the data you need isn’t available What compromises could you make Where would you look for more data Are there proxies (other datasets that tell you something about your question)
  • 41. 3-min exercise: design your communications List the types of people you’d want to show your results to How do you want them to change the world? Can they take actions, can they change opinions etc Describe the types of outputs that might be persuasive to them - visuals, text, numbers, stories, art… be as wild with this as you want
  • 42. Things to do before next week See file Tool Install Instructions • Make friends with the terminal window • Install iPython • Install Git