Data Science Project Lifecycle
and Data Scientist Skill Set
Jason Geng @Data Application Lab
Miya Du @Data Science Association
Business
Requirement
Data
Acquisition
Data
Preparation
Hypothesis &
Modeling
Evaluation &
Interpretation
Deployment
Operations
Optimization
Business Requirements
 Data scientists need to work with business people and
those with expertise in understanding the data,
understanding the business
 Specify the business requirements
 For instance, the healthcare data
e.g. ‘DISCWT’:
‘This the discharge-level weight
on the HCUP nationwide data to
produce national estimates’
Understand the data:
Understand the Business:
Goal:
Predict Readmission Rate
Database:
Healthcare:
Readmissions Database
Modeling
Data Collection
 Data from product line
 Purchase third party data
 Social media (Facebook, LinkedIn)
 Web crawling
 Open source data (Opendata, U.S. Census Data)
Challenge
Data Storage
Data Management
Legacy data
OLTP Web Log
Web Crawler
Open Source
Third Party
Data
Social Media
Data
XML
CSV
LOG
SQL
…
Product Line
Business
Intelligence
Data Science
App
Data preparation (data wrangling)
 Cleaning data (semantic errors, missing entries, or inconsistent
formatting)
 Challenge: data integration
 80% time in project workflow
Data
Source A
Data
Source B
Data
Source B
ETL
Data
Warehouse
Feature engineering
Select or
creating
features
Research
feature
relevance
Experiment
and
validation
Change the
feature set
Go back to
feature
selection
step
Modeling
Reference Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Deploy to product line
Machine Learning
Data Collection
Communication
& Storytelling
Data Wrangling
Product Development &
Feedback Analysis
Data Visualization
Statistics
Domain
Knowledge &
Business Mindset
Data Science
Skill Tree
Required Knowledge
Skillsets Knowledge
Domain Knowledge and
Business Mindset
Programming R, Python, NLP, Java, Distributed System
Industry
Various Concentrations(Finance, E-
Commerce, Geo, Biology, Medicine)
Data Collection &
Wrangling
Database Database Systems and Management
Big Data Big Data Processing and Analytics
Statistics Modeling, Inference and Optimization
Machine learning Data Mining and Machine Learning
Data Visualization Data Visualization and Exploratory Analytics
Communication and Storytelling Professional Speaking and Writing
Program Comparison
University Name Northwestern CMU
Johns
Hopkins
Columbia
University
Stanford Berkeley UW USC
Domain
Knowledge
& Business
Mindset
Programming ✓ ✓ ✓ ✓ ✓ ✓ ✓
Industry ✓ ✓ ✓ ✓ ✓
Data
Collection
&
Wrangling
Database ✓ ✓ ✓ ✓
Big Data ✓ ✓ ✓ ✓ ✓ ✓
Statistics ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Machine learning ✓ ✓ ✓ ✓ ✓ ✓ ✓
Data Visualization ✓ ✓ ✓ ✓
Communication and
Storytelling
✓ ✓ ✓
Thank you!
https://www.DataAppLab.com
Feb 2017
PPT: Xiaolu Zhao @ Feb 16, 2017

Data Science Project Lifecycle and Skill Set

Editor's Notes

  • #4 Add health care Re-adminssion Niu ying
  • #5 为了所有数据能在全国范围的医疗数据做横向比较而算出来的系数
  • #6 Data source + add picture => bring challenge