Jason Geng's presentation at Dallas Data Science Conference 2017 (www.dsassn.org/dallas)
Research of Data Science Project Lifecycle, Skillsets and Gaps between the industry and curriculums provided by universities.
3. Business Requirements
Data scientists need to work with business people and
those with expertise in understanding the data,
understanding the business
Specify the business requirements
For instance, the healthcare data
4. e.g. ‘DISCWT’:
‘This the discharge-level weight
on the HCUP nationwide data to
produce national estimates’
Understand the data:
Understand the Business:
Goal:
Predict Readmission Rate
Database:
Healthcare:
Readmissions Database
Modeling
5. Data Collection
Data from product line
Purchase third party data
Social media (Facebook, LinkedIn)
Web crawling
Open source data (Opendata, U.S. Census Data)
Challenge
Data Storage
Data Management
6. Legacy data
OLTP Web Log
Web Crawler
Open Source
Third Party
Data
Social Media
Data
XML
CSV
LOG
SQL
…
Product Line
Business
Intelligence
Data Science
App
7. Data Preparation (Data Wrangling)
Cleaning data (semantic errors, missing entries, or inconsistent
formatting)
Challenge: data integration
80% time in project workflow
Data
Source A
Data
Source B
Data
Source B
ETL
Data
Warehouse
11. Machine Learning
Data Collection
Communication
& Storytelling
Data Wrangling
Product Development &
Feedback Analysis
Data Visualization
Statistics
Domain
Knowledge &
Business Mindset
Data Science
Skill Tree
12.
13. Required Knowledge
Skillsets Knowledge
Domain Knowledge and
Business Mindset
Programming R, Python, NLP, Java, Distributed System
Industry
Various Concentrations(Finance, E-
Commerce, Geo, Biology, Medicine)
Data Collection &
Wrangling
Database Database Systems and Management
Big Data Big Data Processing and Analytics
Statistics Modeling, Inference and Optimization
Machine learning Data Mining and Machine Learning
Data Visualization Data Visualization and Exploratory Analytics
Communication and Storytelling Professional Speaking and Writing
14. Program Comparison
University Name Northwestern CMU
Johns
Hopkins
Columbia
University
Stanford Berkeley UW USC
Domain
Knowledge
& Business
Mindset
Programming ✓ ✓ ✓ ✓ ✓ ✓ ✓
Industry ✓ ✓ ✓ ✓ ✓
Data
Collection
&
Wrangling
Database ✓ ✓ ✓ ✓
Big Data ✓ ✓ ✓ ✓ ✓ ✓
Statistics ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Machine learning ✓ ✓ ✓ ✓ ✓ ✓ ✓
Data Visualization ✓ ✓ ✓ ✓
Communication and
Storytelling
✓ ✓ ✓