How to Approach
Data Science Problems
from Start to End
Polong Lin
Data Scientist
IBM Analytics, Emerging Technologies
@polonglin
@bigdatau
台灣資料科學年會
• Free online	courses
• Data	Science	& Data	Engineering
• A	communityinitiative	led	by	IBM
• Certificates	and	Badges
• >	450,000	users
What	is	Big	Data	University	(BDU)?
3
4
5
“5-5-5	Rule”
Course
Lesson	1
Lesson	2
Lesson	3
Lesson	4
Final	Exam
Certificate/Badge
Lesson	5
5	videos
5	videos
5	videos
5	videos
5	videos
Lab	Exercises
6
Learn	hands-on. Exercises	in	the	cloud.
DataScientistWorkbench.com
1.	Business	
Understanding
Data	Science	Methodology
7.	Modelling
6.	Data
Preparation
3.	Data
Requirements
9.	Deployment10.	Feedback
Prediction
Interpretation
Justification
Testing
4.	Data
Collection
8.	Evaluation
5.	Data
Understanding
2.	Analytic
Approach
“Polong	will	fly	from	San	Francisco	to	New	York
for	a	meeting	at	3:00pm	on	Friday,	July	22.”
Can	Polong	anticipate	whether	his	flight	will	be	delayed?
Flight	delays
8
San	Francisco New	York
• Every	project	begins	with	business	understanding.
• What	is	the	project	objective?
• What	are	we	trying	to	do	– what	is	our	goal?
1. Formulate	a	clear	question
2. Define	problem	and	solution	requirements
9
1. Business	
Understanding
Flight	delays:	Create	some	solution	that	can	help	
users	predict	if	a	flight on	a	given	day	will	be	
delayed or	not	delayed
1.	Business	understanding
Using	departing	&	arrival	airport,	date,	carrier,	etc.,
we	could	predict	flight	[DELAY]	or	[NO-DELAY]	using	
logistic	regression.
• Identify	suitable	statistical/machine	learning	technique(s)
10
2.	Analytic
Approach
• Linear	regression
• Logistic	regression
• Clustering
• Decision	Trees
• Principal	component	
analysis
• Text	analysis
• SVM/SVR
• Neural	networks
• Dimension	
Reduction
2.	Analytic	approach
11
3.	Data
Requirements
4.	Data
Collection
5.	Data
Understanding
What	data	is	required?
What	format?
Collect	the	data
What	does	the	data	look	like?
What	are	initial	insights?
Can	we	visualize	the	data?
Are	missing	anything?
• Flight	data
• Open	data	available
• All	domestic	US	flights	per	year
• CSV	format
• Which	airports	are	busiest?
• Which	flights	are	most	delayed?
• Which	airports	are	best/worst?
Flight	Data
12
We	will	only	look	at		data	from	2007	(seven	million	flights)
http://stat-computing.org/dataexpo/2009/the-data.html
Departure	Delay	(min)
13
Which	airports	are	busiest?
14
Which	flights	are	most	likely	be	delayed?
Data	Preparation	typically	includes:
• Data	cleaning
• Merging	data
• Transforming	data
• Feature	engineering
• Text	analysis
15
6.	Data	preparation
6.	Data
Preparation
Flights	are	classified	as	“delayed”	if	>15	min	late.
• Delayed? [True	or	False]
Does	time	of	day	for	departure	predict	delays?
• Hour
16
Which	day	of	the	week and	time	of	departure	is	worst?
1.	Business	
Understanding
Data	Science	Methodology
7.	Modelling
6.	Data
Preparation
3.	Data
Requirements
9.	Deployment10.	Feedback
Prediction
Interpretation
Justification
Testing
4.	Data
Collection
8.	Evaluation
5.	Data
Understanding
2.	Analytic
Approach
Modeling is	a:
• Highly	iterative	process
• Multiple	models	may	be	used	and	tested
18
Modelling
Modeling
Using	inputs:
• Year
• Month
• Day	of	Month
• Hour of	departure
• Distance
• Destination airport
Predict:
Delay (True/False)
Logistic	Regression
How	well	does	our	model	accurately	predict	
delays?
• Does	the	model	performance	meet	our	business	goals?
• Do	we	need	to	refine	our	model?
19
Evaluation
Model	evaluation
• Once	finalized,	the	model	is	deployed into	a	production	environment.
• May	be	in	a	limited	/	test	environment	until	model	is	proven
• Involves	additional	groups,	skills,	and	technologies	
• Solution	owner
• Marketing
• Application	developers	and	designers
• IT	administration
• Feedback to	assess	model	performance
• Gathering	and	analysis	of	feedback	for	assessment
of	the	model’s	performance	and	impact
• Iterative	process	for	model	refinement	and	redeployment
• Accelerate	through	automated	processes
20
Deployment
Feedback
Prediction
Interpretation
Justification
Testing
Deployment	and	feedback
21
Creating	a	prototype
1.	Business	
Understanding
Data	Science	Methodology
7.	Modelling
6.	Data
Preparation
3.	Data
Requirements
9.	Deployment10.	Feedback
Prediction
Interpretation
Justification
Testing
4.	Data
Collection
8.	Evaluation
5.	Data
Understanding
2.	Analytic
Approach
Case-study	&	Demo:	Food
Can	we	use	ingredients	to	predict	what	cuisine	a	recipe	belongs	to?
23
What	cuisine	is	this?
2	PM
4	minute	
BLT	
Beast
24
What	cuisine	is	this?
Ingredients:
Rice
Seaweed
Wasabi
Soy	sauce
25
http://allrecipes.com/recipe/189477/california-roll-sushi/
26
How	are	we	able	to	tell	what	kind	
of	cuisine	some	food	dish	is,
even	if	we’ve	never	seen	it	before?
Schellack at English	Wikipedia
https://www.flickr.com/photos/10559879@N00/4004745542
A. Based	on	the	ingredients	alone,	can	we	predict	
what	cuisine a	food	dish	belongs	to?
B. Which cuisines	are similar	to	each	other	based	
on	their	ingredients?
27
Business	
1.	Research
Understanding
Japanese American
British Indian
Chinese
French Italian
Vietnamese Canadian
Food	and	ingredients
28
Rice?
ALL	CUISINES
NON-ASIAN	
FOOD
ASIAN	FOOD
NO YES
Wasabi?
NO YES
NOT	JAPANESE JAPANESE
A. Based	on	the	ingredients	alone,	
can	we	predict	what	cuisine a	food	dish	belongs	to?
2.	Analytic
Approach
Decision	trees
B. Which	cuisines	are similar	to	each	other	based	on	their	ingredients?
Analytic
Approach
K-means
Clustering Group	similar	cuisines	together	
into	k number	of	clusters.
www.allrecipes.com
www.epicurious.com
www.menupan.com
30
Web	Scrape
Data
Collection
Data	scraped	by	Yong-Yeol Ahn
http://yongyeol.com/
31
Data
Understanding
Polong Lin(林伯龍)/how to approach data science problems from start to end

Polong Lin(林伯龍)/how to approach data science problems from start to end