SlideShare a Scribd company logo
avier rudent
enough?
	depends	on	what	you	want	to	achieve	(train	neural	network,	AB	tes8ng…)	
	back	to	sta8s8cs	(mathema8cal	condi8ons)	+	rule	of	thumb	
Many	format	can	be	easily	read	with	R	
	 	 	Text,	csv,	excel,	protobuffer,	json,	xml,	html,	SQL…	
Many	sources	already	available	
	Kaggle	
	Websites	mining	
	Opendata	
	Government	agencies	
	
library(XML)	
Web.page	<-	htmlTreeParse("hNp://lapresse.ca")	
More details in coming lectures
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com	
Mistake 10
Not Having (enough) Data
Where does your data come from?
stand	up	and	get	out,	talk	with	people,	read	doc	
library(ggplot2)	
library(tabplot)	
tableplot(diamonds)	
Do	not	underes;mate	
naïve	tests…	
Mistake 9
Do not check data quality
Package	R	Datacheck
What	kind	of	data?	What	do	I	want	to	know? 		
Geographic?	Time-series?	Correla8on?	
	
	
Which	visualiza;on?	
histogram,	box	plot,	mosaic,	heat	map,	hexbining,	scaNer	plot,	line	chart,	3D	
	
	
Many	R	packages	available:	
	ggplot2	
	leaflet	
	plot_ly	
	corrplot	
	
	
Mistake 8
Do not look at your data
Look at your data
More details in coming lectures
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
Choose	the	right	color		
Color	blindness,	prin8ng,	meaning…	
	
Level	of	interac;vity?
Set	up	deadline	
Make	to-do	lists	
Project	management	tool	(Asana)	
Plan	&	monitor	your	8me	
Mistake 7
Not having a plan
Have a plan and focus on it
Do	not	forget	the	big	picture	geng	lost	into	technical	tools	
What is the question you want to answer?
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
R	package	caret	:		
evaluate	model,	choice,	es8mate	performance	(regression	&	classifica8on)	
	
Sta;s;cal	tests	:		
Goodness	of	fit,	R2,	Homer-Lemeshow	test	(MKmisc),	Wald	test,	k-fold	valida8on	
	
Retrain	“oMen”	
Observe > Clean > Understand > Train > Predict
Mistake 6
Focus on training
Ques;on:	How	much	snow	will	fall	on	Montréal	during	the	5	next	years?	
Data:	Snowfall	and	temperature	of	the	last	80	years	
?	
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
Mistake 5
Keep it complex
Do not jump first on the fashion complicated method
Keep	your	method	as	simple	as	possible	(focus	on	the	ques8on)	
	
Know	the	limits	of	this	method	
	
Compare	the	methods	(caret,	ROC)	
	
	
Boosted	Decision	
Tree	coupled	to	
neural	network	
Linear	regression	
Complexity	comes	at	a	price	
(speed,	error	prone,	
exper8se,	amount	of	data)	
	
Can	you	afford	it?	
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
R	standard	func8on:	p.adjust,	Bonferroni,	Benjamini-Hochberg	
You	are	a	miserable	shooter,	proba	to	hit	1%	
You	shoot	10,000	lasers,	hit	at	10,001st	shot	
Does	that	make	you	a	shooHng	genius?	
Mistake 4
Do not correct for multiple tests
Mul8plica8on	of	sensors,	data	gathering	protocols	à	Era	of	Big	Data	
The	more	data	you	analyze,	the	more	weird	cases	will	pop	up	regularly	
Are	they	significant?	
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
document your work
R	Markdown,	Shiny	
	
	Create	HTML,	pdf,	Word,	slides,	webpages,	CV,	journal,	book	
	Automa8cally	include	&	update	the	result	of	your	analysis	
	
	More	interac8ve?	Dashboards,	interac8ve	maps…		
hNp://rmarkdown.rstudio.com/gallery.html	
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com	
Mistake 3
Do not communicate or document
for	the	others	as	well	as	for	yourself	
More details in coming lectures
R	Markdown	
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
Café,	meetups,	colleagues,	board	game	jogging	club	
	
Publish	online	(blog)	
	
Ask	for	external	view	of	your	work	
Mistake 2
Stay alone
Do not stay alone, do not work alone
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com
DATA	science	associa8on:	code	of	conduct	
hNp://www.datascienceassn.org/code-of-conduct.html	
Mistake 1
Ethics is a useless luxury
	
	What	are	you	doing?	For	whom?	
	What	is	the	impact	of	your	work?	
	 	-	Company,	society,	yourself	
	 	-	Short	–	long	term	
	What	type	of	data	are	you	analyzing?	
	 	-	Law	&	regula8on	
	 	-	Privacy	
	Do	you	have	any	conflict	of	interest?	
Montréal	Big	Data	Meetup	22nd	March	2017	–	Xavier	Prudent	–	www.xavierprudent.com	
Tendency	to	focus	on	the	technics,	on	the	challenge	
“Yes,	but”	answers?
CAST!
Xavier	Prudent	 	XAVIER	PRUDENT	
									Organizer	 	MICHAEL	ALBO	
		The	Audience	 	ALL	OF	YOU	
	
				Technical	Support 		OVH	
Design-Photography									CHRISTINE	NAULLEAU	
	
Special	Thanks	to	George	Lucas	and	to	the	
audience	for	their	aNen8on	
question? Comment?
Feel free to contact me:!
Xavier Prudent, prudentxavier@gmail.com!

More Related Content

Similar to 10 Mistakes to avoid in data science

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
DataTactics
 

Similar to 10 Mistakes to avoid in data science (20)

Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Handle 08
Handle 08Handle 08
Handle 08
 
Big Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No CodeBig Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No Code
 
Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.Non-Relational Databases: This hurts. I like it.
Non-Relational Databases: This hurts. I like it.
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Dibbs spidal april6-2016
Dibbs spidal april6-2016Dibbs spidal april6-2016
Dibbs spidal april6-2016
 
Data science
Data scienceData science
Data science
 
10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 
Big Data Conference
Big Data ConferenceBig Data Conference
Big Data Conference
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 

Recently uploaded

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptx
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of Bengal
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdfGEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 

10 Mistakes to avoid in data science