SlideShare a Scribd company logo
What are common mistakes
in Data Science projects?
(and how to avoid them?)
Artur Suchwałko, Ph.D., QuantUp
AI & Big Data 2018, March 10, 2018, Lviv, Ukraine
Real-world Data Science projects
Real-world Data Science projects
• Kaggle competitions and real Data Science projects are two quite
different disciplines
• When a data frame is prepared then it’s easy
• What is done not correctly and can be corrected?
• Analysis of a business problem
• Data
• Process
• Methods, models
• Hardware, sofware
• People
(Everything based on practical experience: 20 years, 100 projects, 3,000
hours of workshops.
For the majority of topics I could add quotes from talks.)
Analysis of a business problem
No. We don’t want to build a model of
production and storage in our factory
Problem:
• We’d like just to optimize cutting a log (a trunk of a dead tree) into
planks
• Let’s do it in the simplest way. Why should we waste time and
money?
• The others can do it. Why do you make it complicated?!?
Solution:
• To build the production and storage model
• Otherwise you will optimize log cutting in a different sawmill
• or something completely different
Solution of a wrong analytical problem
Problem:
• Stating of a wrong problem and solving it can decrease predictive
ability of a model
• Similarly, removing so called false predictors (leaks from future)
• But we never want to have pure predictive power. Usually business
wants actionability and real value
Solution:
• Focus on what influences your busines
Data
Preparation of a development sample is not
very important
Problem:
• Let’s take a sample and model!
• Preparation of the development sample decides if the model will fit
the reality we model or not
• The data and thus the sample is generated (or influenced) by a
process that must be well known and understoo
Solution:
• Think it over really carefully.
We have Big Data. We need to implement
Big Data solutions
Problem:
• If you can email your data or fit it in a pendrive it means you don’t
have Big Data!
• Many Data Science tasks for millions of records can be completed
using (powerful) laptops
• Decisions are data-driven or not. It’s not about data magnitude but
about way the decisions are taken
Solution:
• Be (more than) sure that we need Big Data technologies for storing
and processing
• During PoC / prototype stage don’t use Big Data tools
• Important: Not valid for some problems
Use social media data
Problem:
• It’s a tremendous effort if you don’t use an off-the-shelf solution
• Usually business value is not big
Solution:
• Be sure that the effort will be rewarded
Process
Let’s build a model in one week
Problem:
• It’s possible (in theory)
• If you don’t analyze the process thoughtfully and don’t detect false
predictors then the model will not work in production
• We will be really happy to see how well it performs on our
development sample
Solution:
• Take enough time
• Be sure that the process is correct
There is too short time to complete the task /
model
Problem:
• Data problems
• Stucked in preprocessing
• The implementation takes too long
• Too short experience
Solution:
• Prepare a full product as soon as possible, e.g.:
• cutting out all the functionalities, e.g. a scoring application with a
simple / dummy model
• a full code for building the model but using simpler methods
• improve it in the next iterations
• Using CRISP-DM / checklist to support your memory
• Usually you can start implementation from the first product version
Way you prepare the result (a model, a data
product) doesn’t matter
Problem:
• I want a model. It must work. I don’t care how you’ll build it. Just
build it!
• The process is crucial
• If it is wrong then the analysis is not fully reproducible
• We take a technical debt
• and sooner or later we will be forced to pay it back
Solution:
• Build models in a fully reproducible way
Implementation – I’m sure it’ll work out
somehow
Problem:
• Implementation without planned tests usually fail
• What is really painful, it takes time to realize that they failed (a
model works and generates risk)
Solution:
• Plan both, implementation and tests
Methods & models
AI. We desperately need AI!
Problem:
• We don’t need
• Predictive modeling is not AI!
• It happens that full control over a model is more important than
predictive power
Solution:
• Let’s think what we’d like to achieve and how to do this
• Data-driven decision making is more important
A model just learns everything it is exposed
to
Problem:
• You need to promise self-learning to sell a service / a software
• But it will not learn automatically if not fed by suitable data
• In many situations you don’t have such data to design a feedback loop
Solution:
• Analyze a process that generates the data for the development sample
• Put aside a “not touched” sample
• The model will be taught using a sample and refined in an ongoing
way
Start modeling from using Deep Learning!
Problem:
• But everybody uses it…
• No!!!
• Many problems are too simple for DL
• In particular, the problems with data in a data frame
Solution:
• Random Forest, xgboost
If we have 3000 classes then let’s build a
BIG classifier
Problem:
• For example when we’d like to recommend bank products
• Such a random classifier has error 2999/3000 = 99.97% (not 50%)
• Usually the dataset is too small
Solution:
• It’s good to use a simpler method (usually)
Hardware & software
You can do calculations using a laptop
Problem:
• Sometimes yes, you can
• But usually you cannot
• Usually it doesn’t make any sense – human’s time is more expensive
that machine’s time
Solution:
• It is good to invest some money in hardware
• or use AWS from Amazon (or something similar)
Commercial software is excellent
Problem:
• Users often tell that it is excellent unless bought
• The problems appear later
Solution:
• Test it in similar conditions it will be used
• Think seriously about using open source
Free software is excellent (and it’s free!)
Problem:
• It’s free – in terms of a buying cost
• It’s not just excellent – the cost is neccessity to have qualified people
onboard and to develop software
• There happen inconvenient problems
Solution:
• Use as it should be used
• i.e. write clear and clean code, use additional tools, e.g. VCS
• Take care of the team to have the skills needed
People
All companies have Data Science teams.
Let’s build one for us!
Problem:
• It’s possible to build a team. It will take a lot of time and lots of
money.
• If the results will be wasted then the people will leave
• They need to have fun working on projects
• If I need a plank then do I really need to buy a sawmill?
Solution::
• Be sure that:
• we know how to use their results
• it will give value to the business
• PoC can be outsourced. The first data science project can be
outsourced.
A student or a freshman is enough to give
profits from deep analytics to business
Problem:
• If someone can cut with a scalpel then will we call him a surgeon?
• Why someone who can build (technically) a model having a data
frame is called a Data Scientist?
• Data Scientist is a profession – experience matters!
• People without experience usually don’t give any business value for
a company. Even after spending a year working with data (!)
Solution:
• Hire experienced people, especially in the beginning of a DS journey
• let them teach the freshmen
• But what is you don’t have experienced people?
• Invest time, effort, and money in your team. Let a more business
analyst control the team
The team will learn everything on online
courses
Problem:
• I give each of you $20 (ok, even $50) and learn everything online
• It’s true. The team will learn some things
• But not the most important ones
• A good hands-on training cannot be substituted
Solution:
• Learning by doing (and applying)
• Control and stimulate learning
• Buy knowledge
Summary
Summary
• To avoid mistakes it is good to ask ourselves these questions (and
answer them), e.g.:
• What business problem are we solving?
• What will be business value we can get from the results?
• What could be lost in translation fro business into analytics?
• Do we have adequate and representative data?
• What process does generate them? What are they influenced by?
• What is model building process?
• What analytical tools should be used? Could we apply simpler
approaches?
• How do we control all the risk?
• It is good to do it repeatedly
• It’s best to involve someone experienced
• It’s beneficial to educate the receivers of the results
Contact
Contact
• During the conference!
• After the conference: artur [at] quantup [dot] eu

More Related Content

What's hot

A Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software TestingA Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software Testing
TechWell
 
Problem solving section 1
Problem solving section 1Problem solving section 1
Problem solving section 1
dwyer1an
 
The 7 step problem solving methodology
The 7 step problem solving methodologyThe 7 step problem solving methodology
The 7 step problem solving methodology
quest_pune
 
Data scientist enablement dse 400 week 5 roadmap
Data scientist enablement   dse 400   week 5 roadmapData scientist enablement   dse 400   week 5 roadmap
Data scientist enablement dse 400 week 5 roadmap
Dr. Mohan K. Bavirisetty
 

What's hot (20)

Testing in the Wild
Testing in the WildTesting in the Wild
Testing in the Wild
 
Startup Operating Systems
Startup Operating SystemsStartup Operating Systems
Startup Operating Systems
 
Herman- Pieter Nijhof - Where Do Old Testers Go?
Herman- Pieter Nijhof - Where Do Old Testers Go?Herman- Pieter Nijhof - Where Do Old Testers Go?
Herman- Pieter Nijhof - Where Do Old Testers Go?
 
A Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software TestingA Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software Testing
 
Hiring a developer: step by step debugging
Hiring a developer: step by step debuggingHiring a developer: step by step debugging
Hiring a developer: step by step debugging
 
hypothesis driven development
hypothesis driven developmenthypothesis driven development
hypothesis driven development
 
Witness wednesdays informing agile software development with continuous user...
Witness wednesdays  informing agile software development with continuous user...Witness wednesdays  informing agile software development with continuous user...
Witness wednesdays informing agile software development with continuous user...
 
Better products faster: let's bring the user into the userstory // TAPOST_201...
Better products faster: let's bring the user into the userstory // TAPOST_201...Better products faster: let's bring the user into the userstory // TAPOST_201...
Better products faster: let's bring the user into the userstory // TAPOST_201...
 
Managing Data Science by David Martínez Rego
Managing Data Science by David Martínez RegoManaging Data Science by David Martínez Rego
Managing Data Science by David Martínez Rego
 
Problem solving section 1
Problem solving section 1Problem solving section 1
Problem solving section 1
 
The 7 step problem solving methodology
The 7 step problem solving methodologyThe 7 step problem solving methodology
The 7 step problem solving methodology
 
Principal as agent of change
Principal as agent of change Principal as agent of change
Principal as agent of change
 
The Pragmatic Agilist: estimating, improving quality, and communication with...
The Pragmatic Agilist: estimating, improving quality, and communication  with...The Pragmatic Agilist: estimating, improving quality, and communication  with...
The Pragmatic Agilist: estimating, improving quality, and communication with...
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
Problem solving skills
Problem solving skillsProblem solving skills
Problem solving skills
 
Write code and find a job
Write code and find a jobWrite code and find a job
Write code and find a job
 
eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...
eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...
eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...
 
Using Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A JobUsing Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A Job
 
Data scientist enablement dse 400 week 5 roadmap
Data scientist enablement   dse 400   week 5 roadmapData scientist enablement   dse 400   week 5 roadmap
Data scientist enablement dse 400 week 5 roadmap
 
Dancing for a product release
Dancing for a product releaseDancing for a product release
Dancing for a product release
 

Similar to Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

ACC presentation for QA Club Kiev
ACC presentation for QA Club KievACC presentation for QA Club Kiev
ACC presentation for QA Club Kiev
Nikita Knysh
 
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHubSOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
DevOpsDays Tel Aviv
 

Similar to Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?" (20)

"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Lecture13-Product-Development-PartI-Feb25-2018.pptx
Lecture13-Product-Development-PartI-Feb25-2018.pptxLecture13-Product-Development-PartI-Feb25-2018.pptx
Lecture13-Product-Development-PartI-Feb25-2018.pptx
 
ACC presentation for QA Club Kiev
ACC presentation for QA Club KievACC presentation for QA Club Kiev
ACC presentation for QA Club Kiev
 
Life in the tech trenches (2015)
Life in the tech trenches (2015)Life in the tech trenches (2015)
Life in the tech trenches (2015)
 
CTO Crunch avec Julien Simon, Viadeo
CTO Crunch avec Julien Simon, ViadeoCTO Crunch avec Julien Simon, Viadeo
CTO Crunch avec Julien Simon, Viadeo
 
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHubSOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
 
Rex Sprint 0 - how build the data model with 2 BA and 3 IT architects
Rex Sprint 0 - how build the data model with 2 BA and 3 IT architectsRex Sprint 0 - how build the data model with 2 BA and 3 IT architects
Rex Sprint 0 - how build the data model with 2 BA and 3 IT architects
 
Roadmap
RoadmapRoadmap
Roadmap
 
Adam Ochs - Office 365 Roadmap
Adam Ochs - Office 365 RoadmapAdam Ochs - Office 365 Roadmap
Adam Ochs - Office 365 Roadmap
 
Mini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem managementMini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem management
 
Adopting innovation
Adopting innovationAdopting innovation
Adopting innovation
 
Shipping code is not the problem, deciding what to ship it is!
Shipping code is not the problem, deciding what to ship it is!Shipping code is not the problem, deciding what to ship it is!
Shipping code is not the problem, deciding what to ship it is!
 
FDS Unit I_PPT.pptx
FDS Unit I_PPT.pptxFDS Unit I_PPT.pptx
FDS Unit I_PPT.pptx
 
Product Management in the Era of Data Science
Product Management in the Era of Data ScienceProduct Management in the Era of Data Science
Product Management in the Era of Data Science
 
Adopting innovation
Adopting innovationAdopting innovation
Adopting innovation
 
Building Startups and Minimum Viable Products (NDC2013)
Building Startups and Minimum Viable Products (NDC2013)Building Startups and Minimum Viable Products (NDC2013)
Building Startups and Minimum Viable Products (NDC2013)
 
Cracking the Coding Interview (Oct 2012)
Cracking the Coding Interview (Oct 2012)Cracking the Coding Interview (Oct 2012)
Cracking the Coding Interview (Oct 2012)
 
20180324 zen and the art of programming
20180324 zen and the art of programming20180324 zen and the art of programming
20180324 zen and the art of programming
 

More from Lviv Startup Club

More from Lviv Startup Club (20)

Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)Maksym Vyshnivetskyi: PMO Quality Management (UA)
Maksym Vyshnivetskyi: PMO Quality Management (UA)
 
Anatolii Vintsyk: Комунікації в проєкті під час війни (UA)
Anatolii Vintsyk: Комунікації в проєкті під час війни (UA)Anatolii Vintsyk: Комунікації в проєкті під час війни (UA)
Anatolii Vintsyk: Комунікації в проєкті під час війни (UA)
 
Natalia Renska & Roman Astafiev: Нарциси і психопати в організаціях. Як це вп...
Natalia Renska & Roman Astafiev: Нарциси і психопати в організаціях. Як це вп...Natalia Renska & Roman Astafiev: Нарциси і психопати в організаціях. Як це вп...
Natalia Renska & Roman Astafiev: Нарциси і психопати в організаціях. Як це вп...
 
Diana Natkhir: Інструменти Change management для роботи з клієнтами в продукт...
Diana Natkhir: Інструменти Change management для роботи з клієнтами в продукт...Diana Natkhir: Інструменти Change management для роботи з клієнтами в продукт...
Diana Natkhir: Інструменти Change management для роботи з клієнтами в продукт...
 
Khristina Pototska: Steering the Ship: Product Management in Startups vs. Glo...
Khristina Pototska: Steering the Ship: Product Management in Startups vs. Glo...Khristina Pototska: Steering the Ship: Product Management in Startups vs. Glo...
Khristina Pototska: Steering the Ship: Product Management in Startups vs. Glo...
 
Oleksandr Buratynskyi: Як Agile Coach мікроменеджером став 🙃 (UA)
Oleksandr Buratynskyi: Як Agile Coach мікроменеджером став 🙃 (UA)Oleksandr Buratynskyi: Як Agile Coach мікроменеджером став 🙃 (UA)
Oleksandr Buratynskyi: Як Agile Coach мікроменеджером став 🙃 (UA)
 
Igor Protsenko: Difference between outsourcing and product companies for prod...
Igor Protsenko: Difference between outsourcing and product companies for prod...Igor Protsenko: Difference between outsourcing and product companies for prod...
Igor Protsenko: Difference between outsourcing and product companies for prod...
 
Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...Kseniya Leshchenko: Shared development support service model as the way to ma...
Kseniya Leshchenko: Shared development support service model as the way to ma...
 
Valeriy Kozlov: Taming the Startup Chaos: GTD for Founders & Small Teams (UA)
Valeriy Kozlov: Taming the Startup Chaos: GTD for Founders & Small Teams (UA)Valeriy Kozlov: Taming the Startup Chaos: GTD for Founders & Small Teams (UA)
Valeriy Kozlov: Taming the Startup Chaos: GTD for Founders & Small Teams (UA)
 
Anna Kompanets: Проблеми впровадження проєктів, про які б ви ніколи не подума...
Anna Kompanets: Проблеми впровадження проєктів, про які б ви ніколи не подума...Anna Kompanets: Проблеми впровадження проєктів, про які б ви ніколи не подума...
Anna Kompanets: Проблеми впровадження проєктів, про які б ви ніколи не подума...
 
Viktoriia Honcharova: PMI: нова стратегія розвитку управління проєктами (UA)
Viktoriia Honcharova: PMI: нова стратегія розвитку управління проєктами (UA)Viktoriia Honcharova: PMI: нова стратегія розвитку управління проєктами (UA)
Viktoriia Honcharova: PMI: нова стратегія розвитку управління проєктами (UA)
 
Andrii Mandrika: Як системно допомагати ЗСУ, використовуючи продуктовий підхі...
Andrii Mandrika: Як системно допомагати ЗСУ, використовуючи продуктовий підхі...Andrii Mandrika: Як системно допомагати ЗСУ, використовуючи продуктовий підхі...
Andrii Mandrika: Як системно допомагати ЗСУ, використовуючи продуктовий підхі...
 
Michael Vidyakin: From Vision to Victory: Mastering the Project-Strategy Conn...
Michael Vidyakin: From Vision to Victory: Mastering the Project-Strategy Conn...Michael Vidyakin: From Vision to Victory: Mastering the Project-Strategy Conn...
Michael Vidyakin: From Vision to Victory: Mastering the Project-Strategy Conn...
 
Kateryna Kubasova: Абстрактне Оксфордське лідерство конкретному українському ...
Kateryna Kubasova: Абстрактне Оксфордське лідерство конкретному українському ...Kateryna Kubasova: Абстрактне Оксфордське лідерство конкретному українському ...
Kateryna Kubasova: Абстрактне Оксфордське лідерство конкретному українському ...
 
Andrii Salii: Навіщо публічному сектору NPS: будуємо довіру через відкритість...
Andrii Salii: Навіщо публічному сектору NPS: будуємо довіру через відкритість...Andrii Salii: Навіщо публічному сектору NPS: будуємо довіру через відкритість...
Andrii Salii: Навіщо публічному сектору NPS: будуємо довіру через відкритість...
 
Anton Hlazkov: Впровадження змін – це процес чи проєкт? Чому важливо розуміти...
Anton Hlazkov: Впровадження змін – це процес чи проєкт? Чому важливо розуміти...Anton Hlazkov: Впровадження змін – це процес чи проєкт? Чому важливо розуміти...
Anton Hlazkov: Впровадження змін – це процес чи проєкт? Чому важливо розуміти...
 
Evgen Osmak: Methods of key project parameters estimation: from the shaman-in...
Evgen Osmak: Methods of key project parameters estimation: from the shaman-in...Evgen Osmak: Methods of key project parameters estimation: from the shaman-in...
Evgen Osmak: Methods of key project parameters estimation: from the shaman-in...
 
Yana Bort: Ритм організації. Чи можливо синхронізувати великий ентерпрайз за ...
Yana Bort: Ритм організації. Чи можливо синхронізувати великий ентерпрайз за ...Yana Bort: Ритм організації. Чи можливо синхронізувати великий ентерпрайз за ...
Yana Bort: Ритм організації. Чи можливо синхронізувати великий ентерпрайз за ...
 
Nikita Artemchuk: Навчання та розвиток продакт менеджера (UA)
Nikita Artemchuk: Навчання та розвиток продакт менеджера (UA)Nikita Artemchuk: Навчання та розвиток продакт менеджера (UA)
Nikita Artemchuk: Навчання та розвиток продакт менеджера (UA)
 
Mykyta Melnyk: Досвід провадження AI Driven Development, кейси використання т...
Mykyta Melnyk: Досвід провадження AI Driven Development, кейси використання т...Mykyta Melnyk: Досвід провадження AI Driven Development, кейси використання т...
Mykyta Melnyk: Досвід провадження AI Driven Development, кейси використання т...
 

Recently uploaded

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

  • 1. What are common mistakes in Data Science projects? (and how to avoid them?) Artur Suchwałko, Ph.D., QuantUp AI & Big Data 2018, March 10, 2018, Lviv, Ukraine
  • 3. Real-world Data Science projects • Kaggle competitions and real Data Science projects are two quite different disciplines • When a data frame is prepared then it’s easy • What is done not correctly and can be corrected? • Analysis of a business problem • Data • Process • Methods, models • Hardware, sofware • People (Everything based on practical experience: 20 years, 100 projects, 3,000 hours of workshops. For the majority of topics I could add quotes from talks.)
  • 4. Analysis of a business problem
  • 5. No. We don’t want to build a model of production and storage in our factory Problem: • We’d like just to optimize cutting a log (a trunk of a dead tree) into planks • Let’s do it in the simplest way. Why should we waste time and money? • The others can do it. Why do you make it complicated?!? Solution: • To build the production and storage model • Otherwise you will optimize log cutting in a different sawmill • or something completely different
  • 6. Solution of a wrong analytical problem Problem: • Stating of a wrong problem and solving it can decrease predictive ability of a model • Similarly, removing so called false predictors (leaks from future) • But we never want to have pure predictive power. Usually business wants actionability and real value Solution: • Focus on what influences your busines
  • 8. Preparation of a development sample is not very important Problem: • Let’s take a sample and model! • Preparation of the development sample decides if the model will fit the reality we model or not • The data and thus the sample is generated (or influenced) by a process that must be well known and understoo Solution: • Think it over really carefully.
  • 9. We have Big Data. We need to implement Big Data solutions Problem: • If you can email your data or fit it in a pendrive it means you don’t have Big Data! • Many Data Science tasks for millions of records can be completed using (powerful) laptops • Decisions are data-driven or not. It’s not about data magnitude but about way the decisions are taken Solution: • Be (more than) sure that we need Big Data technologies for storing and processing • During PoC / prototype stage don’t use Big Data tools • Important: Not valid for some problems
  • 10. Use social media data Problem: • It’s a tremendous effort if you don’t use an off-the-shelf solution • Usually business value is not big Solution: • Be sure that the effort will be rewarded
  • 12. Let’s build a model in one week Problem: • It’s possible (in theory) • If you don’t analyze the process thoughtfully and don’t detect false predictors then the model will not work in production • We will be really happy to see how well it performs on our development sample Solution: • Take enough time • Be sure that the process is correct
  • 13. There is too short time to complete the task / model Problem: • Data problems • Stucked in preprocessing • The implementation takes too long • Too short experience Solution: • Prepare a full product as soon as possible, e.g.: • cutting out all the functionalities, e.g. a scoring application with a simple / dummy model • a full code for building the model but using simpler methods • improve it in the next iterations • Using CRISP-DM / checklist to support your memory • Usually you can start implementation from the first product version
  • 14. Way you prepare the result (a model, a data product) doesn’t matter Problem: • I want a model. It must work. I don’t care how you’ll build it. Just build it! • The process is crucial • If it is wrong then the analysis is not fully reproducible • We take a technical debt • and sooner or later we will be forced to pay it back Solution: • Build models in a fully reproducible way
  • 15. Implementation – I’m sure it’ll work out somehow Problem: • Implementation without planned tests usually fail • What is really painful, it takes time to realize that they failed (a model works and generates risk) Solution: • Plan both, implementation and tests
  • 17. AI. We desperately need AI! Problem: • We don’t need • Predictive modeling is not AI! • It happens that full control over a model is more important than predictive power Solution: • Let’s think what we’d like to achieve and how to do this • Data-driven decision making is more important
  • 18. A model just learns everything it is exposed to Problem: • You need to promise self-learning to sell a service / a software • But it will not learn automatically if not fed by suitable data • In many situations you don’t have such data to design a feedback loop Solution: • Analyze a process that generates the data for the development sample • Put aside a “not touched” sample • The model will be taught using a sample and refined in an ongoing way
  • 19. Start modeling from using Deep Learning! Problem: • But everybody uses it… • No!!! • Many problems are too simple for DL • In particular, the problems with data in a data frame Solution: • Random Forest, xgboost
  • 20. If we have 3000 classes then let’s build a BIG classifier Problem: • For example when we’d like to recommend bank products • Such a random classifier has error 2999/3000 = 99.97% (not 50%) • Usually the dataset is too small Solution: • It’s good to use a simpler method (usually)
  • 22. You can do calculations using a laptop Problem: • Sometimes yes, you can • But usually you cannot • Usually it doesn’t make any sense – human’s time is more expensive that machine’s time Solution: • It is good to invest some money in hardware • or use AWS from Amazon (or something similar)
  • 23. Commercial software is excellent Problem: • Users often tell that it is excellent unless bought • The problems appear later Solution: • Test it in similar conditions it will be used • Think seriously about using open source
  • 24. Free software is excellent (and it’s free!) Problem: • It’s free – in terms of a buying cost • It’s not just excellent – the cost is neccessity to have qualified people onboard and to develop software • There happen inconvenient problems Solution: • Use as it should be used • i.e. write clear and clean code, use additional tools, e.g. VCS • Take care of the team to have the skills needed
  • 26. All companies have Data Science teams. Let’s build one for us! Problem: • It’s possible to build a team. It will take a lot of time and lots of money. • If the results will be wasted then the people will leave • They need to have fun working on projects • If I need a plank then do I really need to buy a sawmill? Solution:: • Be sure that: • we know how to use their results • it will give value to the business • PoC can be outsourced. The first data science project can be outsourced.
  • 27. A student or a freshman is enough to give profits from deep analytics to business Problem: • If someone can cut with a scalpel then will we call him a surgeon? • Why someone who can build (technically) a model having a data frame is called a Data Scientist? • Data Scientist is a profession – experience matters! • People without experience usually don’t give any business value for a company. Even after spending a year working with data (!) Solution: • Hire experienced people, especially in the beginning of a DS journey • let them teach the freshmen • But what is you don’t have experienced people? • Invest time, effort, and money in your team. Let a more business analyst control the team
  • 28. The team will learn everything on online courses Problem: • I give each of you $20 (ok, even $50) and learn everything online • It’s true. The team will learn some things • But not the most important ones • A good hands-on training cannot be substituted Solution: • Learning by doing (and applying) • Control and stimulate learning • Buy knowledge
  • 30. Summary • To avoid mistakes it is good to ask ourselves these questions (and answer them), e.g.: • What business problem are we solving? • What will be business value we can get from the results? • What could be lost in translation fro business into analytics? • Do we have adequate and representative data? • What process does generate them? What are they influenced by? • What is model building process? • What analytical tools should be used? Could we apply simpler approaches? • How do we control all the risk? • It is good to do it repeatedly • It’s best to involve someone experienced • It’s beneficial to educate the receivers of the results
  • 32. Contact • During the conference! • After the conference: artur [at] quantup [dot] eu