SlideShare a Scribd company logo
‘ANATOMY OF A DATA
SCIENCE PROJECT’
ADAM SROKA, SENIOR DATA SCIENTIST
WELCOME TO THE DECEMBER SCOTLAND DATA SCIENCE & TECHNOLOGY MEETUP
RICARDO ANTUNES, DATA SCIENTIST
INCREMENTAL GROUP
- with
&
MBN ACADEMY ARE THE OFFICIAL PARTNERS OF THE
DATA LAB MSC. PLACEMENT PROGRAMME 2018/19.
IF YOU ARE PART OF THE SCOTTISH BUSINESS
COMMUNITY AND INTERESTED IN BECOMING A
POTENTIAL HOST ORGANISATION, PLEASE EMAIL ROB
AT ACADEMY@MBNSOLUTIONS.COM
Anatomy of a Data Science
Project
Adam Sroka, Senior Data Scientist
adam.sroka@incrementalgroup.co.uk
@adzsroka
Why we exist
Digital technology
is changing
everything
Sustainable
success comes
from incremental
improvements
Our mission is to
enable
government and
industry to digitally
transform, one
step at a time
THE DATA SCIENCE
HIERARCHY OF NEEDS
THE DATA SCIENCE
HIERARCHY OF NEEDS
COMPLEXITY
FROM SIMPLICITY
What makes a good project?
A few points to consider before you start
Ask yourself…
1. Will this be easy to deploy & use?
2. Will it be considerably better than existing solutions?
3. Can parts be automated, reproduced, and reused?
4. Is it easy to understand, explain, and test?
5. Is it technically interesting?
Data
Making your raw materials go further
 This is always the
first step
 Build a reusable
set of tools for
measuring quality
 DataExplorer is
great start
Quality
Build pipelines
Models
Mastering the tools of the trade
“Never use a long word
where a short one will do.”
George Orwell
 It’s easy to get excited about the new big thing
 Sometimes seems like expressing intelligence has
taken priority over delivering value
 Marginal gains at the cost of understanding and
interoperability aren’t gains at all
Complexity
 What are other people
doing?
 Are there similar problems
to yours on Kaggle?
 Do you have any biases?
 Take a quick first pass with
everything and review
What works?
Tools
A few things to make your life easier
Templates
 Figure out a template that
works for you – then stick to it
 It makes moving between and
sharing projects tolerable
 https://drivendata.github.io/co
okiecutter-data-science/
 For longer lasting projects,
strongly consider build
automation
 This will manage rebuilding
what’s needed when you
make a change
 Tools like Azure Pipelines,
AWS CodeBuild, Luigi, or
even Makefiles
Build Tools
Containers
 Package your entire workspace
into easily manageable
containers
 Makes reproducibility and sharing
simple
 Many cloud platforms allow
automatic distribution of
containers to clusters
Resources
reddit.com/user/adzsroka/m/d
atascience
datatau.com
kagglenoobs.herokuapp.com
dataelixir.com
getrevue.co/profile/
datamachina/
Thanks!
Adam Sroka, Senior Data Scientist
adam.sroka@incrementalgroup.co.uk
@adzsroka

More Related Content

What's hot

Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Volodymyr Kazantsev
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
Domino Data Lab
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Formulatedby
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Venveo
 
1645 track 3 porter
1645 track 3 porter1645 track 3 porter
1645 track 3 porter
Rising Media, Inc.
 
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioH2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.io
Sri Ambati
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
Καρολίνα Κάτι
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
Peculium Crypto
 
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
IADSS
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
mark madsen
 
Operational analytics overview
Operational analytics overviewOperational analytics overview
Operational analytics overview
pallavi pentapati
 
1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his mac
Rising Media, Inc.
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Ali Alkan
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
Sri Ambati
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 

What's hot (20)

Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
 
1645 track 3 porter
1645 track 3 porter1645 track 3 porter
1645 track 3 porter
 
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioH2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.io
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Operational analytics overview
Operational analytics overviewOperational analytics overview
Operational analytics overview
 
1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his mac
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 

Similar to Anatomy of a data science project

Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
Fangda Wang
 
Online productivity tools - SILS20090
Online productivity tools - SILS20090Online productivity tools - SILS20090
Online productivity tools - SILS20090
is20090
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
Betacowork
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
Data Con LA
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Lucas Jellema
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
CCG
 
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
SpagoWorld
 
A Methodology for Building the Internet of Things
A Methodology for Building the Internet of ThingsA Methodology for Building the Internet of Things
A Methodology for Building the Internet of Things
The Internet of Things Methodology
 
Pausefest: Solve your own damn problem
Pausefest: Solve your own damn problemPausefest: Solve your own damn problem
Pausefest: Solve your own damn problem
Mike Ojo
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale
 
OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate Presentation
OpenSistemas
 
Digital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern WorkplaceDigital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern Workplace
Sani Garba Consulting
 
Azure - The Best Cloud for Developers
Azure - The Best Cloud for DevelopersAzure - The Best Cloud for Developers
Azure - The Best Cloud for Developers
Inovar Tech
 
Why Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud ComputingWhy Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud Computing
TechSoup Global
 
The silent project disruptor: Building AI solutions
The silent project disruptor: Building AI solutionsThe silent project disruptor: Building AI solutions
The silent project disruptor: Building AI solutions
Association for Project Management
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
Codiax
 
Cloud Computing Webinar
Cloud Computing WebinarCloud Computing Webinar
Cloud Computing Webinar
TechSoup
 
Capgemini Ron Tolido - the 3rd Platform and Insurance
Capgemini   Ron Tolido - the 3rd Platform and InsuranceCapgemini   Ron Tolido - the 3rd Platform and Insurance
Capgemini Ron Tolido - the 3rd Platform and Insurance
EDGEteam
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
Inside Analysis
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 

Similar to Anatomy of a data science project (20)

Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
Online productivity tools - SILS20090
Online productivity tools - SILS20090Online productivity tools - SILS20090
Online productivity tools - SILS20090
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
 
A Methodology for Building the Internet of Things
A Methodology for Building the Internet of ThingsA Methodology for Building the Internet of Things
A Methodology for Building the Internet of Things
 
Pausefest: Solve your own damn problem
Pausefest: Solve your own damn problemPausefest: Solve your own damn problem
Pausefest: Solve your own damn problem
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to Cloud
 
OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate Presentation
 
Digital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern WorkplaceDigital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern Workplace
 
Azure - The Best Cloud for Developers
Azure - The Best Cloud for DevelopersAzure - The Best Cloud for Developers
Azure - The Best Cloud for Developers
 
Why Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud ComputingWhy Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud Computing
 
The silent project disruptor: Building AI solutions
The silent project disruptor: Building AI solutionsThe silent project disruptor: Building AI solutions
The silent project disruptor: Building AI solutions
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
Cloud Computing Webinar
Cloud Computing WebinarCloud Computing Webinar
Cloud Computing Webinar
 
Capgemini Ron Tolido - the 3rd Platform and Insurance
Capgemini   Ron Tolido - the 3rd Platform and InsuranceCapgemini   Ron Tolido - the 3rd Platform and Insurance
Capgemini Ron Tolido - the 3rd Platform and Insurance
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Anatomy of a data science project

  • 1. ‘ANATOMY OF A DATA SCIENCE PROJECT’ ADAM SROKA, SENIOR DATA SCIENTIST WELCOME TO THE DECEMBER SCOTLAND DATA SCIENCE & TECHNOLOGY MEETUP RICARDO ANTUNES, DATA SCIENTIST INCREMENTAL GROUP - with &
  • 2. MBN ACADEMY ARE THE OFFICIAL PARTNERS OF THE DATA LAB MSC. PLACEMENT PROGRAMME 2018/19. IF YOU ARE PART OF THE SCOTTISH BUSINESS COMMUNITY AND INTERESTED IN BECOMING A POTENTIAL HOST ORGANISATION, PLEASE EMAIL ROB AT ACADEMY@MBNSOLUTIONS.COM
  • 3. Anatomy of a Data Science Project Adam Sroka, Senior Data Scientist adam.sroka@incrementalgroup.co.uk @adzsroka
  • 4. Why we exist Digital technology is changing everything Sustainable success comes from incremental improvements Our mission is to enable government and industry to digitally transform, one step at a time
  • 8. What makes a good project? A few points to consider before you start
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Ask yourself… 1. Will this be easy to deploy & use? 2. Will it be considerably better than existing solutions? 3. Can parts be automated, reproduced, and reused? 4. Is it easy to understand, explain, and test? 5. Is it technically interesting?
  • 14. Data Making your raw materials go further
  • 15.
  • 16.  This is always the first step  Build a reusable set of tools for measuring quality  DataExplorer is great start Quality
  • 19. “Never use a long word where a short one will do.” George Orwell
  • 20.  It’s easy to get excited about the new big thing  Sometimes seems like expressing intelligence has taken priority over delivering value  Marginal gains at the cost of understanding and interoperability aren’t gains at all Complexity
  • 21.  What are other people doing?  Are there similar problems to yours on Kaggle?  Do you have any biases?  Take a quick first pass with everything and review What works?
  • 22. Tools A few things to make your life easier
  • 23. Templates  Figure out a template that works for you – then stick to it  It makes moving between and sharing projects tolerable  https://drivendata.github.io/co okiecutter-data-science/
  • 24.  For longer lasting projects, strongly consider build automation  This will manage rebuilding what’s needed when you make a change  Tools like Azure Pipelines, AWS CodeBuild, Luigi, or even Makefiles Build Tools
  • 25. Containers  Package your entire workspace into easily manageable containers  Makes reproducibility and sharing simple  Many cloud platforms allow automatic distribution of containers to clusters
  • 27. Thanks! Adam Sroka, Senior Data Scientist adam.sroka@incrementalgroup.co.uk @adzsroka

Editor's Notes

  1. Ease of adoption – awful interfaces
  2. Is this an improvement https://medium.com/@EvanSinar/7-data-visualization-types-you-should-be-using-more-and-how-to-start-4015b5d4adf2
  3. https://aeon.co/essays/is-technology-making-the-world-indecipherable
  4. Reusability https://www.freeimageslive.co.uk/free_stock_image/building-concept-jpg
  5. https://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/ https://digital.nhs.uk/ https://registry.opendata.aws/ https://data.europa.eu/euodp/en/data/ https://data.gov.uk/ https://www.data.gov/
  6. https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html
  7. https://icons8.com/icon/2297/gis