Data versioning in machine learning projects. How is data science different from software engineering? Is there a methodology mismatch? How to use dvc.org to version data and experiments.
DVC: O'Reilly Artificial Intelligence Conference 2019 - New YorkDmitry Petrov
ML model and dataset versioning is an essential first step in the direction of establishing a good process. Speaker explores open source tools for ML models and datasets versioning, from traditional Git and Git-LFS tools to the ML project-specific tool Data Version Control or DVC.org
DVC - Git-like Data Version Control for Machine Learning projectsFrancesco Casalegno
DVC is an open-source tool for versioning datasets, artifacts, and models in Machine Learning projects.
This extremely powerful tool allows you to leverage an intuitive git-like interface to seamlessly
1. track datasets version updates
2. have reproducible and sharable machine learning pipelines (e.g. model training)
3. compare model performance scores
4. integrate your data and model versioning with git
5. deploy the desired version of your trained models
Usage of version control systems (VCS) such as Git, which is an established software engineering practice, is challenging for machine learning (ML) projects. Artifacts produced by ML pipelines, such as datasets, pre-processed data, trained models, are often large in size. Once generated, they have to be stored on a disk since reproducing them over and over is expensive. Unfortunately, traditional VCSs have restrictions on handling such large artifacts. Not using version control instead makes reproducibility of results unreliable.
DVC (Data Version Control) not only version-controls large artifacts but also keeps track of the commands that are run to produce them. It detects changes made to the input data and knows which steps in the pipeline have to be rerun to keep the final result up-to-date. By adopting DVC machine learning community can make a big step towards the reproducibility of research.
StarWest 2019 - End to end testing: Stupid or Legit?mabl
Automating end-to-end tests is a tricky business. Many leading practitioners over the years have advised against doing ANY end-to-end test automation. As a result of this, we often see a test automation triangle that recommends a 70/20/10% split between unit, integration, and end-to-end tests to mitigate the traditional associations of the cost of troubleshooting test failures and feedback cycles at each level.
But it’s 2019… many of those risks don’t really exist anymore, and the complete end-to-end is the only thing that brings together all components of your app while focusing on the true end-user functionality and experience. The test automation triangle is changing shape.
Learning outcomes:
- Key pain points of end-to-end test automation and the modern technologies that have eliminated those pains
- The unique benefits of end-to-end test automation
- Tips to bring home for getting started with end-to-end test automation
The quality of data-powered applications depends not only on code, but also on collected data, as well as models trained on data. This renders traditional quality assurance inadequate. We will take a look in our toolbox for more holistic tactics that bridge the gap between code and data quality assurance.
Data versioning in machine learning projects. How is data science different from software engineering? Is there a methodology mismatch? How to use dvc.org to version data and experiments.
DVC: O'Reilly Artificial Intelligence Conference 2019 - New YorkDmitry Petrov
ML model and dataset versioning is an essential first step in the direction of establishing a good process. Speaker explores open source tools for ML models and datasets versioning, from traditional Git and Git-LFS tools to the ML project-specific tool Data Version Control or DVC.org
DVC - Git-like Data Version Control for Machine Learning projectsFrancesco Casalegno
DVC is an open-source tool for versioning datasets, artifacts, and models in Machine Learning projects.
This extremely powerful tool allows you to leverage an intuitive git-like interface to seamlessly
1. track datasets version updates
2. have reproducible and sharable machine learning pipelines (e.g. model training)
3. compare model performance scores
4. integrate your data and model versioning with git
5. deploy the desired version of your trained models
Usage of version control systems (VCS) such as Git, which is an established software engineering practice, is challenging for machine learning (ML) projects. Artifacts produced by ML pipelines, such as datasets, pre-processed data, trained models, are often large in size. Once generated, they have to be stored on a disk since reproducing them over and over is expensive. Unfortunately, traditional VCSs have restrictions on handling such large artifacts. Not using version control instead makes reproducibility of results unreliable.
DVC (Data Version Control) not only version-controls large artifacts but also keeps track of the commands that are run to produce them. It detects changes made to the input data and knows which steps in the pipeline have to be rerun to keep the final result up-to-date. By adopting DVC machine learning community can make a big step towards the reproducibility of research.
StarWest 2019 - End to end testing: Stupid or Legit?mabl
Automating end-to-end tests is a tricky business. Many leading practitioners over the years have advised against doing ANY end-to-end test automation. As a result of this, we often see a test automation triangle that recommends a 70/20/10% split between unit, integration, and end-to-end tests to mitigate the traditional associations of the cost of troubleshooting test failures and feedback cycles at each level.
But it’s 2019… many of those risks don’t really exist anymore, and the complete end-to-end is the only thing that brings together all components of your app while focusing on the true end-user functionality and experience. The test automation triangle is changing shape.
Learning outcomes:
- Key pain points of end-to-end test automation and the modern technologies that have eliminated those pains
- The unique benefits of end-to-end test automation
- Tips to bring home for getting started with end-to-end test automation
The quality of data-powered applications depends not only on code, but also on collected data, as well as models trained on data. This renders traditional quality assurance inadequate. We will take a look in our toolbox for more holistic tactics that bridge the gap between code and data quality assurance.
Start with version control and experiments management in machine learningMikhail Rozhkov
How to manage complexity and reproducibility of Machine Learning projects? What requirements and tools? How to apply in your company and projects? Let's start with data and model version control! Review Data Version Control (DVC), MLFlow and other tools
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides.
In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry.
Of course how can we leave out the real enabler of the whole deal,
"The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Details:
• DevOps and Business Intelligence?
• CI/CD Pipelines: What are they?
• Database Deployments: State based vs Migration based
• Snowflake features for CI/CD
• Azure DevOps: Build and Release Pipelines
• Putting it all together: End to End solution
• Demo
Data Science in Production: Technologies That Drive Adoption of Data Science ...Nir Yungster
Critical to a data science team’s ability to drive impact is its effectiveness in incorporating its solutions into new or existing products. When collaborating with other engineering teams, and especially when solutions must operate at scale, technological choices can be critical factors in determining what type of outcome you'll have. We walk through strategies and specific technologies - Airflow, Docker, Kubernetes - that can help promote successful collaboration between data science and engineering.
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
Docs-as-Code: Evolving the API Documentation ExperiencePronovix
We are a software engineering team creating API docs. Docs are authored using Instructional Design principles to narrate use-cases and practical API implementations. This talk shares why & how we've applied software development practices to evolve our document tooling, creation, & delivery methods.
Our APIs describe asynchronous protocols used for embedded software (firmware) components in a digital 2-way radio communications system. The API is protocol data unit (PDU) based and its definition is described in a proprietary format; consequently, well-known API formats, such as Swagger/OpenAPI, or tools, such as doxygen, are not used.
Our product training and technical writing teams are very experienced in Instructional Design methods, but these teams have only written documentation for an end-user audience. Understanding software development processes is equally important as understanding two-way radio networks in order to successfully integrate with the APIs. This is the rationale for having a software engineering team develop the skillsets to write API documentation for a developer audience.
With a solid foundation of API documentation in place, regular examination of engineering efficiency and developer experience is appropriate. Repeated actions can be replaced by automation. Content can be modular and re-usable. Formats can be streamlined for easier consumption. Docs can be made portable and lightweight for faster delivery.
Delivery Pipelines as a First Class Citizen @deliverAgile2019ciberkleid
In this talk, we will cover important elements for successful CI and CD. We will discuss how these elements make CI and CD much simpler, and hence more attainable. We will cover some best practices / recommendations to include in your application pipelines. We will look at a sample implementation of a pipeline leveraging modern tools. Finally, we will discuss some forthcoming ideas for making it even easier to declaratively enable CI and CD for applications.
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
Presentation given on the 15th July 2021 at the Airflow Summit 2021
Conference website: https://airflowsummit.org/sessions/2021/clearing-airflow-obstructions/
Recording: https://www.crowdcast.io/e/airflowsummit2021/40
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
Presented by Anisha Swain & Riya, Associate Software Engineer, Red Hat as part of PyCloud mini conference on 30th May.
This talk will be highlighting the use of Pbench tool to solve this hectic task of collecting data and explain how best we utilise resources while running applications at scale. It will be beneficial for the people, who are looking for a benchmarking and performance analysis solution for their product with better consistency.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
Presentation at the 2017 LAST (Lean, Agile, Systems Thinking) Conference.
A presentation about the challenges involved in building a production Big Data system used directly by customers.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Start with version control and experiments management in machine learningMikhail Rozhkov
How to manage complexity and reproducibility of Machine Learning projects? What requirements and tools? How to apply in your company and projects? Let's start with data and model version control! Review Data Version Control (DVC), MLFlow and other tools
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides.
In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry.
Of course how can we leave out the real enabler of the whole deal,
"The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Details:
• DevOps and Business Intelligence?
• CI/CD Pipelines: What are they?
• Database Deployments: State based vs Migration based
• Snowflake features for CI/CD
• Azure DevOps: Build and Release Pipelines
• Putting it all together: End to End solution
• Demo
Data Science in Production: Technologies That Drive Adoption of Data Science ...Nir Yungster
Critical to a data science team’s ability to drive impact is its effectiveness in incorporating its solutions into new or existing products. When collaborating with other engineering teams, and especially when solutions must operate at scale, technological choices can be critical factors in determining what type of outcome you'll have. We walk through strategies and specific technologies - Airflow, Docker, Kubernetes - that can help promote successful collaboration between data science and engineering.
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
Title: MLOps Workshop: The Full ML Lifecycle - How to Use ML in Production
Speakers: Spyros Cavadias (https://www.linkedin.com/in/spyros-cavadias/), Konstantinos Pittas (https://www.linkedin.com/in/konstantinos-pittas-83310270/), Thanos Gkinakos (https://www.linkedin.com/in/thanos-gkinakos-03582a128/)
Date: Saturday, December 17, 2022
Event: https://www.meetup.com/athens-big-data/events/289927468/
Docs-as-Code: Evolving the API Documentation ExperiencePronovix
We are a software engineering team creating API docs. Docs are authored using Instructional Design principles to narrate use-cases and practical API implementations. This talk shares why & how we've applied software development practices to evolve our document tooling, creation, & delivery methods.
Our APIs describe asynchronous protocols used for embedded software (firmware) components in a digital 2-way radio communications system. The API is protocol data unit (PDU) based and its definition is described in a proprietary format; consequently, well-known API formats, such as Swagger/OpenAPI, or tools, such as doxygen, are not used.
Our product training and technical writing teams are very experienced in Instructional Design methods, but these teams have only written documentation for an end-user audience. Understanding software development processes is equally important as understanding two-way radio networks in order to successfully integrate with the APIs. This is the rationale for having a software engineering team develop the skillsets to write API documentation for a developer audience.
With a solid foundation of API documentation in place, regular examination of engineering efficiency and developer experience is appropriate. Repeated actions can be replaced by automation. Content can be modular and re-usable. Formats can be streamlined for easier consumption. Docs can be made portable and lightweight for faster delivery.
Delivery Pipelines as a First Class Citizen @deliverAgile2019ciberkleid
In this talk, we will cover important elements for successful CI and CD. We will discuss how these elements make CI and CD much simpler, and hence more attainable. We will cover some best practices / recommendations to include in your application pipelines. We will look at a sample implementation of a pipeline leveraging modern tools. Finally, we will discuss some forthcoming ideas for making it even easier to declaratively enable CI and CD for applications.
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
Open source is at the heart of what we do at Grafana Labs and there is so much happening! The intent of this talk to update everyone on the latest development when it comes to Grafana, Pyroscope, Faro, Loki, Mimir, Tempo and more. Everyone has had at least heard about Grafana but maybe some of the other projects mentioned above are new to you? Welcome to this talk 😉 Beside the update what is new we will also quickly introduce them during this talk.
Presentation given on the 15th July 2021 at the Airflow Summit 2021
Conference website: https://airflowsummit.org/sessions/2021/clearing-airflow-obstructions/
Recording: https://www.crowdcast.io/e/airflowsummit2021/40
DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.
At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.
Presented by Anisha Swain & Riya, Associate Software Engineer, Red Hat as part of PyCloud mini conference on 30th May.
This talk will be highlighting the use of Pbench tool to solve this hectic task of collecting data and explain how best we utilise resources while running applications at scale. It will be beneficial for the people, who are looking for a benchmarking and performance analysis solution for their product with better consistency.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
Presentation at the 2017 LAST (Lean, Agile, Systems Thinking) Conference.
A presentation about the challenges involved in building a production Big Data system used directly by customers.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Similar to Data science workflows: from notebooks to production (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
4. Overview
1. Data science workflow - an
abstraction
2. What is important?
3. Some tools and best practices
a. Setting up a workspace
b. Managing environments
c. Structuring experimental work
d. Refining the pipeline
e. Annotating data
f. Approaches to deployment
g. Documentation best practices
5. Let’s imagine ...
You’re about to start a new project.
● What should you think about before starting?
● How do you organize your workflow?
● Where should your data and code go?
● How do you prioritize experimentation, documentation, clean code, and
reproducibility and still have reasonable timelines?
6. The workflow, abstracted
Data Access
Data Processing /
Feature Creation
Modeling Predictions & Reporting
Basic pipeline
Exploratory Analysis
Experiments
Production
Process
Refinement
7. What’s important?
Criteria for a good workflow
● Reproducible
● Easy
○ to add / remove features
○ to switch pipeline parts
○ to deploy
○ to come back to 6 months later
Data Access
Data
Processing /
Feature
Creation
Modeling
Predictions &
Reporting
Basic pipeline
Exploratory
Analysis
Experiments
Production
Process
Refinement
8. What trade-offs are there?
Fast to
develop/
iterate
Clean code
Fast execution
Well-documented
Scalable
9. Setting up a workspace
● Standardized directory structure
● Promotes good development practice
○ Separate exploration,pipeline, and reporting
● Low overhead to get going
10. Setting up a workspace
cookiecutter-data-science
● http://drivendata.github.io/cookiecutter-data-science
● Based on the cookie cutter package
○ https://cookiecutter.readthedocs.io/en/latest/readme.html
● Standardized directory structure that works well out of the box
● README.md that documents structure
● Standard .gitignore file is set up
● Structure is set up to be pip install-able
● Set-up for Sphinx
● Set-up for tox (standardize testing)
14. Managing Environments
Build from the environment up
● Package managers
○ virtualenv or conda -> requirements.txt
○ docker or vagrant -> dockerfile / vagrant file
● Keep secrets secret
○ .env and .conf files
15. Organizing experimental work
Notebooks are for exploration
● Number notebooks for
ordering
● Add dates to top of notebooks
to help track when changes
happen
● For collaboration, add author
initials
16. Exploration to Experiments to Production
● Natural refinement of the code
○ Notebooks
○ Functions
○ Classes
○ Packages
○ Python scripts
● Other considerations
○ Memory management
■ Data streaming
■ Training in batches
○ Moving between platforms
○ Managing metrics and reporting
Exploratory Analysis
Experiments
Production
Process
Refinement
19. Refining the pipeline
Version control and reproducibility
● What about git?
● How about git-LFS?
● What about DVC - data version control
20. DVC
Data science process is a DAG
● DVC keeps track of code, dependencies and output allowing any step to be
reproduced if there are upstream changes
Separate code storage from data and models
● DVC integrates with git
○ Git stores the code and the dvc files (that store the graph)
○ dvc remotes store the data and the models
22. Working with DVC
Initialize
dvc init
Configure
dvc remote
Add files to be tracked by DVC
dvc add
Store/retrieve data
dvc push dvc pull
23. Working with DVC
Define steps in the DAG
dvc run -f sample.dvc -d cmd.py -d input.data
-M metric.json -o output.data
python cmd.py input.data output.data metrics.json
-f: dvc file to store information in
-d: any dependencies
-M: metric files
-o: output files
27. Getting to production
So you have a model …
and the metrics look good …
Now what?
● Human review of results
● Figuring out how to use it in production
29. Approaches to deployment
Run a model once;
Store results in a table
One-time
● Simple structure
● Predictions can
become dated
● Requires manual
updates
Run new predictions
regularly;
Store results in a table
Batch
● Loose integration with
production
● Little engineering
effort required
● Predictions are
somewhat up-to-date
Run predictions in real-
time
API
● Realtime predictions
● More engineering and
reliability testing
required
30. Documentation
Layers of documentation
● Code comments
● Daily notes / Working notes
● Checkpoints/Summaries
● Code books / how to run
● Project Summary
● Index
31. How to put this into practice?
● Checklists
● Routines
● Sticky-note method