CONFIDENTIAL. Copyright © 1
Dagster - DataOps and MLOps for
Machine Learning Engineers
CONFIDENTIAL. Copyright © 2
8+ years swimming in data @
A Researcher, Engineer and Blogger
CONFIDENTIAL. Copyright © 3
Agenda
01
02
03
04
05
06
Motivation
Dagster's philosophy
Dagster 101
Dagster DataOps
Dagster MLOps
Q&A
CONFIDENTIAL. Copyright © 4
Motivation
CONFIDENTIAL. Copyright © 5
Typical Machine Learning pipeline
Data Preparation Model training Serving model
CONFIDENTIAL. Copyright © 6
Why we need orchestration?
1. Directed Acyclic Graphs
(DAGs)
2. Scheduling and Workflow
Management
3. Error Handling and Retry
Mechanisms
4. Monitoring and Logging
Source: link
CONFIDENTIAL. Copyright © 7
Orchestration frameworks
CONFIDENTIAL. Copyright © 8
Difficulties in answering important questions
• Is this data up-to-date?
• When this upstream data updated which downstream data
affected?
• How can we manage data version overtime?
• How is model's performance overtime?
ModelOps
DataOps
DevOps
90%
10%
10%
CONFIDENTIAL. Copyright © 9
Dagster's philosophy
CONFIDENTIAL. Copyright © 10
Dagster's philosophy: Assets
Reports
Tables
ML Models
CONFIDENTIAL. Copyright © 11
Ideas: transition from Imperative to Declarative
Say goodbye to spaghetti
code and complex DOM
manipulations with ReactJS
Infrastructure as code (IaC)
with Terraform
Managing containerized
applications at scale has
never been easier with K8s
More accurate and efficient
analytics with data
oriented
Front end
Cluster
orchestration
Dev Ops
Data job/op data
CONFIDENTIAL. Copyright © 12
Dagster 101
CONFIDENTIAL. Copyright © 13
• An open-source library used to build ETL and Machine Learning systems
(first released in 2018).
• 100+ contributors, 10K commits, 5K stars.
• Used by many innovation organizations.
CONFIDENTIAL. Copyright © 14
From Job/Op
def upstream_asset1():
return 1
def upstream_asset2():
return 2
def combine_asset(upstream_asset1, upstream_asset2):
combine = upstream_asset1 + upstream_asset2
print(f"{upstream_asset1} + {upstream_asset2} = {combine}")
return combine
result = combine_asset(upstream_asset1(), upstream_asset2())
CONFIDENTIAL. Copyright © 15
To assets
from dagster import asset
@asset
def upstream_asset1():
return 1
@asset
def upstream_asset2():
return 2
@asset
def combine_asset(context, upstream_asset1, upstream_asset2):
combine = upstream_asset1 + upstream_asset2
context.log.info(f"{upstream_asset1} + {upstream_asset2} =
{combine}")
return combine
Asset key
CONFIDENTIAL. Copyright © 16
dagster dev -f <file_name.py>
from dagster import asset
@asset
def upstream_asset1():
return 1
@asset
def upstream_asset2():
return 2
@asset
def combine_asset(context, upstream_asset1, upstream_asset2):
combine = upstream_asset1 + upstream_asset2
context.log.info(f"{upstream_asset1} + {upstream_asset2} =
{combine}")
return combine
Upstream asset key
CONFIDENTIAL. Copyright © 17
Dagster DataOps
CONFIDENTIAL. Copyright © 18
Modularity:
• Designed with modular architecture → easily organize complex data pipelines.
• Provides a clear separation between data processing logic, data management, and infrastructure
management.
Flexibility:
• Supports a wide range of data sources, including databases, application programming interfaces (APIs),
and file systems.
• provides integration with popular data processing frameworks (Apache Airflow, Apache Spark) → easy
integration into existing data pipelines.
Debugging and testing:
• Provides tools to debug, test data pipeline → easily identify and fix errors.
• Powerful UI allows data pipeline visualization and progress tracking.
Supportive Community:
• Dagster has a community of active users and contributors, developing, continuously adding new
features and improving the framework.
CONFIDENTIAL. Copyright © 19
Visualization and debugging
Dagster comes with Dagit,
a graphical user interface
that allows ML engineers
to visualize pipelines,
monitor execution
progress, and debug
issues using detailed logs
and error messages.
CONFIDENTIAL. Copyright © 20
Detailed logs and error messages
CONFIDENTIAL. Copyright © 21
1st: Organize complex data pipeline
• Where’s data come from?
• How’s data computed?
• Is this data up-to-date?
• When this upstream data updated which downstream data
affected?
CONFIDENTIAL. Copyright © 22
2nd : Easy integration into existing tech stacks
from dagster import materialize
if __name__ == "__main__":
result = materialize(assets=[my_first_asset])
pip install dagster dagit
Just install
And materialize your assets
Extensibility and integration: Dagster has a rich ecosystem
of libraries and plugins that support various tools and
platforms related to machine learning, data processing,
and infrastructure. This extensibility allows ML engineers
to integrate Dagster with existing tools and systems.
CONFIDENTIAL. Copyright © 23
3rd : assets changes detection
If the latest version of combine_asset was created before the latest version of upstream_asset1 or upstream_asset2, then
combine_asset may be obsolete. Dagster will warn the difference with the "upstream changed" indicator
CONFIDENTIAL. Copyright © 24
4th : IOManager: reduce data streamline complexity
Write Once, use everywhere!
CONFIDENTIAL. Copyright © 25
CSVIOManager - handle_output() & load_input()
CONFIDENTIAL. Copyright © 26
Dagster MLOps
CONFIDENTIAL. Copyright © 27
Benefits of building machine learning pipelines in Dagster
• Dagster makes iterating on machine learning models and testing easy, and it is designed to use during the
development process.
• Dagster has a lightweight execution model means you can access the benefits of an orchestrator, like re-
executing from the middle of a pipeline and parallelizing steps while you're experimenting.
• Dagster models data assets, not just tasks, so it understands the upstream and downstream data dependencies.
• Dagster is a one-stop shop for both the data transformations and the models that depend on the data
transformations.
CONFIDENTIAL. Copyright © 28
Typical Machine Learning pipeline
Data Preparation Model training Serving model
CONFIDENTIAL. Copyright © 29
Organize complex data pipeline (Modeling Pipeline)
Pipeline abstraction: Dagster
enables ML engineers to define
complex workflows as modular
pipelines composed of
individual units called assets.
This modularity aids in code
readability, maintainability, and
reusability.
CONFIDENTIAL. Copyright © 30
Organize complex data pipeline (Data preparation)
CONFIDENTIAL. Copyright © 31
Organize complex data pipeline (Model training)
CONFIDENTIAL. Copyright © 32
5th : Debug, test data pipeline
from dagster import asset
@asset
def my_first_asset(context):
context.log.info("This is my first asset")
return 1
from dagster import materialize, build_op_context
def test_my_first_asset():
result = materialize(assets=[my_first_asset])
assert result.success
context = build_op_context()
assert my_first_asset(context) == 1
my_assets.py
test_my_assets.py
Testing and development: Dagster supports
local development and testing by enabling
execution of individual assets or entire
pipelines independent of the production
environment, fostering faster iteration and
experimentation.
CONFIDENTIAL. Copyright © 33
Tracking model history
Viewing previous versions of a machine
learning model can be useful to
understand the evaluation history or
referencing a model that was used for
inference. Using Dagster will enable you
to understand:
• What data was used to train the
model
• When the model was refreshed
• The code version and ML model
version was used to generate the
predictions used for predicted values
CONFIDENTIAL. Copyright © 34
Monitoring potential model drift, data drift overtime
Monitoring and observability: Dagster makes it
easier to monitor and track model performance
metrics with built-in logging and error-handling,
enabling ML engineers to detect issues and ensure
the reliability of their machine learning workflows.
CONFIDENTIAL. Copyright © 35
Dagster’s architecture
Scalability and portability: With Dagster, ML engineers can define pipelines that scale across
different execution environments, such as cloud-based infrastructure, containerization
platforms like Docker, and orchestration tools like Kubernetes.
CONFIDENTIAL. Copyright © 36
6th : Transitioning Data Pipelines from Development to Production
Configuration
management: With
Dagster, ML engineers can
manage configurations
more efficiently and
consistently across various
environments, simplifying
pipeline and model
parameterization.
CONFIDENTIAL. Copyright © 37
Dagster features to take away
1.Organize complex data pipeline
2.Easy integration into existing tech stacks
3.Assets changes detection
4.IOManager: reduce data streamline complexity
5.Debug, test data pipeline
6.Transitioning Data Pipelines from Development to Production
37
CONFIDENTIAL. Copyright © 38
Dagster Pros & Cons
Pros Cons
• Data Pipeline Orchestration
• Modularity and Reusability
• Data Quality and Validation checks
• Monitoring and Observability
• Community Support
• Learning Curve
• Not appropriate for stream processing
CONFIDENTIAL. Copyright © 39
Q&A
CONFIDENTIAL. Copyright © 40
References
Introducing Software-Defined Assets
Dagster vs. Airflow
Building machine learning pipelines with Dagster
Managing machine learning models with Dagster
Open Source deployment architecture

Dagster - DataOps and MLOps for Machine Learning Engineers.pdf

  • 1.
    CONFIDENTIAL. Copyright ©1 Dagster - DataOps and MLOps for Machine Learning Engineers
  • 2.
    CONFIDENTIAL. Copyright ©2 8+ years swimming in data @ A Researcher, Engineer and Blogger
  • 3.
    CONFIDENTIAL. Copyright ©3 Agenda 01 02 03 04 05 06 Motivation Dagster's philosophy Dagster 101 Dagster DataOps Dagster MLOps Q&A
  • 4.
  • 5.
    CONFIDENTIAL. Copyright ©5 Typical Machine Learning pipeline Data Preparation Model training Serving model
  • 6.
    CONFIDENTIAL. Copyright ©6 Why we need orchestration? 1. Directed Acyclic Graphs (DAGs) 2. Scheduling and Workflow Management 3. Error Handling and Retry Mechanisms 4. Monitoring and Logging Source: link
  • 7.
    CONFIDENTIAL. Copyright ©7 Orchestration frameworks
  • 8.
    CONFIDENTIAL. Copyright ©8 Difficulties in answering important questions • Is this data up-to-date? • When this upstream data updated which downstream data affected? • How can we manage data version overtime? • How is model's performance overtime? ModelOps DataOps DevOps 90% 10% 10%
  • 9.
    CONFIDENTIAL. Copyright ©9 Dagster's philosophy
  • 10.
    CONFIDENTIAL. Copyright ©10 Dagster's philosophy: Assets Reports Tables ML Models
  • 11.
    CONFIDENTIAL. Copyright ©11 Ideas: transition from Imperative to Declarative Say goodbye to spaghetti code and complex DOM manipulations with ReactJS Infrastructure as code (IaC) with Terraform Managing containerized applications at scale has never been easier with K8s More accurate and efficient analytics with data oriented Front end Cluster orchestration Dev Ops Data job/op data
  • 12.
  • 13.
    CONFIDENTIAL. Copyright ©13 • An open-source library used to build ETL and Machine Learning systems (first released in 2018). • 100+ contributors, 10K commits, 5K stars. • Used by many innovation organizations.
  • 14.
    CONFIDENTIAL. Copyright ©14 From Job/Op def upstream_asset1(): return 1 def upstream_asset2(): return 2 def combine_asset(upstream_asset1, upstream_asset2): combine = upstream_asset1 + upstream_asset2 print(f"{upstream_asset1} + {upstream_asset2} = {combine}") return combine result = combine_asset(upstream_asset1(), upstream_asset2())
  • 15.
    CONFIDENTIAL. Copyright ©15 To assets from dagster import asset @asset def upstream_asset1(): return 1 @asset def upstream_asset2(): return 2 @asset def combine_asset(context, upstream_asset1, upstream_asset2): combine = upstream_asset1 + upstream_asset2 context.log.info(f"{upstream_asset1} + {upstream_asset2} = {combine}") return combine Asset key
  • 16.
    CONFIDENTIAL. Copyright ©16 dagster dev -f <file_name.py> from dagster import asset @asset def upstream_asset1(): return 1 @asset def upstream_asset2(): return 2 @asset def combine_asset(context, upstream_asset1, upstream_asset2): combine = upstream_asset1 + upstream_asset2 context.log.info(f"{upstream_asset1} + {upstream_asset2} = {combine}") return combine Upstream asset key
  • 17.
    CONFIDENTIAL. Copyright ©17 Dagster DataOps
  • 18.
    CONFIDENTIAL. Copyright ©18 Modularity: • Designed with modular architecture → easily organize complex data pipelines. • Provides a clear separation between data processing logic, data management, and infrastructure management. Flexibility: • Supports a wide range of data sources, including databases, application programming interfaces (APIs), and file systems. • provides integration with popular data processing frameworks (Apache Airflow, Apache Spark) → easy integration into existing data pipelines. Debugging and testing: • Provides tools to debug, test data pipeline → easily identify and fix errors. • Powerful UI allows data pipeline visualization and progress tracking. Supportive Community: • Dagster has a community of active users and contributors, developing, continuously adding new features and improving the framework.
  • 19.
    CONFIDENTIAL. Copyright ©19 Visualization and debugging Dagster comes with Dagit, a graphical user interface that allows ML engineers to visualize pipelines, monitor execution progress, and debug issues using detailed logs and error messages.
  • 20.
    CONFIDENTIAL. Copyright ©20 Detailed logs and error messages
  • 21.
    CONFIDENTIAL. Copyright ©21 1st: Organize complex data pipeline • Where’s data come from? • How’s data computed? • Is this data up-to-date? • When this upstream data updated which downstream data affected?
  • 22.
    CONFIDENTIAL. Copyright ©22 2nd : Easy integration into existing tech stacks from dagster import materialize if __name__ == "__main__": result = materialize(assets=[my_first_asset]) pip install dagster dagit Just install And materialize your assets Extensibility and integration: Dagster has a rich ecosystem of libraries and plugins that support various tools and platforms related to machine learning, data processing, and infrastructure. This extensibility allows ML engineers to integrate Dagster with existing tools and systems.
  • 23.
    CONFIDENTIAL. Copyright ©23 3rd : assets changes detection If the latest version of combine_asset was created before the latest version of upstream_asset1 or upstream_asset2, then combine_asset may be obsolete. Dagster will warn the difference with the "upstream changed" indicator
  • 24.
    CONFIDENTIAL. Copyright ©24 4th : IOManager: reduce data streamline complexity Write Once, use everywhere!
  • 25.
    CONFIDENTIAL. Copyright ©25 CSVIOManager - handle_output() & load_input()
  • 26.
  • 27.
    CONFIDENTIAL. Copyright ©27 Benefits of building machine learning pipelines in Dagster • Dagster makes iterating on machine learning models and testing easy, and it is designed to use during the development process. • Dagster has a lightweight execution model means you can access the benefits of an orchestrator, like re- executing from the middle of a pipeline and parallelizing steps while you're experimenting. • Dagster models data assets, not just tasks, so it understands the upstream and downstream data dependencies. • Dagster is a one-stop shop for both the data transformations and the models that depend on the data transformations.
  • 28.
    CONFIDENTIAL. Copyright ©28 Typical Machine Learning pipeline Data Preparation Model training Serving model
  • 29.
    CONFIDENTIAL. Copyright ©29 Organize complex data pipeline (Modeling Pipeline) Pipeline abstraction: Dagster enables ML engineers to define complex workflows as modular pipelines composed of individual units called assets. This modularity aids in code readability, maintainability, and reusability.
  • 30.
    CONFIDENTIAL. Copyright ©30 Organize complex data pipeline (Data preparation)
  • 31.
    CONFIDENTIAL. Copyright ©31 Organize complex data pipeline (Model training)
  • 32.
    CONFIDENTIAL. Copyright ©32 5th : Debug, test data pipeline from dagster import asset @asset def my_first_asset(context): context.log.info("This is my first asset") return 1 from dagster import materialize, build_op_context def test_my_first_asset(): result = materialize(assets=[my_first_asset]) assert result.success context = build_op_context() assert my_first_asset(context) == 1 my_assets.py test_my_assets.py Testing and development: Dagster supports local development and testing by enabling execution of individual assets or entire pipelines independent of the production environment, fostering faster iteration and experimentation.
  • 33.
    CONFIDENTIAL. Copyright ©33 Tracking model history Viewing previous versions of a machine learning model can be useful to understand the evaluation history or referencing a model that was used for inference. Using Dagster will enable you to understand: • What data was used to train the model • When the model was refreshed • The code version and ML model version was used to generate the predictions used for predicted values
  • 34.
    CONFIDENTIAL. Copyright ©34 Monitoring potential model drift, data drift overtime Monitoring and observability: Dagster makes it easier to monitor and track model performance metrics with built-in logging and error-handling, enabling ML engineers to detect issues and ensure the reliability of their machine learning workflows.
  • 35.
    CONFIDENTIAL. Copyright ©35 Dagster’s architecture Scalability and portability: With Dagster, ML engineers can define pipelines that scale across different execution environments, such as cloud-based infrastructure, containerization platforms like Docker, and orchestration tools like Kubernetes.
  • 36.
    CONFIDENTIAL. Copyright ©36 6th : Transitioning Data Pipelines from Development to Production Configuration management: With Dagster, ML engineers can manage configurations more efficiently and consistently across various environments, simplifying pipeline and model parameterization.
  • 37.
    CONFIDENTIAL. Copyright ©37 Dagster features to take away 1.Organize complex data pipeline 2.Easy integration into existing tech stacks 3.Assets changes detection 4.IOManager: reduce data streamline complexity 5.Debug, test data pipeline 6.Transitioning Data Pipelines from Development to Production 37
  • 38.
    CONFIDENTIAL. Copyright ©38 Dagster Pros & Cons Pros Cons • Data Pipeline Orchestration • Modularity and Reusability • Data Quality and Validation checks • Monitoring and Observability • Community Support • Learning Curve • Not appropriate for stream processing
  • 39.
  • 40.
    CONFIDENTIAL. Copyright ©40 References Introducing Software-Defined Assets Dagster vs. Airflow Building machine learning pipelines with Dagster Managing machine learning models with Dagster Open Source deployment architecture