Dagster - DataOps and MLOps for Machine Learning Engineers.pdf

CONFIDENTIAL. Copyright © 1
Dagster - DataOps and MLOps for
Machine Learning Engineers

8+ years swimming in data @
A Researcher, Engineer and Blogger

Agenda
01
02
03
04
05
06
Motivation
Dagster's philosophy
Dagster 101
Dagster DataOps
Dagster MLOps
Q&A

Motivation

Typical Machine Learning pipeline
Data Preparation Model training Serving model

Why we need orchestration?
1. Directed Acyclic Graphs
(DAGs)
2. Scheduling and Workflow
Management
3. Error Handling and Retry
Mechanisms
4. Monitoring and Logging
Source: link

Orchestration frameworks

Difficulties in answering important questions
• Is this data up-to-date?
• When this upstream data updated which downstream data
affected?
• How can we manage data version overtime?
• How is model's performance overtime?
ModelOps
DataOps
DevOps
90%
10%
10%

Dagster's philosophy

Dagster's philosophy: Assets
Reports
Tables
ML Models

Ideas: transition from Imperative to Declarative
Say goodbye to spaghetti
code and complex DOM
manipulations with ReactJS
Infrastructure as code (IaC)
with Terraform
Managing containerized
applications at scale has
never been easier with K8s
More accurate and efficient
analytics with data
oriented
Front end
Cluster
orchestration
Dev Ops
Data job/op data

Dagster 101

• An open-source library used to build ETL and Machine Learning systems
(first released in 2018).
• 100+ contributors, 10K commits, 5K stars.
• Used by many innovation organizations.

From Job/Op
def upstream_asset1():
return 1
return 2
def combine_asset(upstream_asset1, upstream_asset2):
combine = upstream_asset1 + upstream_asset2
print(f"{upstream_asset1} + {upstream_asset2} = {combine}")
return combine
result = combine_asset(upstream_asset1(), upstream_asset2())

To assets
from dagster import asset
@asset
return 1
@asset
return 2
@asset
def combine_asset(context, upstream_asset1, upstream_asset2):
context.log.info(f"{upstream_asset1} + {upstream_asset2} =
{combine}")
return combine
Asset key

dagster dev -f <file_name.py>
@asset
return 1
@asset
return 2
@asset
def combine_asset(context, upstream_asset1, upstream_asset2):
context.log.info(f"{upstream_asset1} + {upstream_asset2} =
{combine}")
return combine
Upstream asset key

Dagster DataOps

Modularity:
• Designed with modular architecture → easily organize complex data pipelines.
• Provides a clear separation between data processing logic, data management, and infrastructure
management.
Flexibility:
• Supports a wide range of data sources, including databases, application programming interfaces (APIs),
and file systems.
• provides integration with popular data processing frameworks (Apache Airflow, Apache Spark) → easy
integration into existing data pipelines.
Debugging and testing:
• Provides tools to debug, test data pipeline → easily identify and fix errors.
• Powerful UI allows data pipeline visualization and progress tracking.
Supportive Community:
• Dagster has a community of active users and contributors, developing, continuously adding new
features and improving the framework.

Visualization and debugging
Dagster comes with Dagit,
a graphical user interface
that allows ML engineers
to visualize pipelines,
monitor execution
progress, and debug
issues using detailed logs
and error messages.

Detailed logs and error messages

1st: Organize complex data pipeline
• Where’s data come from?
• How’s data computed?
• Is this data up-to-date?
• When this upstream data updated which downstream data
affected?

2nd : Easy integration into existing tech stacks
from dagster import materialize
if __name__ == "__main__":
result = materialize(assets=[my_first_asset])
pip install dagster dagit
Just install
And materialize your assets
Extensibility and integration: Dagster has a rich ecosystem
of libraries and plugins that support various tools and
platforms related to machine learning, data processing,
and infrastructure. This extensibility allows ML engineers
to integrate Dagster with existing tools and systems.

3rd : assets changes detection
If the latest version of combine_asset was created before the latest version of upstream_asset1 or upstream_asset2, then
combine_asset may be obsolete. Dagster will warn the difference with the "upstream changed" indicator

4th : IOManager: reduce data streamline complexity
Write Once, use everywhere!

CSVIOManager - handle_output() & load_input()

Dagster MLOps

Benefits of building machine learning pipelines in Dagster
• Dagster makes iterating on machine learning models and testing easy, and it is designed to use during the
development process.
• Dagster has a lightweight execution model means you can access the benefits of an orchestrator, like re-
executing from the middle of a pipeline and parallelizing steps while you're experimenting.
• Dagster models data assets, not just tasks, so it understands the upstream and downstream data dependencies.
• Dagster is a one-stop shop for both the data transformations and the models that depend on the data
transformations.

Typical Machine Learning pipeline
Data Preparation Model training Serving model

Organize complex data pipeline (Modeling Pipeline)
Pipeline abstraction: Dagster
enables ML engineers to define
complex workflows as modular
pipelines composed of
individual units called assets.
This modularity aids in code
readability, maintainability, and
reusability.

Organize complex data pipeline (Data preparation)

Organize complex data pipeline (Model training)

5th : Debug, test data pipeline
@asset
def my_first_asset(context):
context.log.info("This is my first asset")
return 1
from dagster import materialize, build_op_context
def test_my_first_asset():
result = materialize(assets=[my_first_asset])
assert result.success
context = build_op_context()
assert my_first_asset(context) == 1
my_assets.py
test_my_assets.py
Testing and development: Dagster supports
local development and testing by enabling
execution of individual assets or entire
pipelines independent of the production
environment, fostering faster iteration and
experimentation.

Tracking model history
Viewing previous versions of a machine
learning model can be useful to
understand the evaluation history or
referencing a model that was used for
inference. Using Dagster will enable you
to understand:
• What data was used to train the
model
• When the model was refreshed
• The code version and ML model
version was used to generate the
predictions used for predicted values

Monitoring potential model drift, data drift overtime
Monitoring and observability: Dagster makes it
easier to monitor and track model performance
metrics with built-in logging and error-handling,
enabling ML engineers to detect issues and ensure
the reliability of their machine learning workflows.

Dagster’s architecture
Scalability and portability: With Dagster, ML engineers can define pipelines that scale across
different execution environments, such as cloud-based infrastructure, containerization
platforms like Docker, and orchestration tools like Kubernetes.

6th : Transitioning Data Pipelines from Development to Production
Configuration
management: With
Dagster, ML engineers can
manage configurations
more efficiently and
consistently across various
environments, simplifying
pipeline and model
parameterization.

Dagster features to take away
1.Organize complex data pipeline
2.Easy integration into existing tech stacks
3.Assets changes detection
4.IOManager: reduce data streamline complexity
5.Debug, test data pipeline
6.Transitioning Data Pipelines from Development to Production
37

Dagster Pros & Cons
Pros Cons
• Data Pipeline Orchestration
• Modularity and Reusability
• Data Quality and Validation checks
• Monitoring and Observability
• Community Support
• Learning Curve
• Not appropriate for stream processing

Q&A

References
Introducing Software-Defined Assets
Dagster vs. Airflow
Building machine learning pipelines with Dagster
Managing machine learning models with Dagster
Open Source deployment architecture

Dagster - DataOps and MLOps for Machine Learning Engineers.pdf

More Related Content

What's hot

Similar to Dagster - DataOps and MLOps for Machine Learning Engineers.pdf

More from Hong Ong

Recently uploaded

Dagster - DataOps and MLOps for Machine Learning Engineers.pdf