Data Engineering and the Data Science Lifecycle

Confidential and Proprietary to Daugherty Business Solutions
May 1, 2019
Data Engineering and the Data Science
Lifecycle

Confidential and Proprietary to Daugherty Business Solutions 3
Data Science Divided
Data Science Solution
Data
Science
Model
Data Engineering

Data Scientists are not Data Engineers
https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer

NoSQL
6
What is a data pipeline?
CSV
CSV
CSV
CSV
CSV
CSV
CSV
Avro
Simple
More complicated

Creating Reliable Pipelines
It’s not enough to do it once.
Reproducible
Performant
Robust
Flexible
Monitored
Governed

Architecting Distributed Systems

• Containers simplify the process
of deployment making it reliable
and repeatable
• Streaming – because yesterday’s
data might be too old.
9
Architecting Distributed Systems

Shaping Data Sources

• Storage Mechanisms
• Serialization Framework
• Compression Mechanisms
Architecting Data Storage
11

Data Science Lifecycle:
Collaborating with Data Scientists
12

We are looking to create a system
that generates a stream of events
and processes those events.
We will create a machine learning
algorithm to make predictions
based on these events.
We will monitor the effectiveness of
these predictions.
Finally, we will detect model drift
and retrain our machine learning
algorithm to adjust for the new
model.
Exercise: Initial problem statement

Internal Static Data API/Interactive Exchange Streaming Data
Data Acquisition
External data vendor
Robust
Reliable
Governed
Performant

Data Preparation
Every block of stone has a statue inside it, and
it is the task of the sculptor to discover it.

Exercise Architecture
16

Collaborating
with Data
Scientists
Hypothesis and
Modeling
• Data Scientists use their
understanding of the
data to make a guess at
what the underlying
phenomena is.
• They create a model that
offers insight into the
inner workings of the
phenomena.
Evaluation and
Interpretation
• Data scientists train their
models using training
data. Some models are
able to be verified using
testing data.
• They interpret the results
of the model against
reality. Then they can
determine if it is
appropriate for use.

Deployment

Exercise: Reality changes

Operations and Monitoring

Optimization
Retrain Remodel

Retraining

Conclusions
23
Data scientists are not data
engineers.
A data scientist should be
supported by two to five
data engineers.
Data engineers are able to
create reliable, repeatable,
governed data pipelines.

Data Engineering and the Data Science Lifecycle

More Related Content

What's hot

Similar to Data Engineering and the Data Science Lifecycle

More from Adam Doyle

Recently uploaded

Data Engineering and the Data Science Lifecycle

Editor's Notes