Data Engineering and the Data Science Lifecycle

•Download as PPTX, PDF•

1 like•944 views

Everyone wants to be a data scientist. Data modeling is the hottest thing since Tickle Me Elmo. But data scientists don’t work alone. They rely on data engineers to help with data acquisition and data shaping before their model can be developed. They rely on data engineers to deploy their model into production. Once the model is in production, the data engineer’s job isn’t done. The model must be monitored to make sure that it retains its predictive power. And when the model slips, the data engineer and the data scientist need to work together to correct it through retraining or remodeling.

Technology

Confidential and Proprietary to Daugherty Business Solutions
May 1, 2019
Data Engineering and the Data Science
Lifecycle

Confidential and Proprietary to Daugherty Business Solutions 3
Data Science Divided
Data Science Solution
Data
Science
Model
Data Engineering

Confidential and Proprietary to Daugherty Business Solutions 4
Data Scientists are not Data Engineers
https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer

Confidential and Proprietary to Daugherty Business Solutions 5
Data Scientists are not Data Engineers
https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer

Confidential and Proprietary to Daugherty Business Solutions
NoSQL
6
What is a data pipeline?
CSV
CSV
CSV
CSV
CSV
CSV
CSV
Avro
Simple
More complicated

Confidential and Proprietary to Daugherty Business Solutions 7
Creating Reliable Pipelines
It’s not enough to do it once.
Reproducible
Performant
Robust
Flexible
Monitored
Governed

Confidential and Proprietary to Daugherty Business Solutions 8
Architecting Distributed Systems

Confidential and Proprietary to Daugherty Business Solutions
• Containers simplify the process
of deployment making it reliable
and repeatable
• Streaming – because yesterday’s
data might be too old.
9
Architecting Distributed Systems

Confidential and Proprietary to Daugherty Business Solutions 10
Shaping Data Sources

Confidential and Proprietary to Daugherty Business Solutions
• Storage Mechanisms
• Serialization Framework
• Compression Mechanisms
Architecting Data Storage
11

Confidential and Proprietary to Daugherty Business Solutions
Data Science Lifecycle:
Collaborating with Data Scientists
12

Confidential and Proprietary to Daugherty Business Solutions
We are looking to create a system
that generates a stream of events
and processes those events.
We will create a machine learning
algorithm to make predictions
based on these events.
We will monitor the effectiveness of
these predictions.
Finally, we will detect model drift
and retrain our machine learning
algorithm to adjust for the new
model.
Exercise: Initial problem statement

Confidential and Proprietary to Daugherty Business Solutions 14
Internal Static Data API/Interactive Exchange Streaming Data
Data Acquisition
External data vendor
Robust
Reliable
Governed
Performant

Confidential and Proprietary to Daugherty Business Solutions 15
Data Preparation
Every block of stone has a statue inside it, and
it is the task of the sculptor to discover it.

Confidential and Proprietary to Daugherty Business Solutions
Exercise Architecture
16

Confidential and Proprietary to Daugherty Business Solutions 17
Collaborating
with Data
Scientists
Hypothesis and
Modeling
• Data Scientists use their
understanding of the
data to make a guess at
what the underlying
phenomena is.
• They create a model that
offers insight into the
inner workings of the
phenomena.
Evaluation and
Interpretation
• Data scientists train their
models using training
data. Some models are
able to be verified using
testing data.
• They interpret the results
of the model against
reality. Then they can
determine if it is
appropriate for use.

Confidential and Proprietary to Daugherty Business Solutions 18
Deployment

Confidential and Proprietary to Daugherty Business Solutions 19
Exercise: Reality changes

Confidential and Proprietary to Daugherty Business Solutions 20
Operations and Monitoring

Confidential and Proprietary to Daugherty Business Solutions 21
Optimization
Retrain Remodel

Confidential and Proprietary to Daugherty Business Solutions 22
Retraining

Confidential and Proprietary to Daugherty Business Solutions
Conclusions
23
Data scientists are not data
engineers.
A data scientist should be
supported by two to five
data engineers.
Data engineers are able to
create reliable, repeatable,
governed data pipelines.

Confidential and Proprietary to Daugherty Business Solutions

What's hot

TPC-H Column Store and MPP systemsMostafa Mokhtar

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

How to Become a Data Scientistryanorban

Data Modeling in LookerLooker

Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.

data-mesh-101.pptxTarekHamdi8

Introduction to Data EngineeringHadi Fadlallah

Data Visualization in Data ScienceMaloy Manna, PMP®

Vector databaseGuy Korland

Data Engineering BasicsCatherine Kimani

Introducing Neo4jNeo4j

Big data and HadoopRahul Agarwal

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Big Data Architecture and Design PatternsJohn Yeung

Big data visualizationAnurag Gupta

Data Platform Architecture Principles and Evaluation CriteriaScyllaDB

Introduction to Data ScienceANOOP V S

Intro to Pinot (2016-01-04)Jean-François Im

Intro to Graphs and Neo4jjexp

What's hot (20)

TPC-H Column Store and MPP systems

Architect’s Open-Source Guide for a Data Mesh Architecture

How to Become a Data Scientist

Data Modeling in Looker

Simplifying Real-Time Architectures for IoT with Apache Kudu

data-mesh-101.pptx

Introduction to Data Engineering

Data Visualization in Data Science

Vector database

Data Engineering Basics

Introducing Neo4j

Big data and Hadoop

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...

Big Data Architecture and Design Patterns

Big data visualization

Data Platform Architecture Principles and Evaluation Criteria

Introduction to Data Science

Intro to Pinot (2016-01-04)

Intro to Graphs and Neo4j

Similar to Data Engineering and the Data Science Lifecycle

Retooling on the Modern Data and Analytics Tech StackAdam Doyle

Big Data IDEA 101 2019Adam Doyle

Spark: Building an application from Start to FinishAdam Doyle

[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdfDataScienceConferenc1

Back to school: Big Data IDEA 101Adam Doyle

Big Data – Is it a hype or for real? Dirk Ortloff

Best Practices for Scaling Data Science Across the OrganizationChasity Gibson

The Agile Analyst: Solving the Data Problem with VirtualizationInside Analysis

Yhat - Applied Data Science - Feb 2016Austin Ogilvie

Data Science Innovation Summit Philadelphia 2019 - parivedaRyan Gross

Where the Warehouse Ends: A New Age of Information AccessInside Analysis

Challenges of Executing AIDr. Umesh Rao.Hodeghatta

Maciej Marek (Philip Morris International) - The Tools of The TradeCodiax

How Data Virtualization Puts Machine Learning into Production (APAC)Denodo

Four Key Considerations for your Big Data Analytics StrategyArcadia Data

How Cloud BI Powers Today's Agile EnterpriseGoodData

Data summit connect fall 2020 - rise of data opsRyan Gross

Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDenodo

Course 8 : How to start your big data project by Eric Rodriguez Betacowork

Data and data scientists are not equal to money david hoyleInstitute of Contemporary Sciences

Similar to Data Engineering and the Data Science Lifecycle (20)

Retooling on the Modern Data and Analytics Tech Stack

Big Data IDEA 101 2019

Spark: Building an application from Start to Finish

[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf

Back to school: Big Data IDEA 101

Big Data – Is it a hype or for real?

Best Practices for Scaling Data Science Across the Organization

The Agile Analyst: Solving the Data Problem with Virtualization

Yhat - Applied Data Science - Feb 2016

Data Science Innovation Summit Philadelphia 2019 - pariveda

Where the Warehouse Ends: A New Age of Information Access

Challenges of Executing AI

Maciej Marek (Philip Morris International) - The Tools of The Trade

How Data Virtualization Puts Machine Learning into Production (APAC)

Four Key Considerations for your Big Data Analytics Strategy

How Cloud BI Powers Today's Agile Enterprise

Data summit connect fall 2020 - rise of data ops

Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy

Course 8 : How to start your big data project by Eric Rodriguez

Data and data scientists are not equal to money david hoyle

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Real Time Object Detection Using Open CVKhem

Why Teams call analytics are critical to your entire businesspanagenda

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

A Domino Admins Adventures (Engage 2024)Gabriella Davis

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

🐬 The future of MySQL is Postgres 🐘

Real Time Object Detection Using Open CV

Why Teams call analytics are critical to your entire business

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Powerful Google developer tools for immediate impact! (2023-24 C)

A Domino Admins Adventures (Engage 2024)

AWS Community Day CPH - Three problems of Terraform

Top 10 Most Downloaded Games on Play Store in 2024

Data Cloud, More than a CDP by Matt Robison

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Strategies for Landing an Oracle DBA Job as a Fresher

Data Engineering and the Data Science Lifecycle

1. Confidential and Proprietary to Daugherty Business Solutions May 1, 2019 Data Engineering and the Data Science Lifecycle

2. Confidential and Proprietary to Daugherty Business Solutions 3 Data Science Divided Data Science Solution Data Science Model Data Engineering

3. Confidential and Proprietary to Daugherty Business Solutions 4 Data Scientists are not Data Engineers https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer

4. Confidential and Proprietary to Daugherty Business Solutions 5 Data Scientists are not Data Engineers https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer

5. Confidential and Proprietary to Daugherty Business Solutions NoSQL 6 What is a data pipeline? CSV CSV CSV CSV CSV CSV CSV Avro Simple More complicated

6. Confidential and Proprietary to Daugherty Business Solutions 7 Creating Reliable Pipelines It’s not enough to do it once. Reproducible Performant Robust Flexible Monitored Governed

7. Confidential and Proprietary to Daugherty Business Solutions 8 Architecting Distributed Systems

8. Confidential and Proprietary to Daugherty Business Solutions • Containers simplify the process of deployment making it reliable and repeatable • Streaming – because yesterday’s data might be too old. 9 Architecting Distributed Systems

9. Confidential and Proprietary to Daugherty Business Solutions 10 Shaping Data Sources

10. Confidential and Proprietary to Daugherty Business Solutions • Storage Mechanisms • Serialization Framework • Compression Mechanisms Architecting Data Storage 11

11. Confidential and Proprietary to Daugherty Business Solutions Data Science Lifecycle: Collaborating with Data Scientists 12

12. Confidential and Proprietary to Daugherty Business Solutions We are looking to create a system that generates a stream of events and processes those events. We will create a machine learning algorithm to make predictions based on these events. We will monitor the effectiveness of these predictions. Finally, we will detect model drift and retrain our machine learning algorithm to adjust for the new model. Exercise: Initial problem statement

13. Confidential and Proprietary to Daugherty Business Solutions 14 Internal Static Data API/Interactive Exchange Streaming Data Data Acquisition External data vendor Robust Reliable Governed Performant

14. Confidential and Proprietary to Daugherty Business Solutions 15 Data Preparation Every block of stone has a statue inside it, and it is the task of the sculptor to discover it.

15. Confidential and Proprietary to Daugherty Business Solutions Exercise Architecture 16

16. Confidential and Proprietary to Daugherty Business Solutions 17 Collaborating with Data Scientists Hypothesis and Modeling • Data Scientists use their understanding of the data to make a guess at what the underlying phenomena is. • They create a model that offers insight into the inner workings of the phenomena. Evaluation and Interpretation • Data scientists train their models using training data. Some models are able to be verified using testing data. • They interpret the results of the model against reality. Then they can determine if it is appropriate for use.

17. Confidential and Proprietary to Daugherty Business Solutions 18 Deployment

18. Confidential and Proprietary to Daugherty Business Solutions 19 Exercise: Reality changes

19. Confidential and Proprietary to Daugherty Business Solutions 20 Operations and Monitoring

20. Confidential and Proprietary to Daugherty Business Solutions 21 Optimization Retrain Remodel

21. Confidential and Proprietary to Daugherty Business Solutions 22 Retraining

22. Confidential and Proprietary to Daugherty Business Solutions Conclusions 23 Data scientists are not data engineers. A data scientist should be supported by two to five data engineers. Data engineers are able to create reliable, repeatable, governed data pipelines.

23. Confidential and Proprietary to Daugherty Business Solutions

Editor's Notes

Data science solutions are more than just modeling. To successfully deliver a data science solution, you need to get able to get the data to the model in the right form in order to train the model. After the model is trained, you need to integrate it into your data science pipeline using good data management and software management process. In other words, you need data engineering to make it work.
Most data scientists are not skilled in software development and data management practices. Their skill set skews towards advanced statistical algorithms and machine learning algorithms. These skills are necessary to create a data science solution, but they aren’t on their own sufficient.
While there is some overlap on data scientists who can do data engineering and data engineers who can do data science, the overlap isn’t particularly deep. A moderately complicated data pipeline may be beyond the skill set of even those cross over data scientists.
An example of a simple pipeline would be processing text files stored in HDFS/S3 with Spark. An example of a moderately complicated data pipeline is to start optimizing your storage with a correctly used NoSQL database that uses a binary format like Avro. More complicated pipelines could include streaming data processing. The additional complexity can turn your data science project into data project science.
Data engineers build data science pipelines that are: Reproducible – across environments using templated solutions to solve common problems Performant – getting the data into the right place at the right time Robust – handles peaks and valleys in volume and data Flexible – can handle different formats without erroring Monitored – communicates error conditions effectively Governed – uses good data governance practices especially around data lifecycle It’s not enough to do it once
Data engineers need to be able to understand how to build distributed systems. If they are using Hadoop or other Big Data technologies, they need to understand how the different ecosystem components can be merged together in order to create a data science solution. If they are using Cloud solutions, they need to be able to understand how the different cloud components can be put together in order to assemble a solution. It is especially important that they are able to understand the cost implications for different solution architectures.
In some cases the solution for a distributed architecture may rely on technologies like Docker and Kubernetes in order to simplify deployment and make it reliable and repeatable. In other cases, the data engineer may have to handle streaming data from IoT devices using technologies like Kafka and NiFi.
Data engineer need to shape the data in order to transform it from data into information. In some cases this will happen programmatically using languages like Java, Python, Scala, or R. The data may be residing in SQL databases or in different forms of NoSQL databases. The kinds of data shaping activities that a data engineer might engage in are: Profiling Filtering Sorting Projection Type conversion Data imputation Feature Abstraction Segmentation
Architecting data storage means that we need to understand different storage mechanisms, different serialization frameworks, and different compression mechanisms
Data engineers collaborate with data scientists in acquiring data and preparing it for use in data science models. Once the model is complete, the data engineers can make sure that it is ready for production work loads and ready for deployment. After the model is in production, data engineers need to monitor its effectiveness. When the model performance starts to degrade, the data engineers collaborate with the data scientist to retrain or remodel it in order to restore its effectiveness. Understanding the kinds of inputs and outputs that come from that process enable the data engineer to assist in the development and deployment of the data science model.
Acquire external data using repeatable process, wrapping external data with data governance processes. Acquire internal static data with repeatable process, wrapping internal data with data governance processes Acquire streaming data with repeatable process. Store the data in such a way that data scientists can use the data Stale Contractual details Approvals Compliance
Preparation of data for the model is an area where data engineers need to collaborate with data scientists in order to make sure the data is fit for modeling. Activities that may happen are: Scaling Feature Abstraction Data Cleaning Data Imputation
Core Components: Observed Data X, Y, Result Messaging Platform Kafka Production and Consumption Database Machine Learning Model
The data scientist generally takes the lead when it comes to the creation and curation of the data science model.
The output from the model creation step may not be ready for production. The model may be not be ready for scaling or able to yield the desired performance. Data engineers need to work with the data scientists to convert the model into something that is production ready. Finally, the data engineer can integrate it into the data pipeline.
In our exercise, we’re changing the inputs into the pipeline. In reality, this may be changing customer tastes or an environmental shift that makes our model less useful.
In this example, you can see that the performance of the model has slipped. But for accuracy and recall, it isn’t immediately apparent that the performance has changed significantly. However precision really tells the story. As a data engineer, you need to understand the outputs of the model in order to make sure that you are able to monitor the effectiveness of the model.
If the model’s general parameters just need a bit of adjustment, you may be able to get away with just retraining the model. This something has seriously changed in the underlying environment, you may have to go back to the beginning and identify the features that now would govern the desired behavior.
With some retraining, our model is back on track.
In conclusion, data scientists are not data engineers. Their skill set may overlap with a data engineer, but their focus should be on preparing, creating, evaluating, and explaining models that produce business value. Data engineers complement data scientists. We recommend that a data scientist be supported with two to five data engineers in order to let them spend their time optimally focusing on the things that they do that bring value. Data engineers create the data pipelines that are needed in order to realize the business value.

Data Engineering and the Data Science Lifecycle

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Engineering and the Data Science Lifecycle

Similar to Data Engineering and the Data Science Lifecycle (20)

More from Adam Doyle

More from Adam Doyle (20)

Recently uploaded

Recently uploaded (20)

Data Engineering and the Data Science Lifecycle

Editor's Notes