SlideShare a Scribd company logo
Confidential and Proprietary to Daugherty Business Solutions
May 1, 2019
Data Engineering and the Data Science
Lifecycle
Confidential and Proprietary to Daugherty Business Solutions 3
Data Science Divided
Data Science Solution
Data
Science
Model
Data Engineering
Confidential and Proprietary to Daugherty Business Solutions 4
Data Scientists are not Data Engineers
https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer
Confidential and Proprietary to Daugherty Business Solutions 5
Data Scientists are not Data Engineers
https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer
Confidential and Proprietary to Daugherty Business Solutions
NoSQL
6
What is a data pipeline?
CSV
CSV
CSV
CSV
CSV
CSV
CSV
Avro
Simple
More complicated
Confidential and Proprietary to Daugherty Business Solutions 7
Creating Reliable Pipelines
It’s not enough to do it once.
Reproducible
Performant
Robust
Flexible
Monitored
Governed
Confidential and Proprietary to Daugherty Business Solutions 8
Architecting Distributed Systems
Confidential and Proprietary to Daugherty Business Solutions
• Containers simplify the process
of deployment making it reliable
and repeatable
• Streaming – because yesterday’s
data might be too old.
9
Architecting Distributed Systems
Confidential and Proprietary to Daugherty Business Solutions 10
Shaping Data Sources
Confidential and Proprietary to Daugherty Business Solutions
• Storage Mechanisms
• Serialization Framework
• Compression Mechanisms
Architecting Data Storage
11
Confidential and Proprietary to Daugherty Business Solutions
Data Science Lifecycle:
Collaborating with Data Scientists
12
Confidential and Proprietary to Daugherty Business Solutions
We are looking to create a system
that generates a stream of events
and processes those events.
We will create a machine learning
algorithm to make predictions
based on these events.
We will monitor the effectiveness of
these predictions.
Finally, we will detect model drift
and retrain our machine learning
algorithm to adjust for the new
model.
Exercise: Initial problem statement
Confidential and Proprietary to Daugherty Business Solutions 14
Internal Static Data API/Interactive Exchange Streaming Data
Data Acquisition
External data vendor
Robust
Reliable
Governed
Performant
Confidential and Proprietary to Daugherty Business Solutions 15
Data Preparation
Every block of stone has a statue inside it, and
it is the task of the sculptor to discover it.
Confidential and Proprietary to Daugherty Business Solutions
Exercise Architecture
16
Confidential and Proprietary to Daugherty Business Solutions 17
Collaborating
with Data
Scientists
Hypothesis and
Modeling
• Data Scientists use their
understanding of the
data to make a guess at
what the underlying
phenomena is.
• They create a model that
offers insight into the
inner workings of the
phenomena.
Evaluation and
Interpretation
• Data scientists train their
models using training
data. Some models are
able to be verified using
testing data.
• They interpret the results
of the model against
reality. Then they can
determine if it is
appropriate for use.
Confidential and Proprietary to Daugherty Business Solutions 18
Deployment
Confidential and Proprietary to Daugherty Business Solutions 19
Exercise: Reality changes
Confidential and Proprietary to Daugherty Business Solutions 20
Operations and Monitoring
Confidential and Proprietary to Daugherty Business Solutions 21
Optimization
Retrain Remodel
Confidential and Proprietary to Daugherty Business Solutions 22
Retraining
Confidential and Proprietary to Daugherty Business Solutions
Conclusions
23
Data scientists are not data
engineers.
A data scientist should be
supported by two to five
data engineers.
Data engineers are able to
create reliable, repeatable,
governed data pipelines.
Confidential and Proprietary to Daugherty Business Solutions

More Related Content

What's hot

Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
Peter Ward
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
Matthias Feys
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Gurpreet Singh Sachdeva
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
Catherine Kimani
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
Heman Hosainpana
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
Retail Data Warehouse
Retail Data WarehouseRetail Data Warehouse
Retail Data Warehouse
Peter Campbell
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
EMC
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 

What's hot (20)

Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Retail Data Warehouse
Retail Data WarehouseRetail Data Warehouse
Retail Data Warehouse
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
 

Similar to Data Engineering and the Data Science Lifecycle

Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
Adam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
Adam Doyle
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to Finish
Adam Doyle
 
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
DataScienceConferenc1
 
Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101
Adam Doyle
 
Big Data – Is it a hype or for real?
 Big Data – Is it a hype or for real?  Big Data – Is it a hype or for real?
Big Data – Is it a hype or for real?
Dirk Ortloff
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the Organization
Chasity Gibson
 
The Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with VirtualizationThe Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with Virtualization
Inside Analysis
 
Yhat - Applied Data Science - Feb 2016
Yhat - Applied Data Science - Feb 2016Yhat - Applied Data Science - Feb 2016
Yhat - Applied Data Science - Feb 2016
Austin Ogilvie
 
Data Science Innovation Summit Philadelphia 2019 - pariveda
Data Science Innovation Summit  Philadelphia 2019 - parivedaData Science Innovation Summit  Philadelphia 2019 - pariveda
Data Science Innovation Summit Philadelphia 2019 - pariveda
Ryan Gross
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
Inside Analysis
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
Dr. Umesh Rao.Hodeghatta
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
Codiax
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
Denodo
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics Strategy
Arcadia Data
 
How Cloud BI Powers Today's Agile Enterprise
How Cloud BI Powers Today's Agile EnterpriseHow Cloud BI Powers Today's Agile Enterprise
How Cloud BI Powers Today's Agile Enterprise
GoodData
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
Ryan Gross
 
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDemystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Denodo
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
Betacowork
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
Institute of Contemporary Sciences
 

Similar to Data Engineering and the Data Science Lifecycle (20)

Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to Finish
 
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
 
Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101
 
Big Data – Is it a hype or for real?
 Big Data – Is it a hype or for real?  Big Data – Is it a hype or for real?
Big Data – Is it a hype or for real?
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the Organization
 
The Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with VirtualizationThe Agile Analyst: Solving the Data Problem with Virtualization
The Agile Analyst: Solving the Data Problem with Virtualization
 
Yhat - Applied Data Science - Feb 2016
Yhat - Applied Data Science - Feb 2016Yhat - Applied Data Science - Feb 2016
Yhat - Applied Data Science - Feb 2016
 
Data Science Innovation Summit Philadelphia 2019 - pariveda
Data Science Innovation Summit  Philadelphia 2019 - parivedaData Science Innovation Summit  Philadelphia 2019 - pariveda
Data Science Innovation Summit Philadelphia 2019 - pariveda
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics Strategy
 
How Cloud BI Powers Today's Agile Enterprise
How Cloud BI Powers Today's Agile EnterpriseHow Cloud BI Powers Today's Agile Enterprise
How Cloud BI Powers Today's Agile Enterprise
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data StrategyDemystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
Demystifying Data Virtualization: Why it’s Now Critical for Your Data Strategy
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 

More from Adam Doyle

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
Adam Doyle
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
Adam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
Adam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
Adam Doyle
 
The new big data
The new big dataThe new big data
The new big data
Adam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
Adam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
Adam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
Adam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
Adam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
Adam Doyle
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 

Recently uploaded

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Data Engineering and the Data Science Lifecycle

  • 1. Confidential and Proprietary to Daugherty Business Solutions May 1, 2019 Data Engineering and the Data Science Lifecycle
  • 2. Confidential and Proprietary to Daugherty Business Solutions 3 Data Science Divided Data Science Solution Data Science Model Data Engineering
  • 3. Confidential and Proprietary to Daugherty Business Solutions 4 Data Scientists are not Data Engineers https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer
  • 4. Confidential and Proprietary to Daugherty Business Solutions 5 Data Scientists are not Data Engineers https://www.oreilly.com/ideas/why-a-data-scientist-is-not-a-data-engineer
  • 5. Confidential and Proprietary to Daugherty Business Solutions NoSQL 6 What is a data pipeline? CSV CSV CSV CSV CSV CSV CSV Avro Simple More complicated
  • 6. Confidential and Proprietary to Daugherty Business Solutions 7 Creating Reliable Pipelines It’s not enough to do it once. Reproducible Performant Robust Flexible Monitored Governed
  • 7. Confidential and Proprietary to Daugherty Business Solutions 8 Architecting Distributed Systems
  • 8. Confidential and Proprietary to Daugherty Business Solutions • Containers simplify the process of deployment making it reliable and repeatable • Streaming – because yesterday’s data might be too old. 9 Architecting Distributed Systems
  • 9. Confidential and Proprietary to Daugherty Business Solutions 10 Shaping Data Sources
  • 10. Confidential and Proprietary to Daugherty Business Solutions • Storage Mechanisms • Serialization Framework • Compression Mechanisms Architecting Data Storage 11
  • 11. Confidential and Proprietary to Daugherty Business Solutions Data Science Lifecycle: Collaborating with Data Scientists 12
  • 12. Confidential and Proprietary to Daugherty Business Solutions We are looking to create a system that generates a stream of events and processes those events. We will create a machine learning algorithm to make predictions based on these events. We will monitor the effectiveness of these predictions. Finally, we will detect model drift and retrain our machine learning algorithm to adjust for the new model. Exercise: Initial problem statement
  • 13. Confidential and Proprietary to Daugherty Business Solutions 14 Internal Static Data API/Interactive Exchange Streaming Data Data Acquisition External data vendor Robust Reliable Governed Performant
  • 14. Confidential and Proprietary to Daugherty Business Solutions 15 Data Preparation Every block of stone has a statue inside it, and it is the task of the sculptor to discover it.
  • 15. Confidential and Proprietary to Daugherty Business Solutions Exercise Architecture 16
  • 16. Confidential and Proprietary to Daugherty Business Solutions 17 Collaborating with Data Scientists Hypothesis and Modeling • Data Scientists use their understanding of the data to make a guess at what the underlying phenomena is. • They create a model that offers insight into the inner workings of the phenomena. Evaluation and Interpretation • Data scientists train their models using training data. Some models are able to be verified using testing data. • They interpret the results of the model against reality. Then they can determine if it is appropriate for use.
  • 17. Confidential and Proprietary to Daugherty Business Solutions 18 Deployment
  • 18. Confidential and Proprietary to Daugherty Business Solutions 19 Exercise: Reality changes
  • 19. Confidential and Proprietary to Daugherty Business Solutions 20 Operations and Monitoring
  • 20. Confidential and Proprietary to Daugherty Business Solutions 21 Optimization Retrain Remodel
  • 21. Confidential and Proprietary to Daugherty Business Solutions 22 Retraining
  • 22. Confidential and Proprietary to Daugherty Business Solutions Conclusions 23 Data scientists are not data engineers. A data scientist should be supported by two to five data engineers. Data engineers are able to create reliable, repeatable, governed data pipelines.
  • 23. Confidential and Proprietary to Daugherty Business Solutions

Editor's Notes

  1. Data science solutions are more than just modeling. To successfully deliver a data science solution, you need to get able to get the data to the model in the right form in order to train the model. After the model is trained, you need to integrate it into your data science pipeline using good data management and software management process. In other words, you need data engineering to make it work.
  2. Most data scientists are not skilled in software development and data management practices. Their skill set skews towards advanced statistical algorithms and machine learning algorithms. These skills are necessary to create a data science solution, but they aren’t on their own sufficient.
  3. While there is some overlap on data scientists who can do data engineering and data engineers who can do data science, the overlap isn’t particularly deep. A moderately complicated data pipeline may be beyond the skill set of even those cross over data scientists.
  4. An example of a simple pipeline would be processing text files stored in HDFS/S3 with Spark. An example of a moderately complicated data pipeline is to start optimizing your storage with a correctly used NoSQL database that uses a binary format like Avro. More complicated pipelines could include streaming data processing. The additional complexity can turn your data science project into data project science.
  5. Data engineers build data science pipelines that are: Reproducible – across environments using templated solutions to solve common problems Performant – getting the data into the right place at the right time Robust – handles peaks and valleys in volume and data Flexible – can handle different formats without erroring Monitored – communicates error conditions effectively Governed – uses good data governance practices especially around data lifecycle It’s not enough to do it once
  6. Data engineers need to be able to understand how to build distributed systems. If they are using Hadoop or other Big Data technologies, they need to understand how the different ecosystem components can be merged together in order to create a data science solution. If they are using Cloud solutions, they need to be able to understand how the different cloud components can be put together in order to assemble a solution. It is especially important that they are able to understand the cost implications for different solution architectures.
  7. In some cases the solution for a distributed architecture may rely on technologies like Docker and Kubernetes in order to simplify deployment and make it reliable and repeatable. In other cases, the data engineer may have to handle streaming data from IoT devices using technologies like Kafka and NiFi.
  8. Data engineer need to shape the data in order to transform it from data into information. In some cases this will happen programmatically using languages like Java, Python, Scala, or R. The data may be residing in SQL databases or in different forms of NoSQL databases. The kinds of data shaping activities that a data engineer might engage in are: Profiling Filtering Sorting Projection Type conversion Data imputation Feature Abstraction Segmentation
  9. Architecting data storage means that we need to understand different storage mechanisms, different serialization frameworks, and different compression mechanisms
  10. Data engineers collaborate with data scientists in acquiring data and preparing it for use in data science models. Once the model is complete, the data engineers can make sure that it is ready for production work loads and ready for deployment. After the model is in production, data engineers need to monitor its effectiveness. When the model performance starts to degrade, the data engineers collaborate with the data scientist to retrain or remodel it in order to restore its effectiveness. Understanding the kinds of inputs and outputs that come from that process enable the data engineer to assist in the development and deployment of the data science model.
  11. Acquire external data using repeatable process, wrapping external data with data governance processes. Acquire internal static data with repeatable process, wrapping internal data with data governance processes Acquire streaming data with repeatable process. Store the data in such a way that data scientists can use the data Stale Contractual details Approvals Compliance
  12. Preparation of data for the model is an area where data engineers need to collaborate with data scientists in order to make sure the data is fit for modeling. Activities that may happen are: Scaling Feature Abstraction Data Cleaning Data Imputation
  13. Core Components: Observed Data X, Y, Result Messaging Platform Kafka Production and Consumption Database Machine Learning Model
  14. The data scientist generally takes the lead when it comes to the creation and curation of the data science model.
  15. The output from the model creation step may not be ready for production. The model may be not be ready for scaling or able to yield the desired performance. Data engineers need to work with the data scientists to convert the model into something that is production ready. Finally, the data engineer can integrate it into the data pipeline.
  16. In our exercise, we’re changing the inputs into the pipeline. In reality, this may be changing customer tastes or an environmental shift that makes our model less useful.
  17. In this example, you can see that the performance of the model has slipped. But for accuracy and recall, it isn’t immediately apparent that the performance has changed significantly. However precision really tells the story. As a data engineer, you need to understand the outputs of the model in order to make sure that you are able to monitor the effectiveness of the model.
  18. If the model’s general parameters just need a bit of adjustment, you may be able to get away with just retraining the model. This something has seriously changed in the underlying environment, you may have to go back to the beginning and identify the features that now would govern the desired behavior.
  19. With some retraining, our model is back on track.
  20. In conclusion, data scientists are not data engineers. Their skill set may overlap with a data engineer, but their focus should be on preparing, creating, evaluating, and explaining models that produce business value. Data engineers complement data scientists. We recommend that a data scientist be supported with two to five data engineers in order to let them spend their time optimally focusing on the things that they do that bring value. Data engineers create the data pipelines that are needed in order to realize the business value.