APRIL, 2023
DataOps
The Future of Data Management - Embracing Agility,
Collaboration, and Automation
Agenda
2
Introductions
DevOps to DataOps
CI/CD for Data Products
Orchestration, Testing and Monitoring
Questions
Jeewan Singh
Senior Principal,
Data Analytics
Tomy Rhymond
Principal- Cloud Lead
Technology Enablement
3
About Us.
So…. what is DevOps, really???
DevOps is a cultural movement to:
• Improve Collaboration
• Automate operations (aka the “plumbing”)
• Increase the rate of deployment
• Improve quality and security
What
 Source Control
 CI/CD
 Infrastructure Automation (IAC)
 Automated Test and Validation
 Design for Scalability
 Use the Cloud
How
Why
Spend more time on valuable work
… and have more fun!
Continuous Deployment Of Databases : Part 1
Data and Analytics professional face unique challenges for
automation
State Rolling back Other
Testing
Down Time
Application code is
stateless
Database contains
valuable business data
Change structure and
data without loss
Hand crafting release
scripts is error-prone
Application servers are
easy to swap in/out
Database servers are
very difficult to swap
in/out (even in cluster)
Can sometimes swap
databases or tables
in/out
Applications easy to roll
back from source
control
Databases must be
explicitly backed up and
restored
Very time-consuming
Database unavailable
during restore
Application code is easy
to test with unit tests
Unit testing for
databases is challenging
Unit testing requires test
data generation and
management which gets
complicated quickly
Configuration changes
deployed via CI/CD
Most often only DBAs
touch the database
(control)
Prod databases don’t
match source control
(drift)
Database change
management is difficult
6
These Roadblocks add friction, prevent automation, and
slow adoption of DataOps best practices
Fragile Column Mappings
Embedded Credentials
Hard-coded connections
Black-Box SaaS
GUI-Only Tools
5 Critical Mindset Changes
 Business Requirements are Static
“Our job is to meet the agreed business requirements.”
 Single-Developer, Individual Ownership
“Someone will email me if it breaks.”
 UAT Testing Approach
“We will run some tests before we launch.”
 Everything Manual
“No time to build the automation yet.”
 Demos at End of Project
“Creating demos take time.”
Traditional Mindset DevOps Mindset
 Business Requirements are Fluid
“We aren’t doing right if we assume requirements are static.”
 Multiple Developers, Team Ownership
“Someone else may have to fix this if it breaks.”
 Continuous Testing Approach
“We wrote the tests before we started developing.”
 Mostly Automated
“No time to waste on manual stuff.”
 Demos Daily or Weekly
“Continual feedback is critical to success.”
8
DataOps is a collaborative and automated approach to
managing the entire lifecycle of data, from its creation to
its deletion, in a way that ensures that data is trustworthy,
accurate, and readily available to the right people at the
right time.
PEOPLE PROCESS
TECHNOL
OGY
DataOps Collaboration
Product
Owner/Architect
Operations/
Administration
Chief Data Officer
Data
Analysts
Data
Scientist
Data
Engineer
10
DataOps is an approach to data analytics and data-driven decision
making that follows the agile methodology of continuous
improvement.
Source
Data
Data
Ingestions
Data
Engineering
Data
Analytics
Business
Users
DataOps
CI/CD Orchestration Testing Monitoring
11
DataOps practices are an investment whose dividends
increase with time and experience
Increased speed of delivery
from improved processes
End-to-end efficient data
form automated pipelines
with feedback loops
Improved productivity and
collaboration from
empowered developers
Better business outcomes
from happier customers
Secure and compliant data
from automated, data
quality checks, masking,
tokenization and more.
Reduced mean time to
resolution (MTTR) from shift-
left quality approach
Increased data reliability
and resiliency
Developer empowerment with the
DevOps culture that promote
collaboration and ownership &
accountability
12
DataOps Principles
Analytics is code.
Differences can be spotted easily and
are all committed to the code repo.
Orchestrate.
When everything is automated, we
never have to choose between delivery
new features and performing manual
maintenance.
Make it reproducible.
The code runs the same way every time.
There is no state to manage and there are no
“two ways” to run it which might produce
different results.
Disposable environments.
There’s no such things as data loss. At any
time, the production environment can be
recycled, and a new environment can be spun
up automatically.
DataOps Maturity Model
CI/CD for Data
Products
Taken from Stefana Muller in Dev Leaders Compare Continuous Delivery vs. Continuous Deployment vs. Continuous Integration
What do we mean when we say “CI/CD”?
CI/CD Definitions
Continuous Integration (CI)
is a software engineering practice in which
developers integrate code into a shared
repository several times a day in order to
obtain rapid feedback of the feasibility of that
code. CI enables automated build and
testing so that teams can rapidly work on a
single project together.
Continuous Deployment (also
CD)
is the process by which qualified changes in
software code or architecture are deployed
to production as soon as they are ready and
without human intervention.
Continuous Delivery (CD)
is a software engineering practice in which
teams develop, build, test, and release
software in short cycles. It depends on
automation at every stage so that cycles can
be both quick and reliable.
Developing with
CI/CD commit
commit
commit
commit
commit
main
branch
dev
branch
Pull
Request
✔
✔
✔
❌
Rebuild a
“Beta” Copy
of DW
Auto-Publish
to Production
DW
❌
Refreshed daily/hourly
1. Continuous Integration (CI) Testing:
Automatic or with every commit!
2. Continuous Delivery (CD):
New changes automatically delivered in beta!
3. Continuous Deployment (also CD):
New features and fixes delivered
to customers automatically!
✔ ❌
 1) Store all your files in source control.
 2) Create a full deployment script.
 3) Create a text file pointing to your
deployment script.
CI/CDGettingStartedChecklist
Orchestration, Testing
and Monitoring
18
DataOps Compared to DevOps
Develop Build Test Deploy Run
CI CD
Sandbox Develop Orchestrate Test Deploy
Orchestrate
Monitor
CI
CD
©4/13/23
Slalom. All Rights Reserved. Proprietary and Confidential. 19
Modern Cloud Data Reference Architecture
Data Pipeline Orchestration and Monitoring
Security: Authorization & Authentication
Continuous Integration, Continuous Deployment (CI/CD)
End-User
Manufacturer
Management Team
Internal Analytics
Teams
External Users
Data Source Layer
External
Unstructured Data
Loyalty
E-Commerce
POS Technology
Patient Support Program
Wholesale Distribution
Vistex JDA MBA Anzio
SoloChain MSA
Maple CMSV2
PharmaClick
POS
Reflex POS
Tulip MagicBox
Guardian
Rewards
Uniprix
Rewards
Proxim
Rewards
Newsletter LMS NPS / Survey
IQVIA Nielsen Health Canada
Program
Participation
First Data Bank
IQ DataSmart UniBi
Website /
Facebook
Email
(Dialogue)
Mobile Apps
UniSante
ProxiSante
PTS (db)
Proxim POS Cyberlog ICN
General Pharmacy
Operations Team
Data Lake
Raw Zone
Processed Zone
Curated Zone
Data
Ingestion
Batch Ingestion
• Cloud based ETL
• Event driven f(x)
• Rest APIs
Streaming Ingestion
• Real-time ingestion
• IoT Devices
Machine Learning
(Predictions & Recommendations)
Feature
Generation
Model
Development
Model
Deployment
Model
Monitoring
Central Data Storage
Data Warehouse
Transformation
&
Business
Rules
Data Governance and Access
Data Access Layer Governance Layer Management Layer
Centralized Policies
Data Quality Monitoring
Data Lineage & Metadata
Data Catalog
Consistent Controls
Security Policy Enforcement
Data
Tokenization
&
Masking
Patient Data Hub
Facts
Dimensions
Aggregates
Views
Merge & Match
Deduplication
Enrichment
Specialty Pharmacy
Operations Team
Consumption Layer
Operational Reports
• Warehouse & Specialty
• Store Sales & Growth
• Kiosk Reports
External Data Portal
• Neilsen Data
• External Kiosk
• SharePoint
Sandbox Environment
• Ad-hoc data analysis
• Raw data analysis
• Merging / curating data
sets
Analytical Dashboard
• Manufacturer Insights
• Patient Insights
• Pharmacy Insights
API Apps
• LifeLabs Apps
• Loyalty Program Apps
• Etc.
VPN
Patient / Customer
Data Governance
SMEs
SIR
DLD
RX Technology
Kroll
Reflex RX
Fillware
Compliance
Cube
AssysteRx
PharmaClick RX
Applied
Robotics
Ubik
Data Warehouses
GCP E-
commerce
RelayHealth
Hub
SAP
BeWell
Diem
Taken from Stefana Muller in Dev Leaders Compare Continuous Delivery vs. Continuous Deployment vs. Continuous Integration
Orchestrated,Test and Monitor
Orchestrate
• Both Infrastructure as code and data
pipeline code with single pipeline
• Composer (GCP), Airflow, Azure Data
Factory (Azure), DBT, DataOps.live,
Informatica, Mattilion, Stitch, AWS Data
Pipeline
Monitor
• Cloud Resources
• GCP Monitoring, CloudWatch,
Azure Monitor, Datadog
• Data pipelines
• Respective tools, native cloud
monitoring dashboards
• Data Quality
• ETL tools, manual tools on top of
data platforms
Test
• At the end of the pipeline run
• DBT, DataOps.live, Google Dataform,
Boomi, Informatica, Matillion, Great
Expectations, TSQLT
21
From ETL
to ELTP
Extract
Load
Transform
Publish
Extract
Transform
Load
Extract
Load
Transform
Publish
Benefits of ELT over ETL:
• non-destructive updates
• improved stability and recoverability
“Publish” step signals that data is available
and ready for downstream subscribers, may
involve shipping a copy of the data into the
data lake, replicating to multiple redshift
clusters, populating BI models, or similar
actions.
22
At the core of DataOps is your organization’s information
architecture
• How well you know your data?
• Do you trust your data?
• Are you able to quickly detect errors?
• Can you make changes incrementally without
“breaking” your entire data pipeline?
Critical areas below can transform your data
pipeline:
• Data Curation services
• Metadata Management
• Data Governance
• Master Data Management
• Self-Service interaction
Thank You.
Questions?

DataOps , cbuswaw April '23

  • 1.
    APRIL, 2023 DataOps The Futureof Data Management - Embracing Agility, Collaboration, and Automation
  • 2.
    Agenda 2 Introductions DevOps to DataOps CI/CDfor Data Products Orchestration, Testing and Monitoring Questions
  • 3.
    Jeewan Singh Senior Principal, DataAnalytics Tomy Rhymond Principal- Cloud Lead Technology Enablement 3 About Us.
  • 4.
    So…. what isDevOps, really??? DevOps is a cultural movement to: • Improve Collaboration • Automate operations (aka the “plumbing”) • Increase the rate of deployment • Improve quality and security What  Source Control  CI/CD  Infrastructure Automation (IAC)  Automated Test and Validation  Design for Scalability  Use the Cloud How Why Spend more time on valuable work … and have more fun!
  • 5.
    Continuous Deployment OfDatabases : Part 1 Data and Analytics professional face unique challenges for automation State Rolling back Other Testing Down Time Application code is stateless Database contains valuable business data Change structure and data without loss Hand crafting release scripts is error-prone Application servers are easy to swap in/out Database servers are very difficult to swap in/out (even in cluster) Can sometimes swap databases or tables in/out Applications easy to roll back from source control Databases must be explicitly backed up and restored Very time-consuming Database unavailable during restore Application code is easy to test with unit tests Unit testing for databases is challenging Unit testing requires test data generation and management which gets complicated quickly Configuration changes deployed via CI/CD Most often only DBAs touch the database (control) Prod databases don’t match source control (drift) Database change management is difficult
  • 6.
    6 These Roadblocks addfriction, prevent automation, and slow adoption of DataOps best practices Fragile Column Mappings Embedded Credentials Hard-coded connections Black-Box SaaS GUI-Only Tools
  • 7.
    5 Critical MindsetChanges  Business Requirements are Static “Our job is to meet the agreed business requirements.”  Single-Developer, Individual Ownership “Someone will email me if it breaks.”  UAT Testing Approach “We will run some tests before we launch.”  Everything Manual “No time to build the automation yet.”  Demos at End of Project “Creating demos take time.” Traditional Mindset DevOps Mindset  Business Requirements are Fluid “We aren’t doing right if we assume requirements are static.”  Multiple Developers, Team Ownership “Someone else may have to fix this if it breaks.”  Continuous Testing Approach “We wrote the tests before we started developing.”  Mostly Automated “No time to waste on manual stuff.”  Demos Daily or Weekly “Continual feedback is critical to success.”
  • 8.
    8 DataOps is acollaborative and automated approach to managing the entire lifecycle of data, from its creation to its deletion, in a way that ensures that data is trustworthy, accurate, and readily available to the right people at the right time. PEOPLE PROCESS TECHNOL OGY
  • 9.
  • 10.
    10 DataOps is anapproach to data analytics and data-driven decision making that follows the agile methodology of continuous improvement. Source Data Data Ingestions Data Engineering Data Analytics Business Users DataOps CI/CD Orchestration Testing Monitoring
  • 11.
    11 DataOps practices arean investment whose dividends increase with time and experience Increased speed of delivery from improved processes End-to-end efficient data form automated pipelines with feedback loops Improved productivity and collaboration from empowered developers Better business outcomes from happier customers Secure and compliant data from automated, data quality checks, masking, tokenization and more. Reduced mean time to resolution (MTTR) from shift- left quality approach Increased data reliability and resiliency Developer empowerment with the DevOps culture that promote collaboration and ownership & accountability
  • 12.
    12 DataOps Principles Analytics iscode. Differences can be spotted easily and are all committed to the code repo. Orchestrate. When everything is automated, we never have to choose between delivery new features and performing manual maintenance. Make it reproducible. The code runs the same way every time. There is no state to manage and there are no “two ways” to run it which might produce different results. Disposable environments. There’s no such things as data loss. At any time, the production environment can be recycled, and a new environment can be spun up automatically.
  • 13.
  • 14.
  • 15.
    Taken from StefanaMuller in Dev Leaders Compare Continuous Delivery vs. Continuous Deployment vs. Continuous Integration What do we mean when we say “CI/CD”? CI/CD Definitions Continuous Integration (CI) is a software engineering practice in which developers integrate code into a shared repository several times a day in order to obtain rapid feedback of the feasibility of that code. CI enables automated build and testing so that teams can rapidly work on a single project together. Continuous Deployment (also CD) is the process by which qualified changes in software code or architecture are deployed to production as soon as they are ready and without human intervention. Continuous Delivery (CD) is a software engineering practice in which teams develop, build, test, and release software in short cycles. It depends on automation at every stage so that cycles can be both quick and reliable.
  • 16.
    Developing with CI/CD commit commit commit commit commit main branch dev branch Pull Request ✔ ✔ ✔ ❌ Rebuilda “Beta” Copy of DW Auto-Publish to Production DW ❌ Refreshed daily/hourly 1. Continuous Integration (CI) Testing: Automatic or with every commit! 2. Continuous Delivery (CD): New changes automatically delivered in beta! 3. Continuous Deployment (also CD): New features and fixes delivered to customers automatically! ✔ ❌  1) Store all your files in source control.  2) Create a full deployment script.  3) Create a text file pointing to your deployment script. CI/CDGettingStartedChecklist
  • 17.
  • 18.
    18 DataOps Compared toDevOps Develop Build Test Deploy Run CI CD Sandbox Develop Orchestrate Test Deploy Orchestrate Monitor CI CD
  • 19.
    ©4/13/23 Slalom. All RightsReserved. Proprietary and Confidential. 19 Modern Cloud Data Reference Architecture Data Pipeline Orchestration and Monitoring Security: Authorization & Authentication Continuous Integration, Continuous Deployment (CI/CD) End-User Manufacturer Management Team Internal Analytics Teams External Users Data Source Layer External Unstructured Data Loyalty E-Commerce POS Technology Patient Support Program Wholesale Distribution Vistex JDA MBA Anzio SoloChain MSA Maple CMSV2 PharmaClick POS Reflex POS Tulip MagicBox Guardian Rewards Uniprix Rewards Proxim Rewards Newsletter LMS NPS / Survey IQVIA Nielsen Health Canada Program Participation First Data Bank IQ DataSmart UniBi Website / Facebook Email (Dialogue) Mobile Apps UniSante ProxiSante PTS (db) Proxim POS Cyberlog ICN General Pharmacy Operations Team Data Lake Raw Zone Processed Zone Curated Zone Data Ingestion Batch Ingestion • Cloud based ETL • Event driven f(x) • Rest APIs Streaming Ingestion • Real-time ingestion • IoT Devices Machine Learning (Predictions & Recommendations) Feature Generation Model Development Model Deployment Model Monitoring Central Data Storage Data Warehouse Transformation & Business Rules Data Governance and Access Data Access Layer Governance Layer Management Layer Centralized Policies Data Quality Monitoring Data Lineage & Metadata Data Catalog Consistent Controls Security Policy Enforcement Data Tokenization & Masking Patient Data Hub Facts Dimensions Aggregates Views Merge & Match Deduplication Enrichment Specialty Pharmacy Operations Team Consumption Layer Operational Reports • Warehouse & Specialty • Store Sales & Growth • Kiosk Reports External Data Portal • Neilsen Data • External Kiosk • SharePoint Sandbox Environment • Ad-hoc data analysis • Raw data analysis • Merging / curating data sets Analytical Dashboard • Manufacturer Insights • Patient Insights • Pharmacy Insights API Apps • LifeLabs Apps • Loyalty Program Apps • Etc. VPN Patient / Customer Data Governance SMEs SIR DLD RX Technology Kroll Reflex RX Fillware Compliance Cube AssysteRx PharmaClick RX Applied Robotics Ubik Data Warehouses GCP E- commerce RelayHealth Hub SAP BeWell Diem
  • 20.
    Taken from StefanaMuller in Dev Leaders Compare Continuous Delivery vs. Continuous Deployment vs. Continuous Integration Orchestrated,Test and Monitor Orchestrate • Both Infrastructure as code and data pipeline code with single pipeline • Composer (GCP), Airflow, Azure Data Factory (Azure), DBT, DataOps.live, Informatica, Mattilion, Stitch, AWS Data Pipeline Monitor • Cloud Resources • GCP Monitoring, CloudWatch, Azure Monitor, Datadog • Data pipelines • Respective tools, native cloud monitoring dashboards • Data Quality • ETL tools, manual tools on top of data platforms Test • At the end of the pipeline run • DBT, DataOps.live, Google Dataform, Boomi, Informatica, Matillion, Great Expectations, TSQLT
  • 21.
    21 From ETL to ELTP Extract Load Transform Publish Extract Transform Load Extract Load Transform Publish Benefitsof ELT over ETL: • non-destructive updates • improved stability and recoverability “Publish” step signals that data is available and ready for downstream subscribers, may involve shipping a copy of the data into the data lake, replicating to multiple redshift clusters, populating BI models, or similar actions.
  • 22.
    22 At the coreof DataOps is your organization’s information architecture • How well you know your data? • Do you trust your data? • Are you able to quickly detect errors? • Can you make changes incrementally without “breaking” your entire data pipeline? Critical areas below can transform your data pipeline: • Data Curation services • Metadata Management • Data Governance • Master Data Management • Self-Service interaction
  • 23.