SlideShare a Scribd company logo
Big Data Warehouse &
Agile Analytic Operations:
Pharma Case Study with
Amazon Redshift and S3
@ODSC
Boston
Data
Festival
Chris Bergh
Gil Benghiat
cbergh@datakitchen.io
gil@datakitchen.ioBoston | September 23, 2016
For slides send email to
gil@datakitchen.io
#ODSC will also have them available
@DataKitchen_io
@benghiat
@ChrisBergh
#ODSC
#BostonDataFest
Agenda
• Background
• Who We Are
• Pharmaceuticals Industry Product Launches
• 7 Shocking Steps to Agile Analytic Operations
• Teams, Timing, Tools, Etc.
• Example Story Implementations
• Lessons Learned and Results
Background
DataKitchen Leadership
5
Chris Bergh
(Head Chef)
Gil Benghiat
(VP Product)
Eric Estabrooks
(VP Cloud and
Data Services)
Depth and Breadth of Experience:
Software Development, Executive Leadership, Hands On
Deep Analytic Experience:
Spent past decade solving the Agile Analytics pain
Commercial Data, Marketing and Analytics:
Supported sales worth $10s of billions to 1,000s of users
Unique Approach To Agile Analytics:
Focused on the Analysts and works well with corporate IT
DataKitchen enables
Data Warehouse and Analytic teams
to deliver value quickly and with high quality.
Offerings:
• Product to implement Agile Analytic Operations
• Service to make your analyst team awesome
• Strategic Consulting for Data Strategy and Agile Analytics
http://www.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf
•10 Years
•$2.6 billion
Source: http://www.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf
Pharmaceutical Product Commercialization
• The six to twelve months
are key of a product launch
are key.
• How fast a product grows
during a launch determines
the overall lifetime revenue
• Because the patent expires
Source: quintiles.com
Data Comes from a Variety of Sources
• Analytic data comes at
varying timings and
sources
• Syndicated Data
• Sales Data
• Master Data
• Excel
Analytic Team Has Multiple Deliverables
• The Analytic Team support
supports:
• Ongoing, production
reports and deliverables:
‘Weekly Launch Tracker’
• Ad Hoc answers to
business leader questions
• Resource Allocation and
predictive models
• For Sales and Marketing
Ad Hoc
Prediction
and
Optimization
The Analytic Team’s Goals
Allow fast changes
to support investigative analytics and
high quality production deliverables
You need both process and tools
Agile
Process
Agile
Analytic
Operations
• Technical Environment
• 7 Steps
• agilemanefesto.org
• 4 values
• 12 principles
• Start with Scrum
• Learn and evolve to
what works best in
your environment
Seven “Shocking” Steps to
Agile Analytic Operations
1. Add Tests
2. Do Branching & Merging
3. Use Multiple Environments
4. Reuse & Containerize
5. Parameterize your Process
6. Use Simple Storage
7. Support Three Workflows
Seven “Shocking” Steps To
Agile Analytic Operations
❶ Add tests
Types
1. Error – stop the line
2. Warning – investigate later
3. Info – list of changes
Examples
1. Input file row count way below
a critical threshold
2. Input file row count a little
below a threshold
3. These customers changed
territories
And keep adding them with each feature developed!
❶ Add tests throughout your whole process
Are inputs free
from issues?
Are your business
logic assumptions
still true?
Are your outputs
consistent?
And Save Test Results!
At the end of the day, Analytic work is all just code
Access:
Python Code
Transform:
SQL Code, ETL
Code
Model:
R Code
Visualize:
Tableau
Workbook XML
Report:
Excel File
❷ It’s Just Code: Branch & Merge
Source Code Control
❷ It’s Just Code: Branch & Merge
Source Code Control
Branching & Merging enables people to safely work on their own tasks
Access:
Python Code
Transform:
SQL Code, ETL
Code
Model:
R Code
Visualize:
Tableau
Workbook XML
Report:
Excel File
❸ Use Multiple Environments
Analytic Environment
Your Analytic Works Requires a Coordinating Tools And Hardware
❸ Use Multiple Environments
Provide an Analytic Environment for each branch
• Analysts need a controlled environment for their experiments
• Engineers need a place to develop outside of production
• Update Production only after all tests are run!
❹ Modularize & Containerize
Containerize
1. Manage the environment for each
component (e.g. Docker, AMI)
2. Practice Environment Version Control
Modularize
1. Do not create on ‘monolith’ of code
2. Reuse the code and results
❺ Parameterize Your Process
• Parameters and named sets of
parameters will increase your
velocity. With them, you can
vary
• Inputs [you can make a time
machine]
• Outputs
• Steps in the workflow
❻ Use Simple Storage
• Data Lake
• Keep copies of all your raw data in simple, cheap storage
(s3, HDFS)
• Data Restore: Be able to back up and restore your
databases easily
• “My Own Database”: Data Marts On Demand
• Create parameterized variations of your process that
allow you to assemble data for experimentation,
development, and productionDMDWDM
TransformTransform Transform
❻ The Data Lake Pattern
Data
Sources
Data Lake:
Raw
Format
Relevant Data In
Separate Analytic
Environment
Data Supporting
Each Need
Development In
Data Science Team
Development In
Business Analytics
Team
Production
Analytics
❼ Support three workflows
Small Team
Promote directly to production
Feature Branch
Merge back to production branch
Data Governance
3rd party verification before
production merge
Review
Test
Approve
❼ This is the workflow we use
Sprint 1 Sprint 2
f1 f2
f3
main / master / trunk
f5
Teams, Timing, Tools, etc.
Teams
Data
Analyst
Data
Engineer
Data
Supplier
Business
Customers
Deliverables
Data
Analyst
Data
Engineer
Data
Supplier
Insights via
Charts, Graphs,
Dashboards,
Models
Organized, quality
checked data set
Data Extracts
Business
Customers
Super Power Mindset
Data
Analyst
Data
Engineer
Data
Supplier
Business
Customers
Timing
Data
Analyst
Data
Engineer
Data
Supplier
Monthly Weekly Daily
Business
Customers
Process Tools
Data
Analyst
Data
Engineer
Data
Supplier
Jira
Confluence
Jira
Confluence
?
Business
Customers
Separate Jira Projects
Common Confluence Wiki
Technical Tools
Data
Analyst
Data
Engineer
Data
Supplier
Tableau Desktop
& Online, Alteryx,
Excel
Amazon Web
Services (AWS):
S3, EC2, Redshift,
DataKitchen (data
workflow &
tests), GIT
RDBS, MDM,
Salesforce, Excel,
sFTP, etc.
Business
Customers
The Redshift Data Lake Pattern
Data
Data
Data
Data
Lake
SQL
SQL
SQL
DM
Data
DW
DM
Data
Sources
Data in
raw
format
Data
transforms
Data
ready for
each need
Get the data into S3
• Use an EC2 machine to
run a program
• Get data locally (e.g.
with sFTP)
• Use “aws s3” to move
data to S3
S3 is the data lake
• Keep all your raw data
• Put dates and sources
in the path & keep a
full history
• If costs get high,
consider Glacier
Transform your data into target schemas
• Redshift “copy”
statement moves data
from s3 to a redshift
table
• Do your transforms in
SQL
Transform your data into target schemas
• Analysts hit Redshift
with their favorite
tools.
• Scale to petabytes
• Types of tables in Redshift
• Raw
• QA
• Target Schema (e.g. Star)
Environments for BI
• Environments can be
separate redshift
clusters or “schemas”
in the same redshift
cluster
DM
DW
DM
• Test/preview major
changes to data
warehouse
• Experiments
• Feature work
• Prevent warehouse values
from changing during
development
Example Story Implementations
Example 1
There is a new business
question that requires
a large new data set.
Data Supplier Team
1. New data set needed: Monthly Market Surveys
2. Design the survey
3. Run the market survey
4. Collect and clean the data
5. Deliver the data set via sFTP
This often takes uncompressible calendar time
Data Engineering Team
1. Make a “scrappy star” in a data mart
2. Send questions to Supplier and keep Analysts in the loop
3. Add data tests – enables speed with quality
4. Share star with Data Analyst team and get feedback
5. Iterate (several Agile sprints)
6. Release “solid star”
Data Analytic Team
1. Make “scrappy” dashboards
2. Provide feedback to Data Engineering team
3. Show early dashboards to users
4. Have active build / design sessions
• Make as many changes live as possible
5. Publish production dashboards 70% there
• Update changes via Tableau Online
Example 2
There is a new business
question that requires
a new excel file.
Data Analytic Team
• New data set arrives from Sales Ops
• Use Alteryx to blend new data with solid star
• Publish with Tableau Online
• Gage adoption
• Eventually
• Provide Alteryx script as requirements to Data Engineering team
Example 3
The number of customers
drops from 1000 to 900
and some reps in the field
wonders what happened.
Data Engineering Team
ISSUE: 1000 -> 900 customers
1. Investigate the root cause of the issue and why it was not detected
at data assembly time.
2. Root cause: Issue with production of supplier file.
3. Not detected: Test fails for 800 or fewer customers.
4. Change the test to fail with a 5% variation from the last data drop.
Lessons Learned and Results
Lessons Learned
• Culture Change: directionally correct, 70% right the first time
• Process Duality: Requires focus both Agile Processes and Analytic
Operations
• Focus: Know Your Customers and Make Them a Hero
• Speed Trumps Errors: Find, admit and fix errors quickly,
retrospectives
Results
• Reduced time to insight
• Improved analytic quality
• Lowered the marginal cost to ask the next business question
• Improved analytic team satisfaction and morale
• Perceived by industry as very successful launch
• Team promotions!
For slides send email to
gil@datakitchen.io
#ODSC will also have them available
@DataKitchen_io
@benghiat
@ChrisBergh
#ODSC
#BostonDataFest

More Related Content

What's hot

QuerySurge for DevOps
QuerySurge for DevOpsQuerySurge for DevOps
QuerySurge for DevOps
RTTS
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
Qualitest
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
DataKitchen
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
QA InfoTech
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
RTTS
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
RTTS
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
RTTS
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
RTTS
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
RTTS
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
RTTS
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
DataKitchen
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
RTTS
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
RTTS
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
DataKitchen
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
RTTS
 
A data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonA data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonTerry Bunio
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
Knoldus Inc.
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
RTTS
 

What's hot (20)

QuerySurge for DevOps
QuerySurge for DevOpsQuerySurge for DevOps
QuerySurge for DevOps
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & RedshiftIntroduction to Big Data Technologies:  Hadoop/EMR/Map Reduce & Redshift
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing ProjectImplementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = SuccessCompleting the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
 
A data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madisonA data driven etl test framework sqlsat madison
A data driven etl test framework sqlsat madison
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 

Viewers also liked

FlyData Koichi Fujikawa / FlyData Founder #SVFT
FlyData  Koichi Fujikawa / FlyData Founder #SVFTFlyData  Koichi Fujikawa / FlyData Founder #SVFT
FlyData Koichi Fujikawa / FlyData Founder #SVFT
#SVFT Skyland Ventures Fest Tokyo
 
Datalicious - Smart Data Driven Marketing
Datalicious - Smart Data Driven MarketingDatalicious - Smart Data Driven Marketing
Datalicious - Smart Data Driven Marketing
Datalicious
 
Student innovation teams transition start up
Student innovation teams transition start upStudent innovation teams transition start up
Student innovation teams transition start upmylesdanson
 
EDUCAUSE ECAR session jisc presentation 2015
EDUCAUSE ECAR session jisc presentation 2015EDUCAUSE ECAR session jisc presentation 2015
EDUCAUSE ECAR session jisc presentation 2015
mylesdanson
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
DataKitchen
 
The AMB Data Warehouse: A Case Study
The AMB Data Warehouse: A Case StudyThe AMB Data Warehouse: A Case Study
The AMB Data Warehouse: A Case Study
Mark Gschwind
 
Business intelligence implementation case study
Business intelligence implementation case studyBusiness intelligence implementation case study
Business intelligence implementation case study
Jennie Chen, CTP
 
Bi presentation Designing and Implementing Business Intelligence Systems
Bi presentation   Designing and Implementing Business Intelligence SystemsBi presentation   Designing and Implementing Business Intelligence Systems
Bi presentation Designing and Implementing Business Intelligence Systems
Vispi Munshi
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case Study
Mark Ginnebaugh
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
Randy L. Archambault
 
Case Studies Power Point
Case Studies Power PointCase Studies Power Point
Case Studies Power Pointguest3762ea6
 
Introduction to Business Intelligence
Introduction to Business IntelligenceIntroduction to Business Intelligence
Introduction to Business Intelligence
Almog Ramrajkar
 
Business intelligence ppt
Business intelligence pptBusiness intelligence ppt
Business intelligence ppt
sujithkylm007
 
MBA case study presentation template
MBA case study presentation templateMBA case study presentation template
MBA case study presentation template
gorvis
 

Viewers also liked (14)

FlyData Koichi Fujikawa / FlyData Founder #SVFT
FlyData  Koichi Fujikawa / FlyData Founder #SVFTFlyData  Koichi Fujikawa / FlyData Founder #SVFT
FlyData Koichi Fujikawa / FlyData Founder #SVFT
 
Datalicious - Smart Data Driven Marketing
Datalicious - Smart Data Driven MarketingDatalicious - Smart Data Driven Marketing
Datalicious - Smart Data Driven Marketing
 
Student innovation teams transition start up
Student innovation teams transition start upStudent innovation teams transition start up
Student innovation teams transition start up
 
EDUCAUSE ECAR session jisc presentation 2015
EDUCAUSE ECAR session jisc presentation 2015EDUCAUSE ECAR session jisc presentation 2015
EDUCAUSE ECAR session jisc presentation 2015
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
The AMB Data Warehouse: A Case Study
The AMB Data Warehouse: A Case StudyThe AMB Data Warehouse: A Case Study
The AMB Data Warehouse: A Case Study
 
Business intelligence implementation case study
Business intelligence implementation case studyBusiness intelligence implementation case study
Business intelligence implementation case study
 
Bi presentation Designing and Implementing Business Intelligence Systems
Bi presentation   Designing and Implementing Business Intelligence SystemsBi presentation   Designing and Implementing Business Intelligence Systems
Bi presentation Designing and Implementing Business Intelligence Systems
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case Study
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Case Studies Power Point
Case Studies Power PointCase Studies Power Point
Case Studies Power Point
 
Introduction to Business Intelligence
Introduction to Business IntelligenceIntroduction to Business Intelligence
Introduction to Business Intelligence
 
Business intelligence ppt
Business intelligence pptBusiness intelligence ppt
Business intelligence ppt
 
MBA case study presentation template
MBA case study presentation templateMBA case study presentation template
MBA case study presentation template
 

Similar to Bdf16 big-data-warehouse-case-study-data kitchen

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
Revolution Analytics
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Developer Night - Opticon18
Developer Night - Opticon18Developer Night - Opticon18
Developer Night - Opticon18
Optimizely
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
sharpan
 
Data DevOps: An Overview
Data DevOps: An OverviewData DevOps: An Overview
Data DevOps: An Overview
Scott W. Ambler
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
DataKitchen
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Christopher Gutknecht
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
iguazio
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOps
Christopher Bergh
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
DataKitchen
 
Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic
IntelAPAC
 
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000DDeveloping and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
dclsocialmedia
 
SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013
SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013
SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013
jtreague
 

Similar to Bdf16 big-data-warehouse-case-study-data kitchen (20)

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Developer Night - Opticon18
Developer Night - Opticon18Developer Night - Opticon18
Developer Night - Opticon18
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
 
Data DevOps: An Overview
Data DevOps: An OverviewData DevOps: An Overview
Data DevOps: An Overview
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOps
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic
 
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000DDeveloping and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
 
SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013
SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013
SOA Suite 11g Project Experience - FDUG Meeting - November 14 2013
 

Recently uploaded

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Bdf16 big-data-warehouse-case-study-data kitchen

  • 1. Big Data Warehouse & Agile Analytic Operations: Pharma Case Study with Amazon Redshift and S3 @ODSC Boston Data Festival Chris Bergh Gil Benghiat cbergh@datakitchen.io gil@datakitchen.ioBoston | September 23, 2016
  • 2. For slides send email to gil@datakitchen.io #ODSC will also have them available @DataKitchen_io @benghiat @ChrisBergh #ODSC #BostonDataFest
  • 3. Agenda • Background • Who We Are • Pharmaceuticals Industry Product Launches • 7 Shocking Steps to Agile Analytic Operations • Teams, Timing, Tools, Etc. • Example Story Implementations • Lessons Learned and Results
  • 5. DataKitchen Leadership 5 Chris Bergh (Head Chef) Gil Benghiat (VP Product) Eric Estabrooks (VP Cloud and Data Services) Depth and Breadth of Experience: Software Development, Executive Leadership, Hands On Deep Analytic Experience: Spent past decade solving the Agile Analytics pain Commercial Data, Marketing and Analytics: Supported sales worth $10s of billions to 1,000s of users Unique Approach To Agile Analytics: Focused on the Analysts and works well with corporate IT
  • 6. DataKitchen enables Data Warehouse and Analytic teams to deliver value quickly and with high quality. Offerings: • Product to implement Agile Analytic Operations • Service to make your analyst team awesome • Strategic Consulting for Data Strategy and Agile Analytics
  • 7. http://www.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf •10 Years •$2.6 billion Source: http://www.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf
  • 8. Pharmaceutical Product Commercialization • The six to twelve months are key of a product launch are key. • How fast a product grows during a launch determines the overall lifetime revenue • Because the patent expires Source: quintiles.com
  • 9. Data Comes from a Variety of Sources • Analytic data comes at varying timings and sources • Syndicated Data • Sales Data • Master Data • Excel
  • 10. Analytic Team Has Multiple Deliverables • The Analytic Team support supports: • Ongoing, production reports and deliverables: ‘Weekly Launch Tracker’ • Ad Hoc answers to business leader questions • Resource Allocation and predictive models • For Sales and Marketing Ad Hoc Prediction and Optimization
  • 11. The Analytic Team’s Goals Allow fast changes to support investigative analytics and high quality production deliverables
  • 12. You need both process and tools Agile Process Agile Analytic Operations • Technical Environment • 7 Steps • agilemanefesto.org • 4 values • 12 principles • Start with Scrum • Learn and evolve to what works best in your environment
  • 13. Seven “Shocking” Steps to Agile Analytic Operations
  • 14. 1. Add Tests 2. Do Branching & Merging 3. Use Multiple Environments 4. Reuse & Containerize 5. Parameterize your Process 6. Use Simple Storage 7. Support Three Workflows Seven “Shocking” Steps To Agile Analytic Operations
  • 15. ❶ Add tests Types 1. Error – stop the line 2. Warning – investigate later 3. Info – list of changes Examples 1. Input file row count way below a critical threshold 2. Input file row count a little below a threshold 3. These customers changed territories And keep adding them with each feature developed!
  • 16. ❶ Add tests throughout your whole process Are inputs free from issues? Are your business logic assumptions still true? Are your outputs consistent? And Save Test Results!
  • 17. At the end of the day, Analytic work is all just code Access: Python Code Transform: SQL Code, ETL Code Model: R Code Visualize: Tableau Workbook XML Report: Excel File ❷ It’s Just Code: Branch & Merge Source Code Control
  • 18. ❷ It’s Just Code: Branch & Merge Source Code Control Branching & Merging enables people to safely work on their own tasks
  • 19. Access: Python Code Transform: SQL Code, ETL Code Model: R Code Visualize: Tableau Workbook XML Report: Excel File ❸ Use Multiple Environments Analytic Environment Your Analytic Works Requires a Coordinating Tools And Hardware
  • 20. ❸ Use Multiple Environments Provide an Analytic Environment for each branch • Analysts need a controlled environment for their experiments • Engineers need a place to develop outside of production • Update Production only after all tests are run!
  • 21. ❹ Modularize & Containerize Containerize 1. Manage the environment for each component (e.g. Docker, AMI) 2. Practice Environment Version Control Modularize 1. Do not create on ‘monolith’ of code 2. Reuse the code and results
  • 22. ❺ Parameterize Your Process • Parameters and named sets of parameters will increase your velocity. With them, you can vary • Inputs [you can make a time machine] • Outputs • Steps in the workflow
  • 23. ❻ Use Simple Storage • Data Lake • Keep copies of all your raw data in simple, cheap storage (s3, HDFS) • Data Restore: Be able to back up and restore your databases easily • “My Own Database”: Data Marts On Demand • Create parameterized variations of your process that allow you to assemble data for experimentation, development, and productionDMDWDM TransformTransform Transform
  • 24. ❻ The Data Lake Pattern Data Sources Data Lake: Raw Format Relevant Data In Separate Analytic Environment Data Supporting Each Need Development In Data Science Team Development In Business Analytics Team Production Analytics
  • 25. ❼ Support three workflows Small Team Promote directly to production Feature Branch Merge back to production branch Data Governance 3rd party verification before production merge Review Test Approve
  • 26. ❼ This is the workflow we use Sprint 1 Sprint 2 f1 f2 f3 main / master / trunk f5
  • 33. Technical Tools Data Analyst Data Engineer Data Supplier Tableau Desktop & Online, Alteryx, Excel Amazon Web Services (AWS): S3, EC2, Redshift, DataKitchen (data workflow & tests), GIT RDBS, MDM, Salesforce, Excel, sFTP, etc. Business Customers
  • 34. The Redshift Data Lake Pattern Data Data Data Data Lake SQL SQL SQL DM Data DW DM Data Sources Data in raw format Data transforms Data ready for each need
  • 35. Get the data into S3 • Use an EC2 machine to run a program • Get data locally (e.g. with sFTP) • Use “aws s3” to move data to S3
  • 36. S3 is the data lake • Keep all your raw data • Put dates and sources in the path & keep a full history • If costs get high, consider Glacier
  • 37. Transform your data into target schemas • Redshift “copy” statement moves data from s3 to a redshift table • Do your transforms in SQL
  • 38. Transform your data into target schemas • Analysts hit Redshift with their favorite tools. • Scale to petabytes • Types of tables in Redshift • Raw • QA • Target Schema (e.g. Star)
  • 39. Environments for BI • Environments can be separate redshift clusters or “schemas” in the same redshift cluster DM DW DM • Test/preview major changes to data warehouse • Experiments • Feature work • Prevent warehouse values from changing during development
  • 41. Example 1 There is a new business question that requires a large new data set.
  • 42. Data Supplier Team 1. New data set needed: Monthly Market Surveys 2. Design the survey 3. Run the market survey 4. Collect and clean the data 5. Deliver the data set via sFTP This often takes uncompressible calendar time
  • 43. Data Engineering Team 1. Make a “scrappy star” in a data mart 2. Send questions to Supplier and keep Analysts in the loop 3. Add data tests – enables speed with quality 4. Share star with Data Analyst team and get feedback 5. Iterate (several Agile sprints) 6. Release “solid star”
  • 44. Data Analytic Team 1. Make “scrappy” dashboards 2. Provide feedback to Data Engineering team 3. Show early dashboards to users 4. Have active build / design sessions • Make as many changes live as possible 5. Publish production dashboards 70% there • Update changes via Tableau Online
  • 45. Example 2 There is a new business question that requires a new excel file.
  • 46. Data Analytic Team • New data set arrives from Sales Ops • Use Alteryx to blend new data with solid star • Publish with Tableau Online • Gage adoption • Eventually • Provide Alteryx script as requirements to Data Engineering team
  • 47. Example 3 The number of customers drops from 1000 to 900 and some reps in the field wonders what happened.
  • 48. Data Engineering Team ISSUE: 1000 -> 900 customers 1. Investigate the root cause of the issue and why it was not detected at data assembly time. 2. Root cause: Issue with production of supplier file. 3. Not detected: Test fails for 800 or fewer customers. 4. Change the test to fail with a 5% variation from the last data drop.
  • 50. Lessons Learned • Culture Change: directionally correct, 70% right the first time • Process Duality: Requires focus both Agile Processes and Analytic Operations • Focus: Know Your Customers and Make Them a Hero • Speed Trumps Errors: Find, admit and fix errors quickly, retrospectives
  • 51. Results • Reduced time to insight • Improved analytic quality • Lowered the marginal cost to ask the next business question • Improved analytic team satisfaction and morale • Perceived by industry as very successful launch • Team promotions!
  • 52. For slides send email to gil@datakitchen.io #ODSC will also have them available @DataKitchen_io @benghiat @ChrisBergh #ODSC #BostonDataFest