Bdf16 big-data-warehouse-case-study-data kitchen

Big Data Warehouse &
Agile Analytic Operations:
Pharma Case Study with
Amazon Redshift and S3
@ODSC
Boston
Data
Festival
Chris Bergh
Gil Benghiat
cbergh@datakitchen.io
gil@datakitchen.ioBoston | September 23, 2016

For slides send email to
gil@datakitchen.io
#ODSC will also have them available
@DataKitchen_io
@benghiat
@ChrisBergh
#ODSC
#BostonDataFest

Agenda
• Background
• Who We Are
• Pharmaceuticals Industry Product Launches
• 7 Shocking Steps to Agile Analytic Operations
• Teams, Timing, Tools, Etc.
• Example Story Implementations
• Lessons Learned and Results

DataKitchen Leadership
5
Chris Bergh
(Head Chef)
Gil Benghiat
(VP Product)
Eric Estabrooks
(VP Cloud and
Data Services)
Depth and Breadth of Experience:
Software Development, Executive Leadership, Hands On
Deep Analytic Experience:
Spent past decade solving the Agile Analytics pain
Commercial Data, Marketing and Analytics:
Supported sales worth $10s of billions to 1,000s of users
Unique Approach To Agile Analytics:
Focused on the Analysts and works well with corporate IT

DataKitchen enables
Data Warehouse and Analytic teams
to deliver value quickly and with high quality.
Offerings:
• Product to implement Agile Analytic Operations
• Service to make your analyst team awesome
• Strategic Consulting for Data Strategy and Agile Analytics

http://www.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf
•10 Years
•$2.6 billion
Source: http://www.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf

Pharmaceutical Product Commercialization
• The six to twelve months
are key of a product launch
are key.
• How fast a product grows
during a launch determines
the overall lifetime revenue
• Because the patent expires
Source: quintiles.com

Data Comes from a Variety of Sources
• Analytic data comes at
varying timings and
sources
• Syndicated Data
• Sales Data
• Master Data
• Excel

Analytic Team Has Multiple Deliverables
• The Analytic Team support
supports:
• Ongoing, production
reports and deliverables:
‘Weekly Launch Tracker’
• Ad Hoc answers to
business leader questions
• Resource Allocation and
predictive models
• For Sales and Marketing
Ad Hoc
Prediction
and
Optimization

The Analytic Team’s Goals
Allow fast changes
to support investigative analytics and
high quality production deliverables

You need both process and tools
Agile
Process
Agile
Analytic
Operations
• Technical Environment
• 7 Steps
• agilemanefesto.org
• 4 values
• 12 principles
• Start with Scrum
• Learn and evolve to
what works best in
your environment

Seven “Shocking” Steps to
Agile Analytic Operations

1. Add Tests
2. Do Branching & Merging
3. Use Multiple Environments
4. Reuse & Containerize
5. Parameterize your Process
6. Use Simple Storage
7. Support Three Workflows
Seven “Shocking” Steps To
Agile Analytic Operations

❶ Add tests
Types
1. Error – stop the line
2. Warning – investigate later
3. Info – list of changes
Examples
1. Input file row count way below
a critical threshold
2. Input file row count a little
below a threshold
3. These customers changed
territories
And keep adding them with each feature developed!

❶ Add tests throughout your whole process
Are inputs free
from issues?
Are your business
logic assumptions
still true?
Are your outputs
consistent?
And Save Test Results!

At the end of the day, Analytic work is all just code
Access:
Python Code
Transform:
SQL Code, ETL
Code
Model:
R Code
Visualize:
Tableau
Workbook XML
Report:
Excel File
❷ It’s Just Code: Branch & Merge
Source Code Control

❷ It’s Just Code: Branch & Merge
Source Code Control
Branching & Merging enables people to safely work on their own tasks

Access:
Python Code
Transform:
SQL Code, ETL
Code
Model:
R Code
Visualize:
Tableau
Workbook XML
Report:
Excel File
❸ Use Multiple Environments
Analytic Environment
Your Analytic Works Requires a Coordinating Tools And Hardware

❸ Use Multiple Environments
Provide an Analytic Environment for each branch
• Analysts need a controlled environment for their experiments
• Engineers need a place to develop outside of production
• Update Production only after all tests are run!

❹ Modularize & Containerize
Containerize
1. Manage the environment for each
component (e.g. Docker, AMI)
2. Practice Environment Version Control
Modularize
1. Do not create on ‘monolith’ of code
2. Reuse the code and results

❺ Parameterize Your Process
• Parameters and named sets of
parameters will increase your
velocity. With them, you can
vary
• Inputs [you can make a time
machine]
• Outputs
• Steps in the workflow

❻ Use Simple Storage
• Data Lake
• Keep copies of all your raw data in simple, cheap storage
(s3, HDFS)
• Data Restore: Be able to back up and restore your
databases easily
• “My Own Database”: Data Marts On Demand
• Create parameterized variations of your process that
allow you to assemble data for experimentation,
development, and productionDMDWDM
TransformTransform Transform

❻ The Data Lake Pattern
Data
Sources
Data Lake:
Raw
Format
Relevant Data In
Separate Analytic
Environment
Data Supporting
Each Need
Development In
Data Science Team
Development In
Business Analytics
Team
Production
Analytics

❼ Support three workflows
Small Team
Promote directly to production
Feature Branch
Merge back to production branch
Data Governance
3rd party verification before
production merge
Review
Test
Approve

❼ This is the workflow we use
Sprint 1 Sprint 2
f1 f2
f3
main / master / trunk
f5

Teams
Data
Analyst
Data
Engineer
Data
Supplier
Business
Customers

Deliverables
Data
Analyst
Data
Engineer
Data
Supplier
Insights via
Charts, Graphs,
Dashboards,
Models
Organized, quality
checked data set
Data Extracts
Business
Customers

Super Power Mindset
Data
Analyst
Data
Engineer
Data
Supplier
Business
Customers

Timing
Data
Analyst
Data
Engineer
Data
Supplier
Monthly Weekly Daily
Business
Customers

Process Tools
Data
Analyst
Data
Engineer
Data
Supplier
Jira
Confluence
Jira
Confluence
?
Business
Customers
Separate Jira Projects
Common Confluence Wiki

Technical Tools
Data
Analyst
Data
Engineer
Data
Supplier
Tableau Desktop
& Online, Alteryx,
Excel
Amazon Web
Services (AWS):
S3, EC2, Redshift,
DataKitchen (data
workflow &
tests), GIT
RDBS, MDM,
Salesforce, Excel,
sFTP, etc.
Business
Customers

The Redshift Data Lake Pattern
Data
Data
Data
Data
Lake
SQL
SQL
SQL
DM
Data
DW
DM
Data
Sources
Data in
raw
format
Data
transforms
Data
ready for
each need

Get the data into S3
• Use an EC2 machine to
run a program
• Get data locally (e.g.
with sFTP)
• Use “aws s3” to move
data to S3

S3 is the data lake
• Keep all your raw data
• Put dates and sources
in the path & keep a
full history
• If costs get high,
consider Glacier

Transform your data into target schemas
• Redshift “copy”
statement moves data
from s3 to a redshift
table
• Do your transforms in
SQL

Transform your data into target schemas
• Analysts hit Redshift
with their favorite
tools.
• Scale to petabytes
• Types of tables in Redshift
• Raw
• QA
• Target Schema (e.g. Star)

Environments for BI
• Environments can be
separate redshift
clusters or “schemas”
in the same redshift
cluster
DM
DW
DM
• Test/preview major
changes to data
warehouse
• Experiments
• Feature work
• Prevent warehouse values
from changing during
development

Example 1
There is a new business
question that requires
a large new data set.

Data Supplier Team
1. New data set needed: Monthly Market Surveys
2. Design the survey
3. Run the market survey
4. Collect and clean the data
5. Deliver the data set via sFTP
This often takes uncompressible calendar time

Data Engineering Team
1. Make a “scrappy star” in a data mart
2. Send questions to Supplier and keep Analysts in the loop
3. Add data tests – enables speed with quality
4. Share star with Data Analyst team and get feedback
5. Iterate (several Agile sprints)
6. Release “solid star”

Data Analytic Team
1. Make “scrappy” dashboards
2. Provide feedback to Data Engineering team
3. Show early dashboards to users
4. Have active build / design sessions
• Make as many changes live as possible
5. Publish production dashboards 70% there
• Update changes via Tableau Online

Example 2
There is a new business
question that requires
a new excel file.

Data Analytic Team
• New data set arrives from Sales Ops
• Use Alteryx to blend new data with solid star
• Publish with Tableau Online
• Gage adoption
• Eventually
• Provide Alteryx script as requirements to Data Engineering team

Example 3
The number of customers
drops from 1000 to 900
and some reps in the field
wonders what happened.

Data Engineering Team
ISSUE: 1000 -> 900 customers
1. Investigate the root cause of the issue and why it was not detected
at data assembly time.
2. Root cause: Issue with production of supplier file.
3. Not detected: Test fails for 800 or fewer customers.
4. Change the test to fail with a 5% variation from the last data drop.

Lessons Learned
• Culture Change: directionally correct, 70% right the first time
• Process Duality: Requires focus both Agile Processes and Analytic
Operations
• Focus: Know Your Customers and Make Them a Hero
• Speed Trumps Errors: Find, admit and fix errors quickly,
retrospectives

Results
• Reduced time to insight
• Improved analytic quality
• Lowered the marginal cost to ask the next business question
• Improved analytic team satisfaction and morale
• Perceived by industry as very successful launch
• Team promotions!

Bdf16 big-data-warehouse-case-study-data kitchen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Bdf16 big-data-warehouse-case-study-data kitchen

Similar to Bdf16 big-data-warehouse-case-study-data kitchen (20)

Recently uploaded

Recently uploaded (20)

Bdf16 big-data-warehouse-case-study-data kitchen