Tordatasci meetup-precima-retail-analytics-201901

Overview
 Analytics pipeline design considerations
 The old way
 The current way
 Looking to the future

Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.

2 Years Ago – Legacy Stack in a Data Center
cron

SAS Enterprise Guide and Scripting

Shell Scripting and Crontab Scheduling

Current Analytics Stack in AWS
Amazon S3
Control-M

Control-M Scheduler
 Coordinate dependencies between disparate
servers and platforms
 Central dashboard of execution status
 We have gone from a handful of servers to
hundreds
 Understand what runs when and for how long
 Comparison of jobs to historical runtimes

Spot Fleet
 Run hundreds of
independent jobs
concurrently
 Each job gets it’s own server
 Compute cost is about
$0.10/hr
 Shared storage
 Servers automatically
shutdown when jobs
complete

Redshift
Pros
 Flexibility over Data Center
 Very quick to onboard new clients
 Can provide very fast query times
over large datasets
Cons
 Concurrency issues – Leader node
 Inconsistent job runtimes based
on overall workloads
 Need to scale for largest expected
workload
 Storage coupled to compute
 Not quick to scale
 AWS Only

Future Precima Analytics Stack
Amazon S3
Control-M

Databricks and Snowflake
 Databricks for Data Pipelines and Data Science
 Snowflake for high performance data warehouse queries
Benefits
 Decouple compute from storage
 Jobs don’t interfere with each other
 Virtually unlimited compute scaling
 Virtually unlimited low cost storage
 Spot pricing for nodes
 Time Travel features allow for repeatable fast dry runs on live or nearly live data
 Notebook interface including Python, SQL, Scala, R and Markdown for comments
 Multi-cloud support

Vision for the Future
 ETL with Databricks Spark jobs built using Object Oriented Python
 Take advantage of inheritance and configuration
 Quickly map new data feeds to our standard data model for our Precima products
 Built-in validation and conversion for data fields
 DRY – Don’t repeat yourself
 Data Science pipeline using Databricks notebook workflow
 Notebook Workflows allow user to include another notebook within a notebook. Users can
concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc
exploration. However, it lacks the ability to build more complex data pipelines.
 Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to
accommodate Apache spark jobs

Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 Workflow management frameworks helped us to achieve most of the desired feature
for data pipeline
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
 Move to AWS unlocked our ability to scale
 Moving toward options that decouple storage from compute in order to scale efficiently
 Have made good progress on embracing configuration
 Moving toward fully configurable

Appendix: Qualities of Ideal Data Pipelines
The desired quality of data pipeline include
 Idempotent with state handling
 Scalable and resilient
 Replaceable or programmable
 Testable and traceable
 Documented and automated

Tordatasci meetup-precima-retail-analytics-201901

More Related Content

What's hot

Similar to Tordatasci meetup-precima-retail-analytics-201901

More from WeCloudData

Recently uploaded

Tordatasci meetup-precima-retail-analytics-201901

Editor's Notes