Analytic Processes at Precima
Overview
 Analytics pipeline design considerations
 The old way
 The current way
 Looking to the future
Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
2 Years Ago – Legacy Stack in a Data Center
cron
SAS Enterprise Guide and Scripting
Shell Scripting and Crontab Scheduling
Current Analytics Stack in AWS
Amazon S3
Control-M
Luigi and Control M
Control-M Scheduler
 Coordinate dependencies between disparate
servers and platforms
 Central dashboard of execution status
 We have gone from a handful of servers to
hundreds
 Understand what runs when and for how long
 Comparison of jobs to historical runtimes
Spot Fleet
 Run hundreds of
independent jobs
concurrently
 Each job gets it’s own server
 Compute cost is about
$0.10/hr
 Shared storage
 Servers automatically
shutdown when jobs
complete
Redshift
Pros
 Flexibility over Data Center
 Very quick to onboard new clients
 Can provide very fast query times
over large datasets
Cons
 Concurrency issues – Leader node
 Inconsistent job runtimes based
on overall workloads
 Need to scale for largest expected
workload
 Storage coupled to compute
 Not quick to scale
 AWS Only
Future Precima Analytics Stack
Amazon S3
Control-M
Databricks and Snowflake
 Databricks for Data Pipelines and Data Science
 Snowflake for high performance data warehouse queries
Benefits
 Decouple compute from storage
 Jobs don’t interfere with each other
 Virtually unlimited compute scaling
 Virtually unlimited low cost storage
 Spot pricing for nodes
 Time Travel features allow for repeatable fast dry runs on live or nearly live data
 Notebook interface including Python, SQL, Scala, R and Markdown for comments
 Multi-cloud support
Vision for the Future
 ETL with Databricks Spark jobs built using Object Oriented Python
 Take advantage of inheritance and configuration
 Quickly map new data feeds to our standard data model for our Precima products
 Built-in validation and conversion for data fields
 DRY – Don’t repeat yourself
 Data Science pipeline using Databricks notebook workflow
 Notebook Workflows allow user to include another notebook within a notebook. Users can
concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc
exploration. However, it lacks the ability to build more complex data pipelines.
 Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to
accommodate Apache spark jobs
Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 Workflow management frameworks helped us to achieve most of the desired feature
for data pipeline
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
 Move to AWS unlocked our ability to scale
 Moving toward options that decouple storage from compute in order to scale efficiently
 Have made good progress on embracing configuration
 Moving toward fully configurable
Appendix: Qualities of Ideal Data Pipelines
The desired quality of data pipeline include
 Idempotent with state handling
 Scalable and resilient
 Replaceable or programmable
 Testable and traceable
 Documented and automated

Tordatasci meetup-precima-retail-analytics-201901

  • 2.
  • 3.
    Overview  Analytics pipelinedesign considerations  The old way  The current way  Looking to the future
  • 4.
    Pipeline Design Considerations Product pipeline is easy to test, debug and monitor. There are clear solutions for replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.  There are several teams involved in product pipeline ( e.g., security, development, support and etc.) ; however , there is a clear chain of responsibility and protocol for when things go wrong.  The pipeline design are reviewed under business/stakeholder use cases and our pipeline are designed to be highly configurable and scalable.
  • 5.
    2 Years Ago– Legacy Stack in a Data Center cron
  • 6.
    SAS Enterprise Guideand Scripting
  • 7.
    Shell Scripting andCrontab Scheduling
  • 8.
    Current Analytics Stackin AWS Amazon S3 Control-M
  • 9.
  • 10.
    Control-M Scheduler  Coordinatedependencies between disparate servers and platforms  Central dashboard of execution status  We have gone from a handful of servers to hundreds  Understand what runs when and for how long  Comparison of jobs to historical runtimes
  • 11.
    Spot Fleet  Runhundreds of independent jobs concurrently  Each job gets it’s own server  Compute cost is about $0.10/hr  Shared storage  Servers automatically shutdown when jobs complete
  • 12.
    Redshift Pros  Flexibility overData Center  Very quick to onboard new clients  Can provide very fast query times over large datasets Cons  Concurrency issues – Leader node  Inconsistent job runtimes based on overall workloads  Need to scale for largest expected workload  Storage coupled to compute  Not quick to scale  AWS Only
  • 13.
    Future Precima AnalyticsStack Amazon S3 Control-M
  • 14.
    Databricks and Snowflake Databricks for Data Pipelines and Data Science  Snowflake for high performance data warehouse queries Benefits  Decouple compute from storage  Jobs don’t interfere with each other  Virtually unlimited compute scaling  Virtually unlimited low cost storage  Spot pricing for nodes  Time Travel features allow for repeatable fast dry runs on live or nearly live data  Notebook interface including Python, SQL, Scala, R and Markdown for comments  Multi-cloud support
  • 15.
    Vision for theFuture  ETL with Databricks Spark jobs built using Object Oriented Python  Take advantage of inheritance and configuration  Quickly map new data feeds to our standard data model for our Precima products  Built-in validation and conversion for data fields  DRY – Don’t repeat yourself  Data Science pipeline using Databricks notebook workflow  Notebook Workflows allow user to include another notebook within a notebook. Users can concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc exploration. However, it lacks the ability to build more complex data pipelines.  Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to accommodate Apache spark jobs
  • 16.
    Pipeline Design Considerations Product pipeline is easy to test, debug and monitor. There are clear solutions for replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.  Workflow management frameworks helped us to achieve most of the desired feature for data pipeline  There are several teams involved in product pipeline ( e.g., security, development, support and etc.) ; however , there is a clear chain of responsibility and protocol for when things go wrong.  The pipeline design are reviewed under business/stakeholder use cases and our pipeline are designed to be highly configurable and scalable.  Move to AWS unlocked our ability to scale  Moving toward options that decouple storage from compute in order to scale efficiently  Have made good progress on embracing configuration  Moving toward fully configurable
  • 18.
    Appendix: Qualities ofIdeal Data Pipelines The desired quality of data pipeline include  Idempotent with state handling  Scalable and resilient  Replaceable or programmable  Testable and traceable  Documented and automated

Editor's Notes

  • #5  Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
  • #17  Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
  • #19  Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos