Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019

102 views

Published on

Greenplum Summit 2019
Sridhar Paladugu
Pivotal ML Engineer MADlib Flow Project Lead

Published in: Software
  • Be the first to comment

Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019

  1. 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Sridhar Paladugu Pivotal ML Engineer MADlib Flow Project Lead Operationalizing AI at scale using MADlib Flow
  2. 2. Cover w/ Image Agenda ● Introductions ● Data Science Process ● Model Operationalization ● Introduction to MADlib Flow ● Case Study: AI for transaction fraud ● Live demo! ● Q&A
  3. 3. Data Science Process Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup
  4. 4. Model Operationalization Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Model Operationalization is the process of deploying data science models to production for ongoing use by other software
  5. 5. Where Most AI & ML Projects Fail Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Model Operationalization is where the majority of artificial intelligence initiatives fail
  6. 6. Common Challenges With Operationalizing Models Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Common challenges with model operationalization: ● Handling production data ● Engineering for scale and performance ● Model transportation ● Managing and orchestrating deployed models ● Data Scientists are not developers or platform experts
  7. 7. BATCH TRAINING BATCH INFERENCE ~40% of today’s use cases Tax Return Fraud: Score database of tax returns - on a nightly basis - to flag likely fraudulent returns for audit EVENT DRIVEN TRAINING EVENT DRIVEN INFERENCE <5% today’s use cases Online Advertising: Maximize Click Thru Rate by algorithmically selecting and testing advertisement placement in real time BATCH TRAINING EVENT DRIVEN INFERENCE ~55% today’s use cases (growing) Real Time Transaction Fraud: Train a ML model on historical data to classify - in real time - whether or not new credit/debit transactions are likely to be fraudulent EXAMPLE Patterns For Operationalizing Models EXAMPLE EXAMPLE PotsgreSQL/Greenplum with MADlib supports this pattern PostgreSQL/Greenplum with MADlib & MADlib Flow supports this pattern Highly specialized – low number of enterprise use cases
  8. 8. Existing Approaches To Model Operationalization Have Failed Data science Engineering Production Model persistence Approaches 1. Rewrite the code 2. Universal markup model language (PMML and PFA) Most models never make it to production
  9. 9. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. MADlib Flow Model deployment, orchestration and management
  10. 10. Containerized Deployment Of Models $ madlibflow --deploy --target kubernetes --type model Key benefits of MADlib Flow ● Easy to deploy & light weight ● Highly scalable REST and Streaming ● End-to-end SQL workflow ● Low latency inference/predictions ● Feature Transformations Single command to deploy a MADlib trained model from GPDB/Postgres to Docker, PCF or Kubernetes Containerized deployment of Apache MADlib Machine Learning workflows for low latency event driven inference and scale
  11. 11. AI For The PostgreSQL Community Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack Experimentation Initial code development and testing, model experimentation on samples. Modeling at Scale Heavy compute tasks such as model training across big data Deployment Production deployment of models to feed downstream applications and reports Artificial Intelligence : Closed Loop Machine Learning
  12. 12. Model Deployment With MADlib Flow 1 ML Training Train ML model in Postgres or Greenplum using Apache MADlib madlibflow -- deploy Set configs in .yml and deploy model from Greenplum to Docker, PCF or Kubernetes 2 Docker pull Pull docker containers with optimized Postgres and MADlib 3 Pull Model Extract model and feature table schema layout from Greenplum database 4 Load Model Load model and feature table schema into optimized Postgres 5 Deploy Deploy docker container to target environment 6 Automated Backend OperationsUser Operations
  13. 13. MADlib Flow Components
  14. 14. Demo: Deploy A Model High level steps ● Connect to Greenplum ● Load data ● Build and train a model ● Deploy and test model on Greenplum ● Deploy model using MADlib Flow to GKE ● Test
  15. 15. Case Study: Credit Card Transaction Fraud model Transactions Topic Greenplum Scored Transactions Topic PCF [PKS]
  16. 16. Event Scoring with Greenplum and Containerized Postgres MADlib REST (Scoring) Scoring Decision (JSON) New Transaction Event (JSON) Read Features Read Features from DB (scheduled) Cache Feature s Join Event & Feature Data Bootstrap MADlib Model Cache Manager Update DB w/ Scoring Decision MADlib Flow Returns Scoring Decision to GPDB Load New Transaction Event REST API Update DB w/ New Event Feature Engine MADlib Flow (Orchestrator ) Pivotal Cloud Cache Scoring Decision
  17. 17. MADlibFlow Greenplum Database Feature Engine Cache Loader and Feature Engine Services Credit/Debit Card Transaction (Input) Message { “transaction_ts”: , “credit_card_number”: , “transaction_amt”:, “merchant_id”: } Approved Credit/Debit Card Transaction (Output) Message { “transaction_ts”: , “transaction_amt”:, “credit_card_number”:, “num_transactions_30days”:, “max_transactions_30days”:, “merchant_id”:, “num_fraud_cases”:, “avg_transaction_amount_30days”:, “fraud_risk_score”: 0.92, “approved”: True } Accounts credit_card_number num_transactions_30days max_transactions_30days Merchants merchant_id num_fraud_cases avg_transaction_amount_30days Cache (Gemfire, PCC, Redis, etc.) Cache Abstraction Cache Abstraction SELECT mch.* ,acct.* ,log(msg.transaction_amt + 1) AS log_transaction_amt FROM message msg JOIN merchants mch ON msg.merchant_id=mch.merchant_id JOIN accounts acct ON msg.credit_card_number=acct.credit_card_number; MADlib REST Cache Loader Automated deployment of scalable low latency end-to-end ML pipelines (“Data Science Ops.”) No code conversion - engineer features and populate cache in SQL Join data from the incoming message with cached data Accounts Merchants SELECT create_accounts(); SELECT create_merchants();
  18. 18. Event Scoring with Greenplum and Containerized Postgres Demo Application Components Kafka Broker + Topic Feeder Greenplum Kafka Stream(2) + Redis Cache Kubernetes MADlib Flow Angular Dashboard + Redis Client Jupyter Notebook
  19. 19. Event Scoring with Greenplum and Containerized Postgres 1. Credit Fraud Model Building On Greenplum 2. Credit Fraud Model Deployment with MADlib Flow a. Flow contains i. Model, ii. feature engine, iii. features cache (refreshable via REST) 3. Kafka Stream Processing for Event Scoring. a. One Kafka producer b. One Kafka streams consumer
  20. 20. Event Scoring with Greenplum and Containerized Postgres Show me the Demo!
  21. 21. Roadmap ● Container and service endpoint security ● Support for Python model deployments via PL/Python ● Support for R model deployments via PL/R ● Support for Deep learning modules (Tensorflow, PyTorch) ● A comprehensive model management UI ○ Model versioning and updates ○ Champion / challenger testing ● Target GA is late Spring
  22. 22. #ScaleMatters © Copyright 2019 Pivotal Software, Inc. All rights Reserved.

×