Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Flink Throughout an Operationalized Streaming ML Lifecycle"

659 views

Published on

Operationalizing Machine Learning models is never easy. Our team at Comcast has been challenged with operationalizing predictive ML models to improve customer care experiences. Using Apache Flink we have been able to apply real-time streaming to all aspects of the Machine Learning lifecycle. This includes data feature exploration and preparation by data scientists, deploying live models to serve near-real-time predictions, and validating results for model retraining and iteration. We will share best practices and lessons learned from Flink’s role in our operationalized lifecycle including:
• Executing as the “Prediction Pipeline” – a model container environment for near-real-time streaming and batch predictions
• Preparing streaming features and data sets for model training, as input for production model predictions, and for a continually-updated customer context
• Using connected streams and savepoints for “Live in the Dark”, multi-variant testing, and validation scenarios
• Incorporating Flink’s Queryable State as an approach to the online “Feature Store” – a data catalog for reuse by multiple models and use cases
• Enabling versioned models, versioned feature sets, and versioned data through DevOps approaches.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Flink Throughout an Operationalized Streaming ML Lifecycle"

  1. 1. EMBEDDING FLINK THROUGHOUT AN OPERATIONALIZED STREAMING ML LIFECYCLE Dave Torok, Senior Principal Architect Sameer Wadkar, Senior Principal Architect 10 April, 2018
  2. 2. 2 INTRODUCTION AND BACKGROUND CUSTOMER EXPERIENCE TEAM 27 MILLION CUSTOMERS (HIGH SPEED DATA, VIDEO, VOICE, HOME SECURITY, MOBILE) INGESTING ABOUT 2 BILLION EVENTS / MONTH SOME HIGH-VOLUME MACHINE-GENERATED EVENTS TYPICAL STREAMING DATA ARCHITECTURE DATA ETL, LAND IN A TIME SERIES DATA LAKE GREW FROM A FEW DOZEN TO 150+ DATA SOURCES / FEEDS IN ABOUT A YEAR Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  3. 3. 3 BUSINESS PROBLEM INCREASE POSITIVE CUSTOMER EXPERIENCES RESOLVE POTENTIAL ISSUES CORRECTLY AND QUICKLY PREDICT AND DIAGNOSE SERVICE TROUBLE ACROSS MULTIPLE KNOWLEDGE DOMAINS REDUCE COSTS THROUGH EARLIER RESOLUTION AND BY REDUCING AVOIDABLE TECHNICIAN VISITS
  4. 4. 4 TECHNICAL PROBLEM MULTIPLE PROGRAMMING AND DATA SCIENCE ENVIRONMENTS WIDESPREAD AND DISCORDANT DATA SOURCES THE “DATA PLANE” PROBLEM: COMBINING DATA AT REST AND DATA IN MOTION ML VERSIONING: DATA, CODE, FEATURES, MODELS
  5. 5. 5 SOLUTION MOTIVATION SELF-SERVICE PLATFORM ALIGN DATA SCIENTISTS AND PRODUCTION MODELS TREATED AS CODE HIGH THROUGHPUT STREAM PLATFORM
  6. 6. 6 MACHINE LEARNING LIFECYCLE USE CASE DEFINITION FEATURE EXPLORATION / ENGINEERING MODEL TRAINING MODEL EVALUATION MODEL ARTIFACT DELIVERY (POJO/DOCKER) MODEL SELECTION MODEL OPERATIONALIZATION MODEL PERFORMANCE MONITORING ON LIVE DATA (A/B & MULTIVARIATE TESTING) PUSH MODEL TO PRODUCTION RETRAIN MODEL ON NEWER DATA
  7. 7. 7 EXAMPLE NEAR REAL TIME PREDICTION USE CASE CUSTOMER RUNS A “SPEED TEST” EVENT TRIGGERS A PREDICTION FLOW ENRICH WITH NETWORK HEALTH AND OTHER INDICATORS EXECUTE ML MODEL PREDICT WHETHER IT IS A WIFI, MODEM, OR NETWORK ISSUE Detect Enrich Predict Gather Data Event ML Model Engage Customer Act / Notify Network Diagnostic Services Slow Speed? Additional Context Services Run Prediction
  8. 8. 8 ML PIPELINE ARCHITECTURE PRINCIPLES Metadata Driven Feature/Model Definition, Versioning , Feature Assembly, Model Deployment, Model Monitoring is metadata driven Automation Orchestrated Deployment for new features and models Rapid Onboarding Portal for Model and Feature Management as well Model Deployment Data Consistency Feature store enforces a consistent data pipeline ensuring that the data used for training is functionally identical to the data used for predictions Monitoring and Metrics Ability to execute & monitor multiple models in production to enable real-time metrics driven model selection Iterative/Consistent Model Development Multiple versions of the model can be developed iteratively while consuming from a consistent dataset (feature store), enables A/B & Multivariate Testing
  9. 9. 9 ML PIPELINE – ROLES & WORKFLOW Define Use Case Business User Data Scientist ML Operations Explore Features Create and publish new features Create & Validate Models Model Selection Go Live with Selected Models • Define Online Feature Assembly • Define pipeline to collect outcomes • Model Deployment and Monitoring Model Review Iterate Evaluate Live Model Performance Inception Exploration Model Development Candidate Model Selection Model Operationalization Model Evaluation Go Live Phase Monitor Live ModelsCollect new data & retrain Iterate
  10. 10. 1 0 WHY APACHE FLINK? UTILIZED AS ORCHESTRATION & ETL ENGINE FIRST-CLASS STREAMING MODEL PERFORMANCE RICH STATEFUL SEMANTICS TEAM EXPERIENCE OPEN SOURCE GROWING COMMUNITY Apache®, Apache Flink®, and the squirrel logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
  11. 11. 1 1 THE “DATA PLANE” PROBLEM Streaming Compute Pipeline AWS S3 HDFS Data File Abstraction Databases MODEL Streaming State Sum Avg Time Buckets Stream Data QUERY Enterprise Services Data Sets at Rest
  12. 12. 1 2 ML MODEL EXECUTION MODEL EXECUTION TRIGGER 1. Payload only contains Model Name & Account Number FEATURE ASSEMBLY Model Metadata Online Feature Store 2. Model Metadata informs which features are needed for a model 3. Pull required features by account number MODEL EXECUTION 4. Pass full set of assembled features for model execution 5. Prediction
  13. 13. 1 3 SOLUTION Rest Service Inputs to REST Service: 1.Model Name 2.Account No SELECT MODEL BASED ON RULES (ON- DEMAND/STREAMING) Request Initiated asynchronously via pushing it to a queue/topic INITIATE MODEL PREDICTION REQUEST (ASYNCHRONOUSLY) REQUESTING APPLICATION TRIGGER EVENT LISTENER
  14. 14. 1 4 SOLUTION ASSEMBLE FEATURES FOR A GIVEN MODEL Happy Path for Model Execution – All Features Current Online Feature Store Model /Feature Metadata Feature Store API Feature Assembly Model Execution Are All Features Current? Yes Prediction/Outc ome Store Prediction Sink Store Prediction Flow Customer Context Listens PushREQUESTING APPLICATION Assemble features based on Account Number as model input Collect predictions and outcome to create datasets for model refinement Store current values of features for interactive query access
  15. 15. 1 5 SOLUTION (CONT.) ASSEMBLE FEATURES FOR A GIVEN MODEL Exception Path – Some/All Features are not current Online Feature Store Model /Feature Metadata Feature Store API Feature Assembly Feature Creation Pipeline Are All Features Current? No History Feature Store Online Feature Store Back to Happy Path Feature Assembly Append store (Ex. S3, HDFS, Redshift) for use by Data Scientist for Model Training
  16. 16. 1 6 SOLUTION – DIGGING DEEPER Global Window, Pane per Request Id Model Execution Requests Request Features KeyBy Request Id Apply Function Custom Evictor Model Metadata Connected Stream Periodically check if Model TTL has expired (onEventTime) Arrival of each feature triggers the model execution (onElement) Evict pane if model executed Evict pane if model request expired Execute model or expire Side Outputs Features Custom Trigger
  17. 17. 1 7 FEATURE STORE TWO TYPES OF FEATURE STORES: • Online Feature Store – Current values by key (Key/Value Store) • History Feature Store – Append features as they are collected (Ex. HDFS, S3) MULTIPLE ONLINE FEATURE STORES BASED ON SLA’S • A feature can be stored in multiple online feature stores to support model specific SLA’s. TYPES OF ONLINE FEATURE STORE • PostgreSQL (AWS RDS, Aurora DB) for low volume on-demand model execution requests • HBase, DynamoDB for high volume feature ingest • Flink Queryable State for high volume ingest, high velocity model execution requests Feature Creation Pipeline History Feature Store Online Feature Store Prediction Phase Model Training Phase AppendOverwrite
  18. 18. 1 8 FEATURE CREATION PIPELINES FLINK AS REAL-TIME DATA STREAM CONSUMER CUSTOM FLOWS FOR AGGREGATION FEATURES SAME DATA FLOWS FOR PREDICTION (STREAMING) & TRAINING (BATCH) • PRODUCED FEATURES UPDATE ONLINE FEATURE STORE (PREDICTION PHASE) • PRODUCED FEATURES APPENDED TO S3 OR HDFS FOR USE BY DATA SCIENTISTS (TRAINING PHASE) Aggregation Features On Demand Feature Raw Data On Demand Feature Request External Rest API Push to Feature Store
  19. 19. 1 9 STREAMING FEATURE EXAMPLE KAFKA ERROR STREAM (~150 / SECOND) DETECT ACCOUNTS WITH SIGNAL ERROR WITH COUNT > 2000 IN TRAILING 24 HOURS SOLUTION: AVRO DESERIALIZER WITH KEY = ACCOUNT “24 HOUR ROLLING” HASH STRUCTURE AS STATE FILTER FUNCTION WITH SIGNAL THRESHOLD Flink Features Used: Kafka Source Keyed Stream Value State Sliding Window Filter Function
  20. 20. 2 0 ON-DEMAND FEATURE EXAMPLE PREMISE HEATH TEST • DIAGNOSTIC TELEMETRY INFORMATION FOR EACH DEVICE FOR A GIVEN CUSTOMER • EXPENSIVE - ONLY REQUESTED ON DEMAND • MODELS USING SUCH A FEATURE WILL EXTRACT SUB-ELEMENTS USING SCRIPTING CAPABILITIES (MODEL METADATA & FEATURE ENGINEERING) • MODEL METADATA WILL CONTAIN TTL ATTRIBUTE FOR SUCH FEATURES INDICATING THEIR TOLERANCE FOR STALE DATA SOLUTION: MAKE AN ON-DEMAND REQUEST FOR PHT TELEMETRY DATA FOR IF IT IS STALE OR ABSENT FOR A GIVEN ACCOUNT Flink Features Used: Async Operator
  21. 21. 2 1 ML PREDICTION COMPONENT • REST SERVICE • H2O.ai Model Container (POJO) • Python based service running specialized ML Models • Any stateless REST service • FLINK MAP OPERATOR • H2O.ai Model Container (POJO) wrapped in a Flink Map Operator • Possibly support native calls via Flink Map Operators running specialized Models (Ex. Tensorflow GPU based predictions) • Same Code Base • Multiple Deployment Models • REST – Low velocity, on- demand model invocations • Map Operators – High velocity, streaming model invocations
  22. 22. 2 2 VERSIONING AND DEVOPS EVERYTHING IS VERSIONED • Feature/Model Metadata • Feature Data & Model Execution environments • Training, Validation datasets are versioned • Feature creation pipelines are versioned VERSIONING ALLOWS PROVENANCE & AUDITABILITY & REPEATABILITY OF EVERY PREDICTION
  23. 23. 2 3 FEATURES OF THE ML PIPELINE CLOUD AGNOSTIC • Integrates with the AWS Cloud but not dependent on it • Framework should be able to work in a non-AWS distributed environment with configuration (not code) changes TRACEABILITY & REPEATABILITY & AUDITABILITY • Model to be traced back to business use- cases • Full traceability from raw data to feature engineering to predictions • “Everything Versioned” enables repeatability CI/CD SUPPORT • Code, Metadata (Hyper-Parameters) and Data (Training/Validation Data) are versioned. Deployable artifacts to integrate with CI/CD Pipeline
  24. 24. 2 4 FEATURES OF THE ML PIPELINE (CONT.) MULTI-DEPLOYMENT OPTIONS • Supports Throughput vs. Latency Tradeoffs- Process in stream/batch/on- demand • Allows multiple versions of the same/different models to be compared with one another on live data • A/B testing & Multivariate testing • Live but dark deployments • Supports integration of outcomes with predictions to measure production performance & support continuous model re-training PLUGGABLE (DATA AND COMPUTE) ARCHITECTURE • De-coupled architecture based on message driven inter-component communication. • Failure of an isolated component does not fail the entire platform • Asynchronous behavior • Micro-Services based design which supports independent deployment of components
  25. 25. 2 5 NEXT STEPS AND FUTURE WORK GENERATING “FLINK NATIVE” FEATURE FLOWS • Evaluating Uber’s “AthenaX” Project / Similar Approaches UI PORTAL FOR • MODEL / FEATURE AND METADATA MANAGEMENT • CONTAINERIZATION SUPPORT FOR MODEL EXECUTION PHASE • WORKBENCH FOR DATA SCIENTIST • CONTINUOUS MODEL MONITORING QUERYABLE STATE AUTOMATING THE RETRAINING PROCESS SUPPORT FOR MULTIPLE/PLUGGABLE FEATURE STORES (SLA DRIVEN)
  26. 26. 2 6 SUMMARY & LESSONS LEARNED FLINK IS HELPING ACHIEVE OUR BUSINESS GOALS • Near-real-time streaming context • Container for ML Prediction Pipeline • Stateful Feature Generation • Multiple Solutions to the “Data Plane” Problem • Natural Asynchronous support • Rich windowing semantics support various aspects of our ML Pipeline (Training/Prediction/ETL) • Connected Streams simplify pushing metadata updates (reduced querying load with better performance) • Queryable State is a natural fit for high velocity and high volume data being pushed to the online feature store
  27. 27. THANK YOU!

×