Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions with Nabeel Sarwar

792 views

Published on

Our team at Comcast is challenged with operationalizing predictive ML models to improve customer experience. Our goal is to eliminate bottlenecks in the process from model inception to deployment and monitoring.

Traditionally CI/CD manages code and infrastructure artifacts like container definitions. We want to extend it to support granular traceability enabling tracking of ML Models from use-case, to feature/attribute selection, development of versioned datasets, model training code, model evaluation artifacts, model prediction deployment containers, and sinks to which the predictions/outcomes are persisted to. Our framework stack enables us to track models from use-case to deployments, manage and evaluate multiple models simultaneously in the live yet dark mode and continue to monitor models in production against real-world outcomes using configurable policies.

The technologies/components which drive this vision are:

1. FeatureStore – Enables data scientists to reuse versioned features and review feature metrics by models. Self-Service capabilities allow all teams to onboard their events data into the feature store.

2. ModelRepository – Manages meta-data about models including pre-processing parameters (Ex. Scaling parameters for features), mapping to the features needed to execute the model, model discovery mechanisms, etc.

3. Spark on Alluxio – Alluxio provides the universal data plane on top of various under-stores (Ex. S3, HDFS, RDBMS). Apache Spark with its Data Sources API provides a unified query language which Data Scientist use to consume features to create training/validation/test datasets which are versioned and integrated into the full model pipeline using Ground-Context discussed next.

4. Ground-Context – This open-source vendor-neutral data context service enables full traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers and prediction/outcome sinks. It integrates with the Feature-Store, Container Repository and Git to integrate data, code and run-time artifacts for CI/CD integration.

Published in: Data & Analytics
  • Login to see the comments

Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions with Nabeel Sarwar

  1. 1. Operationalizing Machine Learning— Managing Provenance from Raw Data to Predictions Nabeel Sarwar, Machine Learning Engineer June 2nd, 2018
  2. 2. 2 INTRODUCTION AND BACKGROUND CUSTOMER EXPERIENCE TEAM 27 MILLION CUSTOMERS (HIGH SPEED DATA, VIDEO, VOICE, HOME SECURITY, MOBILE) INGESTING ABOUT 2 BILLION EVENTS / MONTH HIGH-VOLUME OF MACHINE-GENERATED EVENTS DATA SCIENCE PIPELINE GREW FROM A FEW DOZEN TO 150+ DATA SOURCES / FEEDS IN ABOUT A YEAR Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  3. 3. 3 COMCAST APPLIED AI Media & Video Analytics Machine Learning & Data Science Content Discovery Speech & NLP Video High Speed Internet Home Security / Automation Customer Service Universal Parks Media Properties
  4. 4. 4 BUSINESS PROBLEM INCREASE POSITIVE CUSTOMER EXPERIENCES RESOLVE POTENTIAL ISSUES CORRECTLY, QUICKLY AND EVEN BETTER PROACTIVELY PREDICT AND DIAGNOSE SERVICE TROUBLE ACROSS MULTIPLE KNOWLEDGE DOMAINS REDUCE COSTS THROUGH EARLIER RESOLUTION AND BY REDUCING AVOIDABLE TECHNICIAN VISITS
  5. 5. 5 AI FOR CUSTOMER SERVICE 5 ProactivePredictiveInteractive
  6. 6. 1 2 3 4 6 XFINITY VIRTUAL ASSISTANT My Account Main Screen XFINITY Assistant Type a question Disambiguate
  7. 7. 7 VIRTUAL ASSISTANT – STEP BY STEP 7 Devices, Applications, and Platforms instrumented to provide telemetry Natural language input and feedback Interactive (Conversational) Actions Proactive (Automatic) Actions Customer intents Domain models NLP Action Catalog Schedule Truck Roll Self-Heal Notifications Agent Contact Choose Best Explore Decision Engine Context Root Cause Predictions Predictive AI/MLPredictive AI/ML Predictive AI/ML
  8. 8. 8 TECHNICAL PROBLEM MULTIPLE PROGRAMMING AND DATA SCIENCE ENVIRONMENTS WIDESPREAD AND DISCORDANT DATA SOURCES THE “DATA PLANE” PROBLEM: COMBINING DATA AT REST AND DATA IN MOTION CONSISTENT FEATURE ENGINEERING ML VERSIONING: DATA, CODE, FEATURES, MODELS
  9. 9. 9 EXAMPLE NEAR REAL TIME PREDICTION USE CASE CUSTOMER RUNS A “SPEED TEST” EVENT TRIGGERS A PREDICTION FLOW ENRICH WITH NETWORK HEALTH AND OTHER INDICATORS EXECUTE ML MODEL PREDICT WHETHER IT IS A WIFI, MODEM, OR NETWORK ISSUE Detect Enrich Predict Gather Data Event ML Model Engage Customer Act / Notify Network Diagnostic Services Slow Speed? Additional Context Services Run Prediction
  10. 10. 1 0 SPACE CORRELATION EXAMPLE ML ALGORITHM NEEDS TO LEARN THAT THERE IS NO NEED TO SEND 3 REPAIR TRUCKS • LOGS FROM WHICH TRAINING DATASETS ARE SOURCED SHOW CORRELATION BETWEEN UNSUCCESSFUL TRUCK DISPATCHES AND CONCENTRATED CABLE FAILURES • GEO-LOCATION IS AVAILABLE IN THE CUSTOMER CONTEXT • ALGORITHM CAN CLUSTER CUSTOMERS BASED ON GEO-LOCATION Cable Green = Works Yellow = Has Problems Likely failure
  11. 11. 1 1 CHALLENGE- STANDARDIZATION OF FEATURES TWO MAIN CHALLENGES • FEATURE ASSEMBLY (ENRICHMENT) DURING PREDICTION TIME • DISCOVERING CORRELATIONS WHEN WE HAVE 25 MILLION CUSTOMERS EACH USING 10 PRODUCTS WE NEED A STANDARDIZATION OF FEATURES, ACTIONS AND REWARDS FEATURE STORE – CURATED DATA STORE TO DRIVE MODEL TRAINING AND MODEL PREDICTION
  12. 12. 1 2 ML PIPELINE – ROLES & WORKFLOW Define Use Case Business User Data Scientist ML Operations Explore Features Create and publish new features Create & Validate Models Model Selection Go Live with Selected Models • Define Online Feature Assembly • Define pipeline to collect outcomes • Model Deployment and Monitoring Model Review Iterate Evaluate Live Model Performance Inception Exploration Model Development Candidate Model Selection Model Operationalization Model Evaluation Go Live Phase Monitor Live ModelsCollect new data & retrain Iterate
  13. 13. 1 3 SOLUTION MOTIVATION SELF-SERVICE PLATFORM ALIGN DATA SCIENTISTS AND PRODUCTION MODELS TREATED AS CODE HIGH THROUGHPUT STREAM PLATFORM
  14. 14. 1 4 WHY METADATA DRIVEN? INSPIRED BY GROUND CONTEXT • Berkeley’s RISE Lab • Application context • Parameters, callbacks, “meaty” metadata • Behavior Context • Data sets and code • Change Context • Version history • Track any change end-to-end -> entire pipeline is versioned • Metadata drives what/how code is ran
  15. 15. 1 5 AN OVERVIEW OF SPARK FLOWS RAW DATA STREAM Feature Creation Pipeline VERSION Historical RAW Store Feature Creation Disk or Memory Model ON DEMAND OR CONTINUOUS Historical Feature Store Online Feature Store Prediction CUSTOMER EXPERIENCE ELEMENTS Analysis & Business Value
  16. 16. 1 6 FEATURE STORE TWO TYPES OF FEATURE STORES: • Online Feature Store – Current values by key (Key/Value Store) • History Feature Store – Append features as they are collected (Ex. Hadoop File System, AWS S3) ONLINE FEATURE STORE • Used in the prediction phase for enrichment • Needs to support fast ingest and query as it stores current data for given account or account & device combination HISTORY FEATURE STORE • Used to build history of features • Data Scientists use this store to create their training datasets MAINTAIN (VERSIONED) RAW DATA SEPARATELY Feature Creation Pipeline History Feature Store Online Feature Store Prediction Phase Model Training Phase AppendOverwrite
  17. 17. 1 7 USING THE ONLINE FEATURE STORE MODEL EXECUTION TRIGGER 1. Payload only contains Model Name & Account Number FEATURE ASSEMBLY Model Metadata Online Feature Store 2. Model Metadata informs which features are needed for a model 3. Pull required features by account number MODEL EXECUTION 4. Pass full set of assembled features for model execution 5. Prediction
  18. 18. 1 8 FEATURE CREATION PIPELINE Aggregation Pipeline On Demand Pipeline Continuous Stream On Demand Feature Request External Rest API Feature Writer Feature Assembly Feature Metadata Model Metadata TWO TYPES: • Continuous aggregations on streaming data • On Demand Features AGGREGATION FEATURE EXAMPLES • Number of customer calls in the past 30 days. Key = Account Number • Number of signal errors > 2000 in a 24 hour tumbling window. Key= Account Number + Device Id ON DEMAND FEATURE EXAMPLE • Diagnostic telemetry information for each device for a given customer • Expensive to collect. Only requested on demand • Model Metadata specified TTL for such a feature Online Feature Store History Feature Store Online Feature Store
  19. 19. 1 9 FEATURE METADATA KEY: NAMESPACE, NAME & VERSION ONLINE FEATURE STORE KEY DEFINITIONS: JSON PATH & SCRIPT REFERENCES (CODE & METHOD) IN GITHUB HISTORICAL FEATURE STORE KEY DEFINITIONS: JSON PATH TO EXTRACT IDENTIFIERS, CONNECTION PARAMETERS TO HISTORY STORE, SCRIPT & JSON PATH TO EXTRACT PARTITIONS UPDATE TS EXTRACTORS : COMBINATION OF JSON PATHS, SCRIPT REFERENCES TO EXTRACT TIMESTAMP FROM FEATURE PAYLOADS HOW DO I IDENTIFY A FEATURE? HOW I IDENTIFY A SPECIFIC INSTANCE OF A FEATURE HOW DO I WRITE TO A HISTORY STORE(S)? WHAT IS THE UPDATE TIME STAMPS FOR EACH FEATURE VALUE? EVENT VS. INGESTION TIME
  20. 20. 2 0 EXAMPLE FEATURE VALUE HEADER: TIMESTAMP, INTERNAL CUSTOMER IDENTIFIER PAYLOAD: JSON PAYLOAD (EX. SPEED TEST DATA)
  21. 21. 2 1 INGEST FEATURE VALUE HEADER: TIMESTAMP, INTERNAL CUSTOMER IDENTIFIER PAYLOAD: JSON PAYLOAD (EX. SPEED TEST DATA) FEATURE INGESTION PIPELINE FEATURE METADATA Scripts RepositoryOnline Feature Store History Feature Store
  22. 22. 2 2 MODEL METADATA KEY: USECASE, NAME & VERSION PER FEATURE DEFINITION: PRE-FEATURE ENGINEERING HOOKS, ATTRIBUTE LEVEL FEATURE ENGINEERING HOOKS, POST-FEATURE ENGINEERING HOOKS, TTL HOW DO I IDENTIFY A MODEL? DEFINE ENVIRONMENT PARAMETERS FOR MODEL EXECUTION CONSISTENT FEATURE ENGINEERING (SCRIPTS). WHY? HOW IS THE MODEL DEPLOYED? AUTOSCALING DEFINITIONS ENVIRONMENT PARAMETERS MODEL DEPLOYMENT DEFINITIONS
  23. 23. 2 3 CONSISTENT DATA PLACE DATA ON SAME PLANE • S3 (or form data plane via Alluxio) • Storage parameters driven by metadata • Consistent persistence and reads • Metadata-driven operators • Historical Store • Raw data • Engineered features VERSION THE DATA • Feature Creation keeps metadata paths
  24. 24. 2 4 CONSISTENT FEATURE ENGINEERING - MODEL METADATA FEATURE ENGINEERING MUST BE CONSISTENT: • Training • Prediction Phase METADATA DRIVEN • Using configured scripts just like feature metadata • Define features used • Define TTL per feature (Prediction Phase) SCRIPTS DEFINED IN MODEL METADATA ARE DEFINED BY DATA SCIENTIST • Used for creating training/testing/validation datasets from raw features. Apply on a record of data used for training or during prediction • Also used at prediction time to perform real time feature engineering Feature Engineering Model Metadata Online Feature Store Scripts Repository Reference Feature Engineering Scripts History Feature Store Record for training Record for prediction Training Prediction
  25. 25. 2 5 CONSISTENT FEATURE ENGINEERING SQL AS A UNIFYING LANGUAGE • Replace as many operations with their SQL equivalent • No need to translate code • No need for DSL SPARK AS A UNIFYING LANGUAGE • Many tools for deeper feature engineering • Redeploy same code through streaming / web app • Less frameworks APPLICABLE AT EVERY PHASE • Post-Ingest • In-flight or at-rest • Pre-model • Standards to fit both stream and batch
  26. 26. 2 6 MODEL DEPLOYMENT MODEL AS CODE • H20 AI Pojo • Spark ML Models • Simple Python Scripts – Regression Models • Specialized Python Scripts – Math libraries and need specialized hardware like GPU support ONE MODEL MULTIPLE DEPLOYMENT MODELS • Deploy as Docker containers with REST Endpoints – Easy to test and used directly if request has all the features available • Deploy as Map Operators within Streaming framework • Deploy as Lambda/SageMaker Spark functions in AWS • SparkLauncher • DataBricks Jobs API
  27. 27. 2 7 PREDICTION PHASE ASSEMBLE FEATURES FOR A GIVEN MODEL Online Feature Store Model /Feature Metadata Feature Store API Feature Assembly Feature Creation Pipeline Are All Features Current? No History Feature Store Online Feature Store Feature Assembly Append store (Ex. S3, HDFS, Redshift) for use by Data Scientist for Model Training Model Execution Prediction/Outc ome Store Customer Context REQUESTING APPLICATION Listens Payload: Model Name + Account Number Yes
  28. 28. 2 8 FEATURES OF THE ML PIPELINE AWS AGNOSTIC • Integrates with the AWS Cloud but not dependent on it • Framework should be able to work in a non- AWS distributed environment with configuration (not code) changes TRACEABILITY & REPEATABILITY & AUDITABILITY • Model to be traced back to business use- cases • Full traceability from raw data to feature engineering to predictions • “Everything Versioned” enables repeatability CI/CD SUPPORT • Code, Metadata (Hyper-Parameters) and Data (Training/Validation Data) are versioned. Deployable artifacts to integrate with CI/CD Pipeline
  29. 29. 2 9 NEXT STEPS AND FUTURE WORK UI PORTAL FOR • MODEL / FEATURE AND METADATA MANAGEMENT • CONTAINERIZATION SUPPORT FOR MODEL EXECUTION PHASE • WORKBENCH FOR DATA SCIENTIST • CONTINUOUS MODEL MONITORING KNOWLEDGE SHARING • Promote Reusability : Users search for features by model • Search features by their importance in models • Real time model evaluation by comparing predictions with outcomes • Determining first-class tools AUTOMATING THE RETRAINING PROCESS SUPPORT FOR MULTIPLE/PLUGGABLE FEATURE STORES (SLA DRIVEN)
  30. 30. 3 0 SUMMARY Metadata Driven Feature/Model Definition, Versioning , Feature Assembly, Model Deployment, Model Monitoring is metadata driven Automation Orchestrated Deployment for new Features and Models Rapid Onboarding Portal for Model and Feature Management as well Model Deployment Data Consistency Feature store enforces a consistent data pipeline ensuring that the data used for training is functionally identical to the data used for predictions Monitoring and Metrics Ability to execute & monitor multiple Models in production to enable real-time metrics driven model selection Iterative/Consistent Model Development Multiple versions of the Models can be developed iteratively while consuming from a consistent dataset (feature store), enables A/B & Multivariate Testing
  31. 31. THANK YOU!

×