A machine learning and data science pipeline for real companies

A Data Science Pipeline for Real
Companies
Comcast’s Approach to Multi-datacenter, Cloud and On-
premise Machine Learning

“Comcast brings together
the best in media and
technology. We drive
innovation to create the
world's best entertainment
and online experiences.”
High Speed Internet
Video
Home Automation Digital Voice
Xfinity MobileContent
$84b (2017)
29m Customers

• Predictive network analysis
• Customer premise self-healing
• Comcast network self-healing
• Trouble-ticket prioritization
• Customer self-help (voice and text flows)
• Customer Retention
Use Cases: It’s All About The Customer

Our starting point!
Internal Data Centers
Cloud Based Infrastructure
E1
E2
E3
Predictions
Next Big Thing
Where’s the data?
How do I access it?
What tools do I have?
?
Where can I ﬁnd information about data?

Our Challenges
Security
Diversity of Skills

Our Challenges
Security
Diversity of Skills
Discoverability

FAST Provide frameworks, capabilities that allow for rapid deployment.
SIMPLE & TRANSPARENT Develop capabilities to promote self-service and ease of access to data.
CONSISTENT & SECURE Provide a universal security framework to govern all data under the Big Data Domain.
FULLY AUTOMATED Provide a robust operational model allowing for playback, data quality, and self-healing.
Guiding Principles
Gather, organize, make sense of Comcast data, and make it universally accessible to empower, enable, and
transform Comcast into an insight-driven organization.
Product Vision

• Avoid religious wars where possible
• Whatever framework makes sense for the business problem at hand
• Focus on federated access to curated data
• Focus on Common APIs for Ingest, Egress and Machine Learning
• Focus on metadata and discoverability for:
• Enterprise Data
• Enterprise Features
• Trained Models
• Enterprise Portal
• Containerized scoring endpoints that accommodate multiple frameworks and
models
Approach

Shameless Lyft: Uber’s Michelangelo becomes Comcast’s Da Vinci
• Focus on Art AND Science (and a smattering of creativity)
• Common APIs usable from multiple frameworks using Python
or Scala
• Metadata is Key
• About data
• About features
• About trained models
Focus on a Common Approach to Features and Models

AT LAS
Ingest
API
Egress API (Federation Layer)
Feature Store
Model Store
On Premise Cloud
Portal
<Your Favorite Framework Here>
APACHE
Tools such as Presto and Alluxio
Scala
Python
Open ML API
Training
Deployment
Container Container Container
Client

Open ML API
• Reads and writes features and feature metadata
• Reads and writes model metadata
• Integrates with a common portal searchable by any user
16

The Data Science Pipeline
DX Alpha

Goals
• Develop a system to manage features and models running in Spark
– Based on Uber’s Michelangelo
• Make it easier to build and deploy data transformations and ML models
• Enhance sharing of code across data science teams
• Support a variety of data science toolkits

Data Science Pipeline Components
• Feature Store
– Standardized approach to define data transformations
– A feature is a single attribute or column in a data frame
– A feature table is a set of features combined with meta data
– The transformation definition is separated from the ”context” in which it is applied

• Model store
– Standardized approach for defining models
– A model is defined by train, predict, and evaluate functions, a hyperparameter set, and associated meta data
– The definition of the train, predict, and evaluate functions are separated from their application
– A model may be associated with one or more trained instances, prediction data frames, and evaluation metrics

• Job Scheduler/Runner
– Handle streaming, scheduled, and one-time jobs
– Support interdependencies between jobs

• File system
– Store executable objects such as jar files and notebooks
– Store data frames
– Store trained models and other runtime artifacts

Development Approach
• Build on top of Databricks and Spark
• Start with a “thin slice” proof of concept
– Demonstrate basic end to end run from data exploration to model evaluation
• Iterate to improve usability and tooling

What do we need to know about Feature Tables?
– Descriptive Information
• What data transformation does it perform?
• Who’s owns this feature table?
• Description of Input/output
– Build/deployment information
• Where’s the code? What’s the current version?
• What artifacts have been deployed to the production environment?
– Run information
• What jobs are running or have been run?
• What’s the status of these jobs?
• What data sets or streams are being produced and how do I access them?
• Are there performance metrics or summary statistics available?

What do we need to know about Models?
• Descriptive Information
– What does it do? Classification? Regression?
– Who’s owns this model?
– Is it supervised or unsupervised? What type of labels are required?
– What features does it use?
• Build/deployment information
– Where’s the code? What’s the current version?
– What artifacts have been deployed to the production environment?
• Run information
– What training / prediction / evaluation jobs are running or have run?
– What’s the status of these jobs?
– What data sets or streams are being produced and how do I access them?
– How well is the model performing? What criteria are being used to assess this?

What actions do we need to perform?
• Data exploration and development
• Packaging, versioning, and deployment of ML code
• Job scheduling and monitoring
• Storage/discovery/retrieval of job results
– Data frames
– Metrics
– Trained models
• Discovery of and interaction with Features and Models

What Technologies Already do this?
• Data exploration and development
– Databricks notebooks, local IDE
• Packaging, versioning, and deployment of code and metadata
– Github / Jenkins / Mortar
– Document store for metadata (MongoDB, Cassandra, etc.)
• Job scheduling and monitoring
– Airflow, Databricks Jobs API
• Storage of job results
– DBFS / S3, need to define standard file structure
• Discovery of and interaction with Features and Models
– Finding – Elastic Search or existing Thin Slice API
– Reading/processing data frame artifacts – Spark, Databricks notebooks
– Retrieving/viewing performance metrics - ???
– Monitoring model performance over time - ???
– Algebraic composition features and models - ???

Open questions
• How do we abstract file system details and other constants?
• How do we standardize ETL from other systems within Comcast?
• How do we support human-labeling of data sets?
• What other tools (H20, R, etc.) do we need to support?
• Are there other ways we need to interact with features and models?
• How do we integrate AutoML?
• Other technologies that may be useful? Databricks Delta? Amazon Sagemaker?

Architecture V2: PIpelines
• Pipeline Segments (same as Spark ML)
– Transformers
– Estimators
• Pipeline: linear sequence of Pipeline Segments (same as Spark ML)
– Transformation pipelines contain only Transformers
– Estimation pipelines contain one or more Transformers and end with an Estimator
•
T
E
D T T
T T TD
D
T
Transformation Pipeline
Estimation Pipeline

Architecture V2: PIpelines
• A pipeline is just a function
• It does not produce anything until supplied a specific DataFrame as input

Architecture V2: Workflows
• A Workflow is a directed (acyclic?) graph of Pipelines
– DataSources load data (from disk, streams, etc.) into a DataFrame
– Connectors merge the output of multiple DataSources into a single DataFrame
– Pipelines process the DataFrames
– DataSinks receive the output of the last pipeline
• Workflow rules
– Workflows must end in a single Pipeline node
– An Estimator Pipeline may only appear as the last node in a Workflow

Architecture V2: Workflows
PT PE
D
D
D
C
D
C T
PT
D
D
C
D
C PTTransformation Workflow
Training Workflow

Architecture V2: Data Sources
• Potential sources of data
– HTTP request
– Persistent store (avro, parquet, EDW, …)
– Kafka topic
– Others?
• Connectors could handle complex logic such as combining HTTP data with other sources before feeding into a
Workflow

Architecture V2
• Components
– Pipeline Segment Store
• Code catalog of available transforms, estimators, and pipelines
• Searchable by description, tags, and maybe by schema?
– “Find me a feature of type x that is tagged y”
– Workflow Store
• Stores DAGs (maybe Neo4j or other graph DB?)
• Integrates with Databricks to run DAGs as jobs
• Periodic graph analysis to optimize Workflows
– Data Source Store?
• Separate system or subset of Workflow Store?

A machine learning and data science pipeline for real companies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A machine learning and data science pipeline for real companies

Similar to A machine learning and data science pipeline for real companies (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

A machine learning and data science pipeline for real companies

Editor's Notes