SlideShare a Scribd company logo
A Data Science Pipeline for Real
Companies
Comcast’s Approach to Multi-datacenter, Cloud and On-
premise Machine Learning
“Comcast brings together
the best in media and
technology. We drive
innovation to create the
world's best entertainment
and online experiences.”
High Speed Internet
Video
Home Automation Digital Voice
Xfinity MobileContent
$84b (2017)
29m Customers
“Comcast brings together
the best in media and
technology. We drive
innovation to create the
world's best entertainment
and online experiences.”
High Speed Internet
Video
Home Automation Digital Voice
Xfinity MobileContent
$84b (2017)
29m Customers
• Predictive network analysis
• Customer premise self-healing
• Comcast network self-healing
• Trouble-ticket prioritization
• Customer self-help (voice and text flows)
• Customer Retention
Use Cases: It’s All About The Customer
Our starting point!
Internal Data Centers
Cloud Based Infrastructure
E1
E2
E3
Predictions
Next Big Thing
Where’s the data?
How do I access it?
What tools do I have?
?
Where can I find information about data?
Our Challenges
Our Challenges
Security
Our Challenges
Security
Diversity of Skills
Our Challenges
Security
Diversity of Skills
Discoverability
FAST Provide frameworks, capabilities that allow for rapid deployment.
SIMPLE & TRANSPARENT Develop capabilities to promote self-service and ease of access to data.
CONSISTENT & SECURE Provide a universal security framework to govern all data under the Big Data Domain.
FULLY AUTOMATED Provide a robust operational model allowing for playback, data quality, and self-healing.
Guiding Principles
Gather, organize, make sense of Comcast data, and make it universally accessible to empower, enable, and
transform Comcast into an insight-driven organization.
Product Vision
• Avoid religious wars where possible
• Whatever framework makes sense for the business problem at hand
• Focus on federated access to curated data
• Focus on Common APIs for Ingest, Egress and Machine Learning
• Focus on metadata and discoverability for:
• Enterprise Data
• Enterprise Features
• Trained Models
• Enterprise Portal
• Containerized scoring endpoints that accommodate multiple frameworks and
models
Approach
Shameless Lyft: Uber’s Michelangelo becomes Comcast’s Da Vinci
• Focus on Art AND Science (and a smattering of creativity)
• Common APIs usable from multiple frameworks using Python
or Scala
• Metadata is Key
• About data
• About features
• About trained models
Focus on a Common Approach to Features and Models
AT LAS
Ingest
API
Egress API (Federation Layer)
Feature Store
Model Store
On Premise Cloud
Portal
<Your Favorite Framework Here>
APACHE
Tools such as Presto and Alluxio
Scala
Python
Open ML API
Training
Deployment
Container Container Container
Client
Open ML API
• Reads and writes features and feature metadata
• Reads and writes model metadata
• Integrates with a common portal searchable by any user
16
DEMO
17
The Data Science Pipeline
DX Alpha
Goals
• Develop a system to manage features and models running in Spark
– Based on Uber’s Michelangelo
• Make it easier to build and deploy data transformations and ML models
• Enhance sharing of code across data science teams
• Support a variety of data science toolkits
Data Science Pipeline Components
• Feature Store
– Standardized approach to define data transformations
– A feature is a single attribute or column in a data frame
– A feature table is a set of features combined with meta data
– The transformation definition is separated from the ”context” in which it is applied
Data Science Pipeline Components
• Model store
– Standardized approach for defining models
– A model is defined by train, predict, and evaluate functions, a hyperparameter set, and associated meta data
– The definition of the train, predict, and evaluate functions are separated from their application
– A model may be associated with one or more trained instances, prediction data frames, and evaluation metrics
Data Science Pipeline Components
• Job Scheduler/Runner
– Handle streaming, scheduled, and one-time jobs
– Support interdependencies between jobs
Data Science Pipeline Components
• File system
– Store executable objects such as jar files and notebooks
– Store data frames
– Store trained models and other runtime artifacts
Development Approach
• Build on top of Databricks and Spark
• Start with a “thin slice” proof of concept
– Demonstrate basic end to end run from data exploration to model evaluation
• Iterate to improve usability and tooling
What do we need to know about Feature Tables?
– Descriptive Information
• What data transformation does it perform?
• Who’s owns this feature table?
• Description of Input/output
– Build/deployment information
• Where’s the code? What’s the current version?
• What artifacts have been deployed to the production environment?
– Run information
• What jobs are running or have been run?
• What’s the status of these jobs?
• What data sets or streams are being produced and how do I access them?
• Are there performance metrics or summary statistics available?
What do we need to know about Models?
• Descriptive Information
– What does it do? Classification? Regression?
– Who’s owns this model?
– Is it supervised or unsupervised? What type of labels are required?
– What features does it use?
• Build/deployment information
– Where’s the code? What’s the current version?
– What artifacts have been deployed to the production environment?
• Run information
– What training / prediction / evaluation jobs are running or have run?
– What’s the status of these jobs?
– What data sets or streams are being produced and how do I access them?
– How well is the model performing? What criteria are being used to assess this?
What actions do we need to perform?
• Data exploration and development
• Packaging, versioning, and deployment of ML code
• Job scheduling and monitoring
• Storage/discovery/retrieval of job results
– Data frames
– Metrics
– Trained models
• Discovery of and interaction with Features and Models
What Technologies Already do this?
• Data exploration and development
– Databricks notebooks, local IDE
• Packaging, versioning, and deployment of code and metadata
– Github / Jenkins / Mortar
– Document store for metadata (MongoDB, Cassandra, etc.)
• Job scheduling and monitoring
– Airflow, Databricks Jobs API
• Storage of job results
– DBFS / S3, need to define standard file structure
• Discovery of and interaction with Features and Models
– Finding – Elastic Search or existing Thin Slice API
– Reading/processing data frame artifacts – Spark, Databricks notebooks
– Retrieving/viewing performance metrics - ???
– Monitoring model performance over time - ???
– Algebraic composition features and models - ???
Open questions
• How do we abstract file system details and other constants?
• How do we standardize ETL from other systems within Comcast?
• How do we support human-labeling of data sets?
• What other tools (H20, R, etc.) do we need to support?
• Are there other ways we need to interact with features and models?
• How do we integrate AutoML?
• Other technologies that may be useful? Databricks Delta? Amazon Sagemaker?
Architecture V2: PIpelines
• Pipeline Segments (same as Spark ML)
– Transformers
– Estimators
• Pipeline: linear sequence of Pipeline Segments (same as Spark ML)
– Transformation pipelines contain only Transformers
– Estimation pipelines contain one or more Transformers and end with an Estimator
•
T
E
D T T
T T TD
D
T
Transformation Pipeline
Estimation Pipeline
Architecture V2: PIpelines
• A pipeline is just a function
• It does not produce anything until supplied a specific DataFrame as input
Architecture V2: Workflows
• A Workflow is a directed (acyclic?) graph of Pipelines
– DataSources load data (from disk, streams, etc.) into a DataFrame
– Connectors merge the output of multiple DataSources into a single DataFrame
– Pipelines process the DataFrames
– DataSinks receive the output of the last pipeline
• Workflow rules
– Workflows must end in a single Pipeline node
– An Estimator Pipeline may only appear as the last node in a Workflow
Architecture V2: Workflows
PT PE
D
D
D
C
D
C T
PT
D
D
C
D
C PTTransformation Workflow
Training Workflow
Architecture V2: Data Sources
• Potential sources of data
– HTTP request
– Persistent store (avro, parquet, EDW, …)
– Kafka topic
– Others?
• Connectors could handle complex logic such as combining HTTP data with other sources before feeding into a
Workflow
Architecture V2
• Components
– Pipeline Segment Store
• Code catalog of available transforms, estimators, and pipelines
• Searchable by description, tags, and maybe by schema?
– “Find me a feature of type x that is tagged y”
– Workflow Store
• Stores DAGs (maybe Neo4j or other graph DB?)
• Integrates with Databricks to run DAGs as jobs
• Periodic graph analysis to optimize Workflows
– Data Source Store?
• Separate system or subset of Workflow Store?

More Related Content

What's hot

Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Chris Bingham
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
Jithin Parakka
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Amazon Web Services
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
Opsta
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
India Quotient
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
Animesh Chaturvedi
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)
Chetan Sharma
 
ITSM(IT Service Management)
ITSM(IT Service Management)ITSM(IT Service Management)
ITSM(IT Service Management)
Atlassian 대한민국
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
Amazon Web Services
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...
Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...
Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...
Amazon Web Services
 
Cloud computing
Cloud computingCloud computing
Cloud computing
DebrajKarmakar
 
Building the business case for AWS
Building the business case for AWSBuilding the business case for AWS
Building the business case for AWS
Amazon Web Services
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
k_tauhid
 
Cloud and dynamic infrastructure
Cloud and dynamic infrastructureCloud and dynamic infrastructure
Cloud and dynamic infrastructure
gaurav jain
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. Hunt
Databricks
 
Micro Focus Corporate Overview
Micro Focus Corporate OverviewMicro Focus Corporate Overview
Micro Focus Corporate Overview
Micro Focus
 
Cloud-migration-essentials.pdf
Cloud-migration-essentials.pdfCloud-migration-essentials.pdf
Cloud-migration-essentials.pdf
ALI ANWAR, OCP®
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
Amazon Web Services
 

What's hot (20)

Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)
 
ITSM(IT Service Management)
ITSM(IT Service Management)ITSM(IT Service Management)
ITSM(IT Service Management)
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...
Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...
Airbus Goes Serverless with AWS to Improve Fleet Operations (MFG315) - AWS re...
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Building the business case for AWS
Building the business case for AWSBuilding the business case for AWS
Building the business case for AWS
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Cloud and dynamic infrastructure
Cloud and dynamic infrastructureCloud and dynamic infrastructure
Cloud and dynamic infrastructure
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. Hunt
 
Micro Focus Corporate Overview
Micro Focus Corporate OverviewMicro Focus Corporate Overview
Micro Focus Corporate Overview
 
Cloud-migration-essentials.pdf
Cloud-migration-essentials.pdfCloud-migration-essentials.pdf
Cloud-migration-essentials.pdf
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 

Similar to A machine learning and data science pipeline for real companies

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery Labs
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...
Ákos Horváth
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
Databricks
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...Elizabeth Steiner
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
MapR Technologies
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
DataScienceConferenc1
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
IncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IWIncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IW
IncQuery Labs
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
Codecamp Romania
 

Similar to A machine learning and data science pipeline for real companies (20)

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
IncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptxIncQuery_presentation_Incose_EMEA_WSEC.pptx
IncQuery_presentation_Incose_EMEA_WSEC.pptx
 
Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...Next-Generation Completeness and Consistency Management in the Digital Threa...
Next-Generation Completeness and Consistency Management in the Digital Threa...
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
Knowledge-Based Analysis and Design (KBAD): An Approach to Rapid Systems Engi...
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
IncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IWIncQuery Suite demo for INCOSE 2022IW
IncQuery Suite demo for INCOSE 2022IW
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

A machine learning and data science pipeline for real companies

  • 1. A Data Science Pipeline for Real Companies Comcast’s Approach to Multi-datacenter, Cloud and On- premise Machine Learning
  • 2. “Comcast brings together the best in media and technology. We drive innovation to create the world's best entertainment and online experiences.” High Speed Internet Video Home Automation Digital Voice Xfinity MobileContent $84b (2017) 29m Customers
  • 3. “Comcast brings together the best in media and technology. We drive innovation to create the world's best entertainment and online experiences.” High Speed Internet Video Home Automation Digital Voice Xfinity MobileContent $84b (2017) 29m Customers
  • 4.
  • 5. • Predictive network analysis • Customer premise self-healing • Comcast network self-healing • Trouble-ticket prioritization • Customer self-help (voice and text flows) • Customer Retention Use Cases: It’s All About The Customer
  • 6. Our starting point! Internal Data Centers Cloud Based Infrastructure E1 E2 E3 Predictions Next Big Thing Where’s the data? How do I access it? What tools do I have? ? Where can I find information about data?
  • 7.
  • 11. Our Challenges Security Diversity of Skills Discoverability
  • 12. FAST Provide frameworks, capabilities that allow for rapid deployment. SIMPLE & TRANSPARENT Develop capabilities to promote self-service and ease of access to data. CONSISTENT & SECURE Provide a universal security framework to govern all data under the Big Data Domain. FULLY AUTOMATED Provide a robust operational model allowing for playback, data quality, and self-healing. Guiding Principles Gather, organize, make sense of Comcast data, and make it universally accessible to empower, enable, and transform Comcast into an insight-driven organization. Product Vision
  • 13. • Avoid religious wars where possible • Whatever framework makes sense for the business problem at hand • Focus on federated access to curated data • Focus on Common APIs for Ingest, Egress and Machine Learning • Focus on metadata and discoverability for: • Enterprise Data • Enterprise Features • Trained Models • Enterprise Portal • Containerized scoring endpoints that accommodate multiple frameworks and models Approach
  • 14. Shameless Lyft: Uber’s Michelangelo becomes Comcast’s Da Vinci • Focus on Art AND Science (and a smattering of creativity) • Common APIs usable from multiple frameworks using Python or Scala • Metadata is Key • About data • About features • About trained models Focus on a Common Approach to Features and Models
  • 15. AT LAS Ingest API Egress API (Federation Layer) Feature Store Model Store On Premise Cloud Portal <Your Favorite Framework Here> APACHE Tools such as Presto and Alluxio Scala Python Open ML API Training Deployment Container Container Container Client
  • 16. Open ML API • Reads and writes features and feature metadata • Reads and writes model metadata • Integrates with a common portal searchable by any user 16
  • 18. The Data Science Pipeline DX Alpha
  • 19. Goals • Develop a system to manage features and models running in Spark – Based on Uber’s Michelangelo • Make it easier to build and deploy data transformations and ML models • Enhance sharing of code across data science teams • Support a variety of data science toolkits
  • 20. Data Science Pipeline Components • Feature Store – Standardized approach to define data transformations – A feature is a single attribute or column in a data frame – A feature table is a set of features combined with meta data – The transformation definition is separated from the ”context” in which it is applied
  • 21. Data Science Pipeline Components • Model store – Standardized approach for defining models – A model is defined by train, predict, and evaluate functions, a hyperparameter set, and associated meta data – The definition of the train, predict, and evaluate functions are separated from their application – A model may be associated with one or more trained instances, prediction data frames, and evaluation metrics
  • 22. Data Science Pipeline Components • Job Scheduler/Runner – Handle streaming, scheduled, and one-time jobs – Support interdependencies between jobs
  • 23. Data Science Pipeline Components • File system – Store executable objects such as jar files and notebooks – Store data frames – Store trained models and other runtime artifacts
  • 24. Development Approach • Build on top of Databricks and Spark • Start with a “thin slice” proof of concept – Demonstrate basic end to end run from data exploration to model evaluation • Iterate to improve usability and tooling
  • 25. What do we need to know about Feature Tables? – Descriptive Information • What data transformation does it perform? • Who’s owns this feature table? • Description of Input/output – Build/deployment information • Where’s the code? What’s the current version? • What artifacts have been deployed to the production environment? – Run information • What jobs are running or have been run? • What’s the status of these jobs? • What data sets or streams are being produced and how do I access them? • Are there performance metrics or summary statistics available?
  • 26. What do we need to know about Models? • Descriptive Information – What does it do? Classification? Regression? – Who’s owns this model? – Is it supervised or unsupervised? What type of labels are required? – What features does it use? • Build/deployment information – Where’s the code? What’s the current version? – What artifacts have been deployed to the production environment? • Run information – What training / prediction / evaluation jobs are running or have run? – What’s the status of these jobs? – What data sets or streams are being produced and how do I access them? – How well is the model performing? What criteria are being used to assess this?
  • 27. What actions do we need to perform? • Data exploration and development • Packaging, versioning, and deployment of ML code • Job scheduling and monitoring • Storage/discovery/retrieval of job results – Data frames – Metrics – Trained models • Discovery of and interaction with Features and Models
  • 28. What Technologies Already do this? • Data exploration and development – Databricks notebooks, local IDE • Packaging, versioning, and deployment of code and metadata – Github / Jenkins / Mortar – Document store for metadata (MongoDB, Cassandra, etc.) • Job scheduling and monitoring – Airflow, Databricks Jobs API • Storage of job results – DBFS / S3, need to define standard file structure • Discovery of and interaction with Features and Models – Finding – Elastic Search or existing Thin Slice API – Reading/processing data frame artifacts – Spark, Databricks notebooks – Retrieving/viewing performance metrics - ??? – Monitoring model performance over time - ??? – Algebraic composition features and models - ???
  • 29. Open questions • How do we abstract file system details and other constants? • How do we standardize ETL from other systems within Comcast? • How do we support human-labeling of data sets? • What other tools (H20, R, etc.) do we need to support? • Are there other ways we need to interact with features and models? • How do we integrate AutoML? • Other technologies that may be useful? Databricks Delta? Amazon Sagemaker?
  • 30. Architecture V2: PIpelines • Pipeline Segments (same as Spark ML) – Transformers – Estimators • Pipeline: linear sequence of Pipeline Segments (same as Spark ML) – Transformation pipelines contain only Transformers – Estimation pipelines contain one or more Transformers and end with an Estimator • T E D T T T T TD D T Transformation Pipeline Estimation Pipeline
  • 31. Architecture V2: PIpelines • A pipeline is just a function • It does not produce anything until supplied a specific DataFrame as input
  • 32. Architecture V2: Workflows • A Workflow is a directed (acyclic?) graph of Pipelines – DataSources load data (from disk, streams, etc.) into a DataFrame – Connectors merge the output of multiple DataSources into a single DataFrame – Pipelines process the DataFrames – DataSinks receive the output of the last pipeline • Workflow rules – Workflows must end in a single Pipeline node – An Estimator Pipeline may only appear as the last node in a Workflow
  • 33. Architecture V2: Workflows PT PE D D D C D C T PT D D C D C PTTransformation Workflow Training Workflow
  • 34. Architecture V2: Data Sources • Potential sources of data – HTTP request – Persistent store (avro, parquet, EDW, …) – Kafka topic – Others? • Connectors could handle complex logic such as combining HTTP data with other sources before feeding into a Workflow
  • 35. Architecture V2 • Components – Pipeline Segment Store • Code catalog of available transforms, estimators, and pipelines • Searchable by description, tags, and maybe by schema? – “Find me a feature of type x that is tagged y” – Workflow Store • Stores DAGs (maybe Neo4j or other graph DB?) • Integrates with Databricks to run DAGs as jobs • Periodic graph analysis to optimize Workflows – Data Source Store? • Separate system or subset of Workflow Store?

Editor's Notes

  1. We are a technology, entertainment and media company focused on delivering the best customer experience.