SlideShare a Scribd company logo
Bighead
Airbnb’s End-to-End Machine
Learning Infrastructure
Andrew Hoh and Krishna Puttaswamy
ML Infra @ Airbnb
In 2016
● Only a few major models in production
● Models took on average 8 week to 12 weeks to build
● Everything built in Aerosolve, Spark and Scala
● No support for Tensorflow, PyTorch, SK-Learn or other popular ML packages
● Significant discrepancies between offline and online data
ML Infra was formed with the charter to:
● Enable more users to build ML products
● Reduce time and effort
● Enable easier model evaluation
Q4 2016: Formation of our ML Infra team
Before ML
Infrastructure
ML has had a massive impact on Airbnb’s
product
● Search Ranking
● Smart Pricing
● Fraud Detection
After ML
Infrastructure
But there were many other areas that had
high-potential for ML, but had yet to realize its
full potential.
● Paid Growth - Hosts
● Classifying listing
● Room Type Categorizations
● Experience Ranking + Personalization
● Host Availability
● Business Travel Classifier
● Make Listing a Space Easier
● Customer Service Ticket Routing
● … And many more
Vision
Airbnb routinely ships ML-powered features throughout the
product.
Mission
Equip Airbnb with shared technology to build
production-ready ML applications with no incidental
complexity.
(Technology = tools, platforms, knowledge, shared feature data, etc.)
Value of ML
Infrastructure
Machine Learning Infrastructure can:
● Remove incidental complexities, by providing
generic, reusable solutions
● Simplify the workflow by providing tooling,
libraries, and environments that make ML
development more efficient
And at the same time:
● Establish a standardized platform that
enables cross-company sharing of feature
data and model components
● “Make it easy to do the right thing” (ex:
consistent training/streaming/scoring logic)
Bighead: Motivations
Learnings:
● No consistency between ML Workflows
● New teams struggle to begin using ML
● Airbnb has a wide variety in ML use cases
● Existing ML workflows are slow, fragmented, and brittle
● Incidental complexity vs. intrinsic complexity
● Build and forget - ML as a linear process
Q1 2017: Figuring out what to build
Architecture
● Consistent environment across the stack with Docker
● Consistent data transformation
○ Multi-row aggregation in the warehouse, single row transformation is part of
the model
○ Model transformation code is the same in online and offline
● Common workflow across different ML frameworks
○ Supports Scikit-learn, TF, PyTorch, etc.
● Modular components
○ Easy to customize parts
○ Easy to share data/pipelines
Key Design Decisions
Bighead Architecture
DeepThought Service
Deployment
App for ML
Models
ML Models
git repo
Redspot
ML Automator
(w/ Airflow)
ML DAG
Generator
Zipline Data
(K/V Store)
Clients
Offline Online
Bighead Service
Docker Image Service
Zipline
Feature
Repository
User ML Model
Bighead library
BigQueue
Service
Worker
Bighead UI
Zipline features
git repo
Components
● Data Management: Zipline
● Training: Redspot / BigQueue
● Core ML Library: ML Pipeline
● Productionisation: Deep Thought (online) / ML Automator (offline)
● Model Management: Bighead service
Zipline (ML Data Management Framework)
Zipline - Why
● Defining features (especially windowed) with hive was complicated and error
prone
● Backfilling training sets (on inefficient hive queries) was a major bottleneck
● No feature sharing
● Inconsistent offline and online datasets
● Warehouse is built as of end-of-day, lacked point-in-time features
● ML data pipelines lacked data quality checks or monitoring
● Ownership of pipelines was in disarray
For information on Zipline, please watch the recording of our other
Spark Summit session:
Zipline: Airbnb’s Machine Learning Data
Management Platform
Redspot (Hosted Jupyter Notebook Service)
Bighead Architecture
DeepThought Service
Deployment
App for ML
Models
ML Models
git repo
Redspot
ML Automator
(w/ Airflow)
ML DAG
Generator
Zipline Data
(K/V Store)
Clients
Offline Online
Bighead Service
Docker Image Service
Zipline
Feature
Repository
User ML Model
Bighead library
BigQueue
Service
Worker
Bighead UI
Zipline features
git repo
● Started with Jupyterhub (open-source project), which manages multiple Jupyter
Notebook Servers (prototyping environment)
● But users were installing packages locally, and then creating virtualenv for
other parts of our infra
○ Environment was very fragile
● Users wanted to be able to use jupyterhub on larger instances or instances
with GPU
● Wanting to share notebooks with other teammates was common too
● Files/content resilient to node failures
Redspot - Why
Containerized environments
● Every user’s environment is containerized via docker
○ Allows customizing the notebook environment without affecting
other users
■ e.g. install system/python packages
○ Easier to restore state therefore helps with reproducibility
● Support using custom docker images
○ Base images based on user’s needs
■ e.g. GPU access, pre-installed ML packages
Redspot
Redspot
Remote Instance Spawner
● For bigger jobs and total isolation,
Redspot allows launching a dedicated
instance
● Hardware resources not shared with
other users
● Automatically terminates idle instances
periodically
Redspot redspot-standalone
Local Docker Containers (x91)
Jupyterhub
Docker
Daemon
MySQL
EFS
Data Backend
S3, Hive,
Presto, Spark
Users
Remote Instances (x32)
redspot-singleuser
redspot-singleuser-gpu
Docker
Daemon
Docker
Daemon
X1e.32xlarge (128 vCPUs, 3.9 TB)
● Native dockerfile enforces strict single inheritance.
○ Prevents composition of base images
○ Might lead to copy/pasting Dockerfile
snippets around.
● A git repo of Dockerfiles for each stage and yml
file expressing:
○ Pre-build/post-build commands.
○ Build time/runtime dependencies. (mounting
directories, docker runtime)
● Image builder:
○ Build flow tool for chaining stages to
produce a single image.
○ Build independent images in parallel.
Docker Image Repo/Service
ubuntu16.04-py3.6-cuda9-cudnn7:
base: ubuntu14.04-py3.6
description: "A base Ubuntu 16.04 image with
python 3.6, CUDA 9, CUDNN 7"
stages:
- cuda/9.0
- cudnn/7
args:
python_version: '3.6'
cuda_version: '9.0'
cudnn_version: '7.0'
● A multi-tenant notebook environment
● Makes it easy to iterate and prototype ML models, share work
○ Integrated with the rest of our infra - so one can deploy a notebook to prod
● Improved upon open source Jupyterhub
○ Containerized; can bring custom Docker env
○ Remote notebook spawner for dedicated instances (P3 and X1 machines on
AWS)
○ Persist notebooks in EFS and share with teams
○ Reverting to prior checkpoint
● Support 200+ Weekly Active Users
Redspot Summary
Bighead Library
Bighead Library - Why
● Transformations (NLP, images) are often re-written by different users
● No clear abstraction for data transformation in the model
○ Every user can do data processing in a different way, leading to confusion
○ Users can easily write inefficient code
○ Loss of feature metadata during transformation (can’t plot feature importance)
○ Need special handling of CPU/GPU
○ No visualization of transformations
● Visualizing and understanding input data is key
○ But few good libraries to do so
● Library of transformations; holds more than 100+ different transformations including
automated preprocessing for common input formats (NLP, images, etc.)
● Pipeline abstraction to build a DAG of transformation on input data
○ Propagate feature metadata so we can plot feature importance at the end and
connect it to feature names
○ Pipelines for data processing are reusable in other pipelines
○ Feature parallel and data parallel transformations
○ CPU and GPU support
○ Supports Scikit APIs
● Wrappers for model frameworks (XGB, TF, etc.) so they can be easily serialized/deserialized
(robust to minor version changes)
● Provides training data visualization to help identify data issues
Bighead Library
Bighead Library: ML Pipeline
ML Pipeline Serial Transformation Visualization
ML Parallel Visualization
Feature Importance: Metadata Preserved Through
Transformations
Easy to Serialize/Deserialize
Training Data - Visualization
Productionisation:
Deep Thought (Online Inference Service)
Bighead Architecture
DeepThought Service
Deployment
App for ML
Models
ML Models
git repo
Redspot
ML Automator
(w/ Airflow)
ML DAG
Generator
Zipline Data
(K/V Store)
Clients
Offline Online
Bighead Service
Docker Image Service
Zipline
Feature
Repository
User ML Model
Bighead library
BigQueue
Service
Worker
Bighead UI
Zipline features
git repo
● Performant, scalable execution of model inference in production is hard
○ Engineers shouldn’t build one off solutions for every model.
○ Data scientists should be able to launch new models in production with minimal
eng involvement.
● Debugging differences between online inference and training are difficult
○ We should support the exact serialized version of the model the data scientist
built
○ We should be able to run the same python transformations data scientists write
for training.
○ We should be able to load data computed in the warehouse or streaming easily
into online scoring.
Deep Thought - Why
● Deep Thought is a shared service for online inference
○ Supports all frameworks integrated in ML Pipeline
○ Deployment is completely config driven so data scientists don’t have to involve
engineers to launch new models.
○ Engineers can then connect to a REST API from other services to get scores.
○ Support for loading data from K/V stores
○ Standardized logging, alerting and dashboarding for monitoring and offline
analysis of model performance
○ Isolation to enable multi-tenancy
○ Scalable and Reliable: 100+ models. Highest QPS service at Airbnb. Median
response time: 4ms. p95: 13ms.
Deep Thought - How
Productionisation: ML Automator
(Offline Training and Inference Service)
● Tools and services to automate common (mundane) tasks
○ Periodic training, evaluation and offline scoring of model
■ Get specified amount of resources (GPU/memory) for these tasks
○ Uploading scores to K/V stores
○ Dashboards on scores, alert on score changes
○ Orchestration of these tasks (via Airflow in our case),
○ Scoring on large tables is tricky to scale
ML Automator - Why
● Automate such tasks via configuration
○ We generate Airflow DAGs automatically to train, score, evaluate, upload scores, etc.
with appropriate resources
● Built custom tools to train/score on Spark for large datasets
○ Tools to get training data to the training machine quickly
○ Tool to generate virtualenv (that’s equivalent to a specified docker image) and run
executors in it as (our version of) Yarn doesn’t run executors in a docker image
ML Automator
Bighead Service
● Model management is hard
○ Need a single source of truth to track the model history
● Models reproducibility is hard
○ Models are trained on developer laptops and put into production
○ Model artifact isn’t tagged with model code git sha
● Model monitoring is done in many places
○ Evaluation metrics are stored in ipython notebooks
○ Online scores and model features are monitored separately
○ Online/offline scoring may not be consistent version
Bighead Service - Why
Bighead
Service
Overview
Bighead’s model management service
● Contains prototype and production models
● Can serve models “raw” or trained
● The source of truth on which trained models are
in production
● Stores model health data
Bighead
Service
Internals
We decompose Models into two components:
● Model Version - raw model code + docker image
● Model Artifact - parameters learned via training
Model
Version
Model Artifact
Code
Docker
Image
A trained model consists of:
Model Version
+
Model Artifact
Production
ML models have diverse dependency sets (tensorflow,
xgboost, etc.). We allow users to provide a docker image
within which model code always runs.
ML models don’t run in isolation however, so we’ve built a
lightweight API to interact with the “dockerized model”
Docker Container
Model
(user code)
Other ML
Infra
Services
Model
API
Dockerized
Models
Our built-in UI provides:
● Deployment - review changes, deploy, and rollback trained models
● Model Health - metrics, visualizations, alerting, central dashboard
● Experimentation - Ability to setup model experiments - e.g. split traffic
between two or more models
Bighead: UI
Bighead Architecture
Deployment
App for ML
Models
ML Models
git repo
Redspot
Bighead Service
User ML Model
Bighead library
Bighead UI
Bighead Service/UI
● Bighead’s central model
management service
● The “Deployboard” for
trained ML models - i.e.
the sole source of truth
about what model is
deployed
● End-to-End platform to build and deploy ML models to production
● Built on open source technology
○ Spark, Jupyter, Kubernetes, Scikit-learn, TF, XGBoost, etc.
● But had to fix various gaps in the path to productionisation
○ Generic online and offline inference service (that supports different
frameworks)
○ Feature generation and management framework
○ Model data transformation library
○ Model and data visualization libraries
○ Docker image customization service
○ Multi-tenant training environment
Bighead Summary
Plan to Open Source Soon
If you want to collaborate come talk to
andrew.hoh@airbnb.com

More Related Content

What's hot

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
12 Agile Principles in Pictures
12 Agile Principles in Pictures12 Agile Principles in Pictures
12 Agile Principles in Pictures
IAMCP MENTORING
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
Pieter de Bruin
 
Kubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOKubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPO
Animesh Singh
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
Henrik Skogström
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at NetflixTableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Blake Irvine
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
AllenPeter7
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
User Story Mapping Workshop
User Story Mapping WorkshopUser Story Mapping Workshop
User Story Mapping Workshop
Dana Pylayeva
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
Saurabh Kaushik
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
DataWorks Summit
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
Jordan Birdsell
 
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDBHow Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
ScyllaDB
 
튜토리얼과 하우투 문서의 차이점은?
튜토리얼과 하우투 문서의 차이점은?튜토리얼과 하우투 문서의 차이점은?
튜토리얼과 하우투 문서의 차이점은?
Jay Park
 
CQRS and event sourcing
CQRS and event sourcingCQRS and event sourcing
CQRS and event sourcing
Jeppe Cramon
 

What's hot (20)

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
12 Agile Principles in Pictures
12 Agile Principles in Pictures12 Agile Principles in Pictures
12 Agile Principles in Pictures
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Kubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPOKubeflow Distributed Training and HPO
Kubeflow Distributed Training and HPO
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at NetflixTableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
User Story Mapping Workshop
User Story Mapping WorkshopUser Story Mapping Workshop
User Story Mapping Workshop
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDBHow Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
 
튜토리얼과 하우투 문서의 차이점은?
튜토리얼과 하우투 문서의 차이점은?튜토리얼과 하우투 문서의 차이점은?
튜토리얼과 하우투 문서의 차이점은?
 
CQRS and event sourcing
CQRS and event sourcingCQRS and event sourcing
CQRS and event sourcing
 

Similar to AirBNB's ML platform - BigHead

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Pôle Systematic Paris-Region
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
Deepu K Sasidharan
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
Julien Dubois
 
Go at uber
Go at uberGo at uber
Go at uber
Rob Skillington
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
Yshay Yaacobi
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
Bogdan Kyryliuk
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
Hua Chu
 
Commit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architecture
Commit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architectureCommit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architecture
Commit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architecture
Jordi Puigsegur Figueras
 
Collaborative environment with data science notebook
Collaborative environment with data science notebook Collaborative environment with data science notebook
Collaborative environment with data science notebook
Moon Soo Lee
 
How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product
Invotra
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
DamienCarpy
 
Collaborative data science and how to build a data science toolchain around n...
Collaborative data science and how to build a data science toolchain around n...Collaborative data science and how to build a data science toolchain around n...
Collaborative data science and how to build a data science toolchain around n...
Moon Soo Lee
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
Knoldus Inc.
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Pôle Systematic Paris-Region
 
introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro services
Spyros Lambrinidis
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Akash Tandon
 

Similar to AirBNB's ML platform - BigHead (20)

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 
Go at uber
Go at uberGo at uber
Go at uber
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
 
Commit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architecture
Commit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architectureCommit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architecture
Commit Conf 2018 - Hotelbeds' journey to a microservice cloud-based architecture
 
Collaborative environment with data science notebook
Collaborative environment with data science notebook Collaborative environment with data science notebook
Collaborative environment with data science notebook
 
How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product How we leveraged Drupal to build a leading SaaS product
How we leveraged Drupal to build a leading SaaS product
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
Collaborative data science and how to build a data science toolchain around n...
Collaborative data science and how to build a data science toolchain around n...Collaborative data science and how to build a data science toolchain around n...
Collaborative data science and how to build a data science toolchain around n...
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...
 
introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro services
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 

More from Karthik Murugesan

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
Karthik Murugesan
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
Karthik Murugesan
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
Karthik Murugesan
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
Karthik Murugesan
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
Karthik Murugesan
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
Karthik Murugesan
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Karthik Murugesan
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
Karthik Murugesan
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
Karthik Murugesan
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
Karthik Murugesan
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
Karthik Murugesan
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
Karthik Murugesan
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
Karthik Murugesan
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
Karthik Murugesan
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
Karthik Murugesan
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
Karthik Murugesan
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
Karthik Murugesan
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
Karthik Murugesan
 

More from Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 

AirBNB's ML platform - BigHead

  • 1. Bighead Airbnb’s End-to-End Machine Learning Infrastructure Andrew Hoh and Krishna Puttaswamy ML Infra @ Airbnb
  • 2. In 2016 ● Only a few major models in production ● Models took on average 8 week to 12 weeks to build ● Everything built in Aerosolve, Spark and Scala ● No support for Tensorflow, PyTorch, SK-Learn or other popular ML packages ● Significant discrepancies between offline and online data ML Infra was formed with the charter to: ● Enable more users to build ML products ● Reduce time and effort ● Enable easier model evaluation Q4 2016: Formation of our ML Infra team
  • 3. Before ML Infrastructure ML has had a massive impact on Airbnb’s product ● Search Ranking ● Smart Pricing ● Fraud Detection
  • 4. After ML Infrastructure But there were many other areas that had high-potential for ML, but had yet to realize its full potential. ● Paid Growth - Hosts ● Classifying listing ● Room Type Categorizations ● Experience Ranking + Personalization ● Host Availability ● Business Travel Classifier ● Make Listing a Space Easier ● Customer Service Ticket Routing ● … And many more
  • 5. Vision Airbnb routinely ships ML-powered features throughout the product. Mission Equip Airbnb with shared technology to build production-ready ML applications with no incidental complexity. (Technology = tools, platforms, knowledge, shared feature data, etc.)
  • 6. Value of ML Infrastructure Machine Learning Infrastructure can: ● Remove incidental complexities, by providing generic, reusable solutions ● Simplify the workflow by providing tooling, libraries, and environments that make ML development more efficient And at the same time: ● Establish a standardized platform that enables cross-company sharing of feature data and model components ● “Make it easy to do the right thing” (ex: consistent training/streaming/scoring logic)
  • 8. Learnings: ● No consistency between ML Workflows ● New teams struggle to begin using ML ● Airbnb has a wide variety in ML use cases ● Existing ML workflows are slow, fragmented, and brittle ● Incidental complexity vs. intrinsic complexity ● Build and forget - ML as a linear process Q1 2017: Figuring out what to build
  • 9.
  • 11. ● Consistent environment across the stack with Docker ● Consistent data transformation ○ Multi-row aggregation in the warehouse, single row transformation is part of the model ○ Model transformation code is the same in online and offline ● Common workflow across different ML frameworks ○ Supports Scikit-learn, TF, PyTorch, etc. ● Modular components ○ Easy to customize parts ○ Easy to share data/pipelines Key Design Decisions
  • 12. Bighead Architecture DeepThought Service Deployment App for ML Models ML Models git repo Redspot ML Automator (w/ Airflow) ML DAG Generator Zipline Data (K/V Store) Clients Offline Online Bighead Service Docker Image Service Zipline Feature Repository User ML Model Bighead library BigQueue Service Worker Bighead UI Zipline features git repo
  • 13. Components ● Data Management: Zipline ● Training: Redspot / BigQueue ● Core ML Library: ML Pipeline ● Productionisation: Deep Thought (online) / ML Automator (offline) ● Model Management: Bighead service
  • 14. Zipline (ML Data Management Framework)
  • 15. Zipline - Why ● Defining features (especially windowed) with hive was complicated and error prone ● Backfilling training sets (on inefficient hive queries) was a major bottleneck ● No feature sharing ● Inconsistent offline and online datasets ● Warehouse is built as of end-of-day, lacked point-in-time features ● ML data pipelines lacked data quality checks or monitoring ● Ownership of pipelines was in disarray
  • 16. For information on Zipline, please watch the recording of our other Spark Summit session: Zipline: Airbnb’s Machine Learning Data Management Platform
  • 17. Redspot (Hosted Jupyter Notebook Service)
  • 18. Bighead Architecture DeepThought Service Deployment App for ML Models ML Models git repo Redspot ML Automator (w/ Airflow) ML DAG Generator Zipline Data (K/V Store) Clients Offline Online Bighead Service Docker Image Service Zipline Feature Repository User ML Model Bighead library BigQueue Service Worker Bighead UI Zipline features git repo
  • 19. ● Started with Jupyterhub (open-source project), which manages multiple Jupyter Notebook Servers (prototyping environment) ● But users were installing packages locally, and then creating virtualenv for other parts of our infra ○ Environment was very fragile ● Users wanted to be able to use jupyterhub on larger instances or instances with GPU ● Wanting to share notebooks with other teammates was common too ● Files/content resilient to node failures Redspot - Why
  • 20. Containerized environments ● Every user’s environment is containerized via docker ○ Allows customizing the notebook environment without affecting other users ■ e.g. install system/python packages ○ Easier to restore state therefore helps with reproducibility ● Support using custom docker images ○ Base images based on user’s needs ■ e.g. GPU access, pre-installed ML packages
  • 23. Remote Instance Spawner ● For bigger jobs and total isolation, Redspot allows launching a dedicated instance ● Hardware resources not shared with other users ● Automatically terminates idle instances periodically
  • 24. Redspot redspot-standalone Local Docker Containers (x91) Jupyterhub Docker Daemon MySQL EFS Data Backend S3, Hive, Presto, Spark Users Remote Instances (x32) redspot-singleuser redspot-singleuser-gpu Docker Daemon Docker Daemon X1e.32xlarge (128 vCPUs, 3.9 TB)
  • 25. ● Native dockerfile enforces strict single inheritance. ○ Prevents composition of base images ○ Might lead to copy/pasting Dockerfile snippets around. ● A git repo of Dockerfiles for each stage and yml file expressing: ○ Pre-build/post-build commands. ○ Build time/runtime dependencies. (mounting directories, docker runtime) ● Image builder: ○ Build flow tool for chaining stages to produce a single image. ○ Build independent images in parallel. Docker Image Repo/Service ubuntu16.04-py3.6-cuda9-cudnn7: base: ubuntu14.04-py3.6 description: "A base Ubuntu 16.04 image with python 3.6, CUDA 9, CUDNN 7" stages: - cuda/9.0 - cudnn/7 args: python_version: '3.6' cuda_version: '9.0' cudnn_version: '7.0'
  • 26. ● A multi-tenant notebook environment ● Makes it easy to iterate and prototype ML models, share work ○ Integrated with the rest of our infra - so one can deploy a notebook to prod ● Improved upon open source Jupyterhub ○ Containerized; can bring custom Docker env ○ Remote notebook spawner for dedicated instances (P3 and X1 machines on AWS) ○ Persist notebooks in EFS and share with teams ○ Reverting to prior checkpoint ● Support 200+ Weekly Active Users Redspot Summary
  • 28. Bighead Library - Why ● Transformations (NLP, images) are often re-written by different users ● No clear abstraction for data transformation in the model ○ Every user can do data processing in a different way, leading to confusion ○ Users can easily write inefficient code ○ Loss of feature metadata during transformation (can’t plot feature importance) ○ Need special handling of CPU/GPU ○ No visualization of transformations ● Visualizing and understanding input data is key ○ But few good libraries to do so
  • 29. ● Library of transformations; holds more than 100+ different transformations including automated preprocessing for common input formats (NLP, images, etc.) ● Pipeline abstraction to build a DAG of transformation on input data ○ Propagate feature metadata so we can plot feature importance at the end and connect it to feature names ○ Pipelines for data processing are reusable in other pipelines ○ Feature parallel and data parallel transformations ○ CPU and GPU support ○ Supports Scikit APIs ● Wrappers for model frameworks (XGB, TF, etc.) so they can be easily serialized/deserialized (robust to minor version changes) ● Provides training data visualization to help identify data issues Bighead Library
  • 31. ML Pipeline Serial Transformation Visualization
  • 33. Feature Importance: Metadata Preserved Through Transformations
  • 35. Training Data - Visualization
  • 37. Bighead Architecture DeepThought Service Deployment App for ML Models ML Models git repo Redspot ML Automator (w/ Airflow) ML DAG Generator Zipline Data (K/V Store) Clients Offline Online Bighead Service Docker Image Service Zipline Feature Repository User ML Model Bighead library BigQueue Service Worker Bighead UI Zipline features git repo
  • 38. ● Performant, scalable execution of model inference in production is hard ○ Engineers shouldn’t build one off solutions for every model. ○ Data scientists should be able to launch new models in production with minimal eng involvement. ● Debugging differences between online inference and training are difficult ○ We should support the exact serialized version of the model the data scientist built ○ We should be able to run the same python transformations data scientists write for training. ○ We should be able to load data computed in the warehouse or streaming easily into online scoring. Deep Thought - Why
  • 39. ● Deep Thought is a shared service for online inference ○ Supports all frameworks integrated in ML Pipeline ○ Deployment is completely config driven so data scientists don’t have to involve engineers to launch new models. ○ Engineers can then connect to a REST API from other services to get scores. ○ Support for loading data from K/V stores ○ Standardized logging, alerting and dashboarding for monitoring and offline analysis of model performance ○ Isolation to enable multi-tenancy ○ Scalable and Reliable: 100+ models. Highest QPS service at Airbnb. Median response time: 4ms. p95: 13ms. Deep Thought - How
  • 40.
  • 41.
  • 42. Productionisation: ML Automator (Offline Training and Inference Service)
  • 43. ● Tools and services to automate common (mundane) tasks ○ Periodic training, evaluation and offline scoring of model ■ Get specified amount of resources (GPU/memory) for these tasks ○ Uploading scores to K/V stores ○ Dashboards on scores, alert on score changes ○ Orchestration of these tasks (via Airflow in our case), ○ Scoring on large tables is tricky to scale ML Automator - Why
  • 44. ● Automate such tasks via configuration ○ We generate Airflow DAGs automatically to train, score, evaluate, upload scores, etc. with appropriate resources ● Built custom tools to train/score on Spark for large datasets ○ Tools to get training data to the training machine quickly ○ Tool to generate virtualenv (that’s equivalent to a specified docker image) and run executors in it as (our version of) Yarn doesn’t run executors in a docker image ML Automator
  • 46. ● Model management is hard ○ Need a single source of truth to track the model history ● Models reproducibility is hard ○ Models are trained on developer laptops and put into production ○ Model artifact isn’t tagged with model code git sha ● Model monitoring is done in many places ○ Evaluation metrics are stored in ipython notebooks ○ Online scores and model features are monitored separately ○ Online/offline scoring may not be consistent version Bighead Service - Why
  • 47. Bighead Service Overview Bighead’s model management service ● Contains prototype and production models ● Can serve models “raw” or trained ● The source of truth on which trained models are in production ● Stores model health data
  • 48. Bighead Service Internals We decompose Models into two components: ● Model Version - raw model code + docker image ● Model Artifact - parameters learned via training Model Version Model Artifact Code Docker Image A trained model consists of: Model Version + Model Artifact Production
  • 49. ML models have diverse dependency sets (tensorflow, xgboost, etc.). We allow users to provide a docker image within which model code always runs. ML models don’t run in isolation however, so we’ve built a lightweight API to interact with the “dockerized model” Docker Container Model (user code) Other ML Infra Services Model API Dockerized Models
  • 50. Our built-in UI provides: ● Deployment - review changes, deploy, and rollback trained models ● Model Health - metrics, visualizations, alerting, central dashboard ● Experimentation - Ability to setup model experiments - e.g. split traffic between two or more models Bighead: UI
  • 51. Bighead Architecture Deployment App for ML Models ML Models git repo Redspot Bighead Service User ML Model Bighead library Bighead UI
  • 52. Bighead Service/UI ● Bighead’s central model management service ● The “Deployboard” for trained ML models - i.e. the sole source of truth about what model is deployed
  • 53. ● End-to-End platform to build and deploy ML models to production ● Built on open source technology ○ Spark, Jupyter, Kubernetes, Scikit-learn, TF, XGBoost, etc. ● But had to fix various gaps in the path to productionisation ○ Generic online and offline inference service (that supports different frameworks) ○ Feature generation and management framework ○ Model data transformation library ○ Model and data visualization libraries ○ Docker image customization service ○ Multi-tenant training environment Bighead Summary
  • 54. Plan to Open Source Soon If you want to collaborate come talk to andrew.hoh@airbnb.com