Big Data ML Platform at Pinterest
Yongsheng Wu
Pinterest: pinterest.com/yswu
LinkedIn: linkedin.com/in/yongshengwu
Twitter: @yswu
06/17/2019
Pinterest :
The World’s Catalog of Ideas
Mission
Help people discover and do
what they love.
Scale@Pinterest
Service Scale
• 300M+ MAUs
• 120B+ Pins
• 3B+ Boards
Big Data Scale
• 300+ PB on S3
• 6000+ Hive/Hadoop nodes
• 400+ Presto nodes
• 1000+ Spark nodes
Mission & Vision
Principles
Current Status
Key Technologies
Future Plan
Mission
Provide a highly scalable, reliable, secure, performant, efficient and
delightful-to-use big data and machine learning platform to enable rapid
product innovation and help make Pinterest a thriving business.
Vision
A big data and machine learning platform at scale enables every single
engineer at Pinterest to derive trustworthy, actionable insights and
apply ML to solve complex problems with ease and confidence.
Mission & Vision
Principles
Current Status
Key Technologies
Future Plan
Principles
● Put engineers first - make the platform delightful-to-use for all
engineers at Pinterest
● Keep it simple, get it right - build a simple yet sufficient
platform
● Enable speed and quality - enable all engineers at Pinterest to
move fast with scalable, reliable, secure, performant and efficient
solutions made easy by the platform
● Build with reusability and for reusability - embrace open
source technology, build with lego blocks and provide lego blocks to
all engineers at Pinterest
9
Mission & Vision
Principles
Current Status
Key Technologies
Future Plan
Big Data Platform
Big Data PlatformBig Data Platform
Feature Platform
ML Platform
Big Data Platform
Feature Platform
Big Data PlatformBig Data Platform
Feature Platform
ML Platform
Pinterest’s data graph: Pin/Image/Board/User...
xJoin
pin’s text
image
info
video
info
texts
text
languages
text
scores
SEO
signa
l
link
languagelink
country
link perf
link scores
safe
search
spam
visual
signal
catvec_v0
pin’s catvec_v0
catvec_v1
pin’s catvec_v1
topicvec_v4
pin’s topicvec_v4
country
vecs
text
tokens
landing
page
annot_embedding v3
annotation_v2
annotation_v3
annotation_v4
Feature Platform - Today
code
module
developer
retrieval API, serving, acl, ...
offline consumers
(ML model training)
online consumers
(ML model serving)
Signal Access & Serving
spec
metadata
code
module
developer
spec
metadata
code
module
developer
spec
metadata
Galaxy: next-gen feature platform
* incremental dataflow execution engine
* signal data store (“column”-partitioned) and metadata repo (registry, stats)
* dependency management
* governance: enforcement & tracking
Metadata-driven framework & dev API
ML Platform
BDP BDP
ML Platform
Big Data PlatformBig Data Platform
Feature Platform
ML Platform
Response prediction ML
Serving
TrainingProfiles
Users, Pins, Boards
Logs
events
content
Visual ML
Response Prediction Use Cases at Pinterest
● Discovery
○ Home Feed: time-ordered following feed to ML based recommendation feed
○ Related Pins, Search: heuristic to ML ranking
● Ads
○ gCTR, CPI, CVR
● Growth
○ Notifications, NUX topics
● Content
○ Content comprehension
● Shopping
○ CTR prediction
● Protect
○ Spam & Porn, ATO
● … ...
Response prediction ML at Pinterest
Surfaces 2014:
Home feed
ranking;
Ads ranking
2015:
Related Pins
ranking
2016:
Search
ranking;
Notifications
ranking
2017:
Spam
detection
2018:
NUX topics;
Ads retrieval
Scale < 10 serving
hosts;
Training on
laptop
2500+ serving
hosts;
Training on
clusters
Configuration
Data
Verification
Feature Extraction
Process
Management Tools
Data
Collection
ML
Code Analytics Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
&
Alerting
Hidden Technical Debt in Machine Learning Systems
David Sculley et al., Google, NIPS 2015
Much more complex in practice
Learner 1
Parameter
Autotuning
Serving &
Logging
Automation
Feature
Extraction 1
Related Pins Ads Home Feed
Learner 2
Data
Monitoring
Serving &
Logging
Automation
Feature
Extraction 2
Learner 3
Data
Monitoring
Serving &
Logging
Automation
Feature
Extraction 3
Distributed
Training
Distributed
Training
Similar components, no sharing!
Incomplete stacks
Unified ML Platform
Learner
Parameter
Autotuning
Serving &
Logging
Automation
Feature
Extraction
Related Pins Ads Home Feed
Data
Monitoring
Distributed
Training
Client teams focus on business problems, not infra problems.
Search
NUX Topic Picker
Notifications
New use cases
Platform team specializes in
infra problems.
Quick to build new
ML applications.
Unified Big Data ML Platform
● Speed & quality
● Single Use Case
○ 0 -> 1 made fast, easy and robust - create a ML model
to solve a complex problem
○ 1 -> N made automated - such a ML model continuously
trained, improved, and deployed
● Many Use Cases on the Platform
○ N -> N2 - most of ML models trained and served by the platform
24
Mission & Vision
Principles
Current Status
Key Technologies
Future Plan
Scorpion Training & Catwalk
Catwalk: enables running training jobs on
distributed cluster
Tensorflow XGBoost
Mesos: Cluster resource
management (CPUs, RAM,
GPUs)
Kubernetes:
to replace Mesos in
2018
Scorpion Training
Abstracts user from specific trainer package used.
future: other
packages
runs on
Catwalk
Mesos
Master
Caffe GPU
SciPy
MXNet
KerasCaffe
TensorFlow
TFMesosServer
Param
Server
Update
gradients
Chronos/Aurora
TFMesos
TFMesos
Torch
TFMesosServer
Worker
TFMesosServer
Worker
Chronos/
Aurora
PinBall
Legend
Mesos Agents
Scorpion Serving
Linchpin - Easy Feature Definition
Declarative language for using common
feature extraction logic.
● Single implementation for both serving
& training.
● Heavily optimized.
Generic "Match"
Implementation
Interest
Match
Annotation
Match
reuses
pin <- source(TAG="pin", OUTPUTS="p", TYPE="PinJoinRawData")
user <- source(TAG="user", OUTPUTS="u", TYPE="UserJoinRawData")
cat_match <- match(INPUTS=[user.u.categoryVec, pin.p.categoryVec],
MATCH_TYPE="COSINE_SIM")
topic_match <- match(INPUTS=[user.u.topicVec, pin.p.topicVec], ...)
features <- union(INPUTS=[cat_match, topic_match, ...])
Confidential
Corpus
Root
Query
understanding
Leaf Leaf Leaf
Searchable
doc
index
builder
index
Indexing
pipeline
model
training
pipeline
models
Cache
Mixer
Cache
Reranker
Feature log
Merger
corpus
Fresh
corpus
streaming
pipeline
index builder
fresh index
Fresh index
dispatcher
Perdoc
data
dispatc
her
Searchable
doc
Planner
Muse
Pixie: Graph walks
● The greatest asset of Pinterest is our pin-to-board graph
○ It captures relationships between pins (how objects are organized into collections)
○ Can be used to capture multiple different interactions: pins to boards, clicks by user,...
● We use Pixie for candidate generation: How to quickly go from 2B pins to 1k
pins so that ML models can then score each pin separately
● Represent user a (set of) pin(s) Q and do a random walk from Q:
○ Bias the walk towards fresh pins, Pins in the local user’s language, Pins that males/females like
Pixie Architecture Diagram
32
Mission & Vision
Principles
Current Status
Key Technologies
Future Plan
● [Product Enablement] Streaming engines
○ Spark Structured Streaming
○ Flink
○ … ...
● [Scalability] Spinner - next gen workflow engine
● [Performance] Hive on Tez
● [Efficiency] Hadoop auto-scaling
● [Future Proofing] Spark on Kubernetes
● [Future Proofing] Hadoop 3.0
Big Data Platform
code
module
developer
retrieval API, serving, acl, ...
offline consumers
(ML model training)
online consumers
(ML model serving)
Signal Access & Serving
spec
metadata
code
module
developer
spec
metadata
code
module
developer
spec
metadata
Galaxy: next-gen feature platform
* incremental dataflow execution engine
* signal data store (“column”-partitioned) and metadata repo (registry, stats)
* dependency management
* governance: enforcement & tracking
Metadata-driven framework & dev API
ML Platform
BDP BDP
ML Platform
Learner
Model Eval &
Comparison
Data
Monitoring
Feature
Analysis
Parameter
Autotunin
g
Model
Serving
Logging
Developer Frontend
off-the-shelf
solutions:
Tensorflow ...
Scorpion
Serving
Scorpion
Training
Incremental & Real-Time Training Automation
Model
Deploy
Linchpin DSL
Model Version
Management
Feature
Extraction
Real-time
Feature Sources
Counting
Service
ML Serving Systems
ML Training Platform
Team key:
Model Runtime
Validation
Mission & Vision
Principles
Current Status
Key Technologies
Future Plan
Key Learnings
● Unified big data ML platform greatly accelerates
product innovations
● Data lineage, quality and democracy are vital to
organization scalability
● Speed, quality & delightful-to-use
Pinterest - Big Data Machine Learning Platform at Pinterest

Pinterest - Big Data Machine Learning Platform at Pinterest

  • 1.
    Big Data MLPlatform at Pinterest Yongsheng Wu Pinterest: pinterest.com/yswu LinkedIn: linkedin.com/in/yongshengwu Twitter: @yswu 06/17/2019
  • 2.
    Pinterest : The World’sCatalog of Ideas
  • 3.
    Mission Help people discoverand do what they love.
  • 4.
    Scale@Pinterest Service Scale • 300M+MAUs • 120B+ Pins • 3B+ Boards Big Data Scale • 300+ PB on S3 • 6000+ Hive/Hadoop nodes • 400+ Presto nodes • 1000+ Spark nodes
  • 5.
    Mission & Vision Principles CurrentStatus Key Technologies Future Plan
  • 6.
    Mission Provide a highlyscalable, reliable, secure, performant, efficient and delightful-to-use big data and machine learning platform to enable rapid product innovation and help make Pinterest a thriving business. Vision A big data and machine learning platform at scale enables every single engineer at Pinterest to derive trustworthy, actionable insights and apply ML to solve complex problems with ease and confidence.
  • 7.
    Mission & Vision Principles CurrentStatus Key Technologies Future Plan
  • 8.
    Principles ● Put engineersfirst - make the platform delightful-to-use for all engineers at Pinterest ● Keep it simple, get it right - build a simple yet sufficient platform ● Enable speed and quality - enable all engineers at Pinterest to move fast with scalable, reliable, secure, performant and efficient solutions made easy by the platform ● Build with reusability and for reusability - embrace open source technology, build with lego blocks and provide lego blocks to all engineers at Pinterest
  • 9.
    9 Mission & Vision Principles CurrentStatus Key Technologies Future Plan
  • 10.
    Big Data Platform BigData PlatformBig Data Platform Feature Platform ML Platform
  • 11.
  • 12.
    Feature Platform Big DataPlatformBig Data Platform Feature Platform ML Platform
  • 13.
    Pinterest’s data graph:Pin/Image/Board/User... xJoin pin’s text image info video info texts text languages text scores SEO signa l link languagelink country link perf link scores safe search spam visual signal catvec_v0 pin’s catvec_v0 catvec_v1 pin’s catvec_v1 topicvec_v4 pin’s topicvec_v4 country vecs text tokens landing page annot_embedding v3 annotation_v2 annotation_v3 annotation_v4 Feature Platform - Today
  • 14.
    code module developer retrieval API, serving,acl, ... offline consumers (ML model training) online consumers (ML model serving) Signal Access & Serving spec metadata code module developer spec metadata code module developer spec metadata Galaxy: next-gen feature platform * incremental dataflow execution engine * signal data store (“column”-partitioned) and metadata repo (registry, stats) * dependency management * governance: enforcement & tracking Metadata-driven framework & dev API ML Platform BDP BDP
  • 15.
    ML Platform Big DataPlatformBig Data Platform Feature Platform ML Platform
  • 16.
  • 17.
  • 18.
    Response Prediction UseCases at Pinterest ● Discovery ○ Home Feed: time-ordered following feed to ML based recommendation feed ○ Related Pins, Search: heuristic to ML ranking ● Ads ○ gCTR, CPI, CVR ● Growth ○ Notifications, NUX topics ● Content ○ Content comprehension ● Shopping ○ CTR prediction ● Protect ○ Spam & Porn, ATO ● … ...
  • 19.
    Response prediction MLat Pinterest Surfaces 2014: Home feed ranking; Ads ranking 2015: Related Pins ranking 2016: Search ranking; Notifications ranking 2017: Spam detection 2018: NUX topics; Ads retrieval Scale < 10 serving hosts; Training on laptop 2500+ serving hosts; Training on clusters
  • 20.
    Configuration Data Verification Feature Extraction Process Management Tools Data Collection ML CodeAnalytics Tools Machine Resource Management Serving Infrastructure Monitoring & Alerting Hidden Technical Debt in Machine Learning Systems David Sculley et al., Google, NIPS 2015
  • 21.
    Much more complexin practice Learner 1 Parameter Autotuning Serving & Logging Automation Feature Extraction 1 Related Pins Ads Home Feed Learner 2 Data Monitoring Serving & Logging Automation Feature Extraction 2 Learner 3 Data Monitoring Serving & Logging Automation Feature Extraction 3 Distributed Training Distributed Training Similar components, no sharing! Incomplete stacks
  • 22.
    Unified ML Platform Learner Parameter Autotuning Serving& Logging Automation Feature Extraction Related Pins Ads Home Feed Data Monitoring Distributed Training Client teams focus on business problems, not infra problems. Search NUX Topic Picker Notifications New use cases Platform team specializes in infra problems. Quick to build new ML applications.
  • 23.
    Unified Big DataML Platform ● Speed & quality ● Single Use Case ○ 0 -> 1 made fast, easy and robust - create a ML model to solve a complex problem ○ 1 -> N made automated - such a ML model continuously trained, improved, and deployed ● Many Use Cases on the Platform ○ N -> N2 - most of ML models trained and served by the platform
  • 24.
    24 Mission & Vision Principles CurrentStatus Key Technologies Future Plan
  • 25.
    Scorpion Training &Catwalk Catwalk: enables running training jobs on distributed cluster Tensorflow XGBoost Mesos: Cluster resource management (CPUs, RAM, GPUs) Kubernetes: to replace Mesos in 2018 Scorpion Training Abstracts user from specific trainer package used. future: other packages runs on
  • 26.
  • 27.
  • 28.
    Linchpin - EasyFeature Definition Declarative language for using common feature extraction logic. ● Single implementation for both serving & training. ● Heavily optimized. Generic "Match" Implementation Interest Match Annotation Match reuses pin <- source(TAG="pin", OUTPUTS="p", TYPE="PinJoinRawData") user <- source(TAG="user", OUTPUTS="u", TYPE="UserJoinRawData") cat_match <- match(INPUTS=[user.u.categoryVec, pin.p.categoryVec], MATCH_TYPE="COSINE_SIM") topic_match <- match(INPUTS=[user.u.topicVec, pin.p.topicVec], ...) features <- union(INPUTS=[cat_match, topic_match, ...])
  • 29.
    Confidential Corpus Root Query understanding Leaf Leaf Leaf Searchable doc index builder index Indexing pipeline model training pipeline models Cache Mixer Cache Reranker Featurelog Merger corpus Fresh corpus streaming pipeline index builder fresh index Fresh index dispatcher Perdoc data dispatc her Searchable doc Planner Muse
  • 30.
    Pixie: Graph walks ●The greatest asset of Pinterest is our pin-to-board graph ○ It captures relationships between pins (how objects are organized into collections) ○ Can be used to capture multiple different interactions: pins to boards, clicks by user,... ● We use Pixie for candidate generation: How to quickly go from 2B pins to 1k pins so that ML models can then score each pin separately ● Represent user a (set of) pin(s) Q and do a random walk from Q: ○ Bias the walk towards fresh pins, Pins in the local user’s language, Pins that males/females like
  • 31.
  • 32.
    32 Mission & Vision Principles CurrentStatus Key Technologies Future Plan
  • 33.
    ● [Product Enablement]Streaming engines ○ Spark Structured Streaming ○ Flink ○ … ... ● [Scalability] Spinner - next gen workflow engine ● [Performance] Hive on Tez ● [Efficiency] Hadoop auto-scaling ● [Future Proofing] Spark on Kubernetes ● [Future Proofing] Hadoop 3.0 Big Data Platform
  • 34.
    code module developer retrieval API, serving,acl, ... offline consumers (ML model training) online consumers (ML model serving) Signal Access & Serving spec metadata code module developer spec metadata code module developer spec metadata Galaxy: next-gen feature platform * incremental dataflow execution engine * signal data store (“column”-partitioned) and metadata repo (registry, stats) * dependency management * governance: enforcement & tracking Metadata-driven framework & dev API ML Platform BDP BDP
  • 35.
    ML Platform Learner Model Eval& Comparison Data Monitoring Feature Analysis Parameter Autotunin g Model Serving Logging Developer Frontend off-the-shelf solutions: Tensorflow ... Scorpion Serving Scorpion Training Incremental & Real-Time Training Automation Model Deploy Linchpin DSL Model Version Management Feature Extraction Real-time Feature Sources Counting Service ML Serving Systems ML Training Platform Team key: Model Runtime Validation
  • 36.
    Mission & Vision Principles CurrentStatus Key Technologies Future Plan
  • 37.
    Key Learnings ● Unifiedbig data ML platform greatly accelerates product innovations ● Data lineage, quality and democracy are vital to organization scalability ● Speed, quality & delightful-to-use