Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pinterest - Big Data Machine Learning Platform at Pinterest

This was presented by the Yongsheng Wu, head of big data and ML platform at Pinterest, at the Alluxio bay area meetup.

Yongsheng shares Pinterest's journey to build a fast and scalable big data and ML platform in AWS for Pinterest to handle the requests and complexity in data at scale. In this talk, he will cover different aspects from the requirements of the platform, the challenges encountered, the technologies chosen, and the tradeoffs that were made.

  • Login to see the comments

Pinterest - Big Data Machine Learning Platform at Pinterest

  1. 1. Big Data ML Platform at Pinterest Yongsheng Wu Pinterest: pinterest.com/yswu LinkedIn: linkedin.com/in/yongshengwu Twitter: @yswu 06/17/2019
  2. 2. Pinterest : The World’s Catalog of Ideas
  3. 3. Mission Help people discover and do what they love.
  4. 4. Scale@Pinterest Service Scale • 300M+ MAUs • 120B+ Pins • 3B+ Boards Big Data Scale • 300+ PB on S3 • 6000+ Hive/Hadoop nodes • 400+ Presto nodes • 1000+ Spark nodes
  5. 5. Mission & Vision Principles Current Status Key Technologies Future Plan
  6. 6. Mission Provide a highly scalable, reliable, secure, performant, efficient and delightful-to-use big data and machine learning platform to enable rapid product innovation and help make Pinterest a thriving business. Vision A big data and machine learning platform at scale enables every single engineer at Pinterest to derive trustworthy, actionable insights and apply ML to solve complex problems with ease and confidence.
  7. 7. Mission & Vision Principles Current Status Key Technologies Future Plan
  8. 8. Principles ● Put engineers first - make the platform delightful-to-use for all engineers at Pinterest ● Keep it simple, get it right - build a simple yet sufficient platform ● Enable speed and quality - enable all engineers at Pinterest to move fast with scalable, reliable, secure, performant and efficient solutions made easy by the platform ● Build with reusability and for reusability - embrace open source technology, build with lego blocks and provide lego blocks to all engineers at Pinterest
  9. 9. 9 Mission & Vision Principles Current Status Key Technologies Future Plan
  10. 10. Big Data Platform Big Data PlatformBig Data Platform Feature Platform ML Platform
  11. 11. Big Data Platform
  12. 12. Feature Platform Big Data PlatformBig Data Platform Feature Platform ML Platform
  13. 13. Pinterest’s data graph: Pin/Image/Board/User... xJoin pin’s text image info video info texts text languages text scores SEO signa l link languagelink country link perf link scores safe search spam visual signal catvec_v0 pin’s catvec_v0 catvec_v1 pin’s catvec_v1 topicvec_v4 pin’s topicvec_v4 country vecs text tokens landing page annot_embedding v3 annotation_v2 annotation_v3 annotation_v4 Feature Platform - Today
  14. 14. code module developer retrieval API, serving, acl, ... offline consumers (ML model training) online consumers (ML model serving) Signal Access & Serving spec metadata code module developer spec metadata code module developer spec metadata Galaxy: next-gen feature platform * incremental dataflow execution engine * signal data store (“column”-partitioned) and metadata repo (registry, stats) * dependency management * governance: enforcement & tracking Metadata-driven framework & dev API ML Platform BDP BDP
  15. 15. ML Platform Big Data PlatformBig Data Platform Feature Platform ML Platform
  16. 16. Response prediction ML Serving TrainingProfiles Users, Pins, Boards Logs events content
  17. 17. Visual ML
  18. 18. Response Prediction Use Cases at Pinterest ● Discovery ○ Home Feed: time-ordered following feed to ML based recommendation feed ○ Related Pins, Search: heuristic to ML ranking ● Ads ○ gCTR, CPI, CVR ● Growth ○ Notifications, NUX topics ● Content ○ Content comprehension ● Shopping ○ CTR prediction ● Protect ○ Spam & Porn, ATO ● … ...
  19. 19. Response prediction ML at Pinterest Surfaces 2014: Home feed ranking; Ads ranking 2015: Related Pins ranking 2016: Search ranking; Notifications ranking 2017: Spam detection 2018: NUX topics; Ads retrieval Scale < 10 serving hosts; Training on laptop 2500+ serving hosts; Training on clusters
  20. 20. Configuration Data Verification Feature Extraction Process Management Tools Data Collection ML Code Analytics Tools Machine Resource Management Serving Infrastructure Monitoring & Alerting Hidden Technical Debt in Machine Learning Systems David Sculley et al., Google, NIPS 2015
  21. 21. Much more complex in practice Learner 1 Parameter Autotuning Serving & Logging Automation Feature Extraction 1 Related Pins Ads Home Feed Learner 2 Data Monitoring Serving & Logging Automation Feature Extraction 2 Learner 3 Data Monitoring Serving & Logging Automation Feature Extraction 3 Distributed Training Distributed Training Similar components, no sharing! Incomplete stacks
  22. 22. Unified ML Platform Learner Parameter Autotuning Serving & Logging Automation Feature Extraction Related Pins Ads Home Feed Data Monitoring Distributed Training Client teams focus on business problems, not infra problems. Search NUX Topic Picker Notifications New use cases Platform team specializes in infra problems. Quick to build new ML applications.
  23. 23. Unified Big Data ML Platform ● Speed & quality ● Single Use Case ○ 0 -> 1 made fast, easy and robust - create a ML model to solve a complex problem ○ 1 -> N made automated - such a ML model continuously trained, improved, and deployed ● Many Use Cases on the Platform ○ N -> N2 - most of ML models trained and served by the platform
  24. 24. 24 Mission & Vision Principles Current Status Key Technologies Future Plan
  25. 25. Scorpion Training & Catwalk Catwalk: enables running training jobs on distributed cluster Tensorflow XGBoost Mesos: Cluster resource management (CPUs, RAM, GPUs) Kubernetes: to replace Mesos in 2018 Scorpion Training Abstracts user from specific trainer package used. future: other packages runs on
  26. 26. Catwalk Mesos Master Caffe GPU SciPy MXNet KerasCaffe TensorFlow TFMesosServer Param Server Update gradients Chronos/Aurora TFMesos TFMesos Torch TFMesosServer Worker TFMesosServer Worker Chronos/ Aurora PinBall Legend Mesos Agents
  27. 27. Scorpion Serving
  28. 28. Linchpin - Easy Feature Definition Declarative language for using common feature extraction logic. ● Single implementation for both serving & training. ● Heavily optimized. Generic "Match" Implementation Interest Match Annotation Match reuses pin <- source(TAG="pin", OUTPUTS="p", TYPE="PinJoinRawData") user <- source(TAG="user", OUTPUTS="u", TYPE="UserJoinRawData") cat_match <- match(INPUTS=[user.u.categoryVec, pin.p.categoryVec], MATCH_TYPE="COSINE_SIM") topic_match <- match(INPUTS=[user.u.topicVec, pin.p.topicVec], ...) features <- union(INPUTS=[cat_match, topic_match, ...])
  29. 29. Confidential Corpus Root Query understanding Leaf Leaf Leaf Searchable doc index builder index Indexing pipeline model training pipeline models Cache Mixer Cache Reranker Feature log Merger corpus Fresh corpus streaming pipeline index builder fresh index Fresh index dispatcher Perdoc data dispatc her Searchable doc Planner Muse
  30. 30. Pixie: Graph walks ● The greatest asset of Pinterest is our pin-to-board graph ○ It captures relationships between pins (how objects are organized into collections) ○ Can be used to capture multiple different interactions: pins to boards, clicks by user,... ● We use Pixie for candidate generation: How to quickly go from 2B pins to 1k pins so that ML models can then score each pin separately ● Represent user a (set of) pin(s) Q and do a random walk from Q: ○ Bias the walk towards fresh pins, Pins in the local user’s language, Pins that males/females like
  31. 31. Pixie Architecture Diagram
  32. 32. 32 Mission & Vision Principles Current Status Key Technologies Future Plan
  33. 33. ● [Product Enablement] Streaming engines ○ Spark Structured Streaming ○ Flink ○ … ... ● [Scalability] Spinner - next gen workflow engine ● [Performance] Hive on Tez ● [Efficiency] Hadoop auto-scaling ● [Future Proofing] Spark on Kubernetes ● [Future Proofing] Hadoop 3.0 Big Data Platform
  34. 34. code module developer retrieval API, serving, acl, ... offline consumers (ML model training) online consumers (ML model serving) Signal Access & Serving spec metadata code module developer spec metadata code module developer spec metadata Galaxy: next-gen feature platform * incremental dataflow execution engine * signal data store (“column”-partitioned) and metadata repo (registry, stats) * dependency management * governance: enforcement & tracking Metadata-driven framework & dev API ML Platform BDP BDP
  35. 35. ML Platform Learner Model Eval & Comparison Data Monitoring Feature Analysis Parameter Autotunin g Model Serving Logging Developer Frontend off-the-shelf solutions: Tensorflow ... Scorpion Serving Scorpion Training Incremental & Real-Time Training Automation Model Deploy Linchpin DSL Model Version Management Feature Extraction Real-time Feature Sources Counting Service ML Serving Systems ML Training Platform Team key: Model Runtime Validation
  36. 36. Mission & Vision Principles Current Status Key Technologies Future Plan
  37. 37. Key Learnings ● Unified big data ML platform greatly accelerates product innovations ● Data lineage, quality and democracy are vital to organization scalability ● Speed, quality & delightful-to-use

×