Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Confidentia
l
Mao Ye
Big Data Platform at interest
1
Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management
Data Architecture
Data at Pinterest
• 60 Billion Pins
• 1 Billion boards
• 100M MAU
• 60 PB of data on S3
• 3 PB processed every day
• 2000 ...
Pinterest Data Architecture
App
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer
Design Choices for Hadoop Platform
• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated ...
Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store
Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
...
Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server
• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Ha...
Confidentia
l
Pinball for Workflow Management
Confidentia
l
● Scale:
o 60 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabyt...
Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Ea...
Confidentia
l
Pinball Design
Master
Worker
Scheduler
Command
Line Clients
UI
Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Work...
Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123...
Confidentia
l
Job State Machine
RUNNABLE
RUNNINGWAITING
Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction...
Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies...
Confidentia
l
Worker
Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!f...
Confidentia
l
Thank You
Upcoming SlideShare
Loading in …5
×

Big Data Platform at Pinterest

18,514 views

Published on

Presented by Mao Ye at the AWS Loft meetup on Thursday, September 17th, 2015

Published in: Data & Analytics

Big Data Platform at Pinterest

  1. Confidentia l Mao Ye Big Data Platform at interest 1
  2. Data Architecture Design Choices for Hadoop Platform Pinball for Workflow Management
  3. Data Architecture
  4. Data at Pinterest • 60 Billion Pins • 1 Billion boards • 100M MAU • 60 PB of data on S3 • 3 PB processed every day • 2000 node Hadoop cluster • 250 engineers
  5. Pinterest Data Architecture App
  6. Pinterest Data Architecture App events Kafka Secor Singer
  7. Pinterest Data Architecture App events Kafka Secor Singer
  8. Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer
  9. Design Choices for Hadoop Platform
  10. • Ephemeral clusters • Access control layer • Shared data store • Easy deployment Hadoop Platform Requirements • Isolated multi-tenancy • Elasticity • Support multiple clusters
  11. Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store
  12. Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 DataMetadata
  13. Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI
  14. Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server
  15. • API for simplified executor abstraction • Advanced support for spot instances • Baked AMI customization Why Qubole? • Hadoop & Spark as managed services • Tight integration with Hive • Graceful cluster scaling
  16. Confidentia l Pinball for Workflow Management
  17. Confidentia l ● Scale: o 60 Billion Pins o Hundreds of workflows o Thousands of jobs o 500+ jobs in a workflow o 3 petabytes processed daily ● Support: o Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow
  18. Confidentia l Why Pinball? ● Requirements o Simple abstractions o Extensible in future o Reliable stateless computing o Easy to debug o Scales horizontally o Can be upgraded w/o aborting workflows o Rich features like auto-retries, per-job emails, overrun policies… ● Options o Apache Oozie, Azkaban, Luigi
  19. Confidentia l Pinball Design Master Worker Scheduler Command Line Clients UI
  20. Confidentia l ● Workflow o A directed graph of nodes called jobs ● Edge o Run after dependence ● Node o Job is a node Workflow Model
  21. Confidentia l Job State ● Job state is captured in a token ● Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)
  22. Confidentia l Job State Machine RUNNABLE RUNNINGWAITING
  23. Confidentia l ● Master keeps the state ● Workers claim and execute tasks ● Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack
  24. Confidentia l Master ● Entire state is kept in memory ● Each state update is synchronously persisted before master replies to client ● Master runs on a single thread – no concurrency issues
  25. Confidentia l Worker
  26. Confidentia l Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/ pinball-users
  27. Confidentia l Thank You

×