Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Confidentia
l
Using Hadoop to build data driven Products
50 Billion pins and counting
Krishna Gade
1
What is Pinterest?
A visual bookmarking tool
Discover an inspiring idea
Save it to a board
Go do it
Krishna Gade
• Data Engineering at
Pinterest
• Search and Data
platforms at Twitter and
Bing
• Follow @krishnagade
Who am ...
Pinterest is a data product
Why do we care about data?
How is Hadoop helping us to harness the
power of the data?
What are some of the tools we built ...
Why do we care about data?
3.375
5’10”
< uncertainty
> odds of making the
best decisions
15
It is a capital mistake to theorize
before one has data.
- Sherlock Holmes
How is Hadoop helping us to harness the
power of the data?
Data at Pinterest
• 50 Billion Pins
• 1 Billion boards
• 40 PB of data on S3
• 3 PB processed every day
• 2000 node Hadoop...
Pinterest Data Architecture
App
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer
• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated ...
Confidentia
l
Design Choices
23
Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store
Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
...
Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server
• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Ha...
Confidentia
l
● Scale:
o 50 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabyt...
Confidentia
l
Pinball
30
Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Ea...
Confidentia
l
Pinball Design
Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Work...
Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123...
Confidentia
l
Job State Machine
Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction...
Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies...
Confidentia
l
Worker
Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!f...
Confidentia
l
Data Driven Products
40
Confidentia
l
Guided Search
Confidentia
l
Related Pins
What are some of the tools we built on top
of Hadoop Platform?
Confidentia
l
Scalable Data Analytics Engine
Pinalytics
44
Confidentia
l
Architecture
45
Backend
Thrift Services and Hbase databases
Webapp
Rich UI Components
Reporter
Generates for...
Confidentia
l
Visualizations
• Highcharts
• Time-series updated automatically
daily
Customizability
• Dashboards
• Built-i...
Confidentia
l
Pinomaly
• Anomalous metric tracking
• Email alerts
Reporting
• Formatted dashboards
• PDF printing
• Duplic...
Confidentia
l
Date, seg1, seg2, ... => value
• Store the value for every possible segmentation
• On-the-fly aggregation
E....
Confidentia
l
Backend Architecture
53
Pinalytics
Thrift Service
2. readMetrics()
5. metrics
HBase
Region Server 1
Region S...
Confidentia
l
Horizontal Scalability
• No app-level sharding
Flexibility in Aggregation
• FuzzyRowFilter
• Coprocessor
Tab...
Confidentia
l
Composite row key
• METRIC|TIME|SEG1|SEG2|...
Filters rows given a row key and a fuzzy row
• 0: match the by...
Confidentia
l
• Region-local aggregation with coprocessor
• Final aggregation at the Thrift service
• Reduces Network I/O
...
Confidentia
l
Flexible python client library for generating
reports
• Arbitrary metrics and segments
Easy-to-access data
•...
Confidentia
l
WAU, WARC and MAU segmented by gender and country
class DemoWAUReport(PinalyticsWideReport):
_METRIC_NAMES =...
Confidentia
l
• Pre-compute a lot of
core metrics
• Standard segmentation
- Gender, Country, App
- Spam-filtering
Core Met...
Confidentia
l
Outcomes
69
Confidentia
l
70
Internal Tools Matter
Solving problems inside of our company
400 Unique users
800 Page views per day
1500...
Confidentia
l
Thank You
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
Upcoming SlideShare
Loading in …5
×

50 Billion pins and counting: Using Hadoop to build data driven Products

1,321 views

Published on

50 Billion pins and counting: Using Hadoop to build data driven Products

Krishna Gade
Pinterest

Published in: Technology

50 Billion pins and counting: Using Hadoop to build data driven Products

  1. 1. Confidentia l Using Hadoop to build data driven Products 50 Billion pins and counting Krishna Gade 1
  2. 2. What is Pinterest? A visual bookmarking tool Discover an inspiring idea Save it to a board Go do it
  3. 3. Krishna Gade • Data Engineering at Pinterest • Search and Data platforms at Twitter and Bing • Follow @krishnagade Who am I?
  4. 4. Pinterest is a data product
  5. 5. Why do we care about data? How is Hadoop helping us to harness the power of the data? What are some of the tools we built on top of Hadoop Platform?
  6. 6. Why do we care about data?
  7. 7. 3.375
  8. 8. 5’10”
  9. 9. < uncertainty
  10. 10. > odds of making the best decisions
  11. 11. 15 It is a capital mistake to theorize before one has data. - Sherlock Holmes
  12. 12. How is Hadoop helping us to harness the power of the data?
  13. 13. Data at Pinterest • 50 Billion Pins • 1 Billion boards • 40 PB of data on S3 • 3 PB processed every day • 2000 node Hadoop cluster • 200 engineers
  14. 14. Pinterest Data Architecture App
  15. 15. Pinterest Data Architecture App events Kafka Secor Singer
  16. 16. Pinterest Data Architecture App events Kafka Secor Singer
  17. 17. Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer
  18. 18. • Ephemeral clusters • Access control layer • Shared data store • Easy deployment Hadoop Platform Requirements • Isolated multi-tenancy • Elasticity • Support multiple clusters
  19. 19. Confidentia l Design Choices 23
  20. 20. Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store
  21. 21. Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 DataMetadata
  22. 22. Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI
  23. 23. Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server
  24. 24. • API for simplified executor abstraction • Advanced support for spot instances • Baked AMI customization Why Qubole? • Hadoop & Spark as managed services • Tight integration with Hive • Graceful cluster scaling
  25. 25. Confidentia l ● Scale: o 50 Billion Pins o Hundreds of workflows o Thousands of jobs o 500+ jobs in a workflow o 3 petabytes processed daily ● Support: o Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow
  26. 26. Confidentia l Pinball 30
  27. 27. Confidentia l Why Pinball? ● Requirements o Simple abstractions o Extensible in future o Reliable stateless computing o Easy to debug o Scales horizontally o Can be upgraded w/o aborting workflows o Rich features like auto-retries, per-job emails, overrun policies… ● Options o Apache Oozie, Azkaban, Luigi
  28. 28. Confidentia l Pinball Design
  29. 29. Confidentia l ● Workflow o A directed graph of nodes called jobs ● Edge o Run after dependence ● Node o Job is a node Workflow Model
  30. 30. Confidentia l Job State ● Job state is captured in a token ● Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)
  31. 31. Confidentia l Job State Machine
  32. 32. Confidentia l ● Master keeps the state ● Workers claim and execute tasks ● Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack
  33. 33. Confidentia l Master ● Entire state is kept in memory ● Each state update is synchronously persisted before master replies to client ● Master runs on a single thread – no concurrency issues
  34. 34. Confidentia l Worker
  35. 35. Confidentia l Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/ pinball-users
  36. 36. Confidentia l Data Driven Products 40
  37. 37. Confidentia l Guided Search
  38. 38. Confidentia l Related Pins
  39. 39. What are some of the tools we built on top of Hadoop Platform?
  40. 40. Confidentia l Scalable Data Analytics Engine Pinalytics 44
  41. 41. Confidentia l Architecture 45 Backend Thrift Services and Hbase databases Webapp Rich UI Components Reporter Generates formatted data Metrics Customized optimizations 1 2 3 4 Main Components
  42. 42. Confidentia l Visualizations • Highcharts • Time-series updated automatically daily Customizability • Dashboards • Built-in or user-defined reports User Interface 47
  43. 43. Confidentia l Pinomaly • Anomalous metric tracking • Email alerts Reporting • Formatted dashboards • PDF printing • Duplicated weekly Metric Manipulation • Metric Composer • Global operations (segmentation, rollup/aggregation, etc). User Interface 48
  44. 44. Confidentia l Date, seg1, seg2, ... => value • Store the value for every possible segmentation • On-the-fly aggregation E.g. • 2015-01-01, US, Male => 1 • 2015-01-01, US, Female => 2 • 2015-01-01, UK, Male => 3 • 2015-01-01, UK, Female => 4 • 2015-01-01, UK, * => 7 • 2015-01-01, *, Male => 4 Data Model 51
  45. 45. Confidentia l Backend Architecture 53 Pinalytics Thrift Service 2. readMetrics() 5. metrics HBase Region Server 1 Region Server N Region Server 2 Region1 CP Region2 CP Region3 CP Region4 CP Region5 CP RegionM CP Metric table Webapp Server 3. Scan & Aggregate 1. request 4. Region aggregation
  46. 46. Confidentia l Horizontal Scalability • No app-level sharding Flexibility in Aggregation • FuzzyRowFilter • Coprocessor Tables • Report metadata • Reports HBase 54
  47. 47. Confidentia l Composite row key • METRIC|TIME|SEG1|SEG2|... Filters rows given a row key and a fuzzy row • 0: match the byte, 1: don’t match the byte E.g. MAU of male users on 2015-01-01 • Start row: MAU|2015-01-01| • End row: MAU|2015-01-01|| • Row Key: MAU|2015-01-01|--|M- • Fuzzy filter: 000|0000000000|11|00 Fuzzy Row Filter 55
  48. 48. Confidentia l • Region-local aggregation with coprocessor • Final aggregation at the Thrift service • Reduces Network I/O • Low Latency HBase Coprocessor 56
  49. 49. Confidentia l Flexible python client library for generating reports • Arbitrary metrics and segments Easy-to-access data • Data is automatically copied to s3 • Hive external table is generated Reporter 58
  50. 50. Confidentia l WAU, WARC and MAU segmented by gender and country class DemoWAUReport(PinalyticsWideReport): _METRIC_NAMES = ['wau', 'warc', 'mau'] _SEGKEY_NAMES = ['gender', 'country'] _QUERY_TEMPLATE = """ SELECT dt, gender, country, wau, warc, mau FROM activity_metrics WHERE dt>='2015-01-01';""" • Sample query output [‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110] Reporter Example 60
  51. 51. Confidentia l • Pre-compute a lot of core metrics • Standard segmentation - Gender, Country, App - Spam-filtering Core Metrics 62 • Activity • Event counts • Retention • Signups
  52. 52. Confidentia l Outcomes 69
  53. 53. Confidentia l 70 Internal Tools Matter Solving problems inside of our company 400 Unique users 800 Page views per day 1500 Custom charts created and updated daily
  54. 54. Confidentia l Thank You

×