SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
2.
An online pinboard where you “curate” and
“discover” things you love and go do them in
real life
What is Pinterest ?
3.
Discovery - “Follow” Model
(Follower) (Followee)
“Follower” follows
“Followee”
Follow Interest
Graph
• Follower indicates interest in
Followee’s content
• Following feed - Content
collated from followees
4.
“Following” Feed @Pinterest
New Pin
Follower 1
Follower 2
Follower 3
.
.
.
Fanout
Fanout
Fanout
Challenges @scale
• 100s of millions of pins/repins per month
• High fanout - billions of writes per day (High throughput)
• Billions of requests per month (Low latency and high availability)
5.
“Following” Feed on HBase
CreationTs=100,PinId=8 CreationTs=99,PinId=6
UserId=678 <Empty> <Empty>
• Pins in following feed reverse chronologically sorted
• HBase Schema - choose wide over tall
- Exploit lexicographic sorting within columns for ordering
- Atomic transactions per user (inconsistencies get noticed at
scale)
- Row level bloom filters to eliminate unnecessary seeks
- Prefix compression (FAST_DIFF)
........
6.
Frontend
Message Bus
Workers Follow Store
Thrift + Finagle
layer
HBASE
Pin Store
Async. task
enqueue
Task dequeue
Follow
Unfollow
New Pin
Write Path
• Follow => put
• Unfollow => delete
• New Pin => multi put
7.
Optimizing Writes
• Increase per region memstore size
- 512M memstore -> 40M HFile
- Fewer HFiles and hence less frequent compactions
• GC tuning
- More frequent but smaller pauses
12.
Single Points of Failure
Cluster 1 Cluster 2
Message Queue
Frontend
Dual writes
Cross cluster
replication
ZK Quorum
Writes
Reads
13.
Single Points of Failure (contd)
Ephemeral
EBS
• No concept of HA shared storage on EC2
• Keep it simple
- HA namenode + QJM - hell, no !
- Operate two clusters each in its own AZ
HDFS NN
14.
Am I Better Off ?
Redis vs HBase
• Sharding, load balancing and fault tolerance
• Longer feeds
• Resolve data inconsistencies
• Savings in $$
Cluster configuration
• hi1.4xlarge - SSD backed for performance parity
• HBase - 0.94.3 and 0.94.7
• HDFS - CDH 4.2.0
15.
And many more...
• Rich pins
• Duplicate pin notifications
• Pinterest analytics
• Recommendations - “People who pinned this also pinned”
More to come...