Apache HBase @Pinterest
Scaling our “Feed” storage
Varun Sharma
Software Engineer
June 13, 2013
An online pinboard where you “curate” and
“discover” things you love and go do them in
real life
What is Pinterest ?
Discovery - “Follow” Model
(Follower) (Followee)
“Follower” follows
“Followee”
Follow Interest
Graph
• Follower indicates interest in
Followee’s content
• Following feed - Content
collated from followees
“Following” Feed @Pinterest
New Pin
Follower 1
Follower 2
Follower 3
.
.
.
Fanout
Fanout
Fanout
Challenges @scale
• 100s of millions of pins/repins per month
• High fanout - billions of writes per day (High throughput)
• Billions of requests per month (Low latency and high availability)
“Following” Feed on HBase
CreationTs=100,PinId=8 CreationTs=99,PinId=6
UserId=678 <Empty> <Empty>
• Pins in following feed reverse chronologically sorted
• HBase Schema - choose wide over tall
- Exploit lexicographic sorting within columns for ordering
- Atomic transactions per user (inconsistencies get noticed at
scale)
- Row level bloom filters to eliminate unnecessary seeks
- Prefix compression (FAST_DIFF)
........
Frontend
Message Bus
Workers Follow Store
Thrift + Finagle
layer
HBASE
Pin Store
Async. task
enqueue
Task dequeue
Follow
Unfollow
New Pin
Write Path
• Follow => put
• Unfollow => delete
• New Pin => multi put
Optimizing Writes
• Increase per region memstore size
- 512M memstore -> 40M HFile
- Fewer HFiles and hence less frequent compactions
• GC tuning
- More frequent but smaller pauses
Read Path
Frontend
Thrift + Finagle
layer
HBASE
Pin Metadata
Store
Retrieve
PinId(s)
Retrieve Pin
metadata
Optimizing Reads
Schema
• Prefix compression - FAST_DIFF - 4 X size reduction
• Reduced block size - 16K
Cache
• More block cache (hot set/temporal locality)
• High cache hit rates
Other Standard optimizations
• Short circuit local reads
• HBase checksums
“Scale” Challenges
ABfollow t1
ABunfollow t2
t1 < t2 (user)
M(ABfollow) t1’
M(ABunfollow) t2’
t1’ < t2’ (msg queue)
“Follow Unfollow” race
• Lack of total ordering inside
message queue
• Resolution - client side
timestamps
• Example - use t1 and t2 as cell
timestamps
user1,pin1
user1,pin2
user1,pin3
.
.
.
.
.
.
.
.
user1,pin4
user1,pin5
user1,pin6
.
.
.
.
.
.
.
.
CompactionUnbounded dataset growth
• Mapreduce/realtime trimming
unfeasible
• Coprocessors - trim during
compactions
user1,pin1
user1,pin2
user1,pin4
.
.
.
.
.
.
.
.
Trim
MTTR
20s
Stale node
30s
ZooKeeper
HDFS NN HBase Master
• MTTR < 2 minutes consistently
HBase
• ZK session timeout 30 sec
HDFS
• Tight timeouts
- socket timeout < 5 sec
- connect timeout - 1 sec X 3
• Stale node timeout - 20 sec
• Avoided during
- WAL read
- Lease recovery
- Writing splits
Failed status
checks
Single Points of Failure
Cluster 1 Cluster 2
Message Queue
Frontend
Dual writes
Cross cluster
replication
ZK Quorum
Writes
Reads
Single Points of Failure (contd)
Ephemeral
EBS
• No concept of HA shared storage on EC2
• Keep it simple
- HA namenode + QJM - hell, no !
- Operate two clusters each in its own AZ
HDFS NN
Am I Better Off ?
Redis vs HBase
• Sharding, load balancing and fault tolerance
• Longer feeds
• Resolve data inconsistencies
• Savings in $$
Cluster configuration
• hi1.4xlarge - SSD backed for performance parity
• HBase - 0.94.3 and 0.94.7
• HDFS - CDH 4.2.0
And many more...
• Rich pins
• Duplicate pin notifications
• Pinterest analytics
• Recommendations - “People who pinned this also pinned”
More to come...
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage

HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage

  • 1.
    Apache HBase @Pinterest Scalingour “Feed” storage Varun Sharma Software Engineer June 13, 2013
  • 2.
    An online pinboardwhere you “curate” and “discover” things you love and go do them in real life What is Pinterest ?
  • 4.
    Discovery - “Follow”Model (Follower) (Followee) “Follower” follows “Followee” Follow Interest Graph • Follower indicates interest in Followee’s content • Following feed - Content collated from followees
  • 6.
    “Following” Feed @Pinterest NewPin Follower 1 Follower 2 Follower 3 . . . Fanout Fanout Fanout Challenges @scale • 100s of millions of pins/repins per month • High fanout - billions of writes per day (High throughput) • Billions of requests per month (Low latency and high availability)
  • 8.
    “Following” Feed onHBase CreationTs=100,PinId=8 CreationTs=99,PinId=6 UserId=678 <Empty> <Empty> • Pins in following feed reverse chronologically sorted • HBase Schema - choose wide over tall - Exploit lexicographic sorting within columns for ordering - Atomic transactions per user (inconsistencies get noticed at scale) - Row level bloom filters to eliminate unnecessary seeks - Prefix compression (FAST_DIFF) ........
  • 9.
    Frontend Message Bus Workers FollowStore Thrift + Finagle layer HBASE Pin Store Async. task enqueue Task dequeue Follow Unfollow New Pin Write Path • Follow => put • Unfollow => delete • New Pin => multi put
  • 10.
    Optimizing Writes • Increaseper region memstore size - 512M memstore -> 40M HFile - Fewer HFiles and hence less frequent compactions • GC tuning - More frequent but smaller pauses
  • 11.
    Read Path Frontend Thrift +Finagle layer HBASE Pin Metadata Store Retrieve PinId(s) Retrieve Pin metadata
  • 12.
    Optimizing Reads Schema • Prefixcompression - FAST_DIFF - 4 X size reduction • Reduced block size - 16K Cache • More block cache (hot set/temporal locality) • High cache hit rates Other Standard optimizations • Short circuit local reads • HBase checksums
  • 13.
    “Scale” Challenges ABfollow t1 ABunfollowt2 t1 < t2 (user) M(ABfollow) t1’ M(ABunfollow) t2’ t1’ < t2’ (msg queue) “Follow Unfollow” race • Lack of total ordering inside message queue • Resolution - client side timestamps • Example - use t1 and t2 as cell timestamps user1,pin1 user1,pin2 user1,pin3 . . . . . . . . user1,pin4 user1,pin5 user1,pin6 . . . . . . . . CompactionUnbounded dataset growth • Mapreduce/realtime trimming unfeasible • Coprocessors - trim during compactions user1,pin1 user1,pin2 user1,pin4 . . . . . . . . Trim
  • 14.
    MTTR 20s Stale node 30s ZooKeeper HDFS NNHBase Master • MTTR < 2 minutes consistently HBase • ZK session timeout 30 sec HDFS • Tight timeouts - socket timeout < 5 sec - connect timeout - 1 sec X 3 • Stale node timeout - 20 sec • Avoided during - WAL read - Lease recovery - Writing splits Failed status checks
  • 15.
    Single Points ofFailure Cluster 1 Cluster 2 Message Queue Frontend Dual writes Cross cluster replication ZK Quorum Writes Reads
  • 16.
    Single Points ofFailure (contd) Ephemeral EBS • No concept of HA shared storage on EC2 • Keep it simple - HA namenode + QJM - hell, no ! - Operate two clusters each in its own AZ HDFS NN
  • 17.
    Am I BetterOff ? Redis vs HBase • Sharding, load balancing and fault tolerance • Longer feeds • Resolve data inconsistencies • Savings in $$ Cluster configuration • hi1.4xlarge - SSD backed for performance parity • HBase - 0.94.3 and 0.94.7 • HDFS - CDH 4.2.0
  • 18.
    And many more... •Rich pins • Duplicate pin notifications • Pinterest analytics • Recommendations - “People who pinned this also pinned” More to come...