Voldemort : Prototype to Production
Upcoming SlideShare
Loading in...5
×
 

Voldemort : Prototype to Production

on

  • 882 views

Discusses how Voldemort 'grew up' in Linkedin.

Discusses how Voldemort 'grew up' in Linkedin.

Statistics

Views

Total Views
882
Views on SlideShare
853
Embed Views
29

Actions

Likes
2
Downloads
24
Comments
0

3 Embeds 29

https://twitter.com 17
https://www.linkedin.com 11
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Voldemort : Prototype to Production Voldemort : Prototype to Production Presentation Transcript

  • Recruiting SolutionsRecruiting SolutionsRecruiting Solutions Voldemort : Prototype to Production A Journey to 1M Operations/Sec
  • Voldemort Intro ●  Amazon Dynamo style NoSQL k-v store ○  get(k) ○  put(k,v) ○  getall(k1,k2,...) ○  delete(k) ●  Tunable Consistency ●  Highly Available ●  Automatic Partitioning
  • Voldemort Intro ●  Pluggable Storage ○  BDB-JE - Primary OLTP store ○  Read Only - Reliable serving layer for Hadoop datasets ○  MySQL - Good ‘ol MySQL without native replication ○  InMemory - Backed by Java ConcurrentHashMap ●  Clients ○  Native Java Client ○  REST Coordinator Service ●  Open source ●  More at project-voldemort.com
  • Agenda ○ High Level Overview ○  Usage At LinkedIn ○  Storage Layer ○  Cluster Expansion
  • Architecture Coordinator Service Server 1 Server 2 Native Java Client get() put() getall() Server 3 Server 4 bdb bdb bdb bdb “k1” p1 s1,s2 “k2” p2 s3,s4 “k3” p1 s1,s2 “k4” p2 s3,s4 Client Service Client Service Client Service Client Service Client Service
  • Consistent Hashing ▪  Consistent Hashing Idea ▪  Divide key space into partitions –  Partitions: A,B,C,…,H –  hash(key) mod # partitions = pkey ▪  Randomly map partitions to servers ▪  Locate servers from keys –  K1 => A => S1, –  K2 => C => S3 A B C DE F G H S 1 S 1 S 2 S 3 S 3 S 4 S 2 S 4 K 1 K 2 Voldemort Intro
  • Consistent Hashing with Replication ▪  Replication factor (RF) –  how many replicas to have ▪  Replica selection –  Find the primary partition –  Walk the ring to create preference list ▪  Find RF-1 additional servers ▪  Skip servers already in list ▪  Examples: RF = 3 –  K1: S1, S2, S3 –  K2: S3,S1,S4 A B C DE F G H S 1 S 1 S 2 S 3 S 3 S 4 S 2 S 4 K1 [S1, S2, S3] K2 [S3,S1,S 4] Voldemort Intro
  • Zone Aware Replication ▪  Servers divided into zones ▪  Zone = Data Center ▪  Per zone replication factor ▪  Local zone vs. remote zones –  Local zone (LZ) is where client is ▪  Two zones example: –  LZ = 1 –  Zone1: S1 S3; RF=2 –  Zone2: S2 S4; RF=1 –  Preference lists: ▪  K1: Z1: S1, S3; Z2: S2 ▪  K2: Z1: S3, S1; Z2: S4 A B C DE F G H S 1 S 1 S 2 S 3 S 3 S 4 S 2 S 4 K1 [ Z1 [S1, S3], Z2 [S2] ] K2 [ Z1 [S3, S1], Z2 [S4] ] Voldemort Intro
  • Voldemort @ LinkedIn 385 Stores 238 R- O Stores 147 R- W Stores 3 Zones 14 Clusters ~200 TB ~750 Servers
  • Voldemort @ LinkedIn ~1M Storage ops/s 22% R-O 78% R-W
  • Voldemort @ LinkedIn ●  17% of all LinkedIn Services ○  embed a direct client ●  Fast (95th percentile < 20ms) for almost all clients
  • Voldemort @ LinkedIn ●  Front Facing ○  Search (Recruiter + Site) ○  People You May Know ○  inShare ○  Media thumbnails ○  Notifications ○  Endorsements ○  Skills ○  Frequency capping Ads ○  Custom Segments ○  Who Viewed Your Profile ○  People You Want to Hire ●  Internal Services ○  Email cache ○  Email delivery stack ○  Recommendation Services ○  Personalization Services ○  Mobile Auth ●  Not exhaustive!
  • Growth Since 2011
  • ●  Berkeley DB Java Edition ○  Embedded ○  100% Java ○  ACID compliant ○  Log structured ●  Voldemort uses ○  Vanilla k-v apis ○  Cursors for scans Storage Layer
  • Storage Layer Rewrite Where We Wanted To Be ●  Predictable online performance ●  Scan jobs ○  Non Intrusive, Fast ●  Elastic ○  Recover failed nodes in minutes ○  Add hardware overnight
  • Storage Layer Rewrite Where We Really Were 1.  GC Issues a.  Unpredictable GC Churn b.  Scan jobs cause Full GCs 2.  Slow Scans (even on SSDs) a.  Daily Retention Job/Slop Pusher b.  Not Partition Aware 3.  Memory Management a.  0-Control over a single store’s share 4.  Managing Multiple Versions a.  Lock Contention b.  Additional bdb-delete() cost during put() 5.  Weaker Durability on Crash a.  Dirty Writes in heap
  • Storage Layer Rewrite BDB Cache on JVM Disk Index Index IndexIndexIndex Index Index ... ... ... Leaf Index Leaf Index Leaf Leaf Server Thread BDB-Checkpointer BDB-Cleaner BDB-JE
  • Storage Layer Rewrite JVM Heap BDB Cache Store A’s B+Tree Store D’s B+Tree Store C’s B+Tree Store B’s B+Tree Server Threads Cleaner-A Checkpointer-A Cleaner-A Cleaner-A Cleaner-A Checkpointer-A Checkpointer-A Checkpointer-A Multi-Tenant Example
  • Storage Layer Rewrite Road To Recovery ●  Move data off heap ○  Only Index sits on heap ●  Cache Control to reduce scan impact ●  Partition Aware Storage ○  Range scans to the rescue ●  Dynamic Cache Partitioning ○  Control how much heap goes to a single store ●  SSD Aware Optimizations ○  Checkpointing ○  Cache Policy ●  Manage versions directly ○  Treat BDB as plain k-v store
  • Storage Layer Rewrite Moving Data Off Heap ●  Much improved GC ○  memory churn ○  promotions ●  SSD Aware hit- the-disk design ●  Strong Durability on Crash ○  Runaway heap SSD/Page Cache Index put(k,v) Leafold Leafnew 1 2 JVM Heap
  • Storage Layer Rewrite Reducing Scan Impact ●  Massive Cache Pollution ○  Throttling not an option ●  Exercise cursor level control ●  Sustained rates upto 30-40K/sec
  • Storage Layer Rewrite Managing Versions Directly ●  No more extra delete() ●  No more separate duplicate tree ○  Much improved locking performance ●  More compact storage BIN DIN DBIN V1 V2 BIN V1,V2
  • Storage Layer Rewrite SSD Aware Optimizations ●  Checkpoints on SSD ○  Age-old recovery time vs performance tradeoff ●  Predictability ○  Level based policy ●  Streaming Writes ○  Turn off checkpointer ●  BDB5 Support ○  Much better compaction ○  Much less index metadata Checkpointer Interval vs Recovery Time
  • Storage Layer Rewrite Partition Aware Storage “Key” “Key”Partition-id Root Subtree k5 k6 k7 k8 k1 k2 k3 k4 Subtree k9 k10 k11 k12 k13 k14 k15 k16 Root P1 SubtreeP0 Subtree k1 k3 k5 k7 k9 k11 k13 k15 k2 k4 k6 k8 k10 k12 k14 k16
  • Storage Layer Rewrite Speed Up Percentage Of Partitions Scanned ●  Restore ○  1 Day -> 1 hour ●  Rebalancing ○  ~Week -> Hours
  • Storage Layer Rewrite Dynamic Cache Partitioning ●  Control share of heap per store ○  Dynamically add/reduce memory ○  Currently isolating bursty store ●  Improve Capacity Model ○  More production validation? ○  Auto tuning mechanisms? ●  Isolate at the JVM level? ○  Rethink deployment model
  • Storage Layer Rewrite Wins In Production Rein in GC Storage Latency Way Down
  • Cluster Expansion Rewrite ●  Basis of scale-out philosophy ●  Cluster Expansion ○  Add servers to existing cluster ●  0 Downtime operation ●  Transparent to client ○  Functionality ○  Mostly Performance too
  • Cluster Expansion Rewrite Types Of Clusters ●  Zoned Read Write ○  Zone = DataCenter ●  Non Zoned ○  Read-Write ○  Read-Only (Hadoop BuildAndPush)
  • Zone 1 Zone 2 Zone 1 Zone 2 Server 1 Server 1Server 2 Server 2 Server 2 Server 1Server 3 Server 2 Server 3Server 1 Expansion Example P1 S2 P3 S4 S1 P2 S3 P4 P1 S2 P3 S4 S1 P2 S3 P4 P1S2 P3S4 S1P2 S3P4 P1S2 S4 P2 S3P4 S1 P3 Server 4 New New Server 4 NewNew
  • Expansion In Action 1: Change Cluster Topology Cluster Expansion Rewrite Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller 1 Zone 1 Zone 2
  • Expansion In Action 2: Setup Proxy Bridges Cluster Expansion Rewrite Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Proxy Bridge 1 2 1. Change cluster topology Zone 1 Zone 2
  • Expansion In Action 3: Client Picks Up New Topology Cluster Expansion Rewrite Client Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Proxy Bridge 1 2 1. Change cluster topology 2. Proxy request based on old topology 3 Zone 1 Zone 2
  • Expansion In Action 4: Move Partitions Cluster Expansion Rewrite Client Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Local Move Proxy Bridge Client Cross DC Move 1 3 4 41. Change cluster topology 2. Proxy request based on old topology 3. Client picks up change 2 Zone 1 Zone 2
  • Expansion In Action Cluster Expansion Rewrite Client Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Local Move Proxy Bridge Client Cross DC Move 1 3 4 41. Change cluster topology 2. Client picks up change 3. Proxy request based on old topology 4. Move partitions 2 Zone 1 Zone 2
  • Problems Cluster Expansion Rewrite ●  One Ring Spanning Data Centers ○  Cross datacenter data moves/proxies ●  Not Safely Abortable ○  Additional cleanup/consolidation ●  Cannot Add New Data Centers ●  Opaque Planner Code ○  No special treatment of Zones ●  Lack of tools ○  Skew Analysis ○  Repartitioning/Balancing Utility
  • Zone 1 Zone 2 Server 1 Redesign: Zone N-ary Philosophy P1 S1 Server 4 New New Data Move Old nth Replica of P in Zone Z New nth Replica of P in Zone Z Donor Stealer Proxy Bridge ● Given a partition P, whose mapping has changed Server 3Server 2 Server 1 P1 S1 Server 4Server 3Server 2
  • Redesign: Advantages Cluster Expansion Rewrite ●  Simple, yet powerful ●  Feasible alternative to breaking the ring ○  Expensive to rewrite all of DR ●  No more cross datacenter moves ●  Aligns proxy bridges mechanism with planner logic ●  Principally applied ○  Abortable Rebalances ○  Zone Expansion
  • Abortable Rebalance Cluster Expansion Rewrite ●  Plans go wrong ●  Introducing proxy puts ○  Safely rollback to old topology ●  Avoid Data Loss & adhoc repairs ●  Double write load during rebalance Stealer Donor put(k,v) proxy-get(k) vold local- put(k,vold) local- put(k,v) Success proxy-put(k,v)
  • Zone Expansion Cluster Expansion Rewrite ●  Builds upon Zone N-ary idea ●  Fetch data from an existing zone ●  No proxy bridges ○  No donors in same zone ●  Cannot read from new zone until complete
  • New Rebalance Utilities Cluster Expansion Rewrite •  PartitionAnalysis ○  Determine skewness of a cluster ●  Repartitioner ○  Improve partition balance ○  Greedy-Random swapping ●  RebalancePlanner ○  Incorporate Zone N-Ary logic ○  Operational Insights: storage overhead,probability client will pick up new metadata ●  Rebalance Controller ○  Cleaner reimplementation based on new planner/scheduler
  • Wins In Production Cluster Expansion Rewrite •  7 Zoned RW Clusters expanded into new zone ○  Hiccups resolved overnight ○  Abortability is handy ●  Small Details -> Big Difference ○  Proxy Pause period ○  Accurate Progress reporting ○  Proxy get/getall optimization