Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop and Voldemort @ LinkedIn


Published on

Bhupesh Bansal, LinkedIn

Published in: Technology
  • Be the first to comment

Hadoop and Voldemort @ LinkedIn

  1. Hadoop Voldemort @ LinkedIn Bhupesh Bansal 20 January , 2010 01/21/10
  2. The plan <ul><li>What is Project Voldemort ? </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>New features </li></ul></ul><ul><ul><li>In production </li></ul></ul><ul><li>Hadoop & LinkedIn </li></ul><ul><ul><li>LinkedIn Hadoop ecosystem </li></ul></ul><ul><ul><li>Hadoop & Voldemort </li></ul></ul><ul><li>References </li></ul><ul><li>Q&A </li></ul>
  3. Introduction <ul><li>Project Voldemort is a distributed, scalable, highly available key/value storage system. </li></ul><ul><ul><li>Inspired by Amazon Dynamo Paper and memcached </li></ul></ul><ul><ul><li>Online storage solution which scales horizontally </li></ul></ul><ul><ul><ul><li>High throughput, low latency </li></ul></ul></ul><ul><li>What does it do for you ? </li></ul><ul><ul><li>Provides a simple key/value APIs for client. </li></ul></ul><ul><ul><li>Data partitioning and replication </li></ul></ul><ul><ul><li>Provides consistency guarantees even in presence of failures </li></ul></ul><ul><ul><li>Scales well (amount of data, number of clients) </li></ul></ul>
  4. Motivation I : Big Data Proprietary & Confidential 01/21/10 Reference :
  5. Motivation II: Data Driven Features
  6. Motivation III Proprietary & Confidential 01/21/10
  7. Motivation IV Proprietary & Confidential 01/21/10
  8. Why Is This Hard? <ul><li>Failures in a distributed system are much more complicated </li></ul><ul><ul><li>A can talk to B does not imply B can talk to A </li></ul></ul><ul><li>Nodes will fail and come back to life with stale data </li></ul><ul><li>I/O has high request latency variance </li></ul><ul><ul><li>Is the node down or the node slow ? </li></ul></ul><ul><li>Intermittent failures are common </li></ul><ul><ul><li>Should I try this node now ? </li></ul></ul><ul><li>There are fundamental trade-offs between availability and consistency </li></ul><ul><ul><li>CAP theorem </li></ul></ul><ul><li>User must be isolated from these problems </li></ul>
  9. Some Problems we worked on lately ? <ul><li>Performance improvements </li></ul><ul><ul><li>Push computation to data (Server side views) </li></ul></ul><ul><ul><li>Data compression </li></ul></ul><ul><ul><li>Failure detection </li></ul></ul><ul><li>Reliable Testing </li></ul><ul><ul><li>Testing is hard, Distributed systems make it harder </li></ul></ul><ul><li>Voldemort Rebalancing </li></ul><ul><ul><li>Dynamically add nodes to a running cluster </li></ul></ul><ul><li>Administration </li></ul><ul><ul><li>Often ignored But very important </li></ul></ul>
  10. Server side views <ul><li>Motivation: push computation to data </li></ul><ul><li>Create custom views on server side for specific transformations. </li></ul><ul><ul><li>Avoids network transfer of big blobs </li></ul></ul><ul><ul><li>Avoids serialization CPU and I/O cost </li></ul></ul><ul><li>Examples </li></ul><ul><ul><ul><li>Range query over denormalized list stored as value. </li></ul></ul></ul><ul><ul><ul><li>Append() operation for list values. </li></ul></ul></ul><ul><ul><ul><li>Filters/aggregates on specific fields etc. </li></ul></ul></ul>
  11. Failure Detection <ul><li>Need to maintain up-to-date status of each server availability. </li></ul><ul><li>Detect failures earlier </li></ul><ul><ul><li>Avoid waiting for failed nodes while serving. </li></ul></ul><ul><li>Reduce false positives </li></ul><ul><ul><li>Maintains proper load balancing </li></ul></ul><ul><li>Allowance for intermittent or temporary failures. </li></ul><ul><li>Re-establish node availability asynchronously. </li></ul><ul><li>Contributed by Kirk True </li></ul>Proprietary & Confidential 01/21/10
  12. EC2 based testing <ul><li>Testing “in the cloud” </li></ul><ul><ul><ul><li>Distributed systems have to be tested on multi-node clusters </li></ul></ul></ul><ul><ul><ul><li>Distributed systems have complex failure scenarios </li></ul></ul></ul><ul><ul><ul><li>A storage system, above all, must be stable </li></ul></ul></ul><ul><ul><ul><li>Automated testing allows rapid iteration while maintaining confidence in systems’ correctness and stability </li></ul></ul></ul><ul><li>EC2-based testing framework </li></ul><ul><ul><li>Tests are invoked programmatically </li></ul></ul><ul><ul><li>Adaptable to other cloud hosting providers </li></ul></ul><ul><ul><li>Will run on a regular basis </li></ul></ul><ul><ul><li>Contributed by Kirk True </li></ul></ul>
  13. Coming this Jan (finally): Rebalancing <ul><li>Voldemort Rebalancing </li></ul><ul><ul><li>capability to add/delete nodes, move data around in an online voldemort cluster. </li></ul></ul><ul><li>Features </li></ul><ul><ul><li>No downtime </li></ul></ul><ul><ul><li>Transparent to the client </li></ul></ul><ul><ul><li>Maintain data consistency guarantees </li></ul></ul><ul><ul><li>push button user interface </li></ul></ul>
  14. Administration <ul><li>Monitoring </li></ul><ul><ul><li>View statistics (how many queries are made? How long are they taking?) </li></ul></ul><ul><ul><li>Perform diagnostic operations </li></ul></ul><ul><li>Administrative functionalities </li></ul><ul><ul><li>Functionality which is needed, but shouldn’t be performed by regular store clients, example </li></ul></ul><ul><ul><ul><li>Ability to update and retrieve cluster/store metadata </li></ul></ul></ul><ul><ul><ul><li>Efficient streaming of keys and values. </li></ul></ul></ul><ul><ul><ul><li>Delete entries in bulk </li></ul></ul></ul><ul><ul><ul><li>Truncate entire store </li></ul></ul></ul><ul><ul><ul><li>Restore a node data from replicas </li></ul></ul></ul>
  15. Present day <ul><li>In Production use </li></ul><ul><ul><li>At LinkedIn </li></ul></ul><ul><ul><ul><li>Multiple clusters </li></ul></ul></ul><ul><ul><ul><li>Variety of customers </li></ul></ul></ul><ul><ul><li>Outside of LinkedIn </li></ul></ul><ul><ul><ul><li>Gilt Group, KaChing, others </li></ul></ul></ul><ul><li>Active developer community, inside and outside LinkedIn </li></ul><ul><li>Monthly release cycle </li></ul><ul><ul><li>Continuous testing environment. </li></ul></ul><ul><ul><li>Daily performance tests. </li></ul></ul>
  16. Performance <ul><li>LinkedIn cluster: web event tracking logging and online lookups </li></ul><ul><ul><li>6 nodes, 400 GB of data, 12 clients </li></ul></ul><ul><ul><li>mixed load (67 % Get , 33 % Put) </li></ul></ul><ul><li>Throughput </li></ul><ul><ul><li>1433 QPS (node) </li></ul></ul><ul><ul><li>4299 QPS (cluster) </li></ul></ul><ul><li>Latency </li></ul><ul><ul><li>GET </li></ul></ul><ul><ul><ul><li>50 % percentile 0.05 ms </li></ul></ul></ul><ul><ul><ul><li>95 % percentile 36.07 ms </li></ul></ul></ul><ul><ul><ul><li>99 % percentile 60.65 ms </li></ul></ul></ul><ul><ul><li>PUT </li></ul></ul><ul><ul><ul><li>50 % percentile 0.09 ms </li></ul></ul></ul><ul><ul><ul><li>95 % percentile 0.41 ms </li></ul></ul></ul><ul><ul><ul><li>99 % percentile 1.22 ms </li></ul></ul></ul>
  17. Hadoop @ Linkedin
  18. Batch Computing at Linkedin <ul><li>Some questions we want to answer </li></ul><ul><ul><li>What do we use Hadoop for ? </li></ul></ul><ul><ul><li>How do we store data ? </li></ul></ul><ul><ul><li>How do we manage workflows ? </li></ul></ul><ul><ul><li>How do we do ETL ? </li></ul></ul><ul><ul><li>How do we prototype ideas </li></ul></ul>
  19. What do we use Hadoop for ? Proprietary & Confidential 01/21/10
  20. How do we store Data ? <ul><li>Compact, compressed, binary data (something like Avro) </li></ul><ul><ul><li>Type can be any combination of int, double, float, String, Map, List, Date, etc. </li></ul></ul><ul><ul><li>Example member definition: </li></ul></ul><ul><ul><li>{ </li></ul></ul><ul><li> 'member_id': 'int32', </li></ul><ul><li>‘ first_name': 'string', </li></ul><ul><li>’ last_name': ’string’, </li></ul><ul><li>‘ age’ : ‘int32’ </li></ul><ul><li>… </li></ul><ul><li>} </li></ul><ul><li>Data is stored in Hadoop as sequence files, serialized with this format </li></ul><ul><li>The schema of data is saved in sequence files as metadata </li></ul><ul><ul><li>The schema is read dynamically by Java/Pig jobs on the fly. </li></ul></ul>
  21. How do we manage workflows ? <ul><li>We wrote a workflow management tool (code name: Azkaban ) </li></ul><ul><ul><li>Dependency management (Hadoop, ETL, java, Unix jobs) </li></ul></ul><ul><ul><ul><li>Maintains a dependency directed acyclic graph </li></ul></ul></ul><ul><ul><ul><li>All dependencies must complete before the job itself can run </li></ul></ul></ul><ul><ul><ul><li>If a dependency fails the job itself fails </li></ul></ul></ul><ul><ul><li>Scheduling </li></ul></ul><ul><ul><ul><li>workflows can be scheduled to run on repeating schedule. </li></ul></ul></ul><ul><ul><li>Configuration system (simple properties files) </li></ul></ul><ul><ul><li>GUI for visualizing and controlling job </li></ul></ul><ul><ul><ul><li>Historical logging and job success data retained for each run </li></ul></ul></ul><ul><ul><li>Alerting of failures </li></ul></ul><ul><li>Will be open sourced soon ( APL ) !! </li></ul>
  22. Introducing Azkaban
  23. How do we do ETL ? : Getting data in <ul><li>Two kind of data </li></ul><ul><li>From Databases (user data, news, jobs etc.) </li></ul><ul><ul><li>Need a way to get data reliably periodically </li></ul></ul><ul><ul><li>Need test to verify data </li></ul></ul><ul><ul><li>Support for incremental replication </li></ul></ul><ul><ul><li>Our solution Transmogrify, A driver program which accepts an inputReader and outputWriter </li></ul></ul><ul><ul><ul><li>InputReader: JDBCReader, CSV Reader </li></ul></ul></ul><ul><ul><ul><li>Output Writer: JDBCWriter, HDFS writers </li></ul></ul></ul><ul><li>From web logs (page views, search, clicks etc) </li></ul><ul><ul><ul><li>Weblogs files are rsynced and loaded up in HDFS </li></ul></ul></ul><ul><ul><ul><li>Hadoop jobs for date cleaning and transformation. </li></ul></ul></ul>
  24. ETL II: Getting data out <ul><li>Batch jobs generate output in 100GBs </li></ul><ul><li>How do we finally serve this data to user ? Some constraints </li></ul><ul><ul><li>Wants to show data to users ASAP </li></ul></ul><ul><ul><li>Should not impact online serving </li></ul></ul><ul><ul><li>Should have quick rollback capabilities. </li></ul></ul><ul><ul><li>Should be horizontally scalable </li></ul></ul><ul><ul><li>Should be fault tolerant (replication) </li></ul></ul><ul><ul><li>High throughput, low latency </li></ul></ul>
  25. ETL II : Getting Data Out : Existing Solutions <ul><li>JDBC upload to Oracle </li></ul><ul><ul><li>Very slow (3 days to push 20GB of data) </li></ul></ul><ul><ul><li>Long running transactions cause issues. </li></ul></ul><ul><li>“ Load Data” statement in MySQL </li></ul><ul><ul><li>Tried many many tuning settings </li></ul></ul><ul><ul><li>Was still very slow </li></ul></ul><ul><ul><li>Almost unresponsive for online serving </li></ul></ul><ul><li>Memcached </li></ul><ul><ul><li>Need to have everything in memory </li></ul></ul><ul><ul><li>No support for batch inserts </li></ul></ul><ul><ul><li>Need to repush if server dies. </li></ul></ul><ul><li>Oracle SQL*Loader </li></ul><ul><ul><li>The whole process took 2-3 days </li></ul></ul><ul><ul><li>20-24 hour actual data loading time </li></ul></ul><ul><ul><li>Needed a guy to baby sit the process </li></ul></ul><ul><ul><li>What if we want to do this daily ? </li></ul></ul><ul><li>Hbase (didn’t try) </li></ul>Proprietary & Confidential 01/21/10
  26. ETL II : Getting Data Out : Our solution <ul><li>Index build runs 100% in Hadoop </li></ul><ul><li>MapReduce job outputs Voldemort Stores to HDFS </li></ul><ul><li>Job control initiates a fetch request to Voldemort nodes. </li></ul><ul><li>Voldemort nodes copies data from Hadoop in parallel </li></ul><ul><li>Atomic swap to make the data live </li></ul><ul><li>Heavily optimized storage engine for read-only data </li></ul><ul><li>I/O Throttling on the transfer to protect the live servers </li></ul><ul><li>Our Solution </li></ul><ul><ul><li>Wrote a special Read-only storage engine for Voldemort </li></ul></ul><ul><ul><li>Data is built on Hadoop and copied to Voldemort </li></ul></ul>
  27. Voldemort Read only store: version I <ul><li>Simple File based storage engine </li></ul><ul><ul><li>Two files: key file and value file </li></ul></ul><ul><ul><li>Key file have sorted MD5 key hash and file offset of the value file for corresponding value. </li></ul></ul><ul><ul><li>Value file have value saved as size,data </li></ul></ul><ul><li>Advantages </li></ul><ul><ul><li>Index is built on hadoop, no load on production servers </li></ul></ul><ul><ul><li>Files are copied in parallel to voldemort cluster </li></ul></ul><ul><ul><li>Supports rollback by keeping multiple copies </li></ul></ul>Proprietary & Confidential 01/21/10
  28. Voldemort Read only store: version II <ul><li>The Version I read only stores have few issues </li></ul><ul><ul><li>Only one reducer per node </li></ul></ul><ul><ul><li>Binary search can potentially take 32 steps. </li></ul></ul><ul><li>Version II format </li></ul><ul><ul><li>Make multiple key-file, value-file pairs (multiple reducer per node) </li></ul></ul><ul><ul><li>Mmap all keys file. </li></ul></ul><ul><ul><li>Use interpolation binary search </li></ul></ul><ul><ul><ul><li>Keys are MD5 hash and very well uniformed distributed </li></ul></ul></ul><ul><ul><ul><li>While searching do predicted binary search </li></ul></ul></ul><ul><ul><ul><li>Much faster performance </li></ul></ul></ul><ul><li>Future work </li></ul><ul><ul><li>Group values by frequency, so that frequent values are in operating system cache </li></ul></ul><ul><ul><li>Transfer only the delta </li></ul></ul>Proprietary & Confidential 01/21/10
  29. Performance <ul><li>There are three performance numbers important now </li></ul><ul><ul><li>Hadoop Time to build read-only store indexes </li></ul></ul><ul><ul><ul><li>10 mins to build 20 GB of data </li></ul></ul></ul><ul><ul><li>File transfer time </li></ul></ul><ul><ul><ul><li>Limited only to network and disk throughputs </li></ul></ul></ul><ul><ul><li>Online serving performance </li></ul></ul><ul><ul><ul><li>Read-Only store is fast ( 1ms – 5ms ) </li></ul></ul></ul><ul><ul><ul><li>Operation system page cache helps a lot </li></ul></ul></ul><ul><ul><ul><li>Very predictable performance based on </li></ul></ul></ul><ul><ul><ul><ul><li>Data size, disk speeds, RAM, cache hit ratios </li></ul></ul></ul></ul>
  30. Batch Computing at LinkedIn
  31. Infrastructure At LinkedIn <ul><li>Last year: </li></ul><ul><ul><li>QA lab: 20 machines, cheap dell linux servers </li></ul></ul><ul><ul><li>Production: 20 machines, Heavy machines </li></ul></ul><ul><li>“ QA” cluster is for dev, analysis, and reporting uses </li></ul><ul><ul><li>Pig, Hadoop streaming for prototyping ideas </li></ul></ul><ul><ul><li>Ad hoc jobs compete with scheduled jobs </li></ul></ul><ul><ul><li>Tried different Hadoop schedulers </li></ul></ul><ul><li>Production is for jobs that produce user-facing data </li></ul><ul><li>Hired Allen Wittenauer as our Hadoop Architect in Sep 2009 </li></ul><ul><ul><li>100 hadoop machines </li></ul></ul>
  32. References <ul><li>Amazon dynamo paper </li></ul><ul><li> </li></ul><ul><li>NoSQL presentations at (2009) </li></ul><ul><li>Voldemort presentation by Jay Kreps </li></ul>Proprietary & Confidential 01/21/10
  33. The End
  34. Core Concepts
  35. Core Concepts - I <ul><li>ACID </li></ul><ul><ul><li>Great for single centralized server. </li></ul></ul><ul><li>CAP Theorem </li></ul><ul><ul><li>Consistency (Strict), Availability , Partition Tolerance </li></ul></ul><ul><ul><li>Impossible to achieve all three at same time in distributed platform </li></ul></ul><ul><ul><li>Can choose 2 out of 3 </li></ul></ul><ul><ul><li>Dynamo chooses High Availability and Partition Tolerance </li></ul></ul><ul><ul><ul><li>by sacrificing Strict Consistency to Eventual consistency </li></ul></ul></ul><ul><li>Consistency Models </li></ul><ul><ul><li>Strict consistency </li></ul></ul><ul><ul><ul><li>2 Phase Commits </li></ul></ul></ul><ul><ul><ul><li>PAXOS : distributed algorithm to ensure quorum for consistency </li></ul></ul></ul><ul><ul><li>Eventual consistency </li></ul></ul><ul><ul><ul><li>Different nodes can have different views of value </li></ul></ul></ul><ul><ul><ul><li>In a steady state system will return last written value. </li></ul></ul></ul><ul><ul><ul><li>BUT Can have much strong guarantees. </li></ul></ul></ul>Proprietary & Confidential 01/21/10
  36. Core Concept - II <ul><li>Consistent Hashing </li></ul><ul><li>Key space is Partitioned </li></ul><ul><ul><li>Many small partitions </li></ul></ul><ul><li>Partitions never change </li></ul><ul><ul><li>Partitions ownership can change </li></ul></ul><ul><li>Replication </li></ul><ul><ul><li>Each partition is stored by ‘N’ nodes </li></ul></ul><ul><li>Node Failures </li></ul><ul><ul><li>Transient (short term) </li></ul></ul><ul><ul><li>Long term </li></ul></ul><ul><ul><ul><li>Needs faster bootstrapping </li></ul></ul></ul>Proprietary & Confidential 01/21/10
  37. Core Concept - III <ul><li>N - The replication factor </li></ul><ul><li>R - The number of blocking reads </li></ul><ul><li>W - The number of blocking writes </li></ul><ul><li>If R+W > N </li></ul><ul><ul><li>then we have a quorum-like algorithm </li></ul></ul><ul><ul><li>Guarantees that we will read latest writes OR fail </li></ul></ul><ul><li>R, W, N can be tuned for different use cases </li></ul><ul><ul><li>W = 1, Highly available writes </li></ul></ul><ul><ul><li>R = 1, Read intensive workloads </li></ul></ul><ul><ul><li>Knobs to tune performance, durability and availability </li></ul></ul>Proprietary & Confidential 01/21/10
  38. Core Concepts - IV <ul><li>Vector Clock [Lamport] provides way to order events in a distributed system. </li></ul><ul><li>A vector clock is a tuple {t1 , t2 , ..., tn } of counters. </li></ul><ul><li>Each value update has a master node </li></ul><ul><ul><li>When data is written with master node i, it increments ti. </li></ul></ul><ul><ul><li>All the replicas will receive the same version </li></ul></ul><ul><ul><li>Helps resolving consistency between writes on multiple replicas </li></ul></ul><ul><li>If you get network partitions </li></ul><ul><ul><li>You can have a case where two vector clocks are not comparable. </li></ul></ul><ul><ul><li>In this case Voldemort returns both values to clients for conflict resolution </li></ul></ul>Proprietary & Confidential 01/21/10
  39. Implementation
  40. Voldemort Design
  41. Client API <ul><li>Data is organized into “stores”, i.e. tables </li></ul><ul><li>Key-value only </li></ul><ul><ul><li>But values can be arbitrarily rich or complex </li></ul></ul><ul><ul><ul><li>Maps, lists, nested combinations … </li></ul></ul></ul><ul><li>Four operations </li></ul><ul><ul><li>PUT (Key K, Value V) </li></ul></ul><ul><ul><li>GET (Key K) </li></ul></ul><ul><ul><li>MULTI-GET (Iterator<Key> K), </li></ul></ul><ul><ul><li>DELETE (Key K) / (Key K , Version ver) </li></ul></ul><ul><ul><li>No Range Scans </li></ul></ul>
  42. Versioning & Conflict Resolution <ul><li>Eventual Consistency allows multiple versions of value </li></ul><ul><ul><li>Need a way to understand which value is latest </li></ul></ul><ul><ul><li>Need a way to say values are not comparable </li></ul></ul><ul><li>Solutions </li></ul><ul><ul><li>Timestamp </li></ul></ul><ul><ul><li>Vector clocks </li></ul></ul><ul><ul><ul><li>Provides global ordering. </li></ul></ul></ul><ul><ul><ul><li>No locking or blocking necessary </li></ul></ul></ul>
  43. Serialization <ul><li>Really important </li></ul><ul><ul><li>Few Considerations </li></ul></ul><ul><ul><ul><li>Schema free? </li></ul></ul></ul><ul><ul><ul><li>Backward/Forward compatible </li></ul></ul></ul><ul><ul><ul><li>Real life data structures </li></ul></ul></ul><ul><ul><ul><li>Bytes <=> objects <=> strings? </li></ul></ul></ul><ul><ul><ul><li>Size (No XML) </li></ul></ul></ul><ul><li>Many ways to do it -- we allow anything </li></ul><ul><ul><li>Compressed JSON, Protocol Buffers, Thrift, Voldemort custom serializtion </li></ul></ul>
  44. Routing <ul><li>Routing layer hides lot of complexity </li></ul><ul><ul><li>Hashing schema </li></ul></ul><ul><ul><li>Replication (N, R , W) </li></ul></ul><ul><ul><li>Failures </li></ul></ul><ul><ul><li>Read-Repair (online repair mechanism) </li></ul></ul><ul><ul><li>Hinted Handoff (Long term recovery mechanism) </li></ul></ul><ul><li>Easy to add domain specific strategies </li></ul><ul><ul><li>E.g. only do synchronous operations on nodes in the local data center </li></ul></ul><ul><li>Client Side / Server Side / Hybrid </li></ul>
  45. Voldemort Physical Deployment
  46. Routing With Failures <ul><li>Failure Detection </li></ul><ul><ul><li>Requirements </li></ul></ul><ul><ul><ul><li>Need to be very very fast </li></ul></ul></ul><ul><ul><ul><li>View of server state may be inconsistent </li></ul></ul></ul><ul><ul><ul><ul><ul><li>A can talk to B but C cannot </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>A can talk to C , B can talk to A but not to C </li></ul></ul></ul></ul></ul><ul><ul><li>Currently done by routing layer (request timeouts) </li></ul></ul><ul><ul><ul><li>Periodically retries failed nodes. </li></ul></ul></ul><ul><ul><ul><li>All requests must have hard SLAs </li></ul></ul></ul><ul><ul><li>Other possible solutions </li></ul></ul><ul><ul><ul><li>Central server </li></ul></ul></ul><ul><ul><ul><li>Gossip protocol </li></ul></ul></ul><ul><ul><ul><li>Need to look more into this. </li></ul></ul></ul>
  47. Repair Mechanism <ul><li>Read Repair </li></ul><ul><ul><li>Online repair mechanism </li></ul></ul><ul><ul><ul><li>Routing client receives values from multiple node </li></ul></ul></ul><ul><ul><ul><li>Notify a node if you see an old value </li></ul></ul></ul><ul><ul><ul><li>Only works for keys which are read after failures </li></ul></ul></ul><ul><li>Hinted Handoff </li></ul><ul><ul><li>If a write fails write it to any random node </li></ul></ul><ul><ul><li>Just mark the write as a special write </li></ul></ul><ul><ul><li>Each node periodically tries to get rid of all special entries </li></ul></ul><ul><li>Bootstrapping mechanism (We don’t have it yet) </li></ul><ul><ul><li>If a node was down for long time </li></ul></ul><ul><ul><ul><li>Hinted handoff can generate ton of traffic </li></ul></ul></ul><ul><ul><ul><li>Need a better way to bootstrap and clear hinted handoff tables </li></ul></ul></ul>Proprietary & Confidential 01/21/10
  48. Network Layer <ul><li>Network is the major bottleneck in many uses </li></ul><ul><li>Client performance turns out to be harder than server (client must wait!) </li></ul><ul><ul><li>Lots of issue with socket buffer size/socket pool </li></ul></ul><ul><li>Server is also a Client </li></ul><ul><li>Two implementations </li></ul><ul><ul><li>HTTP + servlet container </li></ul></ul><ul><ul><li>Simple socket protocol + custom server </li></ul></ul><ul><li>HTTP server is great, but http client is 5-10X slower </li></ul><ul><li>Socket protocol is what we use in production </li></ul><ul><li>Recently added a non-blocking version of the server </li></ul>
  49. Persistence <ul><li>Single machine key-value storage is a commodity </li></ul><ul><li>Plugins are better than tying yourself to a single strategy </li></ul><ul><ul><li>Different use cases </li></ul></ul><ul><ul><ul><li>optimize reads </li></ul></ul></ul><ul><ul><ul><li>optimize writes </li></ul></ul></ul><ul><ul><ul><li>Large vs Small values </li></ul></ul></ul><ul><ul><li>SSDs may completely change this layer </li></ul></ul><ul><li>Couple of different options </li></ul><ul><ul><li>BDB, MySQL and mmap’d file implementations </li></ul></ul><ul><ul><li>Berkeley DBs most popular </li></ul></ul><ul><ul><li>In memory plugin for testing </li></ul></ul><ul><li>Btrees are still the best all-purpose structure </li></ul><ul><li>No flush on write is a huge, huge win </li></ul>