Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data DC - NoSQL at LucidMedia


Published on

Describes the process of selecting a NoSQL product for use as part of LucidMedia's ad serving platform. Details pros/cons of several products and tips for general use.

Published in: Technology
  • Be the first to comment

Big Data DC - NoSQL at LucidMedia

  1. 1. NoSQL at LucidMedia Nick Kleinschmidt @kleinsch
  2. 2. Overview• Who is LucidMedia?• What is NoSQL?• Major NoSQL Products• Performance Results• Pro Tips• Questions
  3. 3. LucidMedia• Online Display Advertising Network• Over 1.5B impressions/ day• Based in Reston, VA• Hiring engineers!
  4. 4. Real Time Bidding
  5. 5. The Use Case• Server-side user database (cookie store)• Hundreds of millions of users• Fast access - 5-10ms• Cloud hardware
  6. 6. What is NoSQL?• Data storage tools created in reaction to common web scaling problems with relational databases• Widely differing purposes and feature sets
  7. 7. Problem: Scaling Writes• Relational databases scale vertically - all records must be on the same machine• Solution - distribute data across machines, scaling horizontally• This solves the scaling problem, but makes joins, grouping, transactions difficult
  8. 8. Problem: High Latency• Database is usually biggest contributor to server-side application latency• memcached pioneered low latency key- value store• Solution - compromise functionality for speed• Usually sacrifice transactions, advanced query types
  9. 9. Problem: Inflexible Schemas• Relational databases require schema to be defined ahead of time• Flexible schema gives more options to developers, handle upgrades in code instead of writing SQL• Storing custom formats can save lots of space for records with sparse fields
  10. 10. General NoSQL Features• Storage Format / Operations• Memory / Disk Utilization• Atomic Operations• Auto-Sharding - Partitioning data across servers, scales reads and writes• Replication - Copying data between servers, scales reads
  11. 11. Types of Products Key-Value Document Graph• memcached • MongoDB • FlockDB• Redis • CouchDB • Neo4j• BerkeleyDB• HBase• Cassandra• Amazon SimpleDB
  12. 12. Evaluation - Lucidmedia• Query latency is priority #1• Disk access is suspect, since we’re in the cloud• Transactions not necessary - it’s OK to be briefly inconsistent or even lose a few updates• Replication and auto-sharding are nice, but also can be done manually
  13. 13. Products Evaluated Complex Storage Scalability Type Data Storage License Used By Operations Profile Features LRU cache Facebook, Check and set Open Sourcememcached Key-Value mapping string (CAS) All in memory None (BSD) Twitter, to binary data YouTube Indexing on BSON multiple fields, Disk and Document objects(binary MapReduce, Auto-Sharding, Commercial, FourSquare,MongoDB Store format similar atomic memory Replication AGPL, ShutterFly to JSON) operations (single object) Column family Tunable store - similar Key-Value consistency, Disk and Facebook to BigTable, Clustered, Open SourceCassandra (Column multiple data atomic memory Replication (Apache) Inbox Search, Store) operations Digg, Twitter types for (row level) columns Replication, Simple key- Many atomic All in memory, Cluster value, supports Open Source GitHub, Digg, Redis Key-Value list, set, sorted operations saved to disk (unreleased) (BSD) LucidMedia (single key) for persistence will provide set, hash auto-sharding
  14. 14. Findings Pros Cons Using It? We need more than a cache,memcached Fast, widely used, great for caching MemcacheDB didn’t seem Yes (for other things) widely used at the time Great data model and feature set, strong Early versions had performanceMongoDB commercial support issues No Not optimized for our Great for storing and searching hugeCassandra amounts of data problem, so performance didn’t No fit our needs No auto-sharding (yet), memory Lightning fast, very active development, Redis useful feature set footprint (per key) is a little Yes high
  15. 15. Performance - GET MySQL (InnoDB) Memcached Redis 6000Throughput (reqs/sec) 4500 3000 1500 0 10 20 30 40 60 Concurrency (threads)
  16. 16. Performance - SET MySQL (InnoDB) Memcached Redis 6000Throughput (reqs/sec) 4500 3000 1500 0 10 20 30 40 60 Concurrency (threads)
  17. 17. Performance Testing• Use real application data• Approximate real conditions - run against your web servers, not a simple test program• Averages hide important details - use percentiles to measure latency
  18. 18. Drivers• Huge performance Whalin SpyMemcached difference between drivers for the same 6 language Latency (ms) 4.5• Use asynchronous driver when possible to 3 parallelize requests 1.5 0 1 10 20 30 Concurrency (threads)
  19. 19. Sharding• Split into a large number of shards initially, since you’re going to reshard eventually• Automate shard management processes• Measure performance and utilization metrics in production to predict scaling needs
  20. 20. Questions?• @kleinsch•