• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data DC - NoSQL at LucidMedia
 

Big Data DC - NoSQL at LucidMedia

on

  • 2,693 views

Describes the process of selecting a NoSQL product for use as part of LucidMedia's ad serving platform. Details pros/cons of several products and tips for general use.

Describes the process of selecting a NoSQL product for use as part of LucidMedia's ad serving platform. Details pros/cons of several products and tips for general use.

Statistics

Views

Total Views
2,693
Views on SlideShare
2,513
Embed Views
180

Actions

Likes
1
Downloads
0
Comments
0

6 Embeds 180

http://aaron.jorb.in 142
http://storify.com 24
url_unknown 6
http://paper.li 6
http://www.slideshare.net 1
http://www.slashdocs.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Big Data DC - NoSQL at LucidMedia Big Data DC - NoSQL at LucidMedia Presentation Transcript

  • NoSQL at LucidMedia Nick Kleinschmidt @kleinsch nkleinsch@lucidmedia.com
  • Overview• Who is LucidMedia?• What is NoSQL?• Major NoSQL Products• Performance Results• Pro Tips• Questions
  • LucidMedia• Online Display Advertising Network• Over 1.5B impressions/ day• Based in Reston, VA• Hiring engineers!
  • Real Time Bidding
  • The Use Case• Server-side user database (cookie store)• Hundreds of millions of users• Fast access - 5-10ms• Cloud hardware
  • What is NoSQL?• Data storage tools created in reaction to common web scaling problems with relational databases• Widely differing purposes and feature sets
  • Problem: Scaling Writes• Relational databases scale vertically - all records must be on the same machine• Solution - distribute data across machines, scaling horizontally• This solves the scaling problem, but makes joins, grouping, transactions difficult
  • Problem: High Latency• Database is usually biggest contributor to server-side application latency• memcached pioneered low latency key- value store• Solution - compromise functionality for speed• Usually sacrifice transactions, advanced query types
  • Problem: Inflexible Schemas• Relational databases require schema to be defined ahead of time• Flexible schema gives more options to developers, handle upgrades in code instead of writing SQL• Storing custom formats can save lots of space for records with sparse fields
  • General NoSQL Features• Storage Format / Operations• Memory / Disk Utilization• Atomic Operations• Auto-Sharding - Partitioning data across servers, scales reads and writes• Replication - Copying data between servers, scales reads
  • Types of Products Key-Value Document Graph• memcached • MongoDB • FlockDB• Redis • CouchDB • Neo4j• BerkeleyDB• HBase• Cassandra• Amazon SimpleDB
  • Evaluation - Lucidmedia• Query latency is priority #1• Disk access is suspect, since we’re in the cloud• Transactions not necessary - it’s OK to be briefly inconsistent or even lose a few updates• Replication and auto-sharding are nice, but also can be done manually
  • Products Evaluated Complex Storage Scalability Type Data Storage License Used By Operations Profile Features LRU cache Facebook, Check and set Open Sourcememcached Key-Value mapping string (CAS) All in memory None (BSD) Twitter, to binary data YouTube Indexing on BSON multiple fields, Disk and Document objects(binary MapReduce, Auto-Sharding, Commercial, FourSquare,MongoDB Store format similar atomic memory Replication AGPL bit.ly, ShutterFly to JSON) operations (single object) Column family Tunable store - similar Key-Value consistency, Disk and Facebook to BigTable, Clustered, Open SourceCassandra (Column multiple data atomic memory Replication (Apache) Inbox Search, Store) operations Digg, Twitter types for (row level) columns Replication, Simple key- Many atomic All in memory, Cluster value, supports Open Source GitHub, Digg, Redis Key-Value list, set, sorted operations saved to disk (unreleased) (BSD) LucidMedia (single key) for persistence will provide set, hash auto-sharding
  • Findings Pros Cons Using It? We need more than a cache,memcached Fast, widely used, great for caching MemcacheDB didn’t seem Yes (for other things) widely used at the time Great data model and feature set, strong Early versions had performanceMongoDB commercial support issues No Not optimized for our Great for storing and searching hugeCassandra amounts of data problem, so performance didn’t No fit our needs No auto-sharding (yet), memory Lightning fast, very active development, Redis useful feature set footprint (per key) is a little Yes high
  • Performance - GET MySQL (InnoDB) Memcached Redis 6000Throughput (reqs/sec) 4500 3000 1500 0 10 20 30 40 60 Concurrency (threads) http://www.ruturaj.net/myisam-innodb
  • Performance - SET MySQL (InnoDB) Memcached Redis 6000Throughput (reqs/sec) 4500 3000 1500 0 10 20 30 40 60 Concurrency (threads) http://www.ruturaj.net/myisam-innodb
  • Performance Testing• Use real application data• Approximate real conditions - run against your web servers, not a simple test program• Averages hide important details - use percentiles to measure latency
  • Drivers• Huge performance Whalin SpyMemcached difference between drivers for the same 6 language Latency (ms) 4.5• Use asynchronous driver when possible to 3 parallelize requests 1.5 0 1 10 20 30 Concurrency (threads)
  • Sharding• Split into a large number of shards initially, since you’re going to reshard eventually• Automate shard management processes• Measure performance and utilization metrics in production to predict scaling needs
  • Questions?• @kleinsch• nkleinsch@lucidmedia.com