drop_acid_pycon_2009

1,598 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,598
On SlideShare
0
From Embeds
0
Number of Embeds
128
Actions
Shares
0
Downloads
25
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

drop_acid_pycon_2009

  1. 1. Drop ACID and think about data Bob Ippolito Mochi Media, Inc. PyCon 2009 - Chicago (actually Rosemont) March 28, 2009 Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 1 / 35
  2. 2. Author: Bob Ippolito Date: March 2009 Venue: PyCon 2009 Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 2 / 35
  3. 3. Bob’s Perspective Startup with lots of data: Cofounded Mochi Media in 2005 MochiBot analytics platform (for Flash) MochiAds ad serving platform (for Flash games) Other cool services for game developers Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 2 / 35
  4. 4. Mochi Ad Sales Hard Sell: Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 3 / 35
  5. 5. What’s ACID? A promise ring your DBMS wears: Atomicity: all or nothing Consistency: no explosions Isolation: no fights Durability: no lying Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 4 / 35
  6. 6. ACID Trips Scalability and reliability: Downtime is unacceptable Reliable is >= 2 nodes Scalable is ... more Networks make it hard Networks make it hard Networks make it hard Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 5 / 35
  7. 7. What can I have? CAP theorem says pick two: Consistency Availability Partition tolerance Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 6 / 35
  8. 8. Turn up the BASE Write smarter applications: Basically Available Soft state Eventually consistent Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 7 / 35
  9. 9. BASE jumping Everyone else is doing it: Google Amazon eBay Yahoo! Facebook ... Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 8 / 35
  10. 10. BigTable Google: Paxos (Chubby) Single-master Distributed tablets via GFS Row/Column db hybrid Compression (BMDiff, Zippy) Versioned (Row, Column, Timestamp) Bloom filters Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 9 / 35
  11. 11. BigTable Pros Pros: Compression = Awesome Clients are probably simple Integrates with map/reduce Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 10 / 35
  12. 12. BigTable Cons Cons: Proprietary to Google Single-master Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 11 / 35
  13. 13. BigTable Diagram Single-master: Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 12 / 35
  14. 14. Dynamo Amazon: Key/Value store Consistent hashing Vector clocks Read repair Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 13 / 35
  15. 15. Dynamo Pros Pros: No master Highly available for write Knobs to make it fast to read “Simple” (lots of half-baked clones!) Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 14 / 35
  16. 16. Dynamo Cons Cons: Proprietary to Amazon Clients need to be smart No compression Not suitable for column-like workloads Just a Key/Value store Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 15 / 35
  17. 17. Dynamo Diagram Smart client: Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 16 / 35
  18. 18. Cassandra Facebook -> Apache: Open source! No master like Dynamo Storage model more like BigTable Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 17 / 35
  19. 19. Cassandra Pros Pros: OPEN SOURCE Incrementally scalable Minimal administration Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 18 / 35
  20. 20. Cassandra Cons Cons: Not polished No compression yet Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 19 / 35
  21. 21. Cassandra Diagram Soul Calibur: Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 20 / 35
  22. 22. Distributed Musings New Hotness: Distributed databases are the new web framework ... except none of them are awesome yet I don’t think we need another half-baked Dynamo clone Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 21 / 35
  23. 23. Key-Value Stores Simple and Fast: Similar to a Python dict Keys usually bytes, probably limited Values usually bytes, often have fewer limits Extremely fast, simple Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 22 / 35
  24. 24. Memcached Key/Value store as cache: No persistence RAM only Throws data away (on purpose) Lightning fast “Everyone” uses it Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 23 / 35
  25. 25. Caching Immutable Data If only data never changed: Immutable is easy, do that Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 24 / 35
  26. 26. Caching Mutable Data Invalidation sucks: Mutable is hard Failed transactions? Concurrent writers? Dependent cache keys? You will get it wrong and it will be hard to debug Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 25 / 35
  27. 27. Tokyo Cabinet/Tyrant Not your mom’s BerkeleyDB: Disk persistent Very performant Actively developed Similar replication strategy to MySQL Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 26 / 35
  28. 28. Redis Still very new: Not just a Key/Value store Matching on key spaces Values can be bytes, lists or sets Requires full store in RAM Might be a nice cache server? Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 27 / 35
  29. 29. Document Databases Schema-free: Very easy to use Document Versioning Great for storing documents Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 28 / 35
  30. 30. CouchDB Document DB Poster Child: Apache project Asynchronous replication JSON based Views materialized on demand (not indexes) Neat admin UI Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 29 / 35
  31. 31. MongoDB C++’s revenge: Fast JSON and BSON (binary JSON-ish) Asynchronous replication with auto-sharding “soon” Index support Nested documents Advanced queries Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 30 / 35
  32. 32. Column Databases Data Warehousing: Sequential reads are awesome Columns compress better than rows Doesn’t waste I/O on uninteresting columns Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 31 / 35
  33. 33. MonetDB Research project: Tried really hard to get it to work Crashes a lot and corrupts your data Do not waste your time Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 32 / 35
  34. 34. LucidDB Sounds interesting: Java/C++ open source data warehouse No clustering No experience yet Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 33 / 35
  35. 35. Vertica We paid for it: Commercial (based on C-Store) Clustered Would still prefer open source Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 34 / 35
  36. 36. Bitmap Indexes Sequential Scans can be fast: 1-N bits per row of data Can apply logical operations across indexes Can be compressed (BBC, WAH) FastBit is a good implementation Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 35 / 35
  37. 37. Bitmap Index Uses Big Queries: PostgreSQL 8.1+ in-memory for some queries Almost a requirement for column stores FastBit is a great implementation (WAH) Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 36 / 35
  38. 38. Bloom Filters are Neat But our Princess is in another castle: Probabilistic data structure False positives at a known error Constant space I won’t bore you with the math Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 37 / 35
  39. 39. Bloom Filter Diagram Actually Relevant: Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 38 / 35
  40. 40. Bloom Filter Uses Find stuff, maybe: Approximate counting of a large set (e.g. unique IPs from logs) Knowing that data is definitely NOT stored somewhere, e.g. remote cache Several variants (Counting Bloom Filter, Scalable Bloom Filter, ...) Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 39 / 35
  41. 41. Questions? Open Space: Open Space TODAY @ 5pm, Lambert. See Jonathan Ellis Bob Ippolito (PyCon 2009) Drop ACID and think about data March 28, 2009 40 / 35

×