Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JasperWorld 2012: Reinventing Data Management by Max Schireson


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

JasperWorld 2012: Reinventing Data Management by Max Schireson

  1. 1. Max SchiresonPresident,
  2. 2. My background• Oracle from July 1994 to June 2003• MarkLogic from July 2003 to Feb 2011• 10gen (makers of MongoDB) since Feb 2011
  3. 3. In this talk• Why is everyone and their brother inventing a new database nowadays • Meanwhile, lots of great analytics are happening in Hadoop with no database at all• Why do they all look so different from each other and what we’re used to
  4. 4. Since the dawn of the RDBMS 1970 2012Main memory Intel 1103, 1k bits 4GB of RAM costs $25.99Mass storage IBM 3330 Model 1, 100 3TB Superspeed USB for MB $129Microprocessor Nearly – 4004 being Westmere EX has 10 developed; 4 bits and cores, 30MB L3 cache, 92,000 instructions per runs at 2.4GHz second
  5. 5. More recent changes A decade ago NowFaster Buy a bigger server Buy more serversFaster storage A SAN with more SSD spindlesMore reliable storage More expensive SAN More copies of local storageDeployed in Your data center The cloud – private or publicLarge data set Millions of rows Billions to trillions of rowsDevelopment Waterfall IterativeTasks Simple transactions Complex analytics
  6. 6. Assumptions behind today’s DBMS• Relational data model• Third normal form• ACID• Multi-statement transactions• SQL• RAM is small and disks are slow• Runs on one fast computer
  7. 7. Yesterday’s assumptions in today’s world• Scaleout is hard • Or impossible if you believe the CAP theorem• Custom solutions proliferate• Too slow? Just add a cache• ORM tools everywhere• Only the database is scale-up
  8. 8. Challenging some assumptions• Do you need a database at all• How does it handle transactions and consistency• How does it scale out• How should it model data• How do you query it
  9. 9. My opinions• Different use cases will produce different answers• Existing RDBMS solutions will continue to solve a broad set of problems well but many applications will work better on top of alternative technologies• Many new technologies will find niches but only one or two will become mainstream
  10. 10. Do you need a database at all• Can you better solve your problem with a batch processing framework• Can you better solve your problem with an in memory object store/cache
  11. 11. Is Scaleout Mission Impossible• What about the CAP Theorem? • It says if a distributed system is partitioned, you can’t be able to update everywhere and have consistency • Duh• So, either allow inconsistency or limit where updates can be applied
  12. 12. Two choices for consistency• Eventual consistency • Allow updates when a system has been partitioned • Resolve conflicts later • Example: CouchDB, Cassandra• Immediate consistency • Limit the application of updates to a single master node for a given slice of data • Another node can take over after a failure is detected • Avoids the possibility of conflicts • Example: MongoDB
  13. 13. Transactions• Do they exist• At what level of granularity• MongoDB example • Transactions are document-level • Those short transactions are atomic, consistent, isolated and durable
  14. 14. Simple transactions ({ _id: 700, voters : {$ne : ‘max’ } },{ $inc : { votes : 1 }, $push : { voters : ‘max’ } }) 14
  15. 15. Scaleout architecture• How do you distribute data among many servers• Choices • Hashes (Dynamo style) vs ranges (BigTable style) • Tradeoff: set-and-forget vs optimizability • Physical vs logical segments • Very important with secondary indexes • Tradeoff: cluster rebalancing ease vs performance optimization• MongoDB : bigtable style range partitioning with logical segmentation
  16. 16. Scaleout – no free lunch• With a large cluster: • No known solution to the general case of fast distributed joins • Some subcases can be handled • No known solution to fast distributed transactions
  17. 17. Why mess with the data model• Relational minus joins and multi-statement transactions is much less useful• What about partial solutions to joins and multi-statement transactions • Hard to implement • Complex for developers to understand performance implications• Therefore alternatives are worth considering for distributed systems• Common alternatives • Key-value • Document • Graph • Column-family• MongoDB example: JSON-based document oriented
  18. 18. Change one assumption• First normal form: no repeating groups• Why?• What if that is not a requirement? • You need many fewer joins • Transactions are often simplified • Data locality is often increased• But at a cost • Much theory is now moot • Implementation complexity• From a different initial assumption, different rules apply
  19. 19. Querying a database• By primary key only• Ad-hoc queries • SQL or otherwise, but language details are a minor choice• Via map-reduce• OLTP and BI together • Eg, SAP HANA• MongoDB example: ad-hoc queries (based on JSON) and map-reduce
  20. 20. 20
  21. 21. Max SchiresonPresident,