Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

JasperWorld 2012: Reinventing Data Management by Max Schireson


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Max SchiresonPresident,
  • 2. My background• Oracle from July 1994 to June 2003• MarkLogic from July 2003 to Feb 2011• 10gen (makers of MongoDB) since Feb 2011
  • 3. In this talk• Why is everyone and their brother inventing a new database nowadays • Meanwhile, lots of great analytics are happening in Hadoop with no database at all• Why do they all look so different from each other and what we’re used to
  • 4. Since the dawn of the RDBMS 1970 2012Main memory Intel 1103, 1k bits 4GB of RAM costs $25.99Mass storage IBM 3330 Model 1, 100 3TB Superspeed USB for MB $129Microprocessor Nearly – 4004 being Westmere EX has 10 developed; 4 bits and cores, 30MB L3 cache, 92,000 instructions per runs at 2.4GHz second
  • 5. More recent changes A decade ago NowFaster Buy a bigger server Buy more serversFaster storage A SAN with more SSD spindlesMore reliable storage More expensive SAN More copies of local storageDeployed in Your data center The cloud – private or publicLarge data set Millions of rows Billions to trillions of rowsDevelopment Waterfall IterativeTasks Simple transactions Complex analytics
  • 6. Assumptions behind today’s DBMS• Relational data model• Third normal form• ACID• Multi-statement transactions• SQL• RAM is small and disks are slow• Runs on one fast computer
  • 7. Yesterday’s assumptions in today’s world• Scaleout is hard • Or impossible if you believe the CAP theorem• Custom solutions proliferate• Too slow? Just add a cache• ORM tools everywhere• Only the database is scale-up
  • 8. Challenging some assumptions• Do you need a database at all• How does it handle transactions and consistency• How does it scale out• How should it model data• How do you query it
  • 9. My opinions• Different use cases will produce different answers• Existing RDBMS solutions will continue to solve a broad set of problems well but many applications will work better on top of alternative technologies• Many new technologies will find niches but only one or two will become mainstream
  • 10. Do you need a database at all• Can you better solve your problem with a batch processing framework• Can you better solve your problem with an in memory object store/cache
  • 11. Is Scaleout Mission Impossible• What about the CAP Theorem? • It says if a distributed system is partitioned, you can’t be able to update everywhere and have consistency • Duh• So, either allow inconsistency or limit where updates can be applied
  • 12. Two choices for consistency• Eventual consistency • Allow updates when a system has been partitioned • Resolve conflicts later • Example: CouchDB, Cassandra• Immediate consistency • Limit the application of updates to a single master node for a given slice of data • Another node can take over after a failure is detected • Avoids the possibility of conflicts • Example: MongoDB
  • 13. Transactions• Do they exist• At what level of granularity• MongoDB example • Transactions are document-level • Those short transactions are atomic, consistent, isolated and durable
  • 14. Simple transactions ({ _id: 700, voters : {$ne : ‘max’ } },{ $inc : { votes : 1 }, $push : { voters : ‘max’ } }) 14
  • 15. Scaleout architecture• How do you distribute data among many servers• Choices • Hashes (Dynamo style) vs ranges (BigTable style) • Tradeoff: set-and-forget vs optimizability • Physical vs logical segments • Very important with secondary indexes • Tradeoff: cluster rebalancing ease vs performance optimization• MongoDB : bigtable style range partitioning with logical segmentation
  • 16. Scaleout – no free lunch• With a large cluster: • No known solution to the general case of fast distributed joins • Some subcases can be handled • No known solution to fast distributed transactions
  • 17. Why mess with the data model• Relational minus joins and multi-statement transactions is much less useful• What about partial solutions to joins and multi-statement transactions • Hard to implement • Complex for developers to understand performance implications• Therefore alternatives are worth considering for distributed systems• Common alternatives • Key-value • Document • Graph • Column-family• MongoDB example: JSON-based document oriented
  • 18. Change one assumption• First normal form: no repeating groups• Why?• What if that is not a requirement? • You need many fewer joins • Transactions are often simplified • Data locality is often increased• But at a cost • Much theory is now moot • Implementation complexity• From a different initial assumption, different rules apply
  • 19. Querying a database• By primary key only• Ad-hoc queries • SQL or otherwise, but language details are a minor choice• Via map-reduce• OLTP and BI together • Eg, SAP HANA• MongoDB example: ad-hoc queries (based on JSON) and map-reduce
  • 20. 20
  • 21. Max SchiresonPresident,