JasperWorld 2012: Reinventing Data Management by Max Schireson
 

JasperWorld 2012: Reinventing Data Management by Max Schireson

on

  • 388 views

 

Statistics

Views

Total Views
388
Views on SlideShare
380
Embed Views
8

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 8

http://www.10gen.com 7
http://tssskci.10gen.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

JasperWorld 2012: Reinventing Data Management by Max Schireson JasperWorld 2012: Reinventing Data Management by Max Schireson Presentation Transcript

  • Max SchiresonPresident, 10genmax@10gen.com @mschiresonmaxschireson.com
  • My background• Oracle from July 1994 to June 2003• MarkLogic from July 2003 to Feb 2011• 10gen (makers of MongoDB) since Feb 2011
  • In this talk• Why is everyone and their brother inventing a new database nowadays • Meanwhile, lots of great analytics are happening in Hadoop with no database at all• Why do they all look so different from each other and what we’re used to
  • Since the dawn of the RDBMS 1970 2012Main memory Intel 1103, 1k bits 4GB of RAM costs $25.99Mass storage IBM 3330 Model 1, 100 3TB Superspeed USB for MB $129Microprocessor Nearly – 4004 being Westmere EX has 10 developed; 4 bits and cores, 30MB L3 cache, 92,000 instructions per runs at 2.4GHz second
  • More recent changes A decade ago NowFaster Buy a bigger server Buy more serversFaster storage A SAN with more SSD spindlesMore reliable storage More expensive SAN More copies of local storageDeployed in Your data center The cloud – private or publicLarge data set Millions of rows Billions to trillions of rowsDevelopment Waterfall IterativeTasks Simple transactions Complex analytics
  • Assumptions behind today’s DBMS• Relational data model• Third normal form• ACID• Multi-statement transactions• SQL• RAM is small and disks are slow• Runs on one fast computer
  • Yesterday’s assumptions in today’s world• Scaleout is hard • Or impossible if you believe the CAP theorem• Custom solutions proliferate• Too slow? Just add a cache• ORM tools everywhere• Only the database is scale-up
  • Challenging some assumptions• Do you need a database at all• How does it handle transactions and consistency• How does it scale out• How should it model data• How do you query it
  • My opinions• Different use cases will produce different answers• Existing RDBMS solutions will continue to solve a broad set of problems well but many applications will work better on top of alternative technologies• Many new technologies will find niches but only one or two will become mainstream
  • Do you need a database at all• Can you better solve your problem with a batch processing framework• Can you better solve your problem with an in memory object store/cache
  • Is Scaleout Mission Impossible• What about the CAP Theorem? • It says if a distributed system is partitioned, you can’t be able to update everywhere and have consistency • Duh• So, either allow inconsistency or limit where updates can be applied
  • Two choices for consistency• Eventual consistency • Allow updates when a system has been partitioned • Resolve conflicts later • Example: CouchDB, Cassandra• Immediate consistency • Limit the application of updates to a single master node for a given slice of data • Another node can take over after a failure is detected • Avoids the possibility of conflicts • Example: MongoDB
  • Transactions• Do they exist• At what level of granularity• MongoDB example • Transactions are document-level • Those short transactions are atomic, consistent, isolated and durable
  • Simple transactions exampledb.docs.update ({ _id: 700, voters : {$ne : ‘max’ } },{ $inc : { votes : 1 }, $push : { voters : ‘max’ } }) 14
  • Scaleout architecture• How do you distribute data among many servers• Choices • Hashes (Dynamo style) vs ranges (BigTable style) • Tradeoff: set-and-forget vs optimizability • Physical vs logical segments • Very important with secondary indexes • Tradeoff: cluster rebalancing ease vs performance optimization• MongoDB : bigtable style range partitioning with logical segmentation
  • Scaleout – no free lunch• With a large cluster: • No known solution to the general case of fast distributed joins • Some subcases can be handled • No known solution to fast distributed transactions
  • Why mess with the data model• Relational minus joins and multi-statement transactions is much less useful• What about partial solutions to joins and multi-statement transactions • Hard to implement • Complex for developers to understand performance implications• Therefore alternatives are worth considering for distributed systems• Common alternatives • Key-value • Document • Graph • Column-family• MongoDB example: JSON-based document oriented
  • Change one assumption• First normal form: no repeating groups• Why?• What if that is not a requirement? • You need many fewer joins • Transactions are often simplified • Data locality is often increased• But at a cost • Much theory is now moot • Implementation complexity• From a different initial assumption, different rules apply
  • Querying a database• By primary key only• Ad-hoc queries • SQL or otherwise, but language details are a minor choice• Via map-reduce• OLTP and BI together • Eg, SAP HANA• MongoDB example: ad-hoc queries (based on JSON) and map-reduce
  • 20
  • Max SchiresonPresident, 10genmax@10gen.com @mschiresonmaxschireson.com