Database revolution opening webcast 01 18-12


Published on

Robin Bloor and Mark Madsen offer their theories on where the rapidly-changing database market stands today: What’s new? What’s standard? What is the trajectory of this evolving market? Each Analyst will present for 10-15 minutes, then will engage in a dialogue with the moderator and attendees.

The webcast audio and video archive can be found at

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Database revolution opening webcast 01 18-12

  1. 1. Fit For Purpose:The New Database Revolution Mark Madsen & Robin Bloor
  2. 2. IntroductionSignificant and revolutionary changes are taking placein database technologyIn order to investigate and analyze these changes andwhere they may lead, The Bloor Group has teamed upwith Third Nature to launch an Open Researchproject.This is the first webinar in a series of webinars andresearch activities that will comprise the projectAll research will be made available through our website:
  3. 3. Sponsors of This Research
  4. 4. General Webinar StructureWhat & whyHistory of Database Part 1: How we got to the RDBMSHistory of Database Part 2: Relational and Post- relationalFood For Thought: Issues, Problems, Assumptions,ChallengesCurrent Conclusions: Insofar as we have any
  5. 5. Change? Why?Increased data volumesSignificant hardware changesDatabase product innovationNew workloads, different data structuresEstablished database concepts are being challengedMarket Forces can drive change
  6. 6. Data Volumes: Moore’s Law CubedMoore’s Law suggests that CPU powerincreases 10-fold every 6 years (and othertechnologies have stayed in step to somedegree)Large database volumes have grown 1000-fold every 6 years: In 1992, measured in megabytes In 1998 measured in gigabytes In 2004 measured in terabytes In 2010 measured in petabytesExabytes by 2016?
  7. 7. Hardware ChangesMoore’s Law now proceeds by adding coresrather than by increasing clock speed.Computer grids using commodity servers arenow relatively inexpensiveParallelism is now on the rise and will eventuallybecome the normal mode of processingMemory is about 1 million times faster thandisk and random reads have become veryexpensive in respect of latencySSD are augmenting and may eventually replacespinning disk
  8. 8. Majority of Data becomes Historical Data over time or even all historic when no longer activeData Application Performance 10% 100% Active 70% 90% Static 30% Cost $$$ and PAINTransactional Data Time Image courtesy: RainStror
  9. 9. Market ForcesA new set of products appearThey include some fundamental innovationsA few are sufficiently popular to lastFashion and marketing drive greater adoptionProducts defects begin to be addressedThey eventually challenge the dominant products
  10. 10. Section 1: History Part 1 Pre-relational and RelationalWhat we had in prior technology regimesWhere we came fromWhat we traded away and why
  11. 11. The Dawn of DatabaseSchema defines logical structure of data The schema enables extensive reuse Logical structure vs Physical structureACID properties Atomicity – transactions must be atomic Consistency – a transaction ensures consistency Isolation – a transaction runs in isolation Durability – a completed transaction causes permanent change to data
  12. 12. Database Performance Bottlenecks CPU saturation Memory saturation Disk I/O channel saturation Locking Network saturation Parallelism – inefficient load balancing
  13. 13. The Joys of SQL?SQL is a declarative query languagetargeted at data organized in two-dimensional tables.It enables set operations on thosetables via: Select, Project and Joinoperations which can be qualified(Order By, etc.)It imposes some limitations on thelogical model of data.It can create a barrier between the userand the data....
  14. 14. The Ordering Of Data“A data set is an unordered collection ofunique, non-duplicated items.”Data is naturally ordered by time if bynothing else. Events are ordered by time. Changes to entities are ordered by timeHaving an inherent physical order to datacan save many processing cycles in someareas of applicationThis is particularly the case for timeseries applications.
  15. 15. The RDBMS OptimizerThe database can know how to access data better andfaster than any programmer… It wasn’t true It became true It isn’t always trueIt only optimizes for persistent data
  16. 16. Section 2: History Part 2Relational and Post-relationalWhere we are today: oldsql, newsql and nosqlThe finalizing of the distributed web architectureRediscovery of the past, when we had purpose-built data storesof different types, with a twist.Revisiting of old argumentsChallenging old assumptions
  17. 17. Database Product Innovation
  18. 18. Column Stores and Query-biased Workloads Column store databases are still RDBMSs Most SQL queries do not require all columns of a table So partitioning data by columns (vertically) will usually be better than partitioning by rows (horizontally) And data compression can be more efficient Column store databases scale up [somewhat] better than traditional RDBMSs depending on workload, queries, etc. Column store <> column family
  19. 19. New Lamps For Old Google, Yahoo!, Facebook and others had data management problems that established products did not cater for: Big Data, unusual data structures, new workloads They had money to invest and some smart engineers They built their own solutions: Big Table, MapReduce, Cassandra, etc. In doing so, they provoked a database revolutionIn others words, the internet happened and some people noticed.
  20. 20. A random selection of databasesSybase IQ, ASE EnterpriseDB AlgebraixTeradata, Aster Data LucidDB Intersystems CachéOracle, RAC Vectorwise StreambaseMicrosoft SQLServer, PDW MonetDB SQLStreamIBM DB2s, Netezza Exasol Coral8Paraccel Illuminate IngresKognitio Vertica PostgresEMC/Greenplum InfiniDB CassandraOracle Exadata 1010 Data CouchDBSAP HANA SAND MongoInfobright Endeca HbaseMySQL Xtreme Data RedisMarkLogic IMS RainStorTokyo Cabinet Hive Scalaris And a few hundred more…
  21. 21. Section 3: Database Discussion TopicsThe core post-relational changesin assumptions.Key aspects of the code-database mismatchReclassifying pre-relational asNoSQLComplex data, emergentstructure, types and schemasCloud and databases, uhoh?
  22. 22. Changing AssumptionsOne single scalable piece of reliable hardwareYou really need a schema all the timeA handful of discrete types are all anybody will ever need, andwhen they need more they can code UDTs and UDFs in C++SQL is the optimal way to write and retrieve dataACID always appliesData integrity is a key component of a database
  23. 23. No SQL, New ConceptsMaybe SQL is an unacceptable constraintMaybe SQL is unnecessary for some fit-for-purpose databases,or perhaps just unimportantMaybe the impedance mismatch can be avoidedMaybe a formal schema is a constraintMaybe ACID properties can be compromised
  24. 24. The “Impedance Mismatch”The RDBMS stores data organizedaccording to table structuresThe OO programmer manipulates dataorganized according to complex objectstructures, which may have specificmethods associated with them.The data does not simply map to thestructure it has within the databaseConsequently a mapping activity isnecessary to get and put dataBasically: hierarchies, types, result sets,crappy APIs, language bindings, tools
  25. 25. NoSQL Directions: Technology Types Some NoSQL DBs do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability) Some NoSQL DBs deploy a distributed scale-out architecture with data redundancy. XML DBMS using XQuery are NoSQL DBs Some documents stores are NoSQL DBs (OrientDB, Terrastore, etc.) Object databases are NoSQL DBs (Gemstone, Objectivity, ObjectStore, etc.) Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.) Graph DBMS (DEX, OrientDB, etc.) are NoSQL DBs Large data pools (BigTable, Hbase, Mnesia, etc.) are NoSQL DBs
  26. 26. The Cloud, uh-ohNegative implications for shared-everything databasesthat have scalability needsThere are architectural implications and possibleincompatibilities for shared-nothing databases tooNot at scale and at scale (concurrency, ingest volumesand frequencies, etc.) are differentHow does the database permit dynamic provisioning,elasticity (+/-), etc?
  27. 27. The new database problems for IT …are probably like old problems for people who went through the Unix client-server era. Best of breed, no standards for anything, “polyglot persistence” = silos on steroids, data integration challenges, shifting data movement architectures
  28. 28. Recognize TradeoffsRead consistency vs programmatic correctionSchema vs a program to interpret each data structureStandard access interface vs an API for each type of storeData integrity enforcement vs programmatic controlQuery performance for arbitrary queries vs planned access pathsSpace efficiency vs simplicity / latencyNetwork transfer performance vs simplicity / latencyFor the primary goals of Horizontal scale Looser coupling Flexibility for developers building and changing applications
  29. 29. Information Management Through Human History New technology development creates New methods to cope creates New information scale and availability creates…
  30. 30. Big Data
  31. 31. Big data? Unstructured data isn’t  really unstructured. The problem is that this  data is unmodeled.
  32. 32. The holy grail of databases under current market hypeThe other problem is that we’re talking mostly about computation over data when we talk about “big data” and analytics, another potential mismatch.
  33. 33. ConclusionWherein all is revealed, or ignorance exposedBest of breed is back babyWorkload types and characteristicsThe importance of understanding workload in order to selecttechnologyPragmatism, babies and bathwater
  34. 34. Solving the Problem Depends on the Diagnosis
  35. 35. Types of workloadsWrite‐biased:  Read‐biased: ▪ OLTP ▪ Query ▪ OLTP, batch ▪ Query, simple retrieval ▪ OLTP, lite ▪ Query, complex ▪ Object persistence ▪ Query‐hierarchical /  ▪ Data ingest, batch object / network ▪ Data ingest, real‐time ▪ Analytic Mixed?
  36. 36. The real challenge is that few systems are all one workload.Who said you have to write everything to one place, and read everything from the same place?SOA offers a partial way out, and is how many apps work.
  37. 37. You must understand your workload ‐ throughput and response time requirements aren’t enough. ▪ 100 simple queries accessing  month‐to‐date data ▪ 90 simple queries accessing  month‐to‐date data and 10  complex queries using two  years of history ▪ Hazard calculation for the  entire customer master ▪ Performance problems are  rarely due to a single factor. 
  38. 38. Six Key Query Workload ElementsThese characteristics help determine suitability of technologies to improve query performance. 1. Retrieval – how much data comes back? 2. Selectivity – how much data is filtered? 3. Repetition – how often for the same query? 4. Concurrency – how many queries at once? 5. Data volume – how much data is being queried? 6. Query complexity – how many joins,  aggregations, columns, filters, subselects, etc.? 7. Computational complexity – how much  computation is performed over the data?
  39. 39. Characteristics of BI workloadsWorkload Selectivity Retrieval Repetition ComplexityReporting / BI Moderate Low Moderate ModerateDashboards /  Moderate Low High LowscorecardsAd‐hoc query and  Low to  Moderate Low Low to analysis high to low moderateAnalytics (batch) Low High Low to High Low*Analytics (inline) High Low High Low*Operational /  High Low High Lowembedded BI* Low for retrieving the data, high if doing analytics in SQL
  40. 40. Choosing Hardware Architectures Compute and data sizes are key requirements PF MR and relatedComputations TF Shared nothing GF Shared everything PC or shared disk MF <10s GB 100s GB 1s TB 10s TB 100sTB PB Data volume 40
  41. 41. Choosing Hardware ArchitecturesToday’s reality, and true for a while in most businesses. PFComputations TF GF The bulk of the market resides here! MF <10s GB 100s GB 1s TB 10s TB 100sTB PB Data volume 41
  42. 42. Choosing Hardware ArchitecturesToday’s reality, and true for a while in most businesses. PF …but analyticsComputations pushes many things TF into the MPP zone. GF The bulk of the market resides here! MF <10s GB 100s GB 1s TB 10s TB 100sTB PB Data volume 42
  43. 43. Evaluating DB Technology1. Define the key problems:  response time,  throughput, scalability?2. Examine the workloads  and their requirements3. Match those to suitable  technologies4. Look for vendors using  those technologies5. Evaluate on real data  with real workloads Slide 43 Copyright Third Nature, Inc.
  44. 44. Thank YouFor YourAttention
  45. 45. Back-Up Slides
  46. 46. NoSQL DirectionsSome NDBMS do not attempt to provide all ACID properties.(Atomicity, Consistency, Isolation, Durability)Some NDBMS deploy a distributed scale-out architecture with dataredundancy.XML DBMS using XQuery are NDBMS.Some documents stores are NDBMS (OrientDB, Terrastore, etc.)Object databases are NDBMS (Gemstone, Objectivity, ObjectStore,etc.)Key value stores = schema-less stores (Cassandra, MongoDB,Berkeley DB, etc.)Graph DBMS (DEX, OrientDB, etc.) are NDMBSLarge data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS
  47. 47. The SQL BarrierSQL has: DDL (for data definition) DML (for Select, Project and Join) But it has no MML (Math) or TML (Time)Usually result sets are brought tothe client for further analyticalmanipulation, but this createsproblemsAlternatively doing all analyticalmanipulation in the databasecreates problems
  48. 48. Discussion TopicsIf not covered in history through today: the core post-relational change in assumptions nosql core drivers, persistence in cloud, finalizing of web arch, SOAizing a NoSQL classification list (types and projects/products) key aspects of the OR mismatchcomplex data and emergent structuredatabase technology typesa giant list of databasescloud and databases, uhoh?