One Size Doesn't Fit All: The New Database Revolution


Published on

Slides from a webcast for the database revolution research report (report will be available at

Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!

Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.

Webcast video and audio will be available on the report download site as well.

Published in: Technology, Education
  • Be the first to comment

One Size Doesn't Fit All: The New Database Revolution

  1. One Size Doesn’t Fit All:The New Database Revolution Mark Madsen & Robin Bloor
  2. Your Host
  3. Analysts HostBloor Madsen
  4. IntroductionSignificant and revolutionary changes are taking placein database technologyIn order to investigate and analyze these changes andwhere they may lead, The Bloor Group has teamed upwith Third Nature to launch an Open Researchproject.This is the final webinar in a series of webinars andresearch activities that have comprised part of theprojectAll published research will be made available throughour web site:
  5. Sponsors of This Research
  6. General Webinar StructureMarket Changes, Database Changes (Some Of TheFindings)Workloads, Characteristics, ParametersA General Discussion of PerformanceHow to Select A Database
  7. Market Changes, Database Changes
  8. Database Performance Bottlenecks CPU saturation Memory saturation Disk I/O channel saturation Locking Network saturation Parallelism – inefficient load balancing
  9. Big Data = Scale Out
  10. Cloud Hardware Architecture• It’s a scale-out model. Uniform virtual node building blocks.• This is the future of software deployments, albeit with increasing node sizes, so paying attention to early adopters today will pay off.• This implies that an MPP database architecture will be needed for scale. X
  11. Multiple Database Roles Now there are more...
  12. The Origin of Big Data
  13. Let’s Stop Using the Term NoSQLAs the graph indicates,it’s just not helpful. In fact it’s downright confusing.
  14. NoSQL DirectionsSome NDBMS do not attempt to provide all ACID properties.(Atomicity, Consistency, Isolation, Durability)Some NDBMS deploy a distributed scale-out architecture with dataredundancy.XML DBMS using XQuery are NDBMS.Some documents stores are NDBMS (OrientDB, Terrastore, etc.)Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.)Key value stores = schema-less stores (Cassandra, MongoDB, BerkeleyDB, etc.)Graph DBMS (DEX, OrientDB, etc.) are NDMBSLarge data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS
  15. The Joys of SQL?SQL: very good for set manipulation.Works for OLTP and many queryenvironments.Not good for nested data structures(documents, web pages, etc.)Not good for ordered data setsNot good for data graphs (networks ofvalues)
  16. The “Impedance Mismatch”The RDBMS stores data organizedaccording to table structuresThe OO programmer manipulates dataorganized according to complex objectstructures, which may have specificmethods associated with them.The data does not simply map to thestructure it has within the databaseConsequently a mapping activity isnecessary to get and put dataBasically: hierarchies, types, result sets,crappy APIs, language bindings, tools
  17. The SQL BarrierSQL has: DDL (for data definition) DML (for Select, Project and Join) But it has no MML (Math) or TML (Time)Usually result sets are brought tothe client for further analyticalmanipulation, but this createsproblemsAlternatively doing all analyticalmanipulation in the databasecreates problems
  18. Hadoop/MapReduceHadoop is a parallelprocessing environmentMap/Reduce is a parallelprocessing frameworkHbase turns Hadoopinto a database of a kindHive adds an SQLcapabilityPig adds analytics
  19. Market ForcesA new set of products appearThey include some fundamental innovationsA few are sufficiently popular to lastFashion and marketing drive greater adoptionProducts defects begin to be addressedThey eventually challenge the dominant products
  20. Market forces affecting database choicePerformance: trouble doing what you already do today ▪ Poor response times ▪ Not meeting data availability requirementsScalability: doing more of what you do today ▪ Adding users, processing more dataCapability: doing something new with your data ▪ Data mining, recommendations, real‐timeCost or complexity: working more efficiently ▪ Consolidating / rehosting to simplify and reduce costWhat’s desired is possible but limited by the cost of growing and supporting the existing environment. Page 20
  21. Relational has a good conceptual model, but a  prematurely standardized implementationThe relational database is the franchise technology for storing and retrieving data, but…1. Global, static schema model2. No rich typing system3. No concept of ordering, creating challenges with e.g. time series4. Many are not a good fit for network parallel computing, aka cloud5. Limited API in atomic SQL statement syntax  & simple result set return6. Poor developer support
  22. Big data? Unstructured data isn’t  really unstructured. The problem is that this  data is unmodeled. The real challenge is  complexity.
  23. Text, Objects and Data Don’t Always Fit Together So this is what they meant by “impedance mismatch”
  24. Many new choices, one way to look at them
  25. What About Analytics? Machine  learning Visualization StatisticsGIS Advanced  Analytic  Information  Methods Numerical  theory & IR methods Rules  Text mining  engines &  & text  constraint  analytics programming
  26. The holy grail of databases under current market hypeA key problem is that we’re talking mostly about computation over data when we talk about “big data” and analytics, a potential mismatch for both relational and nosql.
  27. Technologies are not perfect replacements for one another.When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time.
  28. Scalability and performance are not the same thing
  29. Performance measuresThroughput: the number of tasks completed in a given time periodA measure of how much work is or can be done by a system in a set amount of time, e.g. TPM or data loaded per hour.It’s easy to increase throughput without improving response time.Page 29
  30. Performance measuresResponse time: the speed of a single taskResponse time is usually the measure of an individuals experience using a system. Response time = time interval / throughput Page 30
  31. Scalability vs throughput vs response timeScalability = consistent performance for a task over an increase in a scale factor
  32. Scale: Data VolumeThe different ways people count make establishing rules of thumb for sizing hard.How do you measure it? ▪ Row counts ▪ Transaction counts ▪ Data size ▪ Raw data vs loaded data ▪ Schema objectsPeople still have trouble scaling for databases as large as a single PC hard drive.
  33. Scale: Concurrency (active and passive)
  34. Scalability relationshipsAs concurrency increases, response time (usually) decreases,This can be addressed somewhat via workload management tools.When a system hits a bottleneck, response time and throughput will often get worse, not just level off.
  35. Scale: Computational Complexity
  36. A key point worth remembering:Performance over size <> performance over complexityAnalytics performance is about the intersection of both.Database performance for BI is mostly related to size and query complexity.Size, computational complexity and concurrency are the three dimensions that constrain a product’s performance. Workloads fall somewhere along all three.
  37. Solving Your Problem Depends on the Diagnosis
  38. Three General WorkloadsOnline Transaction Processing ▪ Read, write, update ▪ User concurrency is the common performance limiter ▪ Low data, compute complexityBusiness Intelligence / Data warehousing ▪ Assumed to be read‐only, but really read heavy, write heavy,  usually separated in time ▪ Data size is the common performance limiter ▪ High data complexity, low compute complexityAnalytics ▪ Read, write ▪ Data size and complexity of algorithm are the limiters ▪ Moderate data , high compute complexity
  39. Types of workloadsWrite‐biased:  Read‐biased: ▪ OLTP ▪ Query ▪ OLTP, batch ▪ Query, simple retrieval ▪ OLTP, lite ▪ Query, complex ▪ Object persistence ▪ Query‐hierarchical /  ▪ Data ingest, batch object / network ▪ Data ingest, real‐time ▪ Analytic Mixed Inline analytic execution, operational BI
  40. Technology choice depends  on workload & needOptimizing for: ▪ Response time? ▪ Throughput? ▪ both?Concerned about rapid growth in data?Unpredictable spikes in use?Extremely low latency (in or out) requirements?Bulk loads or incremental inserts and/or updates?
  41. Important workload parameters to know• Read‐intensive  vs. write‐intensive
  42. Important workload parameters to know• Read‐intensive  vs. write‐intensive• Mutable vs. immutable data
  43. Important workload parameters to know• Read‐intensive  vs. write‐intensive• Mutable vs. immutable data• Immediate vs. eventual consistency
  44. Important workload parameters to know• Read‐intensive  vs. write‐intensive• Mutable vs. immutable data• Immediate vs. eventual consistency• Short vs. long access latency
  45. Important workload parameters to know• Read‐intensive  vs. write‐intensive• Mutable vs. immutable data• Immediate vs. eventual consistency• Short vs. long data latency• Predictable vs. unpredictable data access patterns
  46. Important workload parameters to know• Read‐intensive  vs. write‐intensive• Mutable vs. immutable data• Immediate vs. eventual consistency• Short vs. long data latency• Predictable vs. unpredictable data access patterns• Simple vs. complex data types
  47. You must understand your workload mix ‐ throughput and response time requirements aren’t enough. ▪ 100 simple queries accessing  month‐to‐date data ▪ 90 simple queries accessing  month‐to‐date data and 10  complex queries using two  years of history ▪ Hazard calculation for the  entire customer master ▪ Performance problems are  rarely due to a single factor. 
  48. Selectivity and number of columns queriedRow store or column store, indexed or not? Chart from “The Mimicking Octopus: Towards a one-size-fits-all Database Architecture”, Alekh Jindal
  49. Characteristics of query workloadsWorkload Selectivity Retrieval Repetition ComplexityReporting / BI Moderate Low Moderate ModerateDashboards /  Moderate Low High LowscorecardsAd‐hoc query and  Low to  Moderate Low Low to analysis high to low moderateAnalytics (batch) Low High Low to High Low*Analytics (inline) High Low High Low*Operational /  High Low High Lowembedded BI* Low for retrieving the data, high if doing analytics in SQL
  50. Characteristics of read‐write workloadsWorkload Selectivity Retrieval Repetition ComplexityOnline OLTP High Low High LowBatch OLTP Moderate to  Moderate  High Moderate to  low to high highObject  High Low High LowpersistenceBulk ingest Low (write) n/a High LowRealtime ingest High (write) n/a High LowWith ingest workloads we’re dealing with write-only, so selectivity andretrieval don’t apply in the same way, instead it’s write volume.
  51. Workload parameters and DB types at data scaleWorkload  Write‐ Read‐ Updateable Eventual  Un‐ Computeparameters biased biased data consistency  predictable intensive ok? query pathStandard RDBMSParallelRDBMSNoSQL (kv,dht, obj)Hadoop*Streaming database You see the problem: it’s an intersection of multiple
  52. Problem: Architecture Can Define Options
  53. A general rule for the read‐write axes As workloads increase in both  intensity and complexity, we move  into a realm of specialized databases  adapted to specific workloads. NewSQLRead intensity NoSQL OldSQL Write intensity
  54. In general…Relational row store databases for conventionally tooled low to mid‐scale OLTPRelational databases for ACID requirementsParallel databases (row or column) for unpredictable or variable query workloadsSpecialized databases for complex data query workjloadsNoSQL (KVS, DHT) for high scale OLTPNoSQL (KVS, DHT) for low latency read‐mostly data accessParallel databases (row or column) for analytic workloads over tabular dataNoSQL / Hadoop for batch analytic workloads over large data volumes
  55. How To Select A Database
  56. How To Select A Database - (1)1. What are the data management requirements and policies (if any) in respect of: - Data security (including regulatory requirements)? - Data cleansing? - Data governance? - Deployment of solutions in the cloud? - If a deployment environment is mandated, what are its technical characteristics and limitations? Best of breed, no standards for anything, “polyglot persistence” = silos on steroids, data integration challenges, shifting data movement architectures2. What kind of data will be stored and used? - Is it structured or unstructured? - Is it likely to be one big table or many tables?
  57. How To Select A Database - (2)3. What are the data volumes expected to be? - What is the expected daily ingest rate? - What will the data retention/archiving policy be? - How big do we expect the database to grow to? (estimate a range).4. What are the applications that will use the database? - Estimate by user numbers and transaction numbers - Roughly classify transactions as OLTP, short query, long query, long query with analytics. - What are the expectations in respect of growth of usage (per user) and growth of user population?5. What are the expected service levels? - Classify according to availability service levels - Classify according to response time service levels - Classify on throughput where appropriate
  58. How To Select A Database - (3)6. What is the budget for this project and what does that cover?7. What is the outline project plan? - Timescales - Delivery of benefits - When are costs incurred?8. Who will make up the project team? - Internal staff - External consultants - Vendor consultants9. What is the policy in respect of external support, possibly including vendor consultancy for the early stages of the project?
  59. How To Select A Database - (4)10.What are the business benefits? - Which ones can be quantified financially? - Which ones can only be guessed at (financially)? - Are there opportunity costs?
  60. A random selection of databasesSybase IQ, ASE EnterpriseDB AlgebraixTeradata, Aster Data LucidDB Intersystems CachéOracle, RAC Vectorwise StreambaseMicrosoft SQLServer, PDW MonetDB SQLStreamIBM DB2s, Netezza Exasol Coral8Paraccel Illuminate IngresKognitio Vertica PostgresEMC/Greenplum InfiniDB CassandraOracle Exadata 1010 Data CouchDBSAP HANA SAND MongoInfobright Endeca HbaseMySQL Xtreme Data RedisMarkLogic IMS RainStorTokyo Cabinet Hive Scalaris And a few hundred more…
  61. Product SelectionPreliminary investigationShort-list (usually arrived at by elimination)Be sure to set the goals and control the process.Evaluation by technical analysis and modelingEvaluation by proof of concept.Do not be afraid to change your mindNegotiation
  62. ConclusionWherein all is revealed, or ignorance exposed
  63. Thank YouFor YourAttention