One Size Doesn’t Fit All                          The database revolution                          April 25, 2012         ...
Your Host                             Eric.kavanagh@bloorgroup.comWednesday, April 25, 12
Analysts Host        Bloor                             MadsenWednesday, April 25, 12
Introduction                  Significant and revolutionary changes are taking place                  in database technolo...
Sponsors of This ResearchWednesday, April 25, 12
General Webinar Structure             Market Changes, Database Changes             (Some Of The Findings)             Let’...
Market Changes, Database                         ChangesWednesday, April 25, 12
Database Performance Bottlenecks                  CPU saturation                  Memory saturation                  Disk ...
Multiple Database Roles                Transactional Systems                      BI and Analytics Systems                ...
The Origin of Big Data                                       Corporate                                       Databases    ...
Wednesday, April 25, 12
Big Data = Scale Out                     The query is decomposed                         into a sub-query                 ...
Let’s Stop Using the Term NoSQL                               Single Table        As the graph          Star Schema   indi...
Wednesday, April 25, 12
NoSQL Directions           Some NDBMS do not attempt to provide all ACID properties.           (Atomicity, Consistency, Is...
The Joys of SQL?             SQL: very good for set manipulation.             Works for OLTP and many query             en...
Wednesday, April 25, 12
The “Impedance Mismatch”           The RDBMS stores data organized           according to table structures           The O...
Wednesday, April 25, 12
The SQL Barrier           SQL has:               DDL (for data definition)               DML (for Select, Project and Join...
Wednesday, April 25, 12
Hadoop/MapReduce             Hadoop is a parallel                  Map       Partition   Combine     Reduce             pr...
Wednesday, April 25, 12
Market Forces                  A new set of products appear                  They include some fundamental innovations    ...
Let’s Talk About PerformanceWednesday, April 25, 12
Performance%and%Scalability%
Scalability%and%performance%are%not%the%same%thing%
Performance%measures                     %Throughput:"the"number"of"tasks"completed"in"a"given"5me"period"A"measure"of"how...
Performance%measures%Response8me:"the"speed"of"a"single"task"Response"5me"is"usually"the"measure"of"an"individuals"experie...
Scalability%vs%throughput%vs%response%<me%Scalability"="consistent"performance"for"a"task"over"an"increase"in"a"scale"fact...
Three%possible%scale%factors                                    %Computations!                                        Numb...
Scale:%Data%Volume%The"different"ways"people"count"make"establishing"rules"of"thumb"for"sizing"hard."How"do"you"measure"it?...
Scale:%Concurrency%(ac<ve%and%passive)                                     %
Scalability%rela<onships                                     %As"concurrency"increases,"response"5me"(usually)"decreases,"...
“Linear%Scalability”                                   % This"is"the"part"of"the"chart"most"vendors"show.                 ...
Scale:%Computa<onal%Complexity%
A"key"point"worth"remembering:"Performance"over"size"<>"performance"over"complexity"Analy5cs"performance"is"about"the"inte...
SOME%TECHNOLOGY%STUFF%
Large%Memories%and%Large%Databases                                        %Not"as"fast"as"you"expect"because"of"how"databa...
In_Memory%Databases%Today%1.  Maybe"not"as"fast"you"think."Depends"en5rely"on"    the"database"(e.g."VectorWise)"2.  Appli...
Hardware%changes%enable%new%so`ware%models                                         %The"extra"CPU"allows"us"to"do"things"i...
Improving%Query%Performance:%Columnar%Databases                                              %ID% Name%            Salary%...
Inser<ng%data%into%a%columnar%database%                        Each column is stored in its own set                       ...
Reading%from%a%columnar%database%                       SELECT * FROM emp WHERE ID = 1                       4 reads, extr...
Column%elimina<on%and%I/O%                       SELECT AVG(salary) FROM emp                       1 read1"   Marge"Inover...
How%do%we%scale%performance%for%queries?%              Make CPU         Add CPUs        Parallelize query                f...
Early%query%performance%scaling:%table%par<<oning%  Table"par55oning"distributes"rows"across"table"  par55ons"by"range,"ha...
Scale_up%vs.%Scale_out%Parallelism%Uniprocessor"environments"required"chip"upgrades."SMP"servers"can"grow"to"a"point,"then...
Sharding,%aka%Par<<oning%at%the%Node%Level                                          %Sharding"is"basically"horizontal"par5...
Sharding,%Databases%and%Queries                                      %What"happens"when"you"need"to"scan"a"full"table"or"j...
Cloud%Hardware%Architecture%It’s"a"scale?out"model."Uniform"virtual"node"building"blocks."This"is"the"future"of"sohware"de...
MPP%Database%Architecture%                                                                    Leader"node(s)"             ...
Key%to%MPP:%data%distribu<on%                                Single logical view of a table                        Table d...
MPP%challenges%mostly%hinge%on%data%distribu<on% Imagine"fact"&"dim"tables"spread"across"all"nodes." You"need"to"get"dim"d...
MATCHING%PROBLEMS%TO%TECHNOLOGIES%
Solving%the%Problem%Depends%on%the%Diagnosis                                           %
Three%General%Workloads                                     %Online"Transac5on"Processing"  ▪  Read,"write,"update"  ▪  Us...
Three%General%Workloads                                 %But…"BI"is"not"read"only"OLTP"is"not"write?only"Analy5cs"is"not"p...
Types%of%workloads                                 %Write?biased:""               Read?biased:"  ▪  OLTP"                 ...
What%you%need%depends% on%workload%&%need%Op5mizing"for:"  ▪  Response"5me?"  ▪  Throughput?"  ▪  both?"Concerned"about"ra...
Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"
Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"
Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs...
Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs...
Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs...
Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs...
You"must"understand"your"workload"mix"?"throughput"and"response"5me"requirements"aren’t"enough."  ▪  100"simple"queries"ac...
Two%useful%concepts%to%characterize%queries                                            %Selec7vity"–"The"restric5veness"of...
Two%useful%concepts%to%characterize%queries                                            %Retrieval"–"The"restric5veness"of"...
Selec<vity%and%number%of%columns%queried%Row"store"or"column"store,"indexed"or"not?"        Chart from “The Mimicking Octo...
Characteris<cs%of%query%workloads                                         %Workload%            Selec<vity% Retrieval% Rep...
Characteris<cs%of%read_write%workloads                                          %Workload%         Selec<vity%    Retrieva...
Workload%parameters%and%DB%types%at"data"scale"Workload%     Write_    Read_ Updateable% Eventual%    Un_          Compute...
Workload%parameters%and%DB%types%at"data"scale"Workload%          Complex% Selec<ve% Low%latency% High%          High%inge...
Problem:%Architecture%Can%Define%Op<ons                                     %
A%general%rule%for%the%read_write%axes                                              %                                As"wo...
In%general…%Rela5onal"row"store"databases"for"conven5onally"tooled"low"to"mid?scale"OLTP"Rela5onal"databases"for"ACID"requ...
How To Select A DatabaseWednesday, April 25, 12
Wednesday, April 25, 12
How To Select A Database - (1)      1.What are the data management requirements and policies (if any) in         respect o...
How To Select A Database - (2)      3.What are the data volumes expected to be?          - What is the expected daily inge...
How To Select A Database - (3)      6. What is the budget for this project and what does that cover?      7. What is the o...
How To Select A Database - (4)      10.What are the business benefits?          - Which ones can be quantified financially...
A random selection of databases  Sybase IQ, ASE               EnterpriseDB     Algebraix  Teradata, Aster Data         Luc...
Product%selec<on%op<ons                                   %The"Subtrac5on"Model" ▪  Start"with"a"full"set,"remove"what’s"b...
Product Selection             Preliminary investigation             Short-list (usually arrived at by elimination)        ...
Conclusion             Wherein all is revealed, or ignorance exposedWednesday, April 25, 12
Wednesday, April 25, 12
Thank You                          For Your                          AttentionWednesday, April 25, 12
Upcoming SlideShare
Loading in …5
×

Fit For Purpose: The New Database Revolution Findings Webcast

1,281 views
1,195 views

Published on

Slides from the Live Webcast on Apr. 25, 2012

Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget?

Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.

For more information visit: http://www.databaserevolution.com

Watch this and the entire series at : http://www.youtube.com/playlist?list=PLE1A2D56295866394

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,281
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Fit For Purpose: The New Database Revolution Findings Webcast

  1. 1. One Size Doesn’t Fit All The database revolution April 25, 2012 Mark R. Madsen http://ThirdNature.net Robin Bloor http://Bloorgroup.comWednesday, April 25, 12
  2. 2. Your Host Eric.kavanagh@bloorgroup.comWednesday, April 25, 12
  3. 3. Analysts Host Bloor MadsenWednesday, April 25, 12
  4. 4. Introduction Significant and revolutionary changes are taking place in database technology In order to investigate and analyze these changes and where they may lead, The Bloor Group has teamed up with Third Nature to launch an Open Research project. This is the final webinar in a series of webinars and research activities that have comprised part of the project All published research will be made available through our web site: Databaserevolution.comWednesday, April 25, 12
  5. 5. Sponsors of This ResearchWednesday, April 25, 12
  6. 6. General Webinar Structure Market Changes, Database Changes (Some Of The Findings) Let’s Talk About Performance How to Select A DatabaseWednesday, April 25, 12
  7. 7. Market Changes, Database ChangesWednesday, April 25, 12
  8. 8. Database Performance Bottlenecks CPU saturation Memory saturation Disk I/O channel saturation Locking Network saturation Parallelism – inefficient load balancingWednesday, April 25, 12
  9. 9. Multiple Database Roles Transactional Systems BI and Analytics Systems BI BI BI BI BI App App App App App Unstructured Structured Data Data Personal BI App App Operational Data Personal BI App App Data Data App App App Data Marts Data App Marts Stores Store Stores File or File or DBMS Staging Data OLAP DBMS or File DBMS OLAP DBMS Area Warehouse Cubes Cubes DBMS DBMS Content BI File or BI DBMS App DBMS App Now there are more...Wednesday, April 25, 12
  10. 10. The Origin of Big Data Corporate Databases + Unstructured Data + Personal Data + Supply Chain & Cust. Data + Web Data + Social Network Data + Embedded Systems DataWednesday, April 25, 12
  11. 11. Wednesday, April 25, 12
  12. 12. Big Data = Scale Out The query is decomposed into a sub-query Query for each node The columnar database scales up and out by Database Sub Sub adding more servers Table Query 1 Query 2 Server 1 Server 2 Server 1 CPU CPU CPU CPU CPU CPU Common Common Common Memory Memory Memory Cache Cache Cache Data is compressed and DataData DataData DataData DataData DataData DataData partitioned on disk by column and by rangeWednesday, April 25, 12
  13. 13. Let’s Stop Using the Term NoSQL Single Table As the graph Star Schema indicates, it’s just not oldsql newsql Snow Flake helpful. In fact it’s TNF Schema Data Volume downright confusing. OLAP Nested Data nosql Graph Data Complex DataWednesday, April 25, 12
  14. 14. Wednesday, April 25, 12
  15. 15. NoSQL Directions Some NDBMS do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability) Some NDBMS deploy a distributed scale-out architecture with data redundancy. XML DBMS using XQuery are NDBMS. Some documents stores are NDBMS (OrientDB, Terrastore, etc.) Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.) Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.) Graph DBMS (DEX, OrientDB, etc.) are NDMBS Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMSWednesday, April 25, 12
  16. 16. The Joys of SQL? SQL: very good for set manipulation. Works for OLTP and many query environments. Not good for nested data structures (documents, web pages, etc.) Not good for ordered data sets Not good for data graphs (networks of values)Wednesday, April 25, 12
  17. 17. Wednesday, April 25, 12
  18. 18. The “Impedance Mismatch” The RDBMS stores data organized according to table structures The OO programmer manipulates data organized according to complex object structures, which may have specific methods associated with them. The data does not simply map to the structure it has within the database Consequently a mapping activity is necessary to get and put data Basically: hierarchies, types, result sets, crappy APIs, language bindings, toolsWednesday, April 25, 12
  19. 19. Wednesday, April 25, 12
  20. 20. The SQL Barrier SQL has: DDL (for data definition) DML (for Select, Project and Join) But it has no MML (Math) or TML (Time) Usually result sets are brought to the client for further analytical manipulation, but this creates problems Alternatively doing all analytical manipulation in the database creates problemsWednesday, April 25, 12
  21. 21. Wednesday, April 25, 12
  22. 22. Hadoop/MapReduce Hadoop is a parallel Map Partition Combine Reduce processing environment BackUp /Recov Scheduler Node i+1 Reducing Map/Reduce is a parallel BackUp Process /Recov BackUp processing framework /Recov Node 1 Hbase turns Hadoop into Node j Mapping HDFS Process Reducing BackUp a database of a kind Process /Recov Hive adds an SQL BackUp /Recov Node k capability Reducing Process BackUp /Recov Node i Pig adds analytics Mapping HDFS ProcessWednesday, April 25, 12
  23. 23. Wednesday, April 25, 12
  24. 24. Market Forces A new set of products appear They include some fundamental innovations A few are sufficiently popular to last Fashion and marketing drive greater adoption Products defects begin to be addressed They eventually challenge the dominant productsWednesday, April 25, 12
  25. 25. Let’s Talk About PerformanceWednesday, April 25, 12
  26. 26. Performance%and%Scalability%
  27. 27. Scalability%and%performance%are%not%the%same%thing%
  28. 28. Performance%measures %Throughput:"the"number"of"tasks"completed"in"a"given"5me"period"A"measure"of"how"much"work"is"or"can"be"done"by"a"system"in"a"set"amount"of"5me,"e.g."TPM"or"data"loaded"per"hour."It’s"easy"to"increase"throughput"without"improving"response"5me."Page 14
  29. 29. Performance%measures%Response8me:"the"speed"of"a"single"task"Response"5me"is"usually"the"measure"of"an"individuals"experience"using"a"system.""Response"5me"=""5me"interval"/"throughput" Page 15
  30. 30. Scalability%vs%throughput%vs%response%<me%Scalability"="consistent"performance"for"a"task"over"an"increase"in"a"scale"factor"
  31. 31. Three%possible%scale%factors %Computations! Number Amount of users! of data!
  32. 32. Scale:%Data%Volume%The"different"ways"people"count"make"establishing"rules"of"thumb"for"sizing"hard."How"do"you"measure"it?" ▪  Row"counts" ▪  Transac5on"counts" ▪  Data"size" ▪  Raw"data"vs"loaded"data" ▪  Schema"objects"Peoples8llhavetroublescalingfordatabasesaslargeasasinglePCharddrive.
  33. 33. Scale:%Concurrency%(ac<ve%and%passive) %
  34. 34. Scalability%rela<onships %As"concurrency"increases,"response"5me"(usually)"decreases,"This"can"be"addressed"somewhat"via"workload"management"tools."When"a"system"hits"a"bogleneck,"response"5me"and"throughput"will "ohen"get"worse,"not"just"level"off."
  35. 35. “Linear%Scalability” % This"is"the"part"of"the"chart"most"vendors"show. "If you’re lucky they leave the bottom axis on so youknow where their system flatlines.
  36. 36. Scale:%Computa<onal%Complexity%
  37. 37. A"key"point"worth"remembering:"Performance"over"size"<>"performance"over"complexity"Analy5cs"performance"is"about"the"intersec5on"of"both. "Database"performance"for"BI"is"mostly"related"to"size"and"query"complexity."
  38. 38. SOME%TECHNOLOGY%STUFF%
  39. 39. Large%Memories%and%Large%Databases %Not"as"fast"as"you"expect"because"of"how"databases"were"designed"(op5mized"for"small"memories"and"disk"access)."For"example:"sequen5al"scans"and"cache"serializa5on"512GB DB buffer cache LRU overwrites older blocks1B rows, 100/block =640GB table unread
  40. 40. In_Memory%Databases%Today%1.  Maybe"not"as"fast"you"think."Depends"en5rely"on" the"database"(e.g."VectorWise)"2.  Applied"mainly"to"shared?everything"systems"3.  Very"large"memories"are"more"applicable"to"shared? nothing"than"shared?memory"systems"7.  S5ll"an"expensive"way"to"get"performance" " "Box?limited "Limited"by"node"scaling" " "e.g."2"TB"max "e.g."16"nodes,"512GB"per"="8TB"
  41. 41. Hardware%changes%enable%new%so`ware%models %The"extra"CPU"allows"us"to"do"things"in"sohware"that"we"avoided"in"the"past"because"of"scarce"resources."Compression"techniques"and"columnar"database"architectures"which"that"consumed"too"much"are"now"possible."
  42. 42. Improving%Query%Performance:%Columnar%Databases %ID% Name% Salary% Posi<on% In a row-store model1" Marge"Inovera" $150,000" Sta5s5cian" these three rows2" Anita"Bath" $120,000" Sewer"inspector" would be stored in3" Ivan"Awfulitch" $160,000" Dermatologist" sequential order as4" Nadia"Geddit" $36,000" DBA" shown here, packed into a block.1" Marge"Inovera" $150,000" Sta5s5cian" In a column store2" Anita"Bath" $120,000" Sewer"inspector" they would be3" Ivan"Awfulitch" $166,000" Dermatologist" divided into columns4" Nadia"Geddit" $36,000" DBA" and stored in different blocks.
  43. 43. Inser<ng%data%into%a%columnar%database% Each column is stored in its own set of blocks, written to disk separately. Extra work for writes over rowstore, update complexity, delete complexity.1" Marge"Inovera" $150,000" Sta5s5cian"2" Anita"Bath" $120,000" Sewer"inspector"3" Ivan"Awfulitch" $166,000" Dermatologist"4" Nadia"Geddit" $36,000" DBA"
  44. 44. Reading%from%a%columnar%database% SELECT * FROM emp WHERE ID = 1 4 reads, extract & stitch1" Marge"Inovera" $150,000" Sta5s5cian"2" Anita"Bath" $120,000" Sewer"inspector"3" Ivan"Awfulitch" $166,000" Dermatologist"4" Nadia"Geddit" $36,000" DBA"
  45. 45. Column%elimina<on%and%I/O% SELECT AVG(salary) FROM emp 1 read1" Marge"Inovera" $150,000" Sta5s5cian"2" Anita"Bath" $120,000" Sewer"inspector"3" Ivan"Awfulitch" $166,000" Dermatologist"4" Nadia"Geddit" $36,000" DBA"
  46. 46. How%do%we%scale%performance%for%queries?% Make CPU Add CPUs Parallelize query faster executionQueryCPU Faster"CPUs" More"CPUs" Parallel"query" means"quicker" means"more" execu5on"resolves" response"5me," throughput." response"5me"but"it" increased" consumes"more" throughput." resources,"reducing" concurrency"and" possibly"throughput."
  47. 47. Early%query%performance%scaling:%table%par<<oning% Table"par55oning"distributes"rows"across"table" par55ons"by"range,"hash"or"round"robin"when" you"insert"or"load"the"data." fn QI Sales Table Q2 Sales Table Q3 Sales Table Q4 Sales Table
  48. 48. Scale_up%vs.%Scale_out%Parallelism%Uniprocessor"environments"required"chip"upgrades."SMP"servers"can"grow"to"a"point,"then"it’s"a"forklih"upgrade"to"a"bigger"box."MPP"servers"grow"by"adding"mode"nodes." (a)"Scaling"up"with"a"larger"server "(b)"Scaling"out"with"many"small"servers"Copyright"Third"Nature,"Inc." Slide 34
  49. 49. Sharding,%aka%Par<<oning%at%the%Node%Level %Sharding"is"basically"horizontal"par55oning"applied"across"mul5ple"database"servers."Each"node"holds"a"(hopefully)"self?consistent"por5on"of"the"database."Good"as"long"as"queried"data"lives"on"a"single"node." Query redirect One large database = several smaller databases
  50. 50. Sharding,%Databases%and%Queries %What"happens"when"you"need"to"scan"a"full"table"or"join"tables"across"nodes?"Mul5ple"queries"and"s5tching"at"the"applica5on"level."Sharding"works"well"for"fixed"access"paths,"uniform"query"plans,"and"data"sets"that"can"be"isolated."Mainly"this"describes"an"OLTP?style"workload."
  51. 51. Cloud%Hardware%Architecture%It’s"a"scale?out"model."Uniform"virtual"node"building"blocks."This"is"the"future"of"sohware"deployments,"albeit"with"increasing"node"sizes,"so"paying"agen5on"to"early"adopters"today"will"pay"off."This"implies"that"an"MPP"database"architecture"will"be"needed"for"scale." X
  52. 52. MPP%Database%Architecture% Leader"node(s)" used"by"some" Worker"nodes" High"speed"interconnect" Some"use"separate"loader"nodes" Some database are symmetric (all nodes are the same). Some allow mixed worker node sizes. Some are leaderless. Some problems with leaders, loaders, e.g. less automated management of the environment, treating bottlenecksCopyright"Third"Nature,"Inc." Slide 38
  53. 53. Key%to%MPP:%data%distribu<on% Single logical view of a table Table data is evenly spread across all nodes. The good: scalability to petabyte range, much faster filtering and selection on scans. The bad: data skew (values, not rowcounts), aggregate function bottlenecks, concurrency challenges, complex multi-table joins with unlike distributions.Copyright"Third"Nature,"Inc." Slide 39
  54. 54. MPP%challenges%mostly%hinge%on%data%distribu<on% Imagine"fact"&"dim"tables"spread"across"all"nodes." You"need"to"get"dim"data"to"each"node"to"join"with" fact"rows"stored"there." Cross?node"joins"result"in"data"shipping."This"is"where" inter?node"latency,"data"skew,"node"skew"can"bog" down"query"performance." Fact tb Fact tb The"real"test"of"an"MPP" database"is"not"how"fast"it" Dim tb Dim tb can"scan"data."That’s"easy." Test"joins"in"a"PoC." Node 1 Node 2
  55. 55. MATCHING%PROBLEMS%TO%TECHNOLOGIES%
  56. 56. Solving%the%Problem%Depends%on%the%Diagnosis %
  57. 57. Three%General%Workloads %Online"Transac5on"Processing" ▪  Read,"write,"update" ▪  User"concurrency"is"the"common"performance"limiter" ▪  Low"data,"compute"complexity"Business"Intelligence"/"Data"warehousing" ▪  Assumed"to"be"read?only,"but"really"read"heavy,"write"heavy," usually"separated"in"5me" ▪  Data"size"is"the"common"performance"limiter" ▪  High"data"complexity,"low"compute"complexity"Analy5cs" ▪  Read,"write" ▪  Data"size"and"complexity"of"algorithm"are"the"limiters" ▪  Moderate"data","high"compute"complexity"
  58. 58. Three%General%Workloads %But…"BI"is"not"read"only"OLTP"is"not"write?only"Analy5cs"is"not"purely"computa5on"
  59. 59. Types%of%workloads %Write?biased:"" Read?biased:" ▪  OLTP" ▪  Query" ▪  OLTP,"batch" ▪  Query,"simple"retrieval" ▪  OLTP,"lite" ▪  Query,"complex" ▪  Object"persistence" ▪  Query?hierarchical"/" ▪  Data"ingest,"batch" object"/"network" ▪  Data"ingest,"real?5me" ▪  Analy5c" Mixed Inline analytic execution, operational BI
  60. 60. What%you%need%depends% on%workload%&%need%Op5mizing"for:" ▪  Response"5me?" ▪  Throughput?" ▪  both?"Concerned"about"rapid"growth"in"data?"Unpredictable"spikes"in"use?"Bulk"loads"or"incremental"inserts"and/or"updates?"
  61. 61. Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"
  62. 62. Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"
  63. 63. Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs."eventual"consistency"
  64. 64. Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs."eventual"consistency"•  Short"vs."long"access"latency"
  65. 65. Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs."eventual"consistency"•  Short"vs."long"data"latency"•  Predictable"vs."unpredictable"data"access"pagerns"
  66. 66. Important%workload%parameters%to%know%•  Read?intensive""vs."write?intensive"•  Mutable"vs."immutable"data"•  Immediate"vs."eventual"consistency"•  Short"vs."long"data"latency"•  Predictable"vs."unpredictable"data"access"pagerns"•  Simple"vs."complex"data"types"
  67. 67. You"must"understand"your"workload"mix"?"throughput"and"response"5me"requirements"aren’t"enough." ▪  100"simple"queries"accessing" month?to?date"data" ▪  90"simple"queries"accessing" month?to?date"data"and"10" complex"queries"using"two" years"of"history" ▪  Hazard"calcula5on"for"the" en5re"customer"master" ▪  Performance"problems"are" rarely"due"to"a"single"factor.""
  68. 68. Two%useful%concepts%to%characterize%queries %Selec7vity"–"The"restric5veness"of"a"query"when"accessing"data."A"highly"selec5ve"query"filters"out"most"rows."Low"selec5ve"queries"read"most"of"the"rows." "High "Low"SELECT SUM(salary) SELECT SUM(salary)FROM emp WHERE ID = 1 FROM emp
  69. 69. Two%useful%concepts%to%characterize%queries %Retrieval"–"The"restric5veness"of"a"query"when"returning"data."High"retrieval"brings"back"most"of"the"rows."Low"retrieval"brings"back"rela5vely"few"rows." "High "Low"SELECT name, salary SELECT SUM(salary)FROM emp FROM emp
  70. 70. Selec<vity%and%number%of%columns%queried%Row"store"or"column"store,"indexed"or"not?" Chart from “The Mimicking Octopus: Towards a one-size-fits-all Database Architecture”, Alekh Jindal
  71. 71. Characteris<cs%of%query%workloads %Workload% Selec<vity% Retrieval% Repe<<on% Complexity%Repor<ng%/%BI% Moderate% Low% Moderate% Moderate%Dashboards%/% Moderate% Low% High% Low%scorecards%Ad_hoc%query%and% Low%to% Moderate% Low% Low%to%analysis% high% to%low% moderate%Analy<cs%(batch)% Low% High% Low%to%High% Low*%Analy<cs%(inline)% High% Low% High% Low*%Opera<onal%/% High% Low% High% Low%embedded%BI%* Low for retrieving the data, high if doing analytics in SQL
  72. 72. Characteris<cs%of%read_write%workloads %Workload% Selec<vity% Retrieval% Repe<<on% Complexity%Online%OLTP% High% Low% High% Low%Batch%OLTP% Moderate%to% Moderate% High% Moderate%to% low% to%high% high%Object% High% Low% High% Low%persistence%Bulk%ingest% Low%(write)% n/a% High% Low%Real<me%ingest% High%(write)% n/a% High% Low%With ingest workloads we’re dealing with write-only, so selectivity andretrieval don’t apply in the same way, instead it’s write volume.
  73. 73. Workload%parameters%and%DB%types%at"data"scale"Workload% Write_ Read_ Updateable% Eventual% Un_ Compute%parameters% biased% biased% data% consistency% predictable% intensive% ok?% query%path%Standard%RDBMS%Parallel%RDBMS%NoSQL%(kv,%dht,%obj)%Hadoop*%Streaming%database% You see the problem: it’s an intersection of multiple parameters, and this chart only includes the first tier of parameters. Plus, workload factors can completely invert these general rules of thumb.
  74. 74. Workload%parameters%and%DB%types%at"data"scale"Workload% Complex% Selec<ve% Low%latency% High% High%ingest%parameters% queries% queries% queries% concurrency% rate%Standard%RDBMS%Parallel%RDBMS%NoSQL%(kv,%dht,%obj)%Hadoop%Streaming%database% You have to look at the combination of workload factors: data scale, concurrency, latency & response time, then chart the parameters.
  75. 75. Problem:%Architecture%Can%Define%Op<ons %
  76. 76. A%general%rule%for%the%read_write%axes % As"workloads"increase"in"both" intensity"and"complexity,"we"move" into"a"realm"of"specialized"databases" adapted"to"specific"workloads." NewSQLRead intensity NoSQL OldSQL Write intensity
  77. 77. In%general…%Rela5onal"row"store"databases"for"conven5onally"tooled"low"to"mid?scale"OLTP"Rela5onal"databases"for"ACID"requirements"Parallel"databases"(row"or"column)"for"unpredictable"or"variable"query"workloads"Specialized"databases"for"complex"data"query"workjloads"NoSQL"(KVS,"DHT)"for"high"scale"OLTP"NoSQL"(KVS,"DHT)"for"low"latency"read?mostly"data"access"Parallel"databases"(row"or"column)"for"analy5c"workloads"over"tabular"data"NoSQL"/"Hadoop"for"batch"analy5c"workloads"over"large"data"volumes"
  78. 78. How To Select A DatabaseWednesday, April 25, 12
  79. 79. Wednesday, April 25, 12
  80. 80. How To Select A Database - (1) 1.What are the data management requirements and policies (if any) in respect of: - Data security (including regulatory requirements)? - Data cleansing? - Data governance? - Deployment of solutions in the cloud? - If a deployment environment is mandated, what are its technical characteristics and limitations? Best of breed, no standards for anything, “polyglot persistence” = silos on steroids, data integration challenges, shifting data movement architectures 2. What kind of data will be stored and used? - Is it structured or unstructured? - Is it likely to be one big table or many tables?Wednesday, April 25, 12
  81. 81. How To Select A Database - (2) 3.What are the data volumes expected to be? - What is the expected daily ingest rate? - What will the data retention/archiving policy be? - How big do we expect the database to grow to? (estimate a range). 4. What are the applications that will use the database? - Estimate by user numbers and transaction numbers - Roughly classify transactions as OLTP, short query, long query, long query with analytics. - What are the expectations in respect of growth of usage (per user) and growth of user population? 5. What are the expected service levels? - Classify according to availability service levels - Classify according to response time service levels - Classify on throughput where appropriateWednesday, April 25, 12
  82. 82. How To Select A Database - (3) 6. What is the budget for this project and what does that cover? 7. What is the outline project plan? - Timescales - Delivery of benefits - When are costs incurred? 8. Who will make up the project team? - Internal staff - External consultants - Vendor consultants 9. What is the policy in respect of external support, possibly including vendor consultancy for the early stages of the project?Wednesday, April 25, 12
  83. 83. How To Select A Database - (4) 10.What are the business benefits? - Which ones can be quantified financially? - Which ones can only be guessed at (financially)? - Are there opportunity costs?Wednesday, April 25, 12
  84. 84. A random selection of databases Sybase IQ, ASE EnterpriseDB Algebraix Teradata, Aster Data LucidDB Intersystems Caché Oracle, RAC Vectorwise Streambase Microsoft SQLServer, PDW MonetDB SQLStream IBM DB2s, Netezza Exasol Coral8 Paraccel Illuminate Ingres Kognitio Vertica Postgres EMC/Greenplum InfiniDB Cassandra Oracle Exadata 1010 Data CouchDB SAP HANA SAND Mongo Infobright Endeca Hbase MySQL Xtreme Data Redis MarkLogic IMS RainStor Tokyo Cabinet Hive Scalaris And a few hundred more…Wednesday, April 25, 12
  85. 85. Product%selec<on%op<ons %The"Subtrac5on"Model" ▪  Start"with"a"full"set,"remove"what’s"bad,"evaluate"the" remainder" ▪  Conven5onal"analyst"model" ▪  Works"best"with"a"stable"market"The"Addi5on"Model" ▪  Start"with"an"empty"set,"add"what’s"good,"evaluate" the"results" ▪  The"designer"model" ▪  Works"best"in"an"emerging"or"changing"market"
  86. 86. Product Selection Preliminary investigation Short-list (usually arrived at by elimination) Be sure to set the goals and control the process. Evaluation by technical analysis and modeling Evaluation by proof of concept. Do not be afraid to change your mind NegotiationWednesday, April 25, 12
  87. 87. Conclusion Wherein all is revealed, or ignorance exposedWednesday, April 25, 12
  88. 88. Wednesday, April 25, 12
  89. 89. Thank You For Your AttentionWednesday, April 25, 12

×