a final version of the slide deck
Upcoming SlideShare
Loading in...5

a final version of the slide deck






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • This whole presentation is Copyright 2009, All Rights Reserved, and all that good stuff.
  • The first time I visited Oracle it had fewer than 50 employees. That’s how long I’ve been doing this. My official bio is at http://www.monash.com/curtbio.html http://www.dbms2.com has most of my research on database and analytics. Exception: http://www.texttechnologies.com contains research on text analytics and search. Both are regarded as premier references in their fields (e.g., academic citations, news links, Wikipedia links, etc.) Our actual business model is built around user, vendor, and occasionally investor consulting.
  • Slides 3-26 outline what you need to know about the sector to conduct any kind of selection process for analytic/data warehouse database management systems (DBMS). The main thing I hope you learn from this part of the presentation is which categorizations – of products and/or users -- are useful, and which just cause confusion. Slides 27-39 have tips for the process itself. They’re meant as reference take-aways. We’ll discuss them selectively as time permits. Our main focus is on terabyte-plus databases, but many of these observations apply to smaller ones as well. Checklists of various sorts are on Slide 10 (vendors) Slide 11 (features) Slide 28 (specific decisions in the selection process) Slides 29-31 (use-case metrics and characteristics) Slide 35 (shortlist building) Slide 38 (proof-of-concept tips) If you want me to discuss specific vendors at any length, please let me know up front. To do so I might have to rush through other parts of the session.
  • Some of the best computer scientists in the world are still sorting all this stuff out.
  • Disk speed dominates everything. The problem is this – disks simply don’t spin very fast. If they did, they’d fly off of the spindle or something. The very first disk drives, introduced in 1956 by IBM, rotated 1200 times per minute. Today’s top-end drives only spin 15000 times per minute. That’s a 12.5 fold increase in 52 years. Most other metrics of computer performance increase 12.5 fold every 7 years or so. That’s just Moore’s Law. A two-year doubling, which turns out to be more factual than other statements of the law, works out to an 8-fold increase in 6 years, or a 12-fold increase in 7. There’s just a huge, huge difference.
  • Another reason disk access is everything is this: RAM is 1,000+ times faster than disk. --------------------------------------------------------------------------------------------------------------------------- Why my numbers are so vague, for those who care: It’s actually hard to get a single firm number for the difference between disk and RAM access times. Disk access times are well-known. They’re advertised a lot, for one thing. But RAM access times are harder. A big part of the problem is that they depend heavily on architecture; access isn’t access isn’t access. There are multiple levels of cache, for example. Another problem is that RAM isn’t RAM isn’t RAM. Anyhow, listed access times tend to be in the 5 to 7-and-a-half nanosecond range, so that’s what I’m going with. One thing we can compute is a very hard lower bound on disk random seek times. If a seek is random, than the average time is at least the time it takes the disk to spin physically around. And we know exactly what that is; it’s 2 milliseconds. There’s just no way random disk seeks will get any faster than that, except to the extent disk rotation resumes its creeping slow progress. “ Tiering” basically means “Use of Level 2 – i.e., on-processor – cache”
  • I’ve been watching the DBMS industry – especially the relational vendors – work on performance for over 25 years now. And I’m in awe at what they’ve accomplished. It’s some of the finest engineering in the software industry. Much of that work for the past decade has been in the area of OLAP. And improving OLAP performance basically means decreasing OLAP I/O. Perhaps the most basic thing they try to do is minimize the amount of data returned. Since the end result is what the end result is, this means optimizing the amount returned at intermediate stages of a query execution process. That’s what cost-based optimizers are all about. Baked into the architecture of disk-centric DBMS is something even more basic; they try to minimize index accesses. Naively, if you’re selecting from 2^30 th – i.e., a billion -- records, there might be 30 steps as you walk through the binary tree. By dividing indices into large pages, this is reduced – at the cost of a whole lot of sorting within the block at each step. Layered on are ever more special indexing structures. For example, if it seems clear that a certain join will be done frequently, an index can be built that essentially bakes in that join’s results. Of course, this also reduces the amount of data returned in the intermediate step, admittedly at the cost of index size. Anyhow, it’s a very important technique. And that’s not the only kind of precalculation. Preaggregation is at the heart of disk-centric MOLAP architectures. Materialized views bring MOLAP benefits to conventional relational processing. These are all more or less logical techniques, although some of the optimizer stuff is on the boundary between logical and physical. There also are approaches that are more purely physical. Most basically, much as in the index situation, data is returned in pages. It turns out to be cheaper to always be wasteful and send a whole block of sequential data back than it is to send back only what is actually needed. Beyond that, efforts are made to understand what data will be requested together, and cluster it so that sequential reads can take the place of truly random I/O. And that leads to the most powerful solution of all – do everything in RAM!! If you always initialized by reading in the whole database, in principle you’re done with ALL your disk I/O for the day! Oh, there may be reasons to write things, such as the results to queries, but basically you’ve made your disk speed problems totally to away. There’s a price of course, mainly and most obviously in the RAM you need to buy, and probably the CPU driving that RAM. But by investing in one area, you’re making a big related problem go away – if, of course, you can afford all that silicon.
  • This is the model for appliances. It’s also the model for software-only configurations that compete with appliances. Think IBM BCUs = Balanced Configuration Units, or various Oracle reference configurations. The pendulum shifts back and forth as to whether there are tight “recommended configurations” for non-appliance offerings. Row-based vendors are generally pickier about their hardware configurations than columnar ones.
  • Kickfire is the only custom-chip-based vendor of note. Netezza’s FPGAs and PowerPC processors aren’t, technically, custom. But they’re definitely unusual. Oracle and DATAllegro (pre-Microsoft) like Infiniband. Other vendors like 10-gigabit Ethernet. Others just use lots and lots of 1-gigabit switches. Teradata, long proprietary, is now going in a couple of different networking directions.
  • This slide is included at this point mainly for the golly-gee-whiz factor.  But it’s also a reasonable long-list of vendors to start from, especially if you’re in the terabyte range.
  • Performance, in almost all dimensions, is closely related to architecture. So is support for alternate datatypes. Most of the others are related more to product maturity and/or vendor emphasis.
  • Columnar isn’t columnar isn’t columnar; each product is different. The same goes for row-based. Still, this categorization is the point from which to start.
  • Oracle (pre-Exadata) and SQL Server (until Madison ships) are single products meant to serve both OLTP and analytics. Any of the main versions of DB2 is something like that too. Sybase, however, separated its OLTP and analytic product lines in the mid-1990s.
  • I’ve alluded to the query acceleration parts already. But other kinds of analytic processing can be enhanced as well. Mature, general-purpose DBMS are commonly leaders in this area.
  • Even when you can make this stuff work at all, it’s hard. That’s a big reason why “disruptive” new analytic DBMS vendors have sprung up.
  • Specialized products aren’t always best. Sometimes you should just stick with your enterprise general-purpose DBMS standard.
  • If you need to manage mucho over 10 terabytes of user data, most of your attractive alternatives are in this category.
  • The advantage of hash distribution is that if your join happens to involve the hash key, a lot of the work is already done for you. The disadvantage can be a bit of skew. The advantage usually wins out. Almost every vendor (Kognitio is an exception) encourages hash distribution. Oracle Exadata is an exception too, for different reasons.
  • Fixed configurations – including but not limited to appliances – are more important in row-based MPP than in columnar MPP systems. Oracle Exadata, Teradata, and Netezza are the most visible examples, but another one is IBM’s BCUs.
  • Sybase IQ is the granddaddy, but it’s not MPP. SAND is another old one, but it’s focused more on archiving now. Vertica is a quite successful recent start-up, with >10X the known customers of ParAccel (published or NDA). InfoBright and Kickfire are MySQL storage engines. Kickfire is also an appliance. Exasol is very memory-centric. So is ParAccel’s TPC-H submission. So is SAP BI Accelerator, but unlike the others it’s not really a DBMS. MonetDB is open source.
  • The big benefit of columnar is at the I/O bottleneck – you don’t have to bring back the whole row. But it also tends to make compression easier. Naïve columnar implementations are terrible at update/load. Any serious commercial product has done engineering work to get around that. For example, Vertica – which is probably the most open about its approach -- pretty much federates queries between disk and what almost amounts to a separate in-memory DBMS.
  • I.e., OLTP system and data warehouse integrated Separate EDW (Enterprise Data Warehouse) Customer-facing – directly or indirectly -- data mart that hence requires OLTP-like uptime 100+ terabytes or so Great speed on terabyte-scale data sets at low per-terabyte TCO
  • The bad news: There is no such thing as the One Right Checklist that matches enterprises to the single product best for them, or even to a short-short list of products. The good news: You can have an effective, efficient selection process even so.
  • Some data warehouses have workloads so light it almost doesn’t matter what software you use. In others, the workload is a performance strain, irrespective of database size. In some, the database size itself causes considerable strain. In some, the toughest issues are mission-critical/OLTP-like. And in some, more than one of these challenges is present.
  • Many products have done a good job for many users each on sub-terabyte data warehouses. Get up into the terabyte range, and the herd thins out. Get over 10 TB, and most of the competitors -- old and new alike -- have pretty dismal reference lists. Concurrency is a challenge for newer products. In Release 1, they typically do a good job for only a handful of users at a time – literally. The first pass of tweaks should get them into the low double digits. Significantly higher user counts require multiple releases. The whole point of an analytic DBMS is that it’s optimized for something other than transactional updates. So getting data in is likely to be performant in some scenarios, slow in others. Figuring out what kinds of latency you can or can’t tolerate is important. Then test that in POCs.
  • How advantageous is it for you to stick with your enterprise DBMS vendor? Contracts, training, porting, and internal politics all come into play. Some buyers love the idea of appliances. Some hate it. Some are more ambivalent. One of the first results of your research process should be to figure out which group you’re in. One of the key results of the metrics analysis is figuring out whether or not you need an MPP offering. Of course, in some cases it makes perfect sense to test MPP and non-MPP DBMS against each other. And how do you feel about data warehouse SaaS?
  • Here starts the explicit how-to.
  • Try not to get locked into marketing categories that were created for vendor convenience.
  • Databases grow naturally, as more transactions are added over time. Cheaper data warehousing also encourages the retention of more detail, and the addition of new data sources. All three factors boost database size. Users can be either humans or other systems. (Both, in fact, are included in the definition of “user” on the Oracle price list.) Cheap data warehousing also leads to a desire for lower latency, often without clear consideration of the benefits of same.
  • In figuring out BI application requirements, it’s important to be forward-looking. But remember People tend to overestimate the need for repetitive reports. It’s hard to judge the need for ad hoc queries until you know what results earlier ones provide.
  • If you already have a formal data mining activity, this analysis can be carried out reasonably effectively. But if you’re just starting out, you need some consulting or peer review to help you scope the need.
  • Nobody ever overestimates their need for storage. But people do sometimes overestimate their need for data immediacy.
  • Most of the many alternatives will be ruled out quickly, which is good for the sanity of everybody concerned.
  • If you’re overly demanding about proof, you may not get the best system. If you’re not demanding enough, you may not get a system that does the job at all.
  • Examples of third-party tool categories: Business intelligence ETL Specific analytic applications In some cases specific brands are must-haves, in other cases you just need to support good, cost-effective ones.
  • If vendors have a lot of control over the POC, what do you think the outcome is apt to be?
  • Just as integration is a huge part of actual data warehousing, getting data is a huge part of POCs. Obviously, it’s important to pick the right list of queries. But check whether your needs suggest that you test other analytic functionality as well. When you do the POC, look for unrealistically favorable scenarios – overly trained personnel (e.g., vendor employees), lightly-loaded machines, simple workloads, etc.
  • Three main rules for data warehouse POCs: Get some data loaded SOON. That’s a POC-within-a-POC, meant to turn up glitches – mainly ones inside your own enterprise that have nothing to do with the product being tested -- that could keep you from getting the real test data loaded later. Test with deliberate malfunctions. Pull out cables, boards, drives, and plugs. Load dirty data. Don’t let the vendor dictate the location, ground rules, or content of the POC.

a final version of the slide deck a final version of the slide deck Presentation Transcript

    • How to Select an Analytic DBMS
      • Overview, checklists, and tips
    • by
    • Curt A. Monash, Ph.D.
    • President, Monash Research
    • Editor, DBMS2
    • contact @monash.com
    • http://www.monash.com
    • http://www.DBMS2.com
  • Curt Monash
    • Analyst since 1981, own firm since 1987
      • Covered DBMS since the pre-relational days
      • Also analytics, search, etc.
    • Publicly available research
      • Blogs, including DBMS2 ( www.DBMS2.com -- the source for most of this talk)
      • Feed at www.monash.com/blogs.html
      • White papers and more at www.monash.com
    • User and vendor consulting
  • Our agenda
    • Why are there such things as specialized analytic DBMS ?
    • What are the major analytic DBMS product alternatives?
    • What are the most relevant differentiations among analytic DBMS users ?
    • What’s the best process for selecting an analytic DBMS?
  • Why are there specialized analytic DBMS?
    • General-purpose database managers are optimized for updating short rows …
    • … not for analytic query performance
    • 10-100X price/performance differences are not uncommon
    • At issue is the interplay between storage, processors, and RAM
  • Moore’s Law, Kryder’s Law, and a huge exception
    • Growth factors:
    • Transistors/chip :
    • >100,000 since 1971
    • Disk density: >100,000,000 since 1956
    • Disk speed:
    • 12.5 since 1956
    • The disk speed barrier dominates everything!
    05/21/10 DRAFT!! THIRD TEST!!
  • The “1,000,000:1” disk-speed barrier
    • RAM access times ~5-7.5 nanoseconds
      • CPU clock speed <1 nanosecond
      • Interprocessor communication can be ~1,000X slower than on-chip
    • Disk seek times ~2.5-3 milliseconds
      • Limit = ½ rotation
      • i.e., 1/30,000 minutes
      • i.e., 1/500 seconds = 2 ms
    • Tiering brings it closer to ~1,000:1 in practice, but even so the difference is VERY BIG
  • Software strategies to optimize analytic I/O
    • Minimize data returned
      • Classic query optimization
    • Minimize index accesses
      • Page size
    • Precalculate results
      • Materialized views
      • OLAP cubes
    • Return data sequentially
    • Store data in columns
    • Stash data in RAM
  • Hardware strategies to optimize analytic I/O
    • Lots of RAM
    • Parallel disk access!!!
    • Lots of networking
    • Tuned MPP (Massively Parallel Processing) is the key
  • Specialty hardware strategies
    • Custom or unusual chips (rare)
    • Custom or unusual interconnects
    • Fixed configurations of common parts
      • Appliances or recommended configurations
    • And there’s also SaaS
  • 18 contenders (and there are more)
    • Aster Data
    • Dataupia
    • Exasol
    • Greenplum
    • HP Neoview
    • IBM DB2 BCUs
    • Infobright/MySQL
    • Kickfire/MySQL
    • Kognitio
    • Microsoft Madison
    • Netezza
    • Oracle Exadata
    • Oracle w/o Exadata
    • ParAccel
    • SQL Server w/o Madison
    • Sybase IQ
    • Teradata
    • Vertica
  • General areas of feature differentiation
    • Query performance
    • Update/load performance
    • Compatibilities
    • Advanced analytics
    • Alternate datatypes
    • Manageability and availability
    • Encryption and security
  • Major analytic DBMS product groupings
    • Architecture is a hot subject
    • Traditional OLTP
    • Row-based MPP
    • Columnar
    • (Not covered tonight) MOLAP/array-based
  • Traditional OLTP examples
    • Oracle (especially pre-Exadata)
    • IBM DB2 (especially mainframe)
    • Microsoft SQL Server (pre-Madison)
  • Analytic optimizations for OLTP DBMS
    • Two major kinds of precalculation
      • Star indexes
      • Materialized views
    • Other specialized indexes
    • Query optimization tools
    • OLAP extensions
    • SQL 2003
    • Other embedded analytics
  • Drawbacks
    • Complexity and people cost
    • Hardware cost
    • Software cost
    • Absolute performance
  • Legitimate use scenarios
    • When TCO isn’t an issue
      • Undemanding performance (and therefore administration too)
    • When specialized features matter
      • OLTP-like
      • Integrated MOLAP
      • Edge-case analytics
    • Rigid enterprise standards
    • Small enterprise/true single-instance
  • Row-based MPP examples
    • Teradata
    • DB2 (open systems version)
    • Netezza
    • Oracle Exadata (sort of)
    • DATAllegro/Microsoft Madison
    • Greenplum
    • Aster Data
    • Kognitio
    • HP Neoview
  • Typical design choices in row-based MPP
    • “ Random” (hashed or round-robin) data distribution among nodes
    • Large block sizes
      • Suitable for scans rather than random accesses
    • Limited indexing alternatives
      • Or little optimization for using the full boat
    • Carefully balanced hardware
    • High-end networking
  • Tradeoffs among row MPP alternatives
    • Enterprise standards
    • Vendor size
    • Hardware lock-in
    • Total system price
    • Features
  • Columnar DBMS examples
    • Sybase IQ
    • SAND
    • Vertica
    • ParAccel
    • InfoBright
    • Kickfire
    • Exasol
    • MonetDB
    • SAP BI Accelerator (sort of)
  • Columnar pros and cons
    • Bulk retrieval is faster
    • Pinpoint I/O is slower
    • Compression is easier
    • Memory-centric processing is easier
    • MPP is not quite as crucial
  • Segmentation – a first cut
    • One database to rule them all
    • One analytic database to rule them all
    • Frontline analytic database
    • Very, very big analytic database
    • Big analytic database handled very cost-effectively
  • Basics of systematic segmentation
    • Use cases
    • Metrics
    • Platform preferences
  • Use cases – a first cut
    • Light reporting
    • Diverse EDW
    • Big Data
    • Operational analytics
  • Metrics – a first cut
    • Total raw/user data
      • Below 1-2 TB, references abound
      • 10 TB is another major breakpoint
    • Total concurrent users
      • 5, 15, 50, or 500?
    • Data freshness
      • Hours
      • Minutes
      • Seconds
  • Basic platform issues
    • Enterprise standards
    • Appliance-friendliness
    • Need for MPP?
    • Cloud/SaaS
  • The selection process in a nutshell
    • Figure out what you’re trying to buy
    • Make a shortlist
    • Do free POCs*
    • Evaluate and decide
    • *The only part that’s even slightly specific to the analytic DBMS category
  • Figure out what you’re trying to buy
    • Inventory your use cases
      • Current
      • Known future
      • Wish-list/dream-list future
    • Set constraints
      • People and platforms
      • Money
    • Establish target SLAs
      • Must-haves
      • Nice-to-haves
  • Use-case checklist -- generalities
    • Database growth
      • As time goes by …
      • More detail
      • New data sources
    • Users (human)
    • Users/usage (automated)
    • Freshness (data and query results)
  • Use-case checklist – traditional BI
    • Reports
      • Today
      • Future
    • Dashboards and alerts
      • Today
      • Future
      • Latency
    • Ad-hoc
      • Users
      • Now that we have great response time …
  • Use-case checklist – data mining
    • How much do you think it would improve results to
      • Run more models?
      • Model on more data?
      • Add more variables?
      • Increase model complexity?
    • Which of those can the DBMS help with anyway?
    • What about scoring?
      • Real-time
      • Other latency issues
  • SLA realism
    • What kind of turnaround truly matters?
      • Customer or customer-facing users
      • Executive users
      • Analyst users
    • How bad is downtime?
      • Customer or customer-facing users
      • Executive users
      • Analyst users
  • Short list constraints
    • Cash cost
      • But purchases are heavily negotiated
    • Deployment effort
      • Appliances can be good
    • Platform politics
      • Appliances can be bad
      • You might as well consider incumbent(s)
  • Filling out the shortlist
    • Who matches your requirements in theory?
    • What kinds of evidence do you require?
      • References?
        • How many?
        • How relevant?
      • A careful POC?
      • Analyst recommendations?
      • General “buzz”?
  • A checklist for shortlists
    • What’s your tolerance for specialized hardware?
    • What’s your tolerance for set-up effort?
    • What’s your tolerance for ongoing administration?
    • What are your insert and update requirements?
    • At what volumes will you run fairly simple queries?
    • What are your complex queries like?
    • For which third-party tools do you need support?
    • and, most important,
    • Are you madly in love with your current DBMS?
  • Proof-of-Concept basics
    • The better you match your use cases, the more reliable the POC is
    • Most of the effort is in the set-up
    • You might as well do POCs for several vendors – at (almost) the same time!
    • Where is the POC being held?
  • The three big POC challenges
    • Getting data
      • Real?
        • Politics
        • Privacy
      • Synthetic?
      • Hybrid?
    • Picking queries
      • And more?
    • Realistic simulation(s)
      • Workload
      • Platform
      • Talent
  • POC tips
    • Don’t underestimate requirements
    • Don’t overestimate requirements
    • Get SOME data ASAP
    • Don’t leave the vendor in control
    • Test what you’ll be buying
    • Use the baseball bat
  • Evaluate and decide
    • It all comes down to
    • Cost
    • Speed
    • Risk
    • and in some cases
    • Time to value
    • Upside
  • Further information Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com http://www.monash.com http://www.DBMS2.com