Iod 2010 1971_lohman_final
Upcoming SlideShare
Loading in...5
×
 

Iod 2010 1971_lohman_final

on

  • 1,214 views

 

Statistics

Views

Total Views
1,214
Slideshare-icon Views on SlideShare
1,213
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Iod 2010 1971_lohman_final Iod 2010 1971_lohman_final Presentation Transcript

    • Technology Preview: Not your father’s database system Session Number 1971 Guy Lohman, PhD, IBM Keshava Murthy, IBM
    • © Copyright IBM Corporation 2010. All rights reserved. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY.  WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.  IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE. IBM, the IBM logo, ibm.com, Informix, DB2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. Disclaimer
    • Disclaimers This talk is about proof of concept of a technology developed at IBM research labs. This is NOT a talk on IBM Informix product or planned release. Please read the previous disclaimer page carefully.
    • IBM Smart Analytics Optimizer (ISAO) – Agenda The Third Generation of Database Technology Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • Third Generation of Database Technology (is here) According to IDC’s Article (Carl Olofson) – Feb. 2010 1 st Generation: - Vendor proprietary databases of IMS, IDMS, Datacom 2 nd Generation: - RDBMS for Open Systems, dependent on disk layout, limitations in scalability and disk I/O IDC Prediction that within 5 years: Most data warehouses will be stored in a columnar fashion Most OLTP database will either be augmented by an in-memory database (IMDB) or reside entirely in memory Most large-scale database servers will achieve horizontal scalability through clustering
    • Why Optimize Smart Analytics? Today, performance of Business Intelligence (BI) queries is too unpredictable When an analyst submits a query, s/he doesn’t know whether to: Response time depends upon “performance layer” of indexes & materializations Depends critically on predicting the workload But BI is inherently ad hoc! Goal of IBM Smart Analytics Optimizer: Predictably Fast (i.e., Interactive ) Ad Hoc Querying Any query should run in about the same time Permit an Analyst to interact with the data
    • What Is the IBM Smart Analytics Optimizer? IBM Smart Analytics Optimizer for Informix Prototype is: Network-attached accelerator to Informix Exploits: Large main memories Commodity multi-core processors Extreme compression Speeds up typical Data Warehouse / Business Intelligence SQL queries by 10x to 100x Without requiring tuning of indexes, materialized views, etc.
    • What IBM Smart Analytics Optimizer is Designed For OLAP-style SQL queries: Relational star schema (large fact table joined to multiple dimensions ) Large subset of data warehouse accessed, reduced significantly by... Aggregations (SUM, AVG, COUNT) and optional grouping (GROUP BY) Looking for trends or exceptions EXAMPLE SQL : SELECT P.Manufacturer, S.Type, SUM(Revenue) FROM Fact_Sales F INNER JOIN Dim_Product P ON F.FKP = P.PK INNER JOIN Dim_Store S ON F.FKS = S.PK LEFT OUTER JOIN Dim_Time T ON F.FKT = T.PK WHERE P.Type = “JEANS“ AND S.Size > 50000 AND T.Year = 2007 GROUP BY P.Manufacturer, S.Type
    • IBM Smart Analytics Optimizer (ISAO) – Agenda The Third Generation of Database Technology Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • IBM Smart Analytics Optimizer Configuration TCP/IP Informix Prototype: Routes SQL queries to accelerator User need not change SQL or apps . Can always run query in IDS, e.g., if too short an est. execution time Uses functions, datatypes unsupported in ISAO at this point. Bulk Loader SQL Queries (from apps) IBM Smart Analytics Optimizer Compressed DB partition Query Processor Data Warehouse Informix SQL (via DRDA) Query Router IBM Smart Analytics Optimizer: Multiple blades in blade center Connects to IDS via TCP/IP & DRDA Analyzes, compresses, and loads Copy of (portion of) warehouse Partitioned among nodes Processes routed SQL query and returns answer to Informix Results
    • IBM Smart Analytics Optimizer Architecture Informix . Accelerator Services DRDA Informix Database SQL Applications DS 5020 SAN GPFS GPFS GPFS GPFS GPFS GPFS 8 Core 48 GB Coord . 8 Core 48 GB Coord . 8 Core 48 GB Worker 8 Core 48 GB Worker 8 Core 48 GB Worker 8 Core 48 GB Worker Blade Center - S Flash Flash Flash Flash Flash Flash FC Module FC ... Ethernet Module IBM Smart Analytics Optimizer
    • IBM Smart Analytics Optimizer (ISAO) – Agenda The Third Generation of Database Technology Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • IBM Partner testing their customer warehouse. ISAO Accelerates Most the Longest-Running Informix Queries Average Speed-up = 116x 1 hour
    • Initial Profiling of a data mart at a government agency.  Performance expectation goals were up to 20X OLAP-style Queries  Tests were done on a 4 node Bladeserver S with Linux  Microstrategy Report was used, which generates 667 SQL statements  537 are SELECT statements. We can accelerate 432 through Informix Prototype & ISAO!  Datamart for this report has 250 Tables and 30 GB Data size  Informix Panther and ISAO run this report in 67 seconds.  7 seconds in ISAO and 60 seconds in Informix (TEMP table processing, etc)  Without ISAO, total runtime on Informix Panther on the same HW is 40 Minutes!  The same report today runs on XPS & SUN HW (Sparc M9000) and takes 90 mins.  Performance gain for the customer would be around 90x !!!
    • IBM Smart Analytics Optimizer (ISAO) – Agenda The Third Generation of Database Technology Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • Getting Started: Loading Data into ISAO DBA Defines Mart (Data to Accelerate) A mart is a logical collection of tables which are related to each other. For example, all tables of a single star schema would belong to the same mart . The administrator uses a rich client interface in Data Studio to define the tables that belong to a mart , together with information about their relationships. Informix Automatically Copies Mart Informix creates definitions for these marts in its own catalog as a VIEW. The mart’ s data is read from the Informix tables and transferred to ISAO. ISAO Automatically Transforms & Loads Mart Data into a highly compressed, scan-optimized format that is kept locally (in memory) on ISAO Define Transform Informix
    • Informix Client Application Step 1. Submit SQL Database protocol: SQLI or DRDA Network protocol: TCP/IP or SHM Step 2. Return results/describe/error Database protocol: SQLI or DRDA Network protocol: TCP/IP or SHM Informix Warehouse -- Now
    • Informix ISAO Appliance 2. Datamart Definition IBM Optim Data Studio 1. Identify the datamart to offload. 4. Create the metadata 5. Issue Off-load Datamart command Datamart and Data off-loading from Informix to ISAO 3. Return the SQL representation 6. Off-load the data 9. Return ACK 7. Distribute the data among blades 8. Compress the data
    • Informix Client Application Step 1. Submit SQL Database protocol: SQLI or DRDA Network protocol: TCP/IP Step 5. Return results/describe/error Database protocol: SQLI or DRDA Network protocol: TCP/IP ISAO Appliance Step 3 offload SQL. DRDA over tcp/ip Step 4 Results DRDA over tcp/ip Query Processing – with Informix and ISAO 2. IDS query matching and redirection technology Local Execution
    • ISAO Supported Query Types IBM Smart Analytics Optimizer can process queries that contain: Most built-in SQL functions Most SQL data types Only one query block (SELECT... FROM... WHERE... ) at a time Informix routes to ISAO each query block of a query Queries including subquery predicates cannot be routed Only equi-joins (ON FACT.FK = DIM.PK) Referencing columns with compatible types SQL allows conversion without explicit cast Expressions such as A.YEAR = YEAR(B.TIMESTAMP) not supported Only Inner joins (explicit INNER JOIN or implicit join syntax), or <Fact table> LEFT OUTER JOIN <Dimension table>
    • Supported Queries in Percent per Workload 21,000 Queries from Various Workloads
    • IBM Smart Analytics Optimizer (ISAO) – Agenda The Third Generation of Database Technology Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? The Query Engine Technology Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • What’s the Big Deal? What’s so Disruptive? ISAO rides the wave of hardware technology trends: Multi-core processors Large main memories Fast interconnects Increasing latency gap between DRAM and disk ISAO disrupts at least 3 major tenets that have been held sacrosanct for over 4 decades!
    • Disruption 1 of 3 Tenet #1: Data warehouses are too big for memory Consequence of Tenet #1: Disk I/O concerns dominate DBMS… Costs Performance Administration efforts Disruption #1: Huge, cheap main memories (RAM) and flash memories / SSDs Consequences of Disruption #1: Portions of warehouse can fit, if partitioned among multiple machines Compression helps! Must be more conscious of cache & memory bandwidth utilization No preferred access path for reads Read only columns referenced (like column stores) = 800 lbs.
    • Disruption 2 of 3 Tenet #2: Need many Indexes & MQTs for scalable OLAP performance Consequences of Tenet #2: Need an optimizer to choose among access paths Need a  wizard to design “performance layer” (expensive!) Must anticipate queries Large time to update performance layer when new data added Disruption #2: Massive parallelism achieves database scan in seconds! Arbitrarily partition database among nodes (32–64 GB RAM / node) On each node: Exploit multi-core architectures (1 user or DB cell / core) Eliminate major portions of scan by partition & column group elimination Consequences of Disruption #2: Only need to define 1 AQT in Informix to satisfy many queries on the accelerator Always scan tables!! Accelerator automatically does equivalent of partition elimination If literal is not in dictionary of that partition Accelerator itself doesn’t need Performance layer (indexes or materialized views)! Optimizer! Simpler! (no need for 4-star wizard) Lower TCO! Consistent response times
    • Disruption 3 of 3 Tenet #3: Main-memory DBMSs are the same as a big buffer pool Disruption #3: Clever engineering can save lots more! Examples of Disruption #3: Specialized order-preserving & fixed-length compression within partitions permits: Faster access Performing most operations on encoded values (saves decoding cost) Simultaneous application of predicate conjuncts (1 compare!) Cache-conscious algorithms make max. use of L2 cache and large registers Exploit multi-core processors Hash-based grouping avoids sorting
    • IBM Smart Analytics Optimizer (ISAO) – Agenda The Third Generation of Database Technology Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? Informix Warehouse with IBM Smart Analytics Optimizer Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • Compression: Frequency Partitioning Top 64 traded goods – 6 bit code Rest Prod Origin Trade Info (volume, product, origin country) Histogram on Origin Histogram on Product Origin Product China USA GER, FRA, … Rest Table partitioned into Cells Column Partitions Vol Field lengths vary between cells Higher Frequencies  Shorter Codes (Approximate Huffman) Field lengths fixed within cells Cell 4 Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Common Values Rare values Number of Occurences
    • Query Processing Cell is also the unit of processing, each cell… Assigned to one core Has its own hash table in cache (so no shared object that needs latching!) Main operator: SCAN over compressed, main-memory table Do selections, GROUP BY, and aggregation as part of this SCAN Only need de-compress for aggregation Response time  (database size) / (# cores x # nodes) Embarrassing Parallelism – little data exchange across nodes Compressed and Partitioned Data Query Executor core + $ (HT) Dictionaries core + $ (HT) core + $ (HT) Cell 1 Cell 2 Cell 3
    • Fast Hash-based Grouping Encoding makes grouping simple! Coded values assigned densely (by construction) Hence, in principle , grouping is simple: aggTable[group] += aggValue Challenges: Fitting hash table in L2 cache Avoiding all branches in hash table lookup IBM Smart Analytics Optimizer adaptively uses one of 2 techniques, depending on # of distinct groups Use dictionary code as a perfect hash (i.e. collision-free) , OR aggTable[groupCode] += aggValue No branches, no hash function computation Works great if groupCode is dense – i.e., single column, or multiple column with little correlation Use usual linear probing Involves branches, random access, …
    • Joins Basic idea: Re-write Join as multiple scans : Over each dimension , to form: A list of qualifying primary keys (PKs), decoded A hash-map of primary key  auxiliary columns (those used later in query for GROUP BY, etc.) Over fact table: First convert PKs to foreign keys (FKs) in fact table column Apply as (very big) IN-list predicates (a semi-join), one per dimension Look up into hash-maps to pick up other columns Complete Grouping and Aggregation Snowflakes: apply repeatedly, outside in (1a) Dimension A scan(A) σ a IN-list of A.k 1 Hash-map: A.k 1  A.g (1b) Dimension B scan(B) σ b IN-list of B.k 2 , B.k 3 Hash-map: B.k 2 , B.k 3  B.h (2) Fact scan(F) σ f GROUP BY, Aggregation σ : F.fk 1 IN … σ : F.fk 2 , F.fk 3 IN … Look up values of g, h
    • IBM Smart Analytics Optimizer (ISAO) – Agenda Why and What is the IBM Smart Analytics Optimizer? ISAO Market – Business Intelligence ISAO Architecture It’s All About Performance! From the User’s Perspective What’s the Big Deal? Behind the Curtain – The Query Engine Technology ISAO vs. Column Stores Conclusions
    • Banks and Tuplets in IBM Smart Analytics Optimizer A bank is a vertical partition of a table, containing a subset of its columns Assignment of columns to banks is cell-specific, since column’s length Varies from cell to cell, but Is fixed within a cell Banks contain Concatenations of the fixed-length column codes Padded to the nearest fraction of a word length (8 / 16 / 32 / 64 bits). We call these word-sized units tuplets . ISAO’s bank-major layouts are a hybrid of row-major and column-major Bank β 1 (32 bits) Bank β 2 (32 bits) Bank β 3 (16 bits) Cell Block A 1 D 1 G 1 A 2 D 2 G 2 A 4 D 4 G 4 B 1 E 1 F 1 B 2 E 2 F 2 B 4 E 4 F 4 C 1 H 1 C 3 H 3 C 4 H 4 A 3 D 3 G 3 B 3 E 3 F 3 C 2 H 2
    • Register Stores Facilitate SIMD Parallelism Access only the banks referenced in the query (like a column store): SELECT SUM ( T.G ) FROM T WHERE T.A > 5 GROUP BY T.D Pack multiple rows from the same bank into the 128-bit register Enables yet another layer of parallelism: SIMD (Single-Instruction, Multiple-Data)! 32 bits 32 bits 32 bits 32 bits 128 bits Vector Operation A 1 D 1 G 1 A 2 D 2 G 2 A 4 D 4 G 4 Bank β 1 (32 bits) A 3 D 3 G 3 B 1 E 1 F 1 B 2 E 2 F 2 B 4 E 4 F 4 C 1 H 1 C 3 H 3 C 4 H 4 Bank β 2 (32 bits) Bank β 3 (16 bits) Cell Block B 3 E 3 F 3 C 2 H 2 Result 1 Result 2 Result 3 Result 4 Operand Operand Operand Operand
    • Simultaneous Evaluation of Equality Predicates State==‘CA’ && Quarter == ‘Q4’ State==01001 && Quarter==1110 Translate value query to Code query Row Mask Selection result … … … … 01001 0 1110 0 == & CPU operates on 128-bit units Lots of fields fit in 128 bits These fields are at fixed offsets Apply predicates to all columns simultaneously ! State Quarter 11111 0 1111 0
    • IBM Smart Analytics Optimizer vs. a Column Store Evaluation matches with Hardware Scan does sequential memory access Almost no branches Simultaneous predicate evaluation SIMD predicate evaluation Evaluation doesn’t match w/ Hardware: Column materialization involves random accesses Code logic involves many branches Predicate evaluation has to be done serially Evaluation Matches Hardware? Insert requires: Single update to each bank, 1 / bank  One I/O to one cell block Insert requires: Separate updates to every column  Multiple random I/Os, 1/column Updating Can skip blocks based upon predicates To answer query: Do multi-bank scan Like having an index on every column To answer query: Determine list(s) of matching records Intersect these lists on RID Query Processing Multiple columns / word  less padding overhead Every column padded to word boundary  more padding/column  worse compression Compression IBM Smart Analytics Optimizer Column Store Aspect
    • Refereed Publications in Top 3 Professional Conferences VLDB 2008: “Main-Memory Scan Sharing for Multi-core CPUs”, Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter Haas, Guy Lohman VLDB 2008: “Row-Wise Parallel Predicate Evaluation”, Ryan Johnson, Vijayshankar Raman, Richard Sidle, Garret Swart VLDB 2006 : “How to wring a table Dry: Entropy Compression of Relations and Querying Compressed Relations”, Vijayshankar Raman, Garret Swart SIGMOD 2007 : “How to barter bits for chronons: compression and bandwidth trade offs for database scans”, Allison L. Holloway, Vijayshankar Raman, Garret Swart, David J. DeWitt ICDE 2008 : “Constant-time Query Processing”, Vijayshankar Raman, Garret Swart, Lin Qiao, Frederick Reiss, Vijay Dialani, Donald Kossmann, Inderpal Narang, Richard Sidle VLDB = International Conference on Very Large Data Bases SIGMOD = ACM SIGMOD International Conference on Management of Data ICDE = IEEE International Conference on Data Engineering
    • Conclusions Radical changes are happening in hardware Large, cheap memories Multi-core processors promise cheap, massive CPU parallelism IBM Smart Analytics Optimizer exploits these trends: Special-purpose accelerator (BI only, snapshot semantics, no transactions) Main-memory DBMS Massive parallelism of commodity multi-core hardware (blade center format) Query processing on compressed values! Cache-conscious algorithms IBM Smart Analytics Optimizer speeds up your problem queries the most! IBM Smart Analytics Optimizer is an appliance that is transparent to the user Minimal set-up Applications need not change Tuning not needed! Lower TCO
    • Disclaimers This talk is about proof of concept of a technology developed at IBM research labs. This is NOT a talk on IBM Informix product or planned release. Please read the previous disclaimer page carefully.
    • Information and Analytics Communities On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more Find a community that interests you at… ibm.com /software/data/community Information Champions Recognizing individuals who have made the most outstanding contributions to Information Management communities ibm.com /software/data/champion
    • Thank You! Your Feedback is Important to Us Access your personal session survey list and complete via SmartSite Your smart phone or web browser at: iodsmartsite.com Any SmartSite kiosk onsite Each completed session survey increases your chance to win an Apple iPod Touch with daily drawing sponsored by Alliance Tech Session number 1971
    • Thank You Merci Grazie Gracias Obrigado Danke Japanese English French Russian German Italian Spanish Brazilian Portuguese Arabic Traditional Chinese Simplified Chinese Hindi Tamil Thai Korean