Your SlideShare is downloading. ×
Robert Hartevelt, IBM - PureData System For Analytics - BI Symposium 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Robert Hartevelt, IBM - PureData System For Analytics - BI Symposium 2012

907

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
907
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
96
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The benefit of the Pure Systems approach can be summarized in the four basic tenets of SPEED, SIMPLICITY, SCALABILITY, and SMARTS – what we like to refer to as the 4 S ’s : PureData for Analytics is 10-100X faster than traditional systems, like Oracle. When analytic queries take seconds instead of hours to perform, customers get the opportunity to completely rethink their business processes and in some cases, even launch entirely new businesses PureData for Analytics is unlike anything that DBAs and IT teams have experienced in the past. Whereas Oracle and Teradata data warehouses require armies of specialists to manage, PureData for Analytics offers performance out-of-the-box, without requiring any tuning, indexing, aggregations, etc. A single appliance scales to more than a petabyte of user data capacity, not just acting as a repository for information, but allowing complex analytics to be conducted at-scale, on all the enterprise data By embedding analytics deep into the data warehouse, PureData for Analytics powers high performance advanced analytics 100 ’s or even 1000’s of times faster than possible before
  • About 80% of the components in the TwinFin are manufactured by IBM. A fully populated rack is a TwinFin12 The Host: Active/Passive failover configuration. Disk Drives: housed on the top of the rack. S-Blades: IBM technology married to the Netezza FPGA cores.
  • Netezza adds the FPGA to the IBM blade center blade using IBM’s “side-car” technology to match a set of FPGA cores to a corresponding set of processor cores. This is called the Database Accelerator card.
  • Looking at the S-Blade components side-by-side, you can easily see where Netezza came up with the TwinFin name for the product. A twin fin surfboard is a board with two fins, and this style of board rides much faster than it’s single-fin counterparts. Two fins are better than one. The standard blade is to the left, and the Netezza FPGA “side-car” is shown to the right. This allows the FPGA cores to have visibility to all the data and processing on the corresponding blade. A SAS Expander Module is included in both halves of the S-Blade to allow the processors and FPGAs to have access to the disk drives holding the warehouse data. There is always an even balance of S-Blade components: 1 CPU core : 1 FPGA core : 2GB RAM : 1 disk drive. 8 CPU cores and 8 FPGA cores on each S-Blade mapped to 8 individual disk drives. This is the nature of the MPP shared nothing architecture that ensures linear scalability.
  • The architecture is a combination of SMP and MPP, so we call it AMPP: Asymmetric Multi-Parallel Processing Architecture. SQL statements come into the Host over the network from external applications, e.g., from the shown BI application through an open API like ODBC. The front-end of the Netezza appliance is a light-weight host with about 300GB of storage. None of the data lives on the host. The host receives SQL statements, compiles the SQL and build a compiled query plan. The host broadcasts the query instructions to all of the data “nodes”. Connecting the major components of the Netezza system is a 10GBit internal ethernet network. Storage is implemented with an MPP architecture: many (100’s or 1000’s) nodes are processing data queries in parallel. Each data processing node is attached to a single disk drive that is divided into three partitions. Each node will process it’s slice of the table data for the given query. Each node will transmit its results back to the host, which will accumulate all of the results from all nodes into the final result set to be returned to the calling application or user. Classic divide-and-conquer strategy. This in and of itself is not unique to the industry but is common to most MPP architectures. One difference is that the back-end disk storage is not directly accessible by the SMP host.
  • A key component of Netezza’s performance is the way in which its streaming architecture processes data. The Netezza architecture uniquely uses the FPGA as a turbocharger … a huge performance accelerator that not only allows the system to keep up with the data stream, but it actually accelerates the data stream through compression before processing it at line rates, ensuring no bottlenecks in the IO path. You can think of the way that data streaming works in the Netezza as similar to an assembly line. The Netezza assembly line has various stages in the FPGA and CPU cores. Each of these stages, along with the disk and network, operate concurrently, processing different chunks of the data stream at any given point in time. The concurrency within each data stream further increases performance relative to other architectures. Compressed data gets streamed from disk onto the assembly line at the fastest rate that the physics of the disk would allow. The data could also be cached, in which case it gets served right from memory instead of disk. The first stage in the assembly line, the Compress Engine within the FPGA core, picks up the data block and uncompresses it at wire speed, instantly transforming each block on disk into 4-8 blocks in memory. The result is a significant speedup of the slowest component in any data warehouse—the disk. The disk block is then passed on to the Project engine or stage, which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed. The assembly line then moves the data block to the Restrict engine, which strips off rows that are not necessary to process the query, based on restrictions specified in the WHERE clause. The Visibility engine also feeds in additional parameters to the Restrict engine, to filter out rows that should not be “seen” by a query e.g. rows belonging to a transaction that is not committed yet. The Visibility engine is critical in maintaining ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds in the Netezza. The processor core picks up the uncompressed, filtered data block and performs fundamental database operations such as sorts, joins and aggregations on it. It also applies complex algorithms that are embedded in the snippet code for advanced analytics processing. It finally assembles all the intermediate results together from the entire data stream and produces a result for the snippet. The result is then sent over the network fabric to other S-Blades or the host, as directed by the snippet code.
  • We do not have indexes. They are not an option, they simply do not exist. There is no disk administration or SA administraion. Day 2, the customer has a pool of disk performant ready. Upgrades are performed by Netezza as standard maintenance tech support call. Does Oracle help you go from 9i to 10g? Instead of spending time and effort on tedious DBA tasks, use the time for higher BUSINESS VALUE tasks: Bring on new applications and groups Quickly build out new data marts Provide more functionality to your end users
  • As data volumes grow, oracle complexity increases. As new indexes are created in oracle, you break existing reports. All of this (indexes, partitioing) is an attempt to out guess the user’s data access. Netezza is database 101. This is as complicated as it gets.
  • Lets focus on the High-Performance series as this is the model that the majority of our existing systems equate to and will form the majority of Data Warehouse/Data Mart implementations. You can see that the systems start as small as 3 S Blades (24 disk, FPGA, Processors). You can then grow this system within the Frame upto 12 S Blades, 96 processors, 125TBs of user table space. Beyond this you can add additional Frames, right up to 10 frames with a processing power of 896 Snippet Processing Units and 1.25PB’s of storage. The Entry Level Development and Test system enables customers to have smaller Dev/Test environment’s but maintain the processor/FPGA/disk ratios.
  • We make some very bold claims … That we can deliver 10-100X performance that can transform businesses That our appliance is the “true appliance” and very simple to deploy and manage, requiring no tuning That TwinFin can scale to petabyte-plus of user data, while delivering the 10-100X performance, and help you keep up with growing data That our true appliance is Smart and will serve as the foundation for better, more intelligent decision making in the enterprise These are not claims we make in a vacuum We will back these claims up … On your data, at your site with our appliance … At no risk to you Take TwinFin for a Test Drive and find out for yourself what the TwinFin can do for you!
  • Transcript

    • 1. IBM PureData Systemfor Analytics Ss(Powered by Netezza technology)Robert Hartevelt27 November 2012 © 2009 IBM Corporation
    • 2. PureData for Analystics(powered by Netezza Introduction to NetezzaTechnology)5 minutes Q&A Speed & Smart25 minutes Seeing = believing Simplicity & POC Scalable © 2009 IBM Corporation
    • 3. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) © 2009 IBM Corporation
    • 4. Why workload optimized systems for Analytics?days for a single query constant tuning “ Nearly 70% of data warehouses experience performance-constrained issues of various types. ” - Gartner 2010 Magic Quadrant specialized resources required months to deploy 4 © 2009 IBM Corporation
    • 5. Traditional data warehouses are just too complex They are based on databases optimized for transaction processing— NOT to meet the demands of advanced analytics on big data.  Too complex an infrastructure  Too inefficient at analytics  Too complicated to deploy  Too many people needed to maintain  Too much tuning required  Too costly to operate Too long to get answers5 © 2009 IBM Corporation
    • 6. Netezza’s revolutionary approach The Appliance Simpler, faster, more accessible analytics “ This is what Netezza has done in the data warehousing market: It has totally changed the way we think about data warehousing.6 - Philip Howard, Bloor Research ” © 2009 IBM Corporation
    • 7. Appliance  Dedicated Device  Optimized for purpose  Complete solution  Fast installation  Very easy operation  Standard interfaces  Low cost7 © 2009 IBM Corporation
    • 8. PureData System for Analytics Transforms the User Experience  Purpose-built analytics engine  Integrated database, server and storage  Standard interfaces  Low total cost of ownership Speed: 10-100x faster than traditional system* Simplicity: Minimal administration and tuning Scalability: Peta-scale user data capacity Smart: High-performance advanced analytics * Based on IBM customers’ reported results. “Traditional custom systems” refers to systems that are not professionally pre-built, pre-tested, and optimized. Individual results may vary. © 2009 IBM Corporation
    • 9. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) So What? Purpose build Appliance © 2009 IBM Corporation
    • 10. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) Speed & Smart © 2009 IBM Corporation
    • 11. The IBM Netezza ArchitectureOptimized Hardware +Software Streaming DataPurpose-built for high Hardware-based queryperformance analytics accelerationRequires no tuning Blistering fast resultsTrue MPP Deep AnalyticsAll processors Complex analyticsfully utilized executed in-databaseMaximum speed Deeper insightsand efficiency © 2009 IBM Corporation
    • 12. The IBM Netezza S-Blade™ © 2009 IBM Corporation
    • 13. S-Blade™ Components SAS Expander Module DRAM SAS Expander Dual-Core FPGA ModuleIntel Quad-Core IBM BladeCenter Server Netezza DB Accelerator © 2009 IBM Corporation
    • 14. The IBM Netezza AMPP™ Architecture FPGA CPU Advanced Advanced Memory Analytics Analytics BI BI FPGA CPU Host Memory Hosts ETL ETL FPGA CPU Loader Loader Memory Disk Network Applications Enclosures S-Blades™ Fabric IBM Netezza Appliance © 2009 IBM Corporation
    • 15. Our acceleratorselect DISTRICT, PRODUCTGRP, sum(NRX)from MTHLY_RX_TERR_DATAwhere MONTH = 20091201 FPGA Core CPU Coreand MARKET = 509123and SPECIALTY = GASTRO Uncompress Restrict, Complex ∑ Project Visibility Joins, Aggs, etc. Slice of table where MONTH = MTHLY_RX_TERR_D sum(NRX) select DISTRICT, 20091201 ATA PRODUCTGRP, and MARKET = 509123 (compressed) sum(NRX) and SPECIALTY = GASTRO © 2009 IBM Corporation
    • 16. Massively Parallel Processing Architecture“Divide and conquer”  MPP – “Shared Nothing” concept – Divides the work in smaller tasks • A big task is sliced vertically into a series of smaller tasks • The smaller tasks run independently • The work is automatically balanced among the tasks to minimize the time to complete • Each task is assigner the same amount of physical resources • Communication between is made only at the beginning and end of the task   Benefits – A large task completes in a short elapsed time – Maximizes use of resources   Points of Attention – Complexity on administration and management – Communication bottlenecks16 © 2009 IBM Corporation
    • 17. SMART  Bring Functions close to the data ● Analytical functions ● Informatica ETL ● SPSS Functions ● Revolutional R ● ESRI Geospatial functions ● ●17 © 2009 IBM Corporation
    • 18. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) So What? Purpose build out of the box Performance Speed & Smart © 2009 IBM Corporation
    • 19. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) Speed & Smart Simplicity & Scalable © 2009 IBM Corporation
    • 20. Appliance Simplicity■ No indexes and tuning■ No storage administration – No dbspace/tablespace sizing and configuration – No redo/physical/Logical log sizing and configuration – No page/block sizing and configuration for tables el – No extent sizing and configuration for tables k. e s yo in n 2 lev – No Temp space allocation and monitoring ee ak tra ithi rt – No RAID level decisions for dbspaces w ,t w pe 1 s – No logical volume creations of files Ex u ing – No integration of OS kernel recommendations – No maintenance of OS recommended patch levels – No JAD sessions to configure host/network/storage■ No software installation © 2009 IBM Corporation
    • 21. Traditional Complexity ... Netezza Simplicity Netezza ORACLE Indexes CREATE TABLE MRDWDDM.RDWF_DDM_ROOMS_SOLD ( CREATE INDEX "MRDWDDM"."RDWF_DDM_ROOMS_SOLD_IDX1" ON "RDWF_DDM_ROOMS_SOLD" ID_PROPERTY numeric(5, 0) NOT NULL , ("ID_PROPERTY" , "ID_DATE_STAY" , "CD_ROOM_POOL" , "CD_RATE_PGM" , ID_DATE_STAY integer NOT NULL , "CD_RATE_TYPE" , "CD_MARKET_SEGMENT" ) PCTFREE 10 INITRANS 6 MAXTRANS 255 CD_ROOM_POOL CHAR(4) NOT NULL , STORAGE( FREELISTS 10) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING CD_RATE_PGM CHAR(4) NOT NULL , PARALLEL ( DEGREE 4 INSTANCES 1) LOCAL(PARTITION "PART1" PCTFREE 10 CD_RATE_TYPE CHAR(1) NOT NULL , INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 MINEXTENTS 1 CD_MARKET_SEGMENT CHAR(2) NOT NULL , MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL ID_CONFO_NUM_ORIG integer NOT NULL , DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART2" ID_CONFO_NUM_CUR integer NOT NULL , PCTFREE 10 INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 ID_DATE_CREATE integer NOT NULL , MINEXTENTS 1 MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS ID_DATE_ARRIVAL integer NOT NULL , ORACLE Bitmap index 1 BUFFER_POOL DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART3" PCTFREE 10 INITRANS 6 MAXTRANS 255 STORAGE(INITIAL ID_DATE_DEPART integer NOT NULL , QY_ROOMS integer NOT NULL , 4194304 NEXT 4259840 MINEXTENTS INDEX "CRDBO"."SNAPSHOT_MONTH_IDX13" ON CREATE BITMAP 1 MAXEXTENTS 100000 PCTINCREASE 0 CU_REV_PROJ_NET_LOCAL numeric(21, 3) NOT NULL , FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL DEFAULT) TABLESPACE ) PCTFREE 10 INITRANS 2 "SNAPSHOT_OPPTY_MONTH_HIST" ("SNAPSHOT_YEAR" CU_REV_PROJ_NET_USD numeric(21, 3) NOT NULL , "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART4" PCTFREE 4194304 MINEXTENTS 2 MAXEXTENTS MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 10 INITRANS 6 QY_DAYS_STAY_CUR smallint NOT NULL , MAXTRANS 255 STORAGE(INITIAL PCTINCREASE 0 4259840 MINEXTENTS 1 MAXEXTENTSBUFFER_POOL 2147483645 4194304 NEXT FREELISTS 1 FREELIST GROUPS 1 CD_BOOK_SOURCE CHAR(1) NOT NULL) 100000 PCTINCREASEDEFAULT) TABLESPACE "SFA_DATAMART_INDEX" NOLOGGING ; 0 FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL DEFAULT) distribute on random; TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART5" PCTFREE 10 ORACLE Table Clusters INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 MINEXTENTS 1 MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL CREATE CLUSTER "MRDW"."CT_INTRMDRY_CAL" ("ID_YEAR_CAL" NUMBER(4, 0), DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART6" "ID_MONTH_CAL" NUMBER(2, 0), "ID_PROPERTY" NUMBER(5, 0)) SIZE 16384 PCTFREE 10 INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 PCTFREE 10 PCTUSED 90 INITRANS 3 MAXTRANS 255 STORAGE(INITIAL MINEXTENTS 1 MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS 83886080 NEXT 41943040 MINEXTENTS 1 MAXEXTENTS 1017 PCTINCREASE 0 1 BUFFER_POOL DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING ) ; No indexes FREELISTS 4 FREELIST GROUPS 1 BUFFER_POOL RECYCLE) TABLESPACE "TSS_FACT" ; No Physical Tuning/Admin Stripe data randomly, or by Columns © 2009 IBM Corporation
    • 22. PureData System for Analytics (powered by Netezza) Models 1 10 .. . N1001- N1001- N1001-002 N1001-020 .. N1001-040 .. N1001-100 005 010 . . Active Sblades 4 7 14 28 56 140 Capacity (TB) 8 16 32 64 128 320 Compression (TB) 32 64 128 256 512 1280 Capacity = User Data space Compression = Effective User Data Space22 © 2009 IBM Corporation
    • 23. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) So What? Purpose build Meet BI demands out of the box Speed & Smart Simple gives new opportunities Simplicity & Scalable © 2009 IBM Corporation
    • 24. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) Speed & Smart Seeing = believing Simplicity & POC Scalable © 2009 IBM Corporation
    • 25. 25 Page 25 © 2009 IBM Corporation
    • 26. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) So What? Purpose build Meet BI demands out of the box Speed & Smart Simple gives new opportunities Seeing = believing Seeing = believing Simplicity & POC Scalable © 2009 IBM Corporation
    • 27. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) Q&A Speed & Smart Seeing = believing Simplicity & POC Scalable © 2009 IBM Corporation
    • 28. Questions © 2009 IBM Corporation
    • 29. Introduction toPureData for NetezzaAnalystics(powered by NetezzaTechnology) So What? Purpose build Q&A Meet BI demands out of the box Speed & Smart Simple gives new opportunities Seeing = believing You can found us at the IBM boot Seeing = believing Simplicity & POC Scalable © 2009 IBM Corporation
    • 30. 30© 2009 IBM Corporation

    ×