High performance analytics sas greenplum sunz 2012


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

High performance analytics sas greenplum sunz 2012

  1. 1. Greenplum Becomes the Foundation ofEMC s Data Computing Division EMC ACQUIRES GREENPLUM Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing. – InformationWeek – For three years, Gartner has identified Greenplum as the most advanced vendor in the visionary quadrant of its data warehouse DBMS Magic Quadrant…. – Gartner Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   1  
  2. 2. Big  Data  will  revolu/onise       Data  Warehousing  and  analy/cs.           New  Reali2es…        •  Do  it  faster            New  Demands!   –  Ingest  more  data   –  Ingest  it  faster   –  Keep  it  unsummarised,  keep  it  for  longer  •  Be  more  Responsive   –  Unpredictable  queries,  Rapidly  evolving  bespoke  analy2cs   –  New  tools:  Hadoop,  MapReduce,  Hive,  HBase,   R  •  Manage  new  data  types   –  Manage  and  allow  queries  across  structured,  semi-­‐structured  and  unstructured  data  •  Do  it  at  a  lower  cost   Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   2  
  3. 3. Why Greenplum?•  EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system•  Core principle of data computing is to move the processing dramatically closer to the data and to the people Fast Data Extreme Performance Unified Loading & Elastic Scalability Data Access Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   3  
  4. 4. Structured  Analy2cs            Unstructured  Analy2cs   SQL BI tools Hadoop Analytical tools MapReduce Standard  Business   Intelligence  and   Analy2cal  tools     Master Server primary server, Clients  see  a  single   Query planning & plus hot failover database     dispatch Queries  distributed   across  all  available   Segment resources     Servers ... ... Query processing & data storage Shared  Nothing,   Massively  Parallel   Processing  means   Network no  boSlenecks  and   Interconnect linear  scalability.     Data Data  loading  also   Sources takes  advantage  of   Loading, streaming, MPP  architecture   etc. External Files, URLs, Hadoop (HDFS), Greenplum  handles   WebServices (including from other DBs), structured,  semi-­‐ structured  and   O/S Pipes (including from other DBs) unstructured  data   Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   4  
  5. 5. Why is MPP different?MPP Traditional •  Queries shipped to each node simultaneously •  Single database buffer used by all user•  Execute parallel on each segment instance. operations•  Multiple pipe lines of data •  More locks, means more complex lock•  Highly Scalable topology management system•  Locks and buffers not shared. •  Single pipe to data •  Limited Scalability … Greenplum is a Scale-Out Architecture on standard commodity hardware Data Computing Division ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   5  
  6. 6. Par22oning:  The  Key  to  Parallelism Strategy: Spread data evenly across as many nodes (and disks) as possible Order Greenplum Database High Speed Loader Customer Order # Order Date ID 43   Oct  20  2005   12   64   Oct  20  2005   111   45   Oct  20  2005   42   46   Oct  20  2005   64   77   Oct  20  2005   32   48   Oct  20  2005   12   50   Oct  20  2005   34   56   Oct  20  2005   213   63   Oct  20  2005   15   44   Oct  20  2005   102   53   Oct  20  2005   82   55   Oct  20  2005   55   Data Computing Division 20/02/12©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   6 6  
  7. 7. Greenplum DatabasePowerful Data Loading Capabilities•  Industry leading performance: –  >10TB per hour per rack•  Innovative, parallel-everything architecture: –  Scatter-Gather Streaming™ provides true linear scaling –  Support for both large-batch and continuous real-time loading strategies –  Enable complex data transformations “in-flight” –  Transparent interfaces to loading via support files, application and services Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   7  
  8. 8. Tradi2onal  Loading  vs  Greenplum  DB  Parallel  Loading ETL   ETL  Servers Servers Conventional Interconnect Loading Interconnect Segment Segment Segment Segment nodes nodes nodes nodes Data Computing Division ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   8  
  9. 9. Advanced pipeline process for fast operation Sort Request Sort Request Master  Server   Segment  Servers 9 6 10 2 11 5 4 3 12 Client   1 7 8 Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   9  
  10. 10. Advanced pipeline process for fast operation Master  Server   Segment  Servers 1 3 5 2 6 8 4 7 10 Client   9 11 12 Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   10  
  11. 11. Greenplum DatabaseExtreme Performance •  Optimized for BI and Analytics –  Rich eco-system of partners •  Provides automatic parallelization –  Just load and query like any database Interconnect –  Tables are automatically distributed across nodes –  No need for manual partitioning or tuning •  Extremely scalable MPP shared-nothing Architecture Loading –  All nodes can scan and process in parallel –  Linear scalability by adding nodes Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   11  
  12. 12. Pla^orm  Independence  Delivers  Choice  and  Flexibility   Data  Compu@ng  Appliance   •   Op2mized  Price/Performance   •   Minimum  2me-­‐to-­‐value   •   Ideal  for  Produc@on  Environments   So2ware-­‐Only   •   On  your  x86  hardware   •   Flexibility  for  any  workload   Virtualized  Infrastructure   •   Pool  resources   •   Elas2c  scalability   Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   12  
  13. 13. Greenplum Polymorphic Data Storage Table ‘Customer’ Jan ’09 Feb ’09 Mar ’09 Jun ’09 Jul ’09 Sept ’09 Nov ’09 Apr ’09 May ’09 Aug ’09 Oct ’09 Column-Oriented Column-Oriented Row-Oriented Archival Compression Fast Compression Fast Compression •  Greenplum Database s engine provides a flexible storage model –  Four table types: heap, row-oriented, column-oriented, external –  Block compression: Gzip (levels 1-9), QuickLZ •  Storage types can be mixed within a database, and even within a table –  Fully configurable via table DDL and partitioning syntax –  You may also choose to index some partitions and not others •  Gives customers the choice of processing model for any table or partition –  Supports ILM scenarios – denser packing of older partitions, etc. –  Tables/partitions of different storage types can be joined together without restriction –  Highly tuned – e.g. columnar does efficient pre-projection and parallel execution Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   13  
  14. 14. Unified Data Access Across The Enterprise•  Workload Management –  Connection management controls how many users can be connected and assigns them to a queue –  User-based resource queues allow for control of the total number or cost of queries allowed at any point in time.•  Dynamic Query Prioritization –  Patent pending technique of dynamically balancing resources across running queries –  Allows DBAs to control query priorities in real- time, or determine default priorities by resource queue Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   14  
  15. 15. Greenplum Performance MonitorHighly interactiveweb-basedperformancemonitoringReal-time andhistoric views of:•  Resource utilization•  Queries and query internals Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   15  
  16. 16. Key Technical Requirements for HPA Ø  Technical Values ü  Performance - Massively parallel Architecture ü  Load speeds – 10TB/hr ü  Integration with SAS ü  In-database analytics using Java, PL/R, etc ü  Integration with many more BI, Analytical tools, ü  Integration with Hadoop for unstructured data analysis Ø  Financial Value ü  Lower Total cost of ownership ü  Best Price/performance Ratio in the industry for EDW/ analytical appliance Ø  Operational Values ü  No Indices maintenance ü  Backup recovery solution ü  Most robust Disaster Recovery Solution in Industry ü  Best Technical and customer Support Organization backing Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   16  
  17. 17. A Few SAS Generalisations Ø Large sequential reads and writes Ø Reading and Writing of data is done via the OS’s file cache Ø I/O throughput rate is restricted by how fast the OS’s file cache can process the data Ø A lot of temporary files can be created . Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   17  
  18. 18. An MPP SQL query – just for fun •  44TB and the query planner executes a sequential scan. There are 1,218 million rows of data and 1000 columns. 5 concurrent users running the same query on a monthy data set. •  As a base line: a single node on a typical high- end server with a single controller can read about 1.5GB per second into the database. So, a DBMS deployed on a single node can scan our 44TB in 40.7 hours. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   18  
  19. 19. An MPP SQL query – just for fun •  If we deploy over 8 nodes on a Greenplum cluster the aggregate I/O bandwidth increases linearly to 12GB/sec. Our query will complete in 61 minutes. •  If we compress the rows then we can read more data with each I/O. Compression varies but 2.5X is a reasonable estimate. So our effective scan rate improves by 2.5 and our query completes in 24.4 minutes. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   19  
  20. 20. An MPP SQL query – just for fun •  Partitioning allows us to split the data on each segment by a known value, by month in our example and if possible, read only the partitions selected. We scan only 1/84th (7 x 12 months) of the table. Our query completes in 17.4 seconds. •  Columnar, based compression is more effective than row based compression. 10X columnar based compression is a conservative estimate… 10X is 4 times better than the 2.5X row compression already built into our example. So now our table scan completes in 4.35 seconds. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   20  
  21. 21. An MPP SQL query – just for fun •  Columnar projection lets us perform I/O on only the columns we are interested in. Lets assume 500 of the 1000 columns in our example. By reading only 50% of the data we reduce our I/O by 50%. And our table scan completes in 2.175 seconds. If 5 people were executing the same query concurrently and each person was configured to have an equal share of the system resources then each persons query would complete in 10.9 seconds. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   21  
  22. 22. An MPP SQL query – just for fun •  Note that queries that touch two months touch twice as much data and would complete in 4.35 seconds, four months in 8.7 seconds, and so on it is scalable and robust •  Also note that joins are also implemented using a shared-nothing approach, meaning that they scale up as well •  We can apply indexes if necessary to further improve query performance. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   22  
  23. 23. An MPP SQL query – Summary Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   23  
  24. 24. Mul2ple  op2ons  for  SAS  &  GP  Deployments   SAS  Access,   Greenplum  database   SAS  Grid   SAS  In-­‐Database   SAS  In-­‐Memory   Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   24  
  25. 25. Mul2ple  op2ons  for  SAS  &  GP  Deployments   •  Provides integration capability to Greenplum •  Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release SAS  Access,   Greenplum  database   •  Products: SAS Access for Greenplum •  libname myGP ODBC server=gplum04 db=customers port=5432 user=gpusr1 password=gppwd1; Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   25  
  26. 26. Mul2ple  op2ons  for  SAS  &  GP  Deployments   In-Database Scoring In-Database Analytics •  SAS Enterprise Miner models to execute within Greenplum database. •  Automatically translates and publishes the model as a scoring function inside the database. SAS  In-­‐Database   •  High-performance model scoring with faster time to results •  Products: SAS Scoring Accelerator Note: Currently, this will be only available for Greenplum in the next version release of 9.3 slated for the end of this year. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   26  
  27. 27. Mul2ple  op2ons  for  SAS  &  GP  Deployments   In-Database Scoring In-Database Analytics •  Execution of key SAS analytical, data discovery and data summarization tasks in database. •  Reduces the time needed to build, execute and deploy powerful predictive models. SAS  In-­‐Database   •  Improve data governance on predictive analytics projects and produce faster, better results. •  Products: SAS Analytics Accelerator Note: Currently, this is in Roadmap for Greenplum will be available with SAS future versions Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   27  
  28. 28. Mul2ple  op2ons  for  SAS  &  GP  Deployments   •  SAS running on a cluster of servers for better performance •  This can provide some acceleration on the base SAS  Grid   procs with Greenplum as the database, as it allows the database to make use of parallel processing •  Products: SAS Access for Greenplum Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   28  
  29. 29. Mul2ple  op2ons  for  SAS  &  GP  Deployments   •  This is a complete big data stack offering fast-loading, robust data management and complex analytics in a purpose-built environment. SAS  In-­‐Memory   •  Very high performance for business users that can significantly increase revenues or decrease costs as a result of improved performance •  Products: GP & SAS HPA Note: Available in Q4 2011 Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   29  
  30. 30. SAS  /  Greenplum  Product  Overview     SAS High Performance Computing SAS In-Memory SAS Access SAS Grid SAS In-Database (HPA)Provides integration Utilized to run SAS on a grid Allows certain models to be New functionality from SAScapability to a number of of commodity servers pushed into the database for that requires dedicateddatabases instead of large UNIX or execution. Requires SAS database appliance Mainframe Enterprise Miner in order to be of utilizedAllows for increased Limited impact to SAS jobs Will lead to significant (20x Very high performance forperformance of Base SAS and users, but simplified or more) improvement in business users that canProcs when using the latest operations. Generally uses performance versus non- signficantly increaseSAS v 9.3 release more CPUs for improved database deployments revenues or decrease costs performance as a result of improved performanceProducts: SAS Access for Products: SAS Access Products: SAS Access for Products: SAS Access forGreenpum Greenplum, SAS Grid Greenplum, SAS Grid, SAS Greenplum, SAS Grid, SAS Enterprise Miner, SAS High Performance Analytics Scoring Accelerator for Greenplum Data Computing Division ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   30  
  31. 31. In-Database Roadmap for Greenplum Greenplum SAS Product Capability Status Base SAS® Descriptive Statistics / Query and Available in 2011 Q4 (9.3 M) Reporting – SQL Pushdown SAS/Access® Interface Database Specific Integration and Available Connectivity Support for SAS Format Function Available in 2011 Q4 (9.3 M) SAS® Data Integration Data Extraction, Load and Available Studio transformation SAS® Scoring Production Batch Scoring / Real Time Available in 2011 Q4 (9.3 M) Accelerator* Scoring Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   31  
  32. 32. What is SAS High Performance Analytics forGP? •  It’s software (GP DB, SAS HPA) •  It combines parallel execution with in-memory •  It allows large volumes of data to be handled quickly •  A select set of procedures from following SAS products: Base SAS, SAS/STAT, SAS/ETS, SAS/ OR and SAS Enterprise Miner. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   32  
  33. 33. Why is GP & SAS and good match?? ü  Greenplum & SAS already work well together via SAS|Access and the Scoring Accelerator ü  GP & SAS represent end-to-end analytics infrastructure, including rapid data loads, powerful ETL, parallel data computing for reports and analytics ü  Greenplum delivers extreme performance via the MPP architecture that is optimized for faster query execution and unmatched data loading ü  Rapidly deployable and designed for massive growth ü  SAS & GP are working to develop advanced solutions with deeper connectivity this solution will represent state of art in high performance, scalable, advanced analytics Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   33  
  34. 34. Some Greenplum Big Data References• •  The Greenplum Database supports up to 2^48 (2 to the power of 48) rows per table. One Greenplum customer – Fox Interactive Media has a trillion row fact table and is adding a further 3TB per day in a True mixed-workload environment supporting production reporting, ad-hoc data mining, and operational data services.• •  Another On-line eCommerce client at last site visit had approximately 21TB in their Greenplum instance with 10 nodes. They load between 10-30 million rows a day but the issue is frequency and complexity rather than size. There are 2,000 Informatica workflows per day, complex hourly loads (up to 300 Greenplum loads per batch with 9,000 Greenplum loads every day)• •  They have 5,000 tables, 350,000 columns 4,000 views, 1,600 indexes, relational and dimensional models, heavily relational/3NF as they had a legacy Teradata DW that Greenplum replaced. Hourly metadata/schema/table changes in response to the hourly data loads.•  This Client is averaging around a million SQL statements per day. They have heavy spikes during peak hours and maintain a Cognos reporting SLA of 100k queries per hour. They have over 1000 Cognos users and 50% of the workload is Cognos; these are mostly small statements. 25% is financial reporting, 10% is CRM. The remaining 15% is ad-hoc by power users and analysts with lots of 25-50 slice significantly large queries (and up to 100 slices). They have dependent views to 4 levels of nesting: view (great-grandchild) -> view (grandchild) -> view (child) -> view -> table. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   34  
  35. 35. Some SAS & Greenplum Customers (some) RWS, in Singapore used MS SQL server as their reporting environment. Their reporting & ETL process were very slow and the DWH environment is limited in terms of scalability. They were looking for an in-database platform that can work with SAS. We won in a competitive PoC last quarter and is being currently implemented. They will be using GP & SAS as EDW to store and analyze the customer trends AIS, a Telco in Thailand migrated a Teradata DWH as well as 2 Oracle DWHs onto a single Greenplum cluster demonstrating the schema independence of the Database. The system has expanded to 70 TB across 32 Servers. AIS using SAS as their analytical platform. Australian Tax Office uses Greenplum as an investigatory tool in their Compliance and Audit Logging Unit. They are an extremely happy reference customer referring to Greenplums ability to pull in data from multiple sources and quickly analysis the data without needing to create complex data models or even indices. Inland Revenue Service was running on Oracle DWH and had problems with Analytical report processing time. We won this deal in Q3 and is currently in the implementation phase. Samsung Life Insurance had a 50TB Sybase DWH that they had spent 8 years building. They ran out of performance but were able to migrate the entire environment to Greenplum in 3 months. They had approx. 400,000 reports across 4 tools (SAS, Webfocus, MSTR, OLAP) only about 100 required tuning. Data Computing Division© Copyright 2010 EMC Corporation. All rights©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  reserved. 31 35  
  36. 36. Greenplum Customers -- Government •  Pacific Northwest National Labs (Dept. of Energy) does cyberanalytics. •  Usa spending.gov traces the outlays of the US Federal Government. •  The Federal Reserve Bank of Kansas City does economic analysis mostly related to the housing market. •  Recently, the Internal Revenue Service purchased a DCA to do work related to Fraudulent Tax returns. •  ATO uses GP as an investigatory tool in their Compliance and Audit Logging Unit. Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   12 36  
  37. 37. SAS AND EMC GREENPLUM INTEGRATED ARCHITECTURE Data Data Data Bl LOB Scientist Engineer Analyst Analyst User SAS Business Intelligence DATA SCIENCE TEAM Greenplum Chorus - Analytic Productivity Layer SAS Analytics Data Access & Query Layer Greenplum Database Greenplum Hadoop Private/Hybrid Cloud Infrastructure or Appliance Data Platform Admin SAS Information Management Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   37  
  38. 38. High Performance Analytics ‘The power to know fast’ Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   38  
  39. 39. Questions? Data Computing Division©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.   39