Elastic Platform for Business Analytics
Upcoming SlideShare
Loading in...5
×
 

Elastic Platform for Business Analytics

on

  • 1,117 views

Sybase IQ 'nun iş analitikleri dünyasında ve mevcut gelişen teknolojiye nasıl hızlı ve esnek cevap verdiğinin güzel bir özeti.

Sybase IQ 'nun iş analitikleri dünyasında ve mevcut gelişen teknolojiye nasıl hızlı ve esnek cevap verdiğinin güzel bir özeti.

Statistics

Views

Total Views
1,117
Views on SlideShare
1,117
Embed Views
0

Actions

Likes
0
Downloads
36
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Elastic Platform for Business Analytics Elastic Platform for Business Analytics Document Transcript

  • W I N T E R C O R P O R A T I O NW H I T E PA P E R SAP Sybase IQ 15.4 An Elastic Platform for Business Analytics en t Experts g em a an M a ta D e al Sc e rg La TheSPONSORED RESEARCH PROGRAM
  • W I N T E R C O R P O R A T I O N SAP Sybase IQ 15.4 An Elastic Platform for Business Analytics APRIL 2012 245 First Street, Suite 1800 Cambridge MA 02145 617-695-1800 visit us at www.wintercorp.com ©2012 Winter Corporation, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 3 A WINTER CORPOR ATION WHITE PAPERExecutive SummaryE      xecutives around the world are intensely focusing on business analytics. They      see an analytical approach to business decisions—an approach based on more abundant      data and mathematical analysis of that data—as the cornerstone of new strategies for profitableoperation, profitable growth, new product development and customer engagement.The opportunity to benefit from business analytics is especially large right now in part becausebusinesses have access to “big data”—enormous, previously unavailable volumes of data on theactions, interests and sentiment of customers; on the movement of products, components and rawmaterials through the supply chain and the distribution chain; and, on many other aspects of theoperation of businesses and their market environment. Perhaps surprisingly, the challenge of “bigdata” is not only the data volume. It is also that much of the new data is less structured and lessregular than the tabular corporate data that has been the focus of data warehousing in the past.The new big data comes from new or greatly expanded sources: social media, rapidly proliferatingsmart mobile devices, from vehicles and a dizzying array of new sensors and intelligent products.Even beyond the challenges of big data, there are other obstacles to success with business analytics:data analysis can be a cumbersome, slow, frustrating and expensive process. First you have to findthe data you want. Then you have to get it loaded into a repository where it is accessible. Then youneed to cleanse it, organize it and integrate it with other data of interest. Then you have to conductthe analysis…every step bedeviled by many practical difficulties, not least of which is often thedifficulty of getting help from people with the right skills.New open source technology has emerged and is being deployed for “big data”; new vocabularyincludes terms such as “Hadoop clusters” and “MapReduce.” This technology brings new benefitsfor certain types of information and analysis. However, it also creates one more data silo in a worldin which there are already too many silos. The complete analytical process thus gets enhanced insome areas but also becomes more fragmented: to get analytic results and business solutions,stakeholders must contend with a yet more complex environment with net new skill requirements.The new, highly analytical business strategies place a particular emphasis on prediction. Knowingwhat happened yesterday isn’t enough—you need to predict which of the business actions in frontof you is likely to produce the best result. And, as well as judgment, you need facts, data and analysisto back that decision. And, you must take into account the new data sources—the customer sentimentexpressed on social media; the customer behavior evident from new data sources and devices; thesubtle patterns that can be seen in purchase behavior, web browsing and many other sources; and,the supply chain realities now visible as parts, components, goods and materials move around theworld and are affected by weather, catastrophes and human events.Often, to the decision maker, the unfortunate reality is that predictions of which profitable customersare at risk may indeed be extremely valuable, but getting such predictions before it’s too late iseasier said than done. For many enterprises, then, the key to the analytic opportunity is finding a way to make the entire analytic process work smoothly, conveniently, responsively and cost effectively—whether the analysis focuses on the tabular data most frequently used for the past 25 years; on newer data sources, such as sentiment expressed in social media; or, both.In response to this challenge, SAP has introduced a new version of its flagship analytic DBMSproduct—SAP Sybase IQ 15.4—as a platform and an integrated environment to support andfacilitate the customer’s entire analytic process. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 4 A WINTER CORPOR ATION WHITE PAPERIn addition to a greatly enhanced DBMS engine for data warehousing, Sybase IQ 15.4 featuressignificant new capabilities for business analytics and big data. Highlights are:• A new analytic services layer that supports the use of MapReduce and many other analytic functions on data within Sybase IQ itself;• Parallel interaction between Sybase IQ and Hadoop;• Support of R, the open source language for statistical analysis;• Support of new third party SQL-callable functions for data mining and predictive analytics;• An expanded eco-system for the support of third-party applications for information lifecycle management, business intelligence and data integration, predictive analytics and system/data administration.At the core of Sybase IQ 15.4 is the most mature column store DBMS for data warehousing on themarket, with sophisticated capabilities for data compression, query processing and queryoptimization—an engine with a long record of exceptional query performance and efficiency.While column storage and column-oriented data compression have been “hot trends” for the lastfew years, Sybase IQ was built from day one with these capabilities: its users have been benefittingfrom them for more than a decade. And, they contribute significantly to the efficiency of SybaseIQ for analytics.In addition to the remarkably efficient storage and query processing technology at its core, SybaseIQ 15.4 features PlexQ™ technology, a distinctive, elastic design that supports highly parallel queryprocessing and data loading along with independent scaling for data growth and workload growth.WinterCorp, an independent expert in analytic data management and big data, has been invitedby Sybase Inc, an SAP company, to review its new product, SAP Sybase IQ 15.4. To conduct itsreview, WinterCorp, reviewed product designs and documentation; and, engaged in technicaldiscussions of the product architecture with key employees at SAP/Sybase and with independentparties. This White Paper, sponsored by Sybase Inc, an SAP company, presents WinterCorp’s viewsand findings from that review. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 5 A WINTER CORPOR ATION WHITE PAPERTable of ContentsExecutive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Table of Contents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62  Architecture of SAP Sybase IQ 15.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1  A Platform For Business Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93  The SAP Sybase IQ15.4 Core Data Management Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Data Load Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 3.2  Column-Store Storage Efficiency, Indexing, and Compression . . . . . . . . . . . . . . . . . . . . . . . 13 3.3  Query Processing Performance and Scalability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4  Very Large Database (VLDB) Management and Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5  In-Database Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6  Text Search and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184  The Application Services Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 4.1  “MPP Enabled” User Defined Functions (UDF). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2  Protected JAVA UDF’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3  In-Database MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Simulation for In-Database Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.5  Hadoop Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.6 Geospatial/Geometric Data & Query Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.7  Free Express Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245  The Ecosystem Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.1 SAP BusinessObjects Portfolio Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2  “R” Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3 MapReduce-Enabled Data Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.4 Social Network Analysis Modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.5 Sybase PowerDesigner 16 Architecture Recommender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.6  In-Database PMML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 6 A WINTER CORPOR ATION WHITE PAPER1 Introduction This paper examines the architecture and capabilities of SAP Sybase IQ 15.4 with a particular focus on demanding new requirements for business analytics and big data. Business Analytics.   People who have been involved with data warehousing for the last decade or more—especially those with a technical background in the field—are often puzzled by the new wave of executive interest in “business analytics.” A common question is, “Aren’t we doing that already?” Surely, the reason all that data has been modeled, cleansed, integrated and stored in data warehouses for the last ten or twenty years is so that it can be analyzed! Certainly there has been analysis going with data warehouse data. But, from the perspective of the business manager or business Methodology end user, data warehousing and business intelligence in practice & Sponsorhip has too often meant little more than ‘routine-ized’ reporting; extraction to other applications and systems; and, the occasional ad hoc query. Sure, business intelligence tools have steadily This WinterCorp Executive improved; data may be delivered on nicer looking, more functional Report describes two trends: electronic reports and dashboards; data access may be more business analytics and “big data” interactive; and, data may even be available on mobile devices. —and the approach to them adopted in SAP Sybase IQ. All of these advances add some value. In developing this report, But most end users will still tell you the same thing: most of what WinterCorp drew on its own they have been doing with the data warehouse has been “looking independent research and in the rear view mirror.” Often, business users learn what has experience; interviewed SAP happened from the data warehouse. They learn which products Sybase IQ employees; and, have been selling; which customers have been buying; which reviewed SAP Sybase IQ suppliers have consistently delivered on time… these insights are product materials. treasured when good information was not previously available In its capacity as the sponsor of as a basis for decision making. this report, Sybase Inc, an SAP The problem is that the practice of business management has company, was provided an moved on from that point. Looking in the rear view mirror is no opportunity to comment on the longer enough. paper with respect to facts. WinterCorp has final editorial Increasingly, operating and strategic decisions must be based on control over the content of this forward looking analysis with a mathematically sound publication and is solely foundation. The analytical approaches to business exemplified responsible for any opinions in Competing on Analytics1 and a series of subsequent books—and expressed. in the best selling popular book and recent hit movie, Moneyball  , 2 have influenced business culture. These accounts and many others have shown how business performance can undergo radical improvement when the decision making process looks forward with analytics. At the heart of this revolutionary analysis is better prediction: whether of the performance of a baseball player, of a product, of a service—or the behavior of a customer. 1 Competing on Analytics, The New Science of Winning, Thomas Davenport and Jeanne Harris, Harvard Business School Press, 2007 (www.tomdavenportbooks.com) 2 Moneyball, The Art of Winning an Unfair Game, Michael Lewis, W.W. Norton & Company, 2003 Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 7 A WINTER CORPOR ATION WHITE PAPERAnd, while you may feel that your data warehouse already has the capabilities to support theseanalytics, there is more to the story.Big Data.  As predictive analytics have been gaining ever more significance in business circles, anothertrend—big data—has made a profound impact on business and data strategies.“Big data” is a broad phenomenon encompassing the rise of social media; the seemingly suddenproliferation of machine generated data; the worldwide spread of mobile intelligent devices, includingsmart phones and tablets; the widespread use of GPS data, which attaches a location to many eventsin daily life; and, rapid decreases in cost associated with capturing, delivering and storing a widerange of previously costly varieties of data, including voice, image, video, etc.Taking all of these phenomena together, we are witnessing an enormous explosion of data which ismany times larger and faster growing than what we have seen in data warehouses over the lastdecade. While the transactional information about customers, products, stores and the like is stilluniquely valuable—and plays a central role in understanding any business—there is now new andunprecedented information available that can provide business, engineering, scientific and medicalinsights never before available.To provide one example, a useful technique in customer retention is to observe when a profitablecustomer’s activity with a credit card begins to decline and then react quickly to retain the customerbefore the account is cancelled. When this technique works it is much more efficient than acquiringa new customer that is equally or more profitable.But what if you could know earlier—before the usage declined—that the customer was at risk?Perhaps the retention rate would become yet higher and the retention cost lower, particularly if youcould discover the reason that the customer relationship was threatened. If you knew the reason,then your actions to deal with it could be yet more efficiently directed at the root cause.But how could you know earlier? One possibility is social media. If you are engaged with yourcustomers on social media, they may tell you what they are thinking: that they like the service orthe incentives or the prices offered by a competitor; that they don’t like your call center or your fees.Or, if they have opted into your social media program, they may let you see what they are saying toothers about your product or service.The enormous flood of data pouring out of social media is one of many examples of big data. Datais also pouring out of a growing tide of products that we use every day, and to the extent that we optin, manufacturers can gain precious knowledge about how, when, and where we use products—andwhat problems we have with them. This is clearly the case today with smartphones and tablets.Vehicles are becoming more intelligent and more connected and will increasingly provide similarcapabilities (more expensive commercial vehicles, such as helicopters, already provide telemetry datathat is used to optimize safety and maintenance). The trend will spread to many other products thatwe use every day, in every case generating yet more “big data” for analysis.New Tools and Technologies.   The concurrent rise of predictive analytics and big data has generatedinterest in new tools and technologies for several reasons.First, much of the big data does not fit closely with the relational database model. Much of thesignificance of the data is not revealed by fitting it into a tabular structure. Social media data hastextual, image, audio, video and other components that must be analyzed primarily by specializedor procedural functions—SQL solves a relatively small part of the problem here. Embedded in thedata is a social graph which is most readily analyzed outside of SQL. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 8 A WINTER CORPOR ATION WHITE PAPERIn general, a significant element of the new, more predictive analysis—especially of the newly variedand highly voluminous “big data”—is best attacked with tools other than SQL. In connection withthis, interest has grown in MapReduce, a parallel data analysis framework, and Hadoop, an opensource engine for running MapReduce jobs.Some data analysis jobs can be readily performed in a Hadoop cluster. Others may require the servicesof a data warehouse, such as SAP Sybase IQ. Yet others may best be handled with a combination ofthe two.Regardless of where the data is stored, interest has also grown rapidly in other analysis tools, suchas the open source statistical analysis language, R. In general, the new business analytics will useSQL and the data warehouse, but will also create a strong demand for other tools.Data Strategies.  As enterprises grapple with this rapidly changing world of big data, they need adata infrastructure that will enable them to implement analytic business strategies. Especially withregulatory and governance requirements enforcing longer periods of data retention, enterprisesneed a convenient, flexible, cost effective process for solving analytic data problems from beginningto end.Sybase seeks to address that customer need—for a comprehensive approach to business analytics—through its new capabilities in SAP Sybase IQ 15.4. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 9 A WINTER CORPOR ATION WHITE PAPER2 Architecture of SAP Sybase IQ 15.4 Software relational database engines have been commercially available since the 1970s. To this day, most of these products were originally conceived as row storage engines for online transaction processing. A notable exception is SAP Sybase IQ. Conceived from its earliest days as a column-storage, analytical DBMS, Sybase IQ was in many ways ahead of its time. It was the first commercial column storage engine; the first to put a major emphasis on data compression; and, one of the earliest to place a strong emphasis on complex queries and analytics, rather than on online transaction processing. Sybase IQ has come into substantially widespread use, with thousands of customer installations, and thus has developed into a reliable, highly usable, comprehensive product for data warehousing and business intelligence. But, with Sybase IQ 15.4, that distinctive engine architecture has been expanded into something more: a platform for large scale business analytics. This section will discuss the new capabilities of Sybase 15.4 and describe how they support and enable analytics for the data warehouse and for the newer phenomenon of big data. 2 .1 A PL ATF O R M F O R BUSINESS ANALY TICS With the introduction of Sybase IQ 15.4, SAP has expanded its IQ product line from data warehouse engine to business analytics platform, as shown in Figure 1. The core data management infrastructure, represented by the innermost layer in Figure 1, is a high performance column storage analytic database engine. In recent releases, the core data management infrastructure has been enhanced with SAP’s patented PlexQ™ technology, which SAP characterizes as massively parallel shared everything architecture. The combination of the relatively new PlexQ™ technology and Sybase IQ’s previously developed grid structure results in an elastic architecture— on which capacity is readily added or removed. The underlying database engine, a distinctive design with sophisticated column storage, compression and indexing techniques, has long established advantages in query performance. In Sybase IQ 15.4, the core data management infrastructure is further enhanced with new capabilities for large object compression and high performance bulk inserts via the industry standard ODBC and JDBC interfaces. The core infrastructure has several other noteworthy features, highlights of which are discussed in Section 3. Figure 1: Sybase IQ 15.4 as a Platform for Business Analytics Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 10 A WINTER CORPOR ATION WHITE PAPERThe Application Services Layer, shown in Figure 1, is a greatly expanded set of services designedspecifically to for the development and support of analytic applications. It also provides facilitiesfor users and partners to develop and use their own analytic functions that Sybase IQ will run inparallel against the database. This layer provides major new services, including an implementationof native MapReduce that runs in parallel against the database and also provides connectivity withHadoop. The Application Services Layer is described further in Section 4.The Ecosystem Layer, represented by the outermost layer in Figure 1, is an environment in whichSAP and its partners can provide and support analytic applications and tools, as well as the businessintelligence tools that have long been available with Sybase IQ. Some key elements of this layerthat are new in Sybase IQ 15.4 include support for:• Expansion to support all major Business Intelligence and Data Integration tools including optimizations for SAP BusinessObjects products;• the R language, an open source language for statistical analysis;• a library of MapReduce enabled data mining functions that will run in parallel against data in Sybase IQ;• a set of social network analysis modules; and,• packaged applications for analytics and data lifecycle management.The Ecosystem layer is yet another significant enhancement to the analytic capabilities of SybaseIQ and a third major element of SAP’s initiative to make Sybase IQ a major platform for businessanalytics. The Ecosystem Layer is described in Section 5.While Sybase IQ has long enjoyed a respected presence in data warehousing, increasing its customerbase over the last few years from about 2,000 to over 4,500 installations, Sybase IQ 15.4 is clearlysomething new and different from what Sybase has offered before. As well as significant continuingenhancement to its core DBMS engine for data warehousing, SAP is now offering an array ofcapabilities for business analytics with Sybase IQ. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 11 A WINTER CORPOR ATION WHITE PAPER3 The SAP Sybase IQ 15.4 Core Data Management Infrastructure The core infrastructure of SAP Sybase IQ has been enhanced significantly over the last three releases with the implementation of elastic PlexQ™ grids for highly parallel processing. The elastic PlexQ™ grid preserves the advantages of the earlier Sybase IQ architecture—a sophisticated form of shared data clustering—while adding scale out processing for queries, loads and other large data warehouse operations. In prior releases, Sybase IQ could run queries and loads in parallel across a single node. In Sybase IQ 15.4, with PlexQ™, the system can run an individual query or load in parallel across multiple nodes. This ability to scale out for individual queries and loads enables Sybase IQ 15.4 to handle significantly larger scale data warehouses and analyses. In addition, as in prior releases, Sybase IQ 15.4 can spread the work of multiple users across the nodes of the grid. Also, grid nodes can be grouped and assigned to specific workloads or user populations, making it possible to dedicate a chosen set of nodes to a particular purpose. New nodes can be added to the cluster as the workload grows, providing an elastic character to the system. Figure 2 below provides an overview of the core data management infrastructure: Figure 2: SAP Sybase IQ Core Infrastructure with PlexQ™ Technology Source: Adapted from a diagram by SAP Inc. Sybase IQ runs on Red Hat and SUSE Linux 64/32 bit systems, Windows 64/32 bit, AIX 64 bit, Sun Solaris 64 bit, and HP-UX 64 bit systems, providing for customers to independently optimize storage, caching, processors, memory, threading, and load distribution. 3.1 DATA LOAD PE RF O R M AN CE AND S CAL ABILIT Y Sybase IQ data load performance and scalability depend primarily on seven factors: 1. PlexQ™ technology, making it possible to spread the work of a load job across multiple nodes of an elastic PlexQ™ grid. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 12 A WINTER CORPOR ATION WHITE PAPER 2. In a new feature for Sybase IQ 15.4, highly efficient bulk inserts via ODBC and JDBC are supported. This means that many third party tools and applications that load via industry standard interfaces will load large data volumes much more rapidly. In some practical examples, for example when third party ETL tools are used, speeds up of 100 times have been measured. 3 3. Fast, flexible load processing built into the engine at the most fundamental level. 4. Versioning to minimize contention between data-load and query processing. 5. Automated, flexible remote loads. 6. “Near-real-time” “Trickle-feed” loads. 7. Sybase’s ETL (extract, transform, load) utility.Fast Load Processing.  Sybase IQ provides specific features for speeding column-store data loading.In the batch case, a column-store approach allows loads to be in “flat schema” (or “semi-normalized”)format—that is, users can avoid the added space and complexity of storing the data as multipletables. Sybase IQ’s architecture allows parallelism in loading, including parallel feeds fromdistributed clients (the “grid”) to multiple servers and parallelism by using multiple processors forparallel storage of individual tables and columns in the target data-warehouse database. SybaseIQ loads only those columns that have changed in a given row (or, of course, in the entire datastore)—this typically allows Sybase IQ to create loads a fraction of the size of the comparable row-store relational approach.Versioning.  As the changed columns are loaded, they do not replace old columns. Rather, newversions are created and old ones maintained while needed by ongoing queries. Within a newcolumn version, only changed pages create new storage. Thus, Sybase IQ querying is not interruptedduring data loading, data loading is not blocked by ongoing querying, and additional storage forversioning is minimized.Automated, Flexible Remote Loads. Sybase IQ allows scale-out loading across its grid architecture.Data can be pulled from the clients, or “pushed” by the client to the server via ODBC. The utilityalso enables data loading from SAP Sybase ASE, Microsoft SQL Server and Oracle data stores.Near-Real-Time Loads. Sybase IQ supports “micro-bursts” of “microbatched” incremental dataloads (i.e., not the constant stream of updates of an OLTP database, but column changes accumulatedover a minute or two, loaded at once). For example, Replication Server—Real Time Loading Edition15.5 allows delivery of changed data to the Sybase IQ data store within minutes of a data changeelsewhere. This ensures “near-real-time” up-to-dateness of data. Combined with versioning, itallows up-to-dateness without interruption of ongoing queries.Sybase InfoPrimer ETL Tool.  This coordinates data loading, including data cleansing as necessary.It takes advantage of the features described in 1-4, and operates multi-threaded, for a high degreeof concurrency and/or parallelism. InfoPrimer ETL combines loading and indexing—a chunk ofdata and its indexes are treated as a single object item—for additional ETL speedup. A SAP utilityautomates data loading from SAP Sybase ASE, Microsoft SQL Server and Oracle data stores. SybaseIQ also supports SAP BusinessObjects Data Services ETL tool and other third-party ETL tools suchas those of Informatica, Syncsort, and Data Stage. Note also that Sybase IQ supports “Extract LoadTransform” schemes, in which database functionality or stored procedures are used to speed someforms of data transformation, as well as “change data capture” via Replication Server.3 Note that bulk inserts were efficiently implemented in prior releases for the native application interface. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 13 A WINTER CORPOR ATION WHITE PAPER3. 2 CO LUM N -STO RE STO R AGE E FFICIE N C Y, INDE XING , AND COM PRESSIO NA key differentiator for Sybase IQ is its ability to store data in the minimum amount of space ondisk or in main memory, which has a dramatic positive effect on performance and scalability.Relational data stores in row format, by and large, already minimize duplication of records (rows).However, relational row stores duplicate columns within a row even when there is no data in thecolumn, and store the same value in a column multiple times. Sybase IQ’s columnar-data-storeapproach does not store non-existent column data, and stores each distinct value only once (Figure4). For example, where a relational row store may store the “Married” value (or any other value) inthe customer-marital-status field in every row, the columnar approach stores a pointer to onecentral instance of each value in the field. Figure 3: SAP Sybase IQ’s Columnar Data Storage Source: SAP Inc.Many queries in BI, complex or otherwise, analyze data using only a few fields in a record, or onlya few columns in a row. For queries involving analysis of many rows, this means that a row-basedquery engine will retrieve much more data than necessary, slowing performance, while a column-based query engine like Sybase IQ will retrieve only those efficiently-stored columns applicableto the analysis. Add Sybase IQ’s ability to partition data according to columns and thus avoid someindexing performance overhead (discussed in VLDB Management, below), and the more thatSybase IQ scales, the greater the frequency and size of its performance advantage.Note that other queries may favor a row-based approach—for example, those that access a smallnumber of rows and a large number of columns. The design philosophy of Sybase IQ argues thatsuch queries typically comprise a modest fraction of the workload in an analytic database. Thereforethe gains from a column-store approach will dominate the performance tradeoffs. While SybaseIQ was alone in advancing this argument ten and more years ago, many several products havesince incorporated some column storage features or capabilities in response. However, few productshave been designed with a column storage approach from the ground up—and Sybase IQ remainsthe most mature of these.To improve storage efficiency, Sybase IQ’s column-store architecture adds data compression,leveraging its storage of a single data type per column per data page. Aside from the standardmethods of compressing individual word strings, Sybase IQ offers bit-mapped indexing (in whichlow-cardinality column data values are represented as bit strings, and query operations can be Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 14 A WINTER CORPOR ATION WHITE PAPERcarried out as bit operations, for two-orders-of-magnitude performance speedup) where appropriate.In fact, Sybase IQ provides compression not only of the data, but also of its indexes (Figure 4).In Sybase IQ 15.4, data compression is enhanced further for large data objects, providing a criticalnew capability for unstructured data. The enhanced data compression applies to variable lengthand fixed length character and binary large objects (VARCHAR, VARBINARY, CHAR and BINARY).In early use of these features, data has compressed from 3 times to 16 times more than with priorreleases of Sybase IQ. This enhanced compression means fewer disk I/O operations to read andwrite the same data, thus enhancing performance. Large objects are especially prevalent in thenew “big data” arena, where unstructured and semi-structured data accounts for most of theincreased volume. Figure 4: SAP Sybase IQ Data Compression Source: SAP Inc.Many relational databases “retrofit” compression into their database engine by decompressingthe data before processing it. Sybase IQ designed in query processing without decompression,so that all operations use the compressed data, and the only time data is decompressed is whenprocessing is finished and the data is being sent to an end user to read in a report. Also, SybaseIQ performs “perfect prefetch of pages,” because it knows from its bitmaps exactly which pageshave to be fetched in sequence. The result is an increase in the amount of data that can be storedin main memory, allowing in-memory-database-like performance plus scalability beyond anin-memory database.Sybase IQ’s indexing schemes complement its columnar storage and compression approaches. Inparticular, Sybase IQ offers a wide range of indexing schemes that allow columns with differentcharacteristics to be stored in less space (Figure 5). Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 15 A WINTER CORPOR ATION WHITE PAPER Figure 5: Forms of Indexing Supported by SAP Sybase IQ Type of Query Operation Index Name Type of Data Useful For Data Type Useful For all columns with Projections with Fast Project (Default) < 16M unique values scalar aggregates All All except, BITs/CHARs High Group high cardinality columns Large joins, GROUP BYs > 255 Mainly for integers and High Non Group high cardinality columns Range searches CHARs < 255 columns with < 1000 unique Projections, joins, All except, BITs/CHARs Low Fast values requiring fast lookup scalar aggregates > 255 Columns with DATE, TIME, Queries with dates, times, DATE, DATETIME, Date, Time, DateTime DATETIME data types timestamps ranges/compares TIME only two columns with Mainly for integers Compare identical data types (for <, >, = compares and CHARs comparison operations) Data types involving strings Word and words Dictionary Lookup CHARs, VARCHARs only Complex text terms/phrase Data types involving strings Text and words searches including boolean, CHARs, VARCHARs only proximity, and scoringThis broad range of indexing techniques is partly baked in (that is, data loading will automaticallyindex data in a compressed form for storage efficiency), but also allows the customer furtherflexibility to create additional indexes to deliver performance for the customer’s unique queryingpatterns. Because indexes are highly compressed, users can create a multitude of them up front inanticipation of future ad hoc queries. An “index advisor” built into the query optimizer assists theuser by suggesting indexes that will improve query performance. Sybase IQ’s column storearchitecture aggressively encourages usage of indexes—in many cases multiple indexes percolumn—on which predicates are applied to obtain speed up. Figure 6 shows how Sybase IQ’sdata-storage approach can minimize I/Os.Note also that SAP Sybase IQ can fetch data in large page sizes (typically 64K), which can reducedisk I/O significantly. Figure 6: SAP Sybase IQ Query I/O Reduction EXAMPLE: select sum(sales)   from customers where state = ’NY’   and class = ‘A’ Sybase IQ will use the LF indexes to filter rows and then apply to HNG to compute the sum. Minimal amount of data is read to resolve the query. Note also that Sybase IQ can fetch data in large page sizes (typically 64K), which can reduce disk I/O significantly. Source: SAP Inc. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 16 A WINTER CORPOR ATION WHITE PAPER3. 3 QUE RY PRO CESSING PE RF O R M AN CE AND S CAL ABILIT YThe Sybase IQ query-processing engine is built to take advantage of all of Sybase IQ’s storage,Shared Everything PlexQ™ architecture, and versioning capabilities. The cost-based optimizer canload-balance a query across processors and systems, while constantly updating its “sense” of therelative load on each processor/system. The optimizer also factors in the size of the compressed/indexed data and its presence or absence in main memory, ensuring quicker data access andprocessing. Sybase IQ can dynamically adjust its query execution plan based on concurrentworkload, after having started the execution of the query. Sybase IQ 15.4 rebalances queryresources—threads, processors, and cache—every several seconds, to maintain query performancefor both long-running/larger and short-time-period/smaller queries. Note that the intelligence ofthe cost-based optimizer allows users to flexibly deploy heterogeneous small-scale servers if needed,each with its own SLA (service level agreement).Once the query is optimized, the engine carries out pipelining of operations within queries as wellas parallelism within and across queries. That is, a query that may involve an initial load and sortfollowed by a join might begin the join operation for one column value immediately, withoutwaiting for all data to have been sorted. When one processor is finished sorting a column value, itmight move to sort the next, passing the value to the “join” processor. Multiple pipelines mayoperate in parallel for different sets of data within a query. In the case of joins, in particular, SybaseIQ provides two levels of parallelism, in which parts of data to be joined may be “grouped” initiallyfor separate, parallel processing, and then the groups may be joined together in a second step.In the case of column data that uses bit-mapped storage and indexing, the engine takes an additionalstep. It combines (performs bit operations) early, in order to reduce the number of times that theengine needs to actually “touch” a data item. In this case, Sybase IQ never needs to do a table scan.3.4 VE RY L ARGE DATABA S E (VLDB) M ANAGE M E NT AND BACKUPThe larger Sybase IQ implementations typically manage hundreds of terabytes of data; a few SybaseIQ systems manage petabytes of data, according to SAP.Moreover, Sybase IQ allows administrators to bind tables, indexes, and columns to particular storagestructures—thus placing less fresh groups of data on more price-performant storage (offline diskor nearline tape) without significant diminution of performance. Logical “groups” can be moved(e.g., from disk to tape) with simple commands, as when “aging” data becomes ready for archiving.Sybase PowerDesigner (also part of the Sybase Workspace IDE) enables creation of programs thatgenerate reports based on the data’s logical “age.” To complement logical “data age” partitioning,Sybase IQ supports physical range partitioning of columns/tables based on the values in a “datecreated” or “date last modified” field. Older data can be marked “read-only,” avoiding the need forfurther backup (see Figure 7).Adding and removing data can have significantly less impact on performance (and hence the needfor retuning) than in row-based systems. Specifically, if a field needs to be added to or removedfrom the data, it does not require reallocation of each row in storage or immediate redefinition ofall affected row indexes, and does not “lock up” rows during the addition/removal process. Moreover,efficient data and column representation means quicker field addition or removal. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 17 A WINTER CORPOR ATION WHITE PAPER Figure 7: Data Partitioning Allows the Placement of Older Data on Lower Cost Storage Source: SAP Inc.In general, Sybase IQ emphasizes ease of tool use by administrators. They can perform most neededoperations via the Sybase Central GUI (graphical user interface), and SAP anticipates releasing aWeb version of administrative tools with the same functionality within the next 12 months.Parameters that administrators may tweak include modeling the data, and ETL. At the same time,Sybase IQ automates job load balancing within a node, as well as ETL-based data-load balancing.Sybase IQ supports active-passive disaster recovery, with manual failover of a single failed node.Sybase IQ’s Virtual Backup integrates with the storage subsystem to create and periodicallyresynchronize shadow data-device copies online, with delayed logged writing of updates to theshadow. Effectively, this means that during normal processing, backup overhead is minimal, and“virtual restore” involves only roll-forward of changes not yet applied to the shadow—often amatter of seconds.Note that Sybase IQ reduces the amount of pre-aggregation/materialization and index creationwork required of the typical data-warehouse administrator. Sybase IQ’s columnar approacheffectively aggregates data according to columns and values in a column; index compression iscarried out during data loading; indexes can be created automatically “on the fly” by the queryengine, and can be based on usage patterns rather than pre-defined by the administrator.Security schemes involve both data communications (e.g., RSA, FIPS 140-2, Kerberos) and datastorage. Data storage encryption is applied to the entire database and to particular columns (usingSybase IQ’s AES 128-bit encryption or an optional FIPS 140-2 certified version of the encryption.3. 5 IN - DATABA S E ANALY TICSUsing stored procedures or user-defined functions compiled and optimized within the databaseengine’s process is a time-honored way to improve performance of key query types. SAP extendsthe notion to encompass not only built-in math functions and SQL OLAP operators but also SAS/SPSS-type complex operations such as clustering, simulations, and classifications. And Sybase IQspecifically opens this capability (e.g., via C++ plug-ins) to third parties such as Fuzzy Logix andVisual Numerics.Sybase IQ 15.4 introduced a major expansion of the User Defined Function (UDF) and other analyticcapabilities. This is described in Section 4 on Analytic Services. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 18 A WINTER CORPOR ATION WHITE PAPER3.6 TE X T S E ARCH AND ANALYSISSybase IQ allows full (semi-structured) text data search in combination with traditional relational(structured) data analysis. For example, users can find all instances of a word or phrase in a set oftext fields stored in Sybase IQ’s data store, without having to scan table rows or having to knowwhich column the word or phrase is stored in. Specialized text indexes that store positionalinformation for terms in the indexed column(s) speed up complex text search and analysis. Moreover,Sybase IQ’s -in-database capabilities (outlined earlier) include plug-ins for third-party C++ TextAnalytics/Mining libraries. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 19 A WINTER CORPOR ATION WHITE PAPER4 The Application Services Layer The Application Services Layer, represented by the middle layer in the Sybase IQ 15.4 analytic platform architecture, is a greatly expanded set of services designed specifically for the development and operation of analytic applications. This layer provides several new services, including an implementation of MapReduce that runs in parallel against the database. The Application Services Layer also provides facilities for users and partners to develop and use their own analytic functions that Sybase IQ 15.4 will perform in parallel against the database. Figure 8: The SAP Sybase IQ 15.4 Application Services Layer Source: SAP Inc. Additional key elements of the Application Services Layer include protected “out of process’ Java UDFs, spatial/geometric data and query support and simulation for in-database application development and testing. 4.1 “M PP E NABLED ” US E R DE FINED FUN C TIO N S (UDF) Several of the advanced capabilities of Application Services Layer are possible because of the new forms of user defined functions (UDF) supported in Sybase IQ 15.4. SAP characterizes these new UDFs as “MPP Enabled,” meaning that Sybase IQ will run them in highly parallel fashion, including spreading the work of a single function call across multiple nodes of the PlexQ™ grid. These are functions written in C or C++ (and for some types, may be written in JAVA); and, are callable from SQL. Because such functions are enabled for execution in parallel across multiple nodes, they are key enablers for business analytics and big data. UDFs are a convenient mechanism for the advanced users or database professionals in an enterprise to codify certain calculations or analytical techniques specific to a business—and then make them available for use throughout the enterprise. Though the industry term for this capability is “user defined function”— and while Sybase IQ customers will certainly write them—a substantial library of such functions is provided by SAP and its partners. UDFs also provide a mechanism whereby a software vendor or data service provider can develop proprietary techniques; and, make them available for use by customers; but, without necessarily disclosing the algorithm or its implementation. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 20 A WINTER CORPOR ATION WHITE PAPERFour classes of UDFs are supported:• Scalar functions operate on individual data items, returning a single value;• Aggregate functions operate on sets of values, returning a single value; several aggregation operations are built into the SQL language—for example SUM, COUNT and AVERAGE—but aggregate UDFs provide an opportunity for users to create their own aggregation functions, which may incorporate techniques specific to an industry, company or analytical discipline;• Table functions produce bulk data (that is, a table) as output and may be written in C/C++ and/or JAVA;• Table parameterized functions both accept bulk data as input and return bulk data as output and may be written in C/C++.Taken together, considered in light of their enablement for highly parallel execution, these UDFsprovide a potent new analytical capability for SAP customers and partners.4. 2 PROTE C TED JAVA UDF ’SPrior to Sybase IQ 15.4, customers have been able to write UDFs in C and C++. Such functions hadto be tested and certified before they could be run as part of a production system. They ran in theSybase IQ kernel.UDFs can now also be written in JAVA. In addition, they are run in a “protected” mode. This meansthat they are executed in a separate process that runs on a database server (that is, it runs on a nodeof the PlexQ™ grid). This prevents an error in the UDF from interfering in the operation of eitherthe core infrastructure of Sybase IQ 15.4 or in the operation of any other UDF or user process. Theresult is therefore more reliable and consistently available data and analytical services.4. 3 IN - DATABA S E M apReduce“MapReduce is a software framework for distributed processing of large data sets on computeclusters,” as described on the website of the Apache Foundation (http://hadoop.apache.org/mapreduce/).In the MapReduce framework, data analysis tasks can be broken into functional pieces—calledmappers and reducers—each of which performs a portion of the analysis, reading an incomingset of (key, value) pairs and writing an outgoing set of (key, value) pairs. When mappers andreducers are run in the correct sequence, the complete analysis task is accomplished.The MapReduce framework is especially interesting when a large volume of data is to be processedbecause it is designed—and MapReduce functions are written—so that many copies of each mapperand each reducer can be run at the same time in a parallel architecture. Thus, if there is a terabyteof data to be analyzed and one runs 100 copies of a mapper, then each mapper needs to analyzeonly 10 GB of data (assuming that there is a readily available way to partition the data into 100roughly equal parts). This concept of scalable, highly parallel analysis is similar to the concept ofparallel query processing used in a data warehouse, though there are important differences betweenthe two.Prior to the development of MapReduce, procedural programs to analyze data—written outsideof the context of a parallel database system—had to deal with all the complexity of parallelprogramming. So, data could be analyzed serially—a very slow process with large data volumes—or the programmer could get involved in the very complex and error prone process of specifyingmanually how the data was to be:• partitioned;• fed to many separate copies of the analysis process; and, Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 21 A WINTER CORPOR ATION WHITE PAPER• analyzed;and, then, how the many separate results were to be• recombined; and,• delivered.In a complex analysis there are many successive stages of parallel analysis, with the data passingbetween them in complex patterns, and the difficulty of the programming task escalates rapidly.The MapReduce framework relieves the programmer of explicitly dealing with the parallel aspectof the analysis, freeing him or her to concentrate on the data and the analytical logic.The MapReduce framework has been popularized in connection with Hadoop, an open softwaresystem that implements MapReduce on compute clusters (typically, clusters of low cost servers andlow cost storage). Hundreds, if not thousands, of companies are now using or experimenting withHadoop clusters in part so that they can have an environment for storing large amounts of data—the so-called “big data”—and analyzing it with MapReduce and other tools.While a Hadoop cluster provides a repository for the storage and analysis of big data, it has differentadvantages and limitations than a data warehouse. WinterCorp believes that most enterprises willhave an analytical environment in which at least one data warehouse and at least one Hadoopcluster will be present. Section 5.x provides more information about Hadoop clusters and describesthe facilities in Sybase IQ 15.4 for interfacing to them and interacting with data stored in them.Meanwhile MapReduce as a programming framework has come to be widely viewed as a standardmethod for interfacing the procedural program—written in Java, Python or some other popularlanguage for data analysis—to a large volume of data in storage. This is because programs andfunctions written using MapReduce can be executed in highly parallel architectures that speed upthe large scale analysis. In Sybase IQ 15.4, a facility is provided for running C++ applications that use the MapReduce framework and run within the Sybase IQ PlexQ™ elastic grid. They can run against data stored in the Sybase IQ database or against externally stored data. The data can be structured or unstructured, as Sybase IQ 15.4 is capable of storing either. And the mappers and reducers are stackable.Note carefully that, in the Sybase IQ 15.4 context, such programs need not have anything to dowith Hadoop. The data that they are analyzing can be data previously stored in Sybase IQ andthat could be analyzed with SQL queries or any other tool that works with Sybase IQ. But, becauseof the popularity of MapReduce this facility is likely to be valuable, because:• Many libraries of analytical functions will be implemented for other environments using MapReduce; such libraries can then be ported to Sybase IQ 15.4;• Many programmers, data scientists and other data specialists will gain familiarity with MapReduce and may prefer to program using it; and,• Sybase IQ customers may want to build their own libraries of functions that can be used both on data in Sybase IQ and on data in other environments such as Hadoop; these customers will therefore be able to use MapReduce for this purpose.As described in Section 5.2, at least one Sybase IQ partner has already leveraged this facility toprovide data mining functions to Sybase IQ customers using MapReduce. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 22 A WINTER CORPOR ATION WHITE PAPER In addition, using MapReduce on Sybase IQ data will typically be simpler than accomplishing the same task on data in a Hadoop cluster. This is because the data to be analyzed can be selected and partitioned using SQL; the results returned by the analysis can be stored back in Sybase IQ using SQL; and, the definition of the data to be analyzed can be maintained in SQL. Each of these simplifies some aspect of the analytical process. Also, data that is stored in Sybase IQ is managed by Sybase IQ. It benefits from all of the services provided in the Sybase IQ environment for other data. For example, it can be incorporated in a routine backup schedule; it can be made recoverable; it can be secured via access controls; and, so on.4.4 SIMUL ATIO N F O R IN - DATABA S E DE VE LO PM E NTAnalyzing data within a UDF—rather than transferring the data to an external system for analysis—has important advantages for a user of Sybase IQ 15.4. First, it takes time and system resources totransfer the data elsewhere. Second, the moment the transfer begins, the data starts growing stale.If the analysis is delayed for some reason it becomes even more stale. And, the larger the volumeof data to be analyzed, the higher is the overhead of first moving it elsewhere. Second, Sybase IQ15.4 is capable of running UDFs in parallel across multiple nodes. If the data is transferred toanother system for analysis—and if that system is not able to analyze data with an equal or greaterdegree of parallelism—then there will be yet more delay. Third, if the results of the analysis aresubstantial and are to be retained for later use, it will be efficient to write them in parallel backwithin Sybase IQ, rather than having to transfer them from another system.These reasons—and others—provide incentives to analyze data in place within UDFs in SybaseIQ. But, there are some issues to address. As UDFs are being developed, they may contain errors.A UDF under test could have unintended—and undesirable—effects on the production environmentif run there.In some environments, the production data is extremely sensitive and it is not practical to have acopy of it in a separate test environment.To address such issues, Sybase IQ 15.4 provides facilities for testing of UDFs and applications thatare intended to perform in-database analysis. These facilitate the process of creating realisticsimulated data in a large scale test database. As a result, the development of in-database analyticsis far more streamlined and UDFs can be more completely tested before they are used withproduction data.4. 5 HADO O P INTE RFACEMany enterprises will create—or have already created—an analytic environment in which thereare multiple data repositories, some on data warehouse platforms and others in Hadoop clusters.In this situation, which WinterCorp believes will become nearly universal within the next severalyears, it will be common to have analytic processes which leverage data from multiple sources.In response to this emerging requirement, Sybase IQ 15.4 has four mechanisms in its ApplicationServices framework for connection between Sybase IQ and Hadoop. These are:a. Client Side Federation.  The Quest TOAD data query tool (certified with Sybase IQ and Hadoop) can retrieve data from each source and bring it together at the client; this can be a good solution when the volumes of data returned are not very large;b. Analysis in the Sybase IQ Environment that Includes Data Extracted from Hadoop. ETL Hadoop data into Sybase IQ via Apache SQOOP, an open source tool for bulk data transfers between Hadoop and relational databases; SQOOP stands for “SQL-to-Hadoop”; this is a Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 23 A WINTER CORPOR ATION WHITE PAPER particularly attractive solution when the data extracted from Hadoop is to be joined or aggregated with data that resides in Sybase IQ; performing the work in the Sybase IQ environment brings all the benefits of a mature relational database environment, including security, compliance, backup/recovery, query optimization, and, defined and controlled data semantics; this is a bulk data transfer which can be highly parallel on both the Sybase IQ side and the Hadoop side.c. Incorporation of Hadoop Data into Sybase IQ Queries.  When the data access is to be more frequent, and when the data volumes to be transferred are not very large; data can be retrieved from Hadoop using Sybase IQ table functions; these retrievals, while not instant because Hadoop is fundamentally a batch environment, can nonetheless be incorporated into SQL queries as they are executing;d. Coordinate Hadoop Job(s) with Sybase IQ Query.  In this case, a Hadoop MapReduce job runs separately from a query but is designed to feed data to it; the query and the Hadoop job interface by means of parameterized table functions in the Sybase IQ query; though similar to case (c) above, the emphasis here on coordinating and integrating analysis that is occurring in two jobs in two separate environments.With these four capabilities, Sybase IQ customers can deal effectively with a range of situations inwhich a Hadoop repository is to be used in conjunction with an Sybase IQ data warehouse to meetan analytic requirement.4.6 GE OS PATIAL/GE OM E TRIC DATA & QUE RY SUPPO RTTrends have increased the prevalence and the significance of location data and geometric data inthe analytical environment.First, GPS enabled devices such as smartphones and tablets are in widespread use and proliferatingrapidly. There are hundreds of millions of such devices in use today and projections are that therewill soon be billions. Such devices frequently communicate their location via the internet and suchlocation data ends up in many commercially valuable databases.Second, many other types of GPS-enabled electronic devices are being created and they alsocommunicate their location with increasing frequency. Examples include vehicles, medical devices,surveillance devices used for traffic analysis, weather devices and others too numerous to mentionhere.Data on the location of devices, once too expensive or impractical to obtain, now shows up in moredatabases every day. Analysis of the location aspects of this data is central to the timely understandingand management of public safety, public health, the commercial supply chain, customer purchasepatterns and many commercial resources.A similar trend exists with respect to geometric data significant to the design and manufacture ofproducts; the maintenance of buildings, highways and bridges; the management of energy use;and, so on.In both cases, it is important for the data to be defined, captured and stored in as standardized,easily specified and easily used a fashion as possible. It is also essential for the database system tofacilitate the specification of queries that exploit geographic and geometric data. And, it is essentialfor the database system to perform such queries efficiently, especially when the data volumesinvolved are large.Sybase presently addresses these requirements in the embedded row store DBMS inside SybaseIQ - SQL Anywhere, a very efficient small footprint DBMS that serves as a catalog store for Sybase Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 24 A WINTER CORPOR ATION WHITE PAPERIQ’s engine. However, users can also spawn a separate instance of SQL Anywhere from Sybase IQto store Geo Spatial data. Sybase IQ then provides facilities for federated query of the geospatial/geometric data stored in SQL Anywhere and the main analytical column data store in Sybase IQ15.4 to enable high performance geospatial analysis4.7 FRE E E XPRESS EDITIO NAs with any software platform, it is important for developers to be encouraged to develop toolsand applications for Sybase IQ. The robust and rapidly growing community of millions of SybaseIQ end users— using the product at over 4,500 installations worldwide—is certainly an incentiveto developers.But it is still important to remove obstacles from the path of any developer interested in providingnew capabilities to those users.To this end, SAP is providing a free Express Edition of Sybase IQ 15.4. Anyone developing forSybase IQ (utilizing the rich Application Services described in this section) using Sybase IQ orthinking about such an activity can download the product from http://www.sybase.com/iqexpressedition. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 25 A WINTER CORPOR ATION WHITE PAPER5 The Ecosystem Layer The Ecosystem Layer, represented by the outermost layer in the Sybase IQ 15.4 analytic platform architecture, is an environment in which SAP and its partners can provide and support analytic applications and tools, as well as the business intelligence tools that have long been available with Sybase IQ. Figure 9: The SAP Sybase IQ 15.4 Application Enablement Layer Source: SAP Inc. Key elements of the Ecosystem Layer include support for SAP BusinessObjects and the “R” statistical language; more efficient and scalable data mining functions written to exploit the new in-database MapReduce; social network analysis modules; support for PMML, for mathematical modeling; new capabilities for PowerDesigner and system administration and monitoring; applications for big data information lifecycle management;. Highlights are described in the following sections 5.1 SAP Business O bjects Portfolio support As part of SAP, Sybase IQ is now well integrated with SAP’s market leading tools for Business Intelligence and Data Integration from its BusinessObjects portfolio. The SAP BusinessObjects BI Platform is not only certified with every new version of Sybase IQ, including Sybase IQ 15.4, but it is also being optimized to support Sybase IQ focused optimized SQL generation. Similarly, SAP BusinessObjects Data Services is being certified and optimized to load and transform data into Sybase IQ in a very efficient manner. 5. 2 “R” L ANGUAGE SUPPO RT As described at www.r-project.org, R is a language and environment for statistical computing and graphics. …R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, etc) and graphical techniques, and is highly extensible. R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 26 A WINTER CORPOR ATION WHITE PAPERSybase IQ 15.4 provides support for the R language in two ways.First, R applications can fetch data sets stored in IQ for analysis in the R environment throughRJDBC.Second, calls to models written in R can be embedded in a Sybase IQ table UDF written in C++.Then SQL queries submitted to Sybase IQ can call the UDF, thereby causing the model to be invokedin an R server process.5. 3 M apReduce- E NABLED DATA MININGSince Version 15.1, the Fuzzy Logix library of data mining and analytic functions has been availablewith Sybase IQ.With Sybase IQ 15.4, this library of over 250 functions has been:• Re-implemented using Sybase IQ’s in-database MapReduce API; and,• Extended with additional new functions.Most significantly, by using Sybase IQ’s in-database MapReduce API, the new implementationleverages the Sybase IQ table and table parameterized functions (thus using bulk data input andbulk data output to gain efficiency) and exploits the elastic PlexQ™ grid to execute the functionswith much higher parallelism.5.4 S O CIAL NE T WO RK ANALYSIS MO DULESKXEN’s InfiniteInsight social network analysis and predictive analytic toolset has been certifiedwith Sybase IQ 15.4 to run on data stored in the database. With Sybase IQ 15.4, KXEN does itsscoring directly in the database, and reports that it realizes large performance benefits both fromthe column storage model and the in-database analytic support.5. 5 SYBA S E POWE RDESIGNE R 16 ARCHITE C TURE RE COM M E NDE RSybase PowerDesigner is a widely used application and database design product that has long beenavailable and integrated with Sybase IQ.PowerDesigner 16 and Sybase IQ 15.4 are now jointly enhanced and integrated to provide a newcapability of recommending the architecture for a Sybase IQ solution. The user providesPowerDesigner with:• Database design• Expected data volumes & growth• Expected workload• Performance requirements• Hardware preferences (e.g., Intel or Power)PowerDesigner will then generate an estimate of the configuration required and a bill of materialsbased on Sybase IQ reference architectures developed in cooperation with system partners. Wherepre-built appliance-like configurations are available, these can be generated.The user can then vary input assumptions and examine the sensitivity of the configuration tovariations.In WinterCorp’s opinion, such estimated configurations would be used only as a starting point incertain capacity planning situations. Particularly in larger and more complex deployments, userswould be well advised to seek independent confirmation and measurement. However, a fast pathto an initial estimate is often extremely useful in capacity planning and this tool can provide that,along with an indication of sensitivity to various planning assumptions. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 27 A WINTER CORPOR ATION WHITE PAPERAn area where particular caution is advised is in regard to large databases with complex queryrequirements. Where query complexity is high and data volumes are large, modest changes to thequery workload can produce surprisingly large variations in capacity requirements. In these cases,a certain amount of realistic testing—along with larger allowances for unexpected capacitydemands—are in order.However, with this tool, a history of configuration changes can be initiated, estimated, tracked,and maintained that can make sizing and deployments much more “factory like.”5.6 IN - DATABA S E PM M LFrom http://www.dmg.org/pmml-v3-0.html The Predictive Model Markup Language (PMML) is an XML-based language which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications. PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications.PMML models can be developed in a variety of data mining and statistical workbench environmentsavailable from other parties. However, when PMML models are actually used in production toscore large volumes of data, they must run in a highly parallel environment.In Sybase IQ 15.4, users can run PMML models with a plug-in, developed by Zementis (http://www.zementis.com/in-DB-plugin.htm). With the plug-in, the PMML model can be run directly against datain Sybase IQ. The Zementis plug-in is a Sybase IQ UDF, leveraging the new JAVA API available inSybase IQ 15.4Besides the various eco-system modules outlined above, Sybase IQ supports a substantial variety ofpackaged analytical applications through its OEM partnerships covering various functional areas.A few examples include Ericsson’s OSS product ENIQ, BMMSoft EDMT, and Solix EDMS. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • SAP Sybase IQ 15.4: An Elastic Platform for Business Analytics 29 A WINTER CORPOR ATION WHITE PAPER6 Conclusions Over the course of its last five rapid releases in 3 years—from 15.0 through the present 15.4—SAP Sybase IQ has been transformed to a platform for large scale data analytics and big data. It has significantly advanced in: • Scalability, with the development of its elastic PlexQ™ grid that adds highly parallel execution of large queries and loads; previously, such operations could run in parallel over a single node of the grid; now they can run in parallel over multiple nodes; this is a major architectural advance, highly significant for larger data and workload requirements; • In-database analytics, with a major generalization and extension of the user defined function (UDF) facility in Sybase IQ; with these new capabilities, UDFs can be written in Java as well as C++; they can read and write bulk data in the form of tables and files; they can be run in a protected mode, increasing system reliability and data availability; and, they can be executed in parallel over multiple nodes of the grid; • In-database MapReduce, enabling end users and partners to run MapReduce routines and libraries against data in place and in a highly parallel fashion in Sybase IQ, and opening Sybase IQ up to a large range analytic tools and applications from many vendors and sources; • Interface to Hadoop, enabling the many customers who are investing—or will invest—in an open source data repository in a Hadoop cluster—to leverage that investment in combination with data and analysis in Sybase IQ; • Other analytic application services leveraging in-database MapReduce and new, more powerful UDFs; these include an expanded, more efficient and more highly parallel version of the Fuzzy Logix data mining and analytics library; a simulator for testing analytic applications; and, other features. • Partner Ecosystem - Other analytical, management and business intelligence tools and functions available from partners, certified by Sybase IQ and providing analytical solutions and capabilities to customers; these include support for the SAP BusinessObjects tool set, the R statistical language; a PMML plug-in for data mining from Zementis; social network analysis from KXEN; query and administration tools from Quest TOAD; and, of other capabilities. These advances are evidence of a significant reorientation of the product direction and a significant enhancement of the product line to focus on the major drivers of change in business today. Organizations everywhere are grappling with the implications of a much larger volume and variety of data and a much increased focus on business strategies driven by fuller analysis of that data. Mobility (tablets, smartphones, other devices), social media and machine generated data are all changing our data environments. Sybase IQ now claims more than 4,500 installations of Sybase IQ across the globe, following a rapid growth in revenue and a large expansion of the development organization. In addition to the recent advances in releases 15.0 through 15.4 described here, Sybase IQ retains its established advantages in column storage, indexing and compression. These features—present since the earliest versions of Sybase IQ—work in combination to confer benefits that are unique to Sybase IQ. While other products offer column storage and compression, no other product has the sophistication of Sybase IQ in integrating these features with advanced indexing and query optimization. The result is that Sybase Q is particularly efficient in reducing the amount of data that must be read to satisfy queries. These fundamental strengths are now combined with increased parallelism and other features to deliver product benefits in a wider range of applications, now including those that use advanced analytic methods, including MapReduce and that involve interaction with big data in Hadoop clusters. Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.
  • WinterCorp is an independent consulting firm expert in the architecture and scalability of big data and analytic database solutions.Since our founding in 1992, we have architected solutions to some of the largest scale and most demanding big data and data warehouse requirements, worldwide. We help technology users define their requirements; architect their solutions;select their platforms; and, engineer their implementations to optimize business value. We create and conduct benchmarks, proofs-of-concept, pilot programs and system engineering studies that help our clients manage technical risk, control cost and reach business goals. Our seminars and structured workshops help client teams establish a shared foundation of knowledge and move forward to meet their challenges in big data and analytic database scalability, performance and availability. We’re expert with SQL, MapReduce and Hadoop—with structured data, unstructured data, and semi-structured data—with the products, tools and technologies of data analytics in all its major forms. With our in-depth knowledge and experience, we deliver unmatched insight into the issues that impede scalability and into the technologies and practice that enable business success. 245 First Street, Suite 1800 Cambridge MA 02145 617-695-1800 visit us at www.wintercorp.com ©2012 Winter Corporation, Cambridge, MA. All rights reserved.