Building an analytical platform


Published on

Business intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data. Traditionally, building an analytical platform has required an expensive infrastructure and a considerable amount of time for setup and deployment. Here we look at a quick and simple alternative.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building an analytical platform

  1. 1. Building an Analytical Platform
  2. 2. I was recently asked to build an analyti- 3. Provide a way to automate the An incidental but equally usefulcal platform for a project. But what is an running of the statistical data models, consequence of using a column-storeanalytical platform? The client, a retailer, once developed, so that they can be database such as SAP Sybase IQ is thatdescribed it as a database where it could run without engaging the statistical there is no advantage in creating a starstore data and as a front end where it development resources. schema as a data model. Instead, hold-could do statistical work. This work Of course, time was of the essence ing all the data in one large wide table iswould range from simple means and and costs had to be as low as possible – -standard deviations through to more but we’ve come to expect that with ing each column with a key means thatcomplex predictive analytics that could the underlying storage of data is a starbe used, for example, to analyze past Step 1: The database schema. Creating a star schema in aperformance of a customer to assess the Our chosen solution for the database column-store database rather than alikelihood that the customer will exhibit a was an SAP® Sybase® IQ database, a large single table would mean incurringfuture behavior. Or it might involve using technology our client was already famil- unnecessary additional join and process-models to classify customers into groups iar with. SAP Sybase IQ is a column-store ing overhead.and ultimately to bring the two processes database. This means that instead of As a result of choosing SAPtogether into an area known as decision storing all the data in its rows, as many Sybase IQ’s column-store databasemodels. The customer had also come up other databases do, the data is organized we are able to have a data model thatwith an innovative way to resource the on disk by the columns. For example if a consists of a number of simple single table data sets (one table for eachwork placements to master’s degree have the text of each country (for exam-students studying statistics at the local ple, “United Kingdom”) stored many that is quick to load and to and arranged for them to work times. In a column-store database the It should be noted that this type ofwith the customer insight team to text is stored only once and given adescribe and develop the advanced unique ID. This is repeated for each online transaction processing (OLTP)models. All the customer needed was a column and therefore the “row” of data applications because of the cost of doingplatform to work with. consists of a list of IDs linked to the data small inserts and updates. However, this From a systems architecture and held for each column. is not relevant for this particulardevelopment perspective, we coulddescribe the requirements in three rela- reporting and analytical databases. The solution can be deployed only ontively simple statements: a Linux platform. We use Linux for three1. Build a database with a very simple reasons. First, RStudio Server Edition is data model that could be easily used. In our example, “United Kingdom” not yet available for Microsoft Windows. loaded, that was capable of support- would occupy 14 bytes, while the ID Second, precompiled packages for all ing high-performance queries, and might occupy only 1 byte – reducing the elements of the solution on Linux reduce that did not consume a massive storage for that one value in that one amount of disk space. It would also column by a ratio of 14:1 – and this environments are normally cheaper than ideally be capable of being placed in Windows environments due to the cost the cloud. the data. Furthermore, because there is of the operating system license. We2. Create a Web-based interface that less data on the disk, the time taken to chose CentOS because it is a Red Hat would allow users to securely log on, read the data from disk and to process derivative that is free. to write statistical programs that One additional advantage of this solu- could use the database as a source of which massively speeds up the queries tion for some organizations is the ability data, and to output reports and graph- too. Finally, each column is already to deploy it in the cloud. Since the solu- ics and well as to populate other indexed, which again helps the overall - tables (for example, target lists) as a query speed. ered, and since all querying is done via a result of statistical models. Web interface, it is possible to use anySAP White Paper – Building an Analytical Platform 3
  3. 3. colocation or cloud-based hosting your environment, but these are well At this point data has to be loaded andprovider. Colocation or cloud deploy- documented on the source Web sites the statisticians can get to work. and in general automatically download if Obviously this is more time consumingsystems management overhead, and you are using a tool such as yum. than the build, and over the days andaccess for both data delivery and data The next step was to get access to the weeks the analysts created their modelsaccess. The system requires SSH access data held in our SAP Sybase IQ server. and produced the results.for management; FTP, SFTP, or SCP for This proved to also be very straightfor- For this exercise we used our in-house ward. There is a SAP Sybase white paper extract, transform, and load (ETL) tool toport open. The RStudio server uses the create a repeatable data extraction andserver login accounts for security but load process, but it would have beencan also be tied to existing LDAP describes the process that can be simply possible to use any of a wide range ofinfrastructure. stated as: tools that are available for this processStep 2: Statistical tools and Web Install the R JDBC package Step 3: Automatically running theinterface Set up a JDBC connection statistical models There are a number of statistical tools Establish your connection Eventually a number of models forin the market. Most are very expensive, Query the table analyzing the data had been created andprohibitively so in this case, and the We now have an R object that contains we were ready to move into a productionassociated skills are hard to come by data sourced from SAP Sybase IQ that environment. We automated the load ofand expensive. However, since 1993 an we can work with. And what is amazing is the data into the agreed single-tableopen-source programming language that it took me less than half a day to structure and wanted to also run thecalled R ( for statisti- build the platform from scratch. data computing and graphics has beenunder development. It is now widely usedamong statisticians for developing statis-tical software and data analysis, is usedby many universities, and is predicted to Analytical Platform Serverbecome the most widely used statisticalpackage by 2015. The R project provides R Studioa command line and graphical interface R Server Editionas well as a large open-source library ofuseful routines (http:/ /cran.r-project. R/JDBCorg) and it is available as packaged soft- Connectionware for most platforms including Linux. In addition, a second open-source proj- SAPect called RStudio (http:/ / Sybaseprovides a single integrated development R/JDBC Connection IQenvironment for R and can be deployed (S) FTP/SCPon a local machine or as a Web-based File Delivery Write toservice using the server’s security model. Database Any NetworkIn this case, we implemented the server Connected Computer Read File ETL with a Browser Accessingedition in order to make the entire envi- the R Studio Server Edition Engineronment Web based. So in two simple steps (download andinstall R, followed by download andinstall RStudio) we can install a full CentOSWeb-based statistical environment. Note ©2012 Data Management & Warehousingpackages may be required depending on
  4. 4. SAP Sybase IQ has the functionality ConCluSionS ABout the Author David Walker has been involved with businessThese C++ programs “talk” to a process Business intelligence requirements are intelligence and data warehousing for overknown as Rserve, which in turn executes changing and business users are movingthe R program and returns the results more and more from historical reportingto SAP Sybase IQ. This allows R func- into predictive analytics in an attempt totions to be embedded directly into SAP get both a better and deeper under- Data Management & Warehousing (http:// in 1995.Sybase IQ SQL commands. While setting standing of their data.this up requires a little more program- Traditionally, building an analytical David and his team have worked aroundming experience, it does mean that all platform has required an expensive infra- the world on projects designed to deliverprocessing can be done within SAP structure and a considerable amount ofSybase IQ. time for setup and deployment. converting data into information and by Conversely, it is possible to run R from By combining the high performance, exploit that information.the command line and call the program low footprint of SAP Sybase IQ with thethat in turn uses the RJDBC connection open-source R and RStudio statistical David’s project work has given him experi-to read and write data to the database. packages, it is possible to quickly deploy ence in a wide variety of industries including Having a choice of methods is very an analytical platform in the cloud for - facturing, transportation, and public sectorhelpful as it means that it can be inte- which there are readily available skills. as well as a broad and deep knowledge ofgrated with the ETL environment in the This infrastructure can be used both business intelligence and data warehousingmost appropriate way. If the ETL tool for rapid prototyping on analytical technologies. models and for running completedfunction (UDF) route is the most attrac- models on new data sets to delivertive. However, if the ETL tool supports greater insight into the callouts (as ours does) then runningR programs from a command line calloutis quicker than developing the UDF.SAP White Paper – Building an Analytical Platform 5
  5. 5. ©2012 SAP AG. All rights reserved.SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign,SAP BusinessObjects Explorer, StreamWork, SAP HANA, andother SAP products and services mentioned herein as well astheir respective logos are trademarks or registered trademarksof SAP AG in Germany and other countries.Business Objects and the Business Objects logo, BusinessObjects,Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and otherBusiness Objects products and services mentioned herein as well as theirrespective logos are trademarks or registered trademarks of BusinessObjects Software Ltd. Business Objects is an SAP company.Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, andother Sybase products and services mentioned herein as well as theirrespective logos are trademarks or registered trademarks of Sybase Inc.Sybase is an SAP company.Crossgate, m@gic EDDY, B2B 360°, and B2B 360° Services are registeredtrademarks of Crossgate AG in Germany and other countries. Crossgateis an SAP company.All other product and service names mentioned are the trademarks oftheir respective companies. Data contained in this document servesThese materials are subject to change without notice. These materialsfor informational purposes only, without representation or warranty ofany kind, and SAP Group shall not be liable for errors or omissions withrespect to the materials. The only warranties for SAP Group products andservices are those that are set forth in the express warranty statementsaccompanying such products and services, if any. Nothing herein shouldbe construed as constituting an additional warranty.