Building a Big Data Analytics Platform- Impetus White Paper


Published on

For Impetus’ White Papers archive, visit-

In this paper, Impetus focuses at why organizations need to design an Enterprise Data Warehouse (EDW) to support the business analytics derived from the Big Data.

Published in: Technology, Business

Building a Big Data Analytics Platform- Impetus White Paper

  1. 1. Building a Big Data Analytics Platform-Going beyond the TraditionalEnterprise Data warehouseW H I T E P A P E RAbstractIn this white paper, Impetus Technologies focuses on the needfor building a Big Data analytics platform for better businessinsights.It also looks at why organizations need to design an EnterpriseData Warehouse (EDW) to support the business analytics derivedfrom the Big Data.Additionally, it discusses the options and challenges of building asuccessful EDW architecture to meet the new Big Data businessrequirements. It talks about why it may include extremeintegration with semi-structured and unstructured data sources,that could be very large in size, or could be streaming data,accessed through Hadoop, as well as massively paralleldatabases.Impetus Technologies,
  2. 2. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse2Table of ContentsIntroduction..............................................................................................3Limitations of traditional EDWs................................................................4The key features of a Big Data Analytics platform ...................................5Options available for building the Big Data platform...............................6Using Open Source to build Big Data solutions ........................................7Opting for a Hybrid solution .....................................................................8Harnessing existing investments in building a Big Data Analytics platform................................................................................................................10Summary.................................................................................................12
  3. 3. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse3IntroductionIn the post recession world, organizations are under pressure to maximizeprofits and reduce expenditure. Business owners need to find the right targetusers, figure out the distribution channels, successfully sell their offerings; aswell as keep all the stakeholders happy.Moreover, every time the business comes up with new products or campaigns,or wishes to evaluate its existing business performance, it has to deal with thefollowing questions: What kind of products are my customers interested in?Where should I open my new store next year? What is the most effectiveDistribution channel?Traditionally, businesses have used Enterprise Data Warehouses (EDW)solutions for providing analytics and gaining deeper insights to address theirbusiness requirements and expansion plans.An EDW can play a pivotal role in an enterprise IT strategy. A comprehensiveEDW plan provides companies the following benefits:• Enables disciplined data integration within a large enterprise• Generates output and facilitates effective representations of allbusiness processesIt’s important to examine how the traditional EDW works. Traditional datasources include an operational DB, old archived data, flat/xml files or ERPsystems. Here, the data is extracted, cleaned and transformed into the desiredformat and then loaded into the data warehouse storage system. This data canbe further divided into marts. Once the data is available in the central EDW,query or reporting tools are used for analytics. However, for deeper or forecast-based analysis, data mining tools are used.The question however is whether such data warehouses are ready to deal withBig Data and more importantly, what is Big Data?The term Big Data is used to describe data sets which cannot be managed orprocessed by traditionally used software tools within an agreed elapsed time.The Big data size is constantly increasing, and can range from a few terabytes tomany petabytes. However, it is expected to reach around 35 zettabytes by theyear 2020!
  4. 4. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse4Traditional Enterprise Data Warehouses have fallen short of expectations whenit comes to handling Big Data, on account of the following reasons:• Inability to handle large data sizes• Storing and Managing the Big Data• Gaining insights from this data• Costs involved in dealing with Big DataLimitations of traditional EDWsLet us examine the limitations of traditional EDWs.Traditionally, Enterprise Data Warehouses focused only on transactional orarchived data. However, in the last few years, the need to capture additionaldata for deeper insights has come-up. This includes, real time data, which maybe the low latency operational data or customer behavior data, which capturesthe sub-transactional processes. At the same time, additional data sources suchas devices and sensors have also emerged.Social Media also provides valuable information on product preferences anduser sentiments. It is extremely useful for generating business intelligence, fromthe large unstructured data generated from the Web applications.
  5. 5. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse5It is clear that traditional EDWs cannot gain meaningful insights from Big Data.This is possibly because traditional EDWs were just not meant to handle TBs andPBs of data. Most of these systems were designed in the 1990s using databasetechnologies.Another difference is that in place of Extra Transform Load, the Big DataWarehouses need ELTL which is Extract-Load-Transform-Load. The new systemneeds a staging area where data is uploaded before thecleansing/transformations operations.Traditional relational database solutions are not suitable for a majority of datasets. The data is too unstructured and/or too voluminous for a traditionalRDBMS to handle. Big Data cannot be analyzed with SQL or similar technologies.In fact, database schema does not allow complex unstructured formats to bedefined and managed in these data warehouses. Moreover, the costs involvedin handling these new data sets by using traditional technologies is also veryhigh.Clearly, existing EDW environments, which were designed decades ago, lack theability to capture and process the new forms of data within reasonableprocessing times. Moreover, these traditional EDWs have limited capabilitieswhen it comes to analyzing user behavioral data.Cost is another important factor. Currently, organizations are spendinghundreds of thousands of dollars per terabyte per year for producing andreplacing data in their existing environments, which is huge. Additionally, themodels in use tend to require specialized hardware, which in-turn results in bigdollars-per-terabyte cost, making large-scale deployments expensive. It is alsoreally hard to predict the infrastructure workload for managing this Big Data.The key features of a Big Data AnalyticsplatformTo manage the Big Data trend, a new breed of Big Data Open Source andproprietary technologies have come up, that leverage commodity hardware. ABig Data Analytics platform helps capture and analyze these new data sets.The ideal Big Data Analytics platform needs to match up to these keycharacteristics:• It should have the ability to scale easily to support large data, which willtypically be in terabytes or petabytes.
  6. 6. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse6• The system should ideally be distributed across geographically unawareprocessors.• It should enable quick response to highly complex queries as well assupport a wide variety of data types• It should be able to incorporate machine learning, providingrecommendations, and executing analytics on real time incoming datasuch as logs, as well as providing domain specific canned reports.• It should be able to handle data from heterogeneous data sources,while providing a high rate for loading and analysis, as well as the abilityto handle failover.Options available for building the Big DataplatformIt is important to understand that for building a Big Data analytics platform, anysingle vendor technology may not be sufficient. The platform should havecertain capabilities to address specific sets of requirements.There are two different approaches that are being used to address Big Dataanalytics.The first one is using Massive Parallel Processing and Columnar Databases. Thissolution can help address scaling, distribution, load management, response time
  7. 7. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse7and failover management issues. Additionally, it may also have some domainspecific capabilities to provide a ready-made solution.The second option is using MapReduce implementations. This framework wasinitially used by Google to perform Web searches and is now easily available asthe Open Source Apache project called Hadoop.Companies therefore, have the option to choose between Open Sourcesolutions and commercial options. However, they can also build a hybridsolution, which has a mix of different capabilities that handle the Big Dataparadigm.The commercial tools of today have strong analytical proficiencies as well assophisticated reporting and OLAP cube capabilities. There are a large number ofvendors in the market who are offering solutions for the main components ofthe EDWs, which are ETL, query tools and BI.Some of the commercial options for MPP are GreenPlum, Teradata,etc.Informatica is an example of ETL. A few commercial solutions for BI andAnalytics are Pentaho, Business Objects, MicroStrategy, among others. It ispossible to build a Big Data warehouse solution using these commercialproducts together.Using Open Source to build Big Data solutionsEvery organization, big or small, is now focused on cutting IT expenditure.Despite this, business analytics remains a major business driver for thesecompanies. If the commercial solutions are scaled to really huge volumes anddeeper BI, it can result in exorbitant licensing costs.This is clearly not a viable proposition. Companies can instead choose from thenumerous Open Source implementations that are available. Lower costs,extensibility, and integration are some of the benefits that organizations realizefrom Open Source solutions. The good news is that the community iscontinuously making efforts to enhance these features and add newfunctionalities to these solutions.Some of the Open Source solutions stacks in the analytics world are jasper soft,and Pantaho Reporting, while the ETL tools are lover ETL, Talend, etc. Pentahoalso provides commercial extensions of its solution, while Apache Hadoop andCassandra provide implementations to the MapReduce framework. Theseproducts solve huge data storage issues and provide ETL and analytics support.
  8. 8. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse8Opting for a Hybrid solutionIn this scenario, it is possible to use an Open Source solution for ETL or BI and acommercial solution for Analytics, or vice-versa. Hadoop and MPP solutions forinstance, can work together as ETL pipes along with a commercial Analytics tool.Alternatively, MPP and columnar databases can be chosen, along with Map Reduce toprovide another perfect hybrid solution.When there are larger volumes of data to be analyzed, organizations are better-offusing Open Source solutions. Hadoop is one of the best available Open Sourcesolutions that can help them in handling their Big Data in a cost-effective manner. Italso makes sense to use parallel processing or other fast mechanisms while trying toimport from the source system or export to the destination system.Incidentally, ‘real time’ is a myth in Big Data. The data warehouse system has to becarefully designed so that real time data can be limited by size or by time. It is possibleto re-use some of the existing EDW investments in building a Big Data platform.
  9. 9. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse9The Impetus solutionBased on its project experiences, Impetus Technologies has built a Big DataAnalytics platform for its clients that can help them roll out their Big DataAnalytics initiatives. The platform is called iLaDaP, which is short for ImpetusLarge Data Analytics Platform.The core of the iLaDaP platform is built using SOA, and incorporates all the keycharacteristics of an ideal Big Data Analytics platform discussed earlier. iLaDaP isdesigned to derive intelligence and operate on huge datasets collected fromnumerous data sources in multiple data formats. It is powered by Hadoop, andtherefore, can linearly scale up to thousands of nodes using commodityhardware. This spells a significant cost advantage in the long run. iLaDaPalsocomes with a set of pre-canned and customized reports.
  10. 10. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse10Recognizing that it is important for businesses to track down and take advantage of anopportunity, as it happens, Impetus’ platform enables them to react to the events asthey occur. iLaDaP is also capable of collecting data from a range of disparate sources.This unstructured data can be transformed and utilized for strategic businessdecisions.iLaDaP can be seamlessly integrated with current platforms, without the need formajor changes. The core iLaDaP platform is built using Open Source technologies,where the components can be replaced with other commercial technologies, inaccordance with requirements.Harnessing existing investments in building aBig Data Analytics platformIt is possible to reuse investments made in the traditional data warehouse, tobuild a Big Data Analytics platform. It is possible to reuse most of the hardwaresince the Big Data solutions can run on commodity grade hardware. Therefore,an existing RDBMS-based infrastructure can be reutilized. The existing codelogic and algorithms can be also used after minor modifications to enable themto run in a state-less architectural environment. In this scenario, tools likeMATLAB can be integrated with Hadoop-like technologies.Another way of utilizing the data warehouse investments is by extending orenhancing their capacity by plugging them together with a Big Data warehouse
  11. 11. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse11solution. Hadoop for example, is a cost-effective option for storing archival data;performing deeper analytics and providing summarized reporting data to anexisting data warehouse. This strategy can also help in reusing the reportingtools. Similarly, ETL tools can be modified to use the Big Data warehouse assinks. Tools like Talend or Informatica provide connectors for using Hadoop andcommercial MPPs as data sinks.The development and testing strategy can also be reused. Most of the new BigData warehouse solutions support SQL or Java or scripting languages and allowthe re-use of existing development and testing investments.Organizations can deployiLaDaP on-premise, aswell as in a Cloudsupported deploymentset up.
  12. 12. Building a Big Data Analytics Platform - Going beyond the Traditional Enterprise Data warehouse12SummaryIn conclusion it can be said that traditional Enterprise Data Warehouses do not havethe ability to keep up with the growing demands of Big Data. The need of the hour isto effectively strategize and build a Big Data analytics platform to manage, store andderive insights from this digital data.Also, no single vendor technology will be sufficient. It is recommended thatorganizations go for a hybrid solution constituted by commercial and Open Sourceoption to build their Big Data analytics platform.When there is a large volume of data to be analyzed, it is suggested that an OpenSource solution be used, and Hadoop is the best option. The success of a Big Dataplatform depends entirely on the tools that are chosen. Therefore, the mostappropriate tools must be selected from the available options. Companies canadditionally re-use existing EDW investments for their Big Data analytics platform.About ImpetusImpetus Technologies is a leading provider of Big Data solutions for theFortune 500®. We help customers effectively manage the “3-Vs” of Big Dataand create new business insights across their enterprises.Website: | Email:© 2013 Impetus Technologies,Inc. All rights reserved. Productand company names mentionedherein may be trademarks oftheir respective companies.May 2013