• Save
Reducing the Total Cost of Ownership of Big Data- Impetus White Paper
Upcoming SlideShare
Loading in...5
×
 

Reducing the Total Cost of Ownership of Big Data- Impetus White Paper

on

  • 1,302 views

For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper ...

For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper

The paper discusses the challenges that relate to the cost of Big Data solutions and looks at the technology options available to overcome these problems.

Statistics

Views

Total Views
1,302
Views on SlideShare
1,302
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Reducing the Total Cost of Ownership of Big Data- Impetus White Paper Reducing the Total Cost of Ownership of Big Data- Impetus White Paper Document Transcript

  • Reducing the Total Cost ofOwnership of Big DataW H I T E P A P E RAbstractIn this white paper, Impetus shares best practices andstrategies that will enable businesses to lower the totalcost of ownership of Big Data solutions. This white paperdiscusses challenges related to the cost of Big Datasolutions, and looks at the technological options availableto address Big Data concerns.Impetus Technologies Inc.www.impetus.com
  • Reducing the Total Cost of Ownership of Big Data2ContentsIntroduction...........................................................................................................3Using Commodity Hardware for Big Data..............................................................3Using Open Source and Cloud Computing.............................................................4The Cost Components of a Big Data Warehouse...................................................4Lowering the Total Cost of Ownership ..................................................................5Reducing the Cost of Storage.................................................................................5What Technologies, Where? ..................................................................................6Big Data Scenarios in OLAP....................................................................................7Analytics with Hadoop............................................................................................8Choosing the Right Technologies...........................................................................8Opting for Faster MapReduce/Hadoop .................................................................9NoSQL Database Solutions.....................................................................................9New Era Relational Databases.............................................................................10Impetus Solutions and Recommendations..........................................................10Conclusion............................................................................................................11
  • Reducing the Total Cost of Ownership of Big Data3IntroductionAs the power of Big Data solutions continues to grow, so too does the cost ofcollecting, managing, and storing data. According to IDC/EMC estimates, thetotal value of the computers, networks, and storage facilities driving the digitaluniverse now stands at a whopping USD 6 trillion! Furthermore, that figure isexpected to grow significantly over the next few years. In fact, some estimatethat the size of digital universe doubles every 18 months.Yet, how much of that information is actually useful? An overload of informationcan actually increase the cost of storage, reduce producitivity, and essentiallyensure much of the collected data will go to waste. Despite access to this richpool of data, many businesses continue to extract information of little value. Itis estimated that businesses spend an extra USD 650 billion to gather and storedata that they never put to use.Clearly, much more can be done to unearth business intelligence and actionableinsights from Big Data. The question is, what is the best way do that bothintellgently and cost-effectively? In this white paper, Impetus examines some ofthe pros and cons of several Big Data solutions on the market, and offerspractical advice based on years of experience.Using Commodity Hardware for Big DataThere are many advantages to using commodity hardware. In addition to beingboth readily available and accessible, the biggest advantage of commodityhardware is businesses can build it themselves, opening up many avenues forinnovation.The cost of building reliable storage from commodity hardware is about USD 1per gigabyte—a great deal and a very good start. However, keep in mind, thatfigure only covers the cost of storage and does not include other costsassociated with managing, monitoring, and hosting data.
  • Reducing the Total Cost of Ownership of Big Data4Using Open Source and Cloud ComputingUsing free, open source software to store, manage, and analyze Big Data comeswith a number of benefits. By now, everyone has heard of Hadoop and its abilityto tackle large volumes of data, while still providing significant savings.Using cloud computing for Big Data also has its advantages. The advantage iscloud computing allows users to rent resources over the cloud to take care ofdata and analytics; for example, Amazon Web Services, and Microsoft, for itsWindows Azure platform. Cloud computing allows you to select an offering fromtheir portfolios appropriate for your needs and requirements.The downside to using cloud computing, however, is its storage capabilities.While there is storage over the cloud, it can be very costly.The Cost Components of a Big Data WarehouseMany businesses today are turning to Big Data Warehouses as a means ofstorage. Before making this decision, it is important to understand the coststhese storage facilities can generate.Entry CostThe first expense is entry cost—the cost incurred to identify the right Big Datasolution.Cost of Migrating DataOnce a Big Data solution has been chosen, next expense will be the cost ofmoving data to the new system. Data migration can be especially expensive forbusinesses requiring ETL processes. ETL processes may require the purchase ofspecialized tools that can also be quite expensive.Other CostsA number of other factors can potentially inflate the cost of Big Data solutions.For example, all solutions require a tool that will enable the system to be easilyhandled for scalability, and in the setting of failing conditions. Thus,
  • Reducing the Total Cost of Ownership of Big Data5performance analytics and data management may represent additional majorexpenses to a Big Data plan.Ongoing maintenance is also essential, and accounts for another cost. As thevolume of data increases and changes are made, Big Data warehouses willalways require monitoring and tuning.Taken together, these factors—performance analytics, data management, datamaintenance—can dramatically increase the cost of a Big Data solution.Lowering the Total Cost of OwnershipBased on years of experience in the field, Impetus has identified a number ofbest practices to help businesses reduce the total cost of ownership of Big Datasolutions. This section discusses potential cost savings in hardware andsoftware, with these two main suggestions in mind:For hardware, Impetus suggests that looking at cost saving available instorage and computation.For software, Impetus suggests a number of solutions that will enablethe processing of more data, more quickly, and for less money.Reducing the Cost of StorageImpetus advises businesses to compress data in order to cut storage costs.Compressed data requires less storage space, and less storage space means lessspending.Some of the solutions available on the market claim they can compress data to1/40th of its previous size. When looking at these solutions, however, be carefulto ensure that the read throughput of the data is not compromised when it isdecompressed.Additionally, with Big Data analytics, businesses may opt to focus on a specificsubset of data, rather than looking at all data, which accumulated over time.Another option would be to look into systems designed to store data andinformation based on principles very similar to information lifecyclemanagement (ILM).
  • Reducing the Total Cost of Ownership of Big Data6With all this talk about Big Data, it is easy to forget about small data. Often, it iseasier to gain business insight using smaller sets of data. Thus, Impetus does notrecommend using Big Data solutions for the storage and retrieval of smallamounts of data, as the relative latency of queries will be higher.What Technologies, Where?One key to reducing the total cost of ownership is to understand the availabletechnologies and how they can be used.With the advent of Big Data, many different commercial and specializedhardware and appliances have come to the market. These solutions offer richfeatures such as fault tolerance, easy capacity scaling, and specializedmanagement tools. The commodity hardware available today can be harnessedfor Big Data use cases by leveraging the open source stacks or solutions.Latency is also a critical factor, but the systems with the lowest latency are alsolikely to be the most expensive. There is, of course, a niche market that focuseson latency as a business problem.For cloud-based Big Data solutions, the first question is whether moving to thecloud is the only solution given data storage requirements. Moving to a cloud-based solution can be quite expensive, especially if the data is not already onthe cloud. Businesses will also need to upload all of the data needed forprocessing, which adds significantly to the cost.With this thorough understanding of the technologies available to tackle BigData, Impetus will now discuss how these technologies can be used. Thesetechnologies can be broadly divided into two categories—online analyticalprocessing (OLAP) and online transaction processing (OTP).Big Data Scenarios in OLTPWhen generating or working with large sets of data in an OLTP scenario, cost-effective NoSQL solutions are ideal. When working with a typical datawarehouse that requires analytical processing, however, Impetus recommendsusing MapReduce or MPP-based systems.
  • Reducing the Total Cost of Ownership of Big Data7Big Data Scenarios in OLAPBig Data online analytical processing (OLAP) can be divided into three differentscenarios:Big Input Small Output. This is the most common scenario, and is oftenused to draw conclusions and to prepare graphs or charts, or in caseswhere the top n-elements in a data set need to be identified.Small Input Big Output. This scenario occurs when the input data set issmall and the resulting output is big, and typically occurs in cases ofpredictive analysis, where n-number of outcomes are possible. It is alsoapplicable in scenarios where correlation-coefficient matrices must bepopulated with a given set of inputs. These inputs may be small, but theresults might turn out to be very large.Big Input and Big Output. The third scenario occurs in ETL processes.Here, the magnitude of output data is similar to that of input data.In the real world, whenever businesses summarize or concentrate data withrespect to parameters such as data volume, latency, or cost, there is a decreasein volume of data. In such a scenario, small data solutions such as MPP-datastores, traditional relational databases, and newer NoSQL databases offeringthe lowest latency are recommended. Note, however, that when moving from asmall data solution to a Big Data solution, the latency of these systems willincrease while the corresponding cost per gigabyte will decrease.It is well known that Hadoop systems are cost effective. That said, in case ofsmall data solutions, where latency is the key factor, opting for customized andtailored solutions that enable quicker data retrieval will provide the best results.The primary downfall of these solutions is that the cost of deployment willincrease the storage cost per GB.Massively parallel processing (MPP), on the other hand, offers a number ofsignificant benefits. MPP-data store solutions provide relational stores whilesimultaneously accommodating larger sizes of data.Often times it is best to deploy a combination of these systems to best addressbusiness needs.
  • Reducing the Total Cost of Ownership of Big Data8Analytics with HadoopIndirect Analytics Over HadoopIn this approach, Hadoop is used to clean and transform the data into astructured form, then to load the structured data into the RDBMS databases.This approach provides the end user with the flexibility of parallel processing ofHadoop and an SQL interface at the summarized data level. This solution isrelatively inexpensive when compared with other options.Direct Analytics Over HadoopApplying analytics directly over a Hadoop system without moving it to anyRDBMS databases can be an effective practice to analyze the data from theHadoop Distributed File System (HDFS).This approach enables both batch and asynchronous analytics of data in theHadoop system. This is a very cost-effective approach because it does notrequire the management of data sources other than existing Hadoop systems. Italso allows flexibility to scale to any level with summarized data.Analytics Over Hadoop with MPP Data WarehouseToday, a number of options available on the market allow for the integration ofMPP-based data warehouses and Hadoop. These options are worth consideringfor large volumes of data.The primary disadvantage to these approaches, however, is the potential costinvolved. Most MPP-based data warehouses are expensive. Some also requirehigh-end servers for deployment, which only add to the expense.Choosing the Right TechnologiesTo choose right technology stack, businesses need to look at these three factorsto first determine whether implementation of business use cases:Cost: The first factor is the cost per terabyte for storage. The nextconsideration is the cost related to business continuity and vendor lock-
  • Reducing the Total Cost of Ownership of Big Data9in. Also, understand how the current system is likely to change withstrategic decisions, and if these changes would require a differentvendor.Latency: The next factors to consider are latency requirements. Do anyuse case take the throughput of the system into account? For a systemfor smaller data, when system response times are critical, MPP-based orrelational database systems would be a better choice.Dollar-per-terabyte: For business driven by the dollar-per-terabytefactor, Impetus advises an MPP-based solution. This option provides amiddle ground between the Hadoop and NoSQL-based solutions, andcan allow storage of large amounts of data without compromisingspeed.For business with varying requirements, whose data and related strategies alsochange frequently, Impetus does recommend working with a vendor lock-inmodel.Opting for Faster MapReduce/HadoopFor business requirements driven by cost or business continuity, opt forHadoop. Hadoop will enable storage of all of data, and has a relatively highdegree of latency. A few vendors offer faster Hadoop implementations or otherparallel processing frameworks. These solutions usually extend standardHadoop APIs and offer enhanced system performance, as well as better supportfor the production environment.NoSQL Database SolutionsOLTP scenarios mean that faster reads and writes are required. The vendors inthis market offer a variety of different solutions with different underlyingimplementations, each suited to a different business use case:Hbase and Cassandra are recommended for banking and financialbusiness. For random and real-time read/write access to the Big ‘table-like’ Data, use HBase. For faster writes, look to Cassandra.MongoDB and CouchDB are recommended when the primaryrequirement is the querying of transactional data and defining indexes.
  • Reducing the Total Cost of Ownership of Big Data10There are also other databases—graph databases like Neo4j forinstance—that make Big-Data-heavy social media analytics problemssimpler.New Era Relational DatabasesThe latest relational databases (RDBMSs) have been specifically designed withthese OLTP scenarios in mind, and have taken major steps toward addressinglatency issues. Many businesses have been using SQL successfully for the lastseveral years, and most business users still consider SQL to be the best tool toquery structured data.Other solutions include emerging sets of technologies and new versions ofexisting RDBMS engines that are all very adept at handling large volumes ofstructured data.Therefore, for handling large volumes of structured data, look to new era RDMSsolutions like MySQL cluster, GridSQL, or later versions of Microsoft SQL Server.Impetus Solutions and RecommendationsOne way to reduce the cost of data migration is to use MapReduce for ETL,rather than costly ETL tools.Management and provisioning tools are available with commercial Big Datasolutions for easy management of systems. Impetus offers Ankush, a vendor-neutral tool for cluster management, which can be used to automaticallyprovision multiple Hadoop clusters.For ongoing maintenance, Impetus mantra for success is, “automate, automate,automate!” Any task that needs to be carried out more than once should beautomated. This also holds true for monitoring and tuning.When dealing with changing capacity, continue to add hardware or look foralternative methods to speed things up. Using graphics processing units forgeneral purpose computing can also help.Impetus also recommends Rainstor or similar solutions that help to compressdata and reduce the cost of hardware required data storage.
  • Reducing the Total Cost of Ownership of Big Data11Finally, look to faster, tailored MapReduce solutions that will allow completionof more tasks in less time.ConclusionIn summary, best practices and robust strategies can help lower the total cost ofownership of your Big Data solutions, and transform Big Data challenges into BigData opportunities.At Impetus, we have used these methods paired with the Hadoop Ecosystem tosuccessfully tackle Big Data problems.About ImpetusImpetus Technologies is a leading provider of Big Data solutions for theFortune 500®. We help customers effectively manage the “3-Vs” of Big Dataand create new business insights across their enterprises.Website: www.bigdata.impetus.com | Email: bigdata@impetus.com© 2013 Impetus Technologies, Inc.All rights reserved. Product andcompany names mentioned hereinmay be trademarks of theirrespective companies.May 2013