Transcript of "Dealing with Big Data: Planning for and Surviving the Petabyte Age"
• Cognizant 20-20 InsightsMaking Sense of Big Data in thePetabyte Age Executive Summary continues to accelerate in terms of volume, complexity and formats.2 The concept of “big data”1 is gaining attention across industries and the globe, thanks to the A traditional approach to handling big data is to growth in social media (Twitter, Facebook, blogs, replace SQL with tools like MapReduce.3 However, etc.) and the explosion of rich content from other the sheer volume of data contained in a data information sources (activity logs from the Web, set that is routinely analyzed does not solve the proximity and wireless sources, etc.). The desire to more pressing issue — that people have difficulty create actionable insights from the ever-increas- focusing on massive amounts of tables, files, Web ing volumes of unstructured and structured sites and data marts, all of which are candidates data sets is forcing enterprises to rethink their for analysis. It’s not all about just data warehouses, approaches to big data, particularly as tradi- anymore. tional approaches have proved difficult — if even possible — to apply to structured data sets. Usability is a factor that will overshadow the technical characteristics of big data analysis for While data volume proliferates, the knowledge it at least the next five years. This paper focuses creates has not kept pace. For example, the sheer specifically on the roadmap organizations must complexity of how to store and index large data create and follow to survive the Petabyte Age. stores, as well as the information models required to access them, have made it difficult for organi- Big Data = Big Challenges zations to convert captured data into insight. The Petabyte Age is creating a multitude of The media appears obsessed with how today’s challenges for organizations. The accelerating leading companies are dealing with big data, a deluge of data is problematic to all, for within phenomenon known as living in the “Petabyte the massive array of data sources — including Age.” However, coverage often focuses on the data warehouses, data marts, portals, Web sites, technology aspects of big data, leaving such social media, files and more — is the informa- concerns as usability largely untouched. tion required to make the smartest strategic business decisions. Many enterprises are facing For years, the accelerating data deluge has also the dilemma that the systems and processes received significant attention from industry devised specifically to integrate all this informa- pundits and researchers. What is new is the tion lack the required responsiveness to place the threshold that has been crossed as the onslaught information into a neatly organized warehouse in cognizant 20-20 insights | june 2011
time to be used at the current speed of business. usage and availability and other issues —The heightened use of Excel and other desktop require businesses to store increasingly greatertools to integrate needed information in the most volumes of information for much longer timeexpedient way possible only adds complexity to frames.the problem enterprises face. As a result of these factors, enterprises worldwideThere are a number of factors at the heart of big have been rapidly increasing the amount of datadata that make analysis difficult. For example, housed for analysis to compete in the Petabytethe timing of data, the complexity of data, Age. Many have responded by embracing newthe complexity of the synthesized enterprise technologies to help manage the sheer volumewarehouse and the identification of the most of data. However, these new toys also introduceappropriate data are equal if not larger challenges data usability issues that will not be solved bythan dealing with large data sets, themselves. new technology but rather will require some rethinking of the business consequences of bigThe increased complexity of the data available for data. Among these challenges:generating insights is a direct consequence of thefollowing: • Big data is not only tabular; it also includes documents, e-mails, pictures, videos, sound• The highly communicative and integrated bites, social media extracts, logs and other global economy. Enterprises across industry forms of information that is difficult to fit are increasingly seeking more granular insight into the nicely organized world of traditional into market forces that ultimately shape their database tables (rows and columns). success and failure. Data generated by the “always-on economy” is the impetus, in many • Companies that tackle big data as a tech- nology-only initiative will only solve a single cases, for the keen interest in implementing dimension of the big data mandate. so-called insight facilitation appliances on mobile devices – smartphones, iPads, Android • There are sheer volumetric issues, such and other tablets — throughout the enterprise. as billions of rows of data, that need to be solved. While tried-and-true technologies (par-• The enlightened consumer. Given the titioning) and newer technologies (MapReduce, explosion in social media and smart devices, etc.) permit organizations to segment data into many consumers have more information at more manageable chunks, such an approach their fingertips than ever before (often more does not deal with the issue that rarely so than businesses) and are becoming increas- used information is clogging the pathway to ingly sophisticated in how they gather and necessary information. Traditional lifecycle apply such information. management technologies will alleviate many• The global regulatory climate. New regulatory of the volumetric issues, but they will do little mandates — covering financial transactions, to solve the non-technical issues associated corporate health, food-borne illnesses, energy with volumetrics.Tabulating the Information Lifecycle Types of Information Retained the Longest How Much Information is Retained Development Records 4% >1 PB 4% Email 4% >500 TB 7% Financial Records 5% >100 TB 7% Database Archive 6% >25 TB 4% Government Records 11% >10 TB 14% Organization Records 18% >5 TB 18% Customer Records 19% >1 TB 25% Source Files 25% <1 TB 21% 0% 5% 10% 15% 20% 25% 0% 5% 10% 15% 20% 25%Source: The 100 Year Archive Task Force, SNIA Data Management ForumFigure 1 cognizant 20-20 insights 2
As a result of mergers and acquisitions, global > Information required for regulatory activi-sourcing, data types and other issues, the sheer ties but not necessarily related to the cre-number of tables and objects available for access ation, extraction or capture of value.has mushroomed. This increase in the volume ofobjects has made access schemes for big data > Historical supporting information.overly complex, and it has made finding needed > Historical information that was once aligneddata akin to finding a needle in a haystack. with value, regulatory or other purposes but is now kept because it might be useful• The information lifecycle management4 con- at some future date. siderations of data have not received the attention they deserve. Information lifecycle • Much of the complexity of information made available for deriving insight is a management should not be limited to parti- complex weave of multiple versions of the tioning schemes and the physical location of truth, data organized for different purposes, data. (Much attention is being given to cloud data of different vintages and similar data and virtualized storage, which presumes a obtained from different sources. This is a process has been devised for rationalizing the phenomenon that many organizational stake- fact that always-on information made available holders describe as a “black box of informa- in the cloud is worthy of this heightened avail- tion” into which they have no insight into its ability.) Information lifecycle management lineage. This adds delay to the use of insight is the process that traditionally stratifies the gained from such information, a development physical layout for technical performance. In caused by the obvious necessity of validating the Petabyte Age, where the amount of infor- information prior to using it for anything out mation available for analysis is increasing at of the ordinary. Much of the data available for an accelerating rate, the information lifecycle analysis results from the conventional wisdom management process should also ensure a that winning organizations are “data pack heightened focus on information that matters. rats” and that information that arrives on their This stratification should categorize informa- doorsteps tends to stay as a permanent artifact tion into the following groups: of the business. Interestingly, according to the > Information directly related to the creation, 100-Year Archive Task Force,5 source files are extraction or capture of value. the most frequently identified source of data retained as part of the “100-Year Archive.” > Supporting information that could be re- ferred to when devising a strategy to cre- • A sizable amount of operational information ate, extract or capture value. is not housed in official systems adminis- tered by enterprise IT groups but instead is > Information required for the operations of stored on desktops throughout the organi- the enterprise but not necessarily related zation. Much of these local data stores were to the creation, extraction or capture of created with good intentions; people respon- value.Many Sources of Data Partner Data Portal Data Warehouse Taxonomy Dimensions Other Informal ? Information Network Spreadsheets and Local Data Operational Personal System Organization KeysFigure 2 cognizant 20-20 insights 3
sible for business functions had no other way The Roadmap to Managing Big Data of gaining control over the information they Big data will be solved through a combination needed. As a result, these desktop-based of enhancements to the people, process and systems have created a different form of big technology strengths of an enterprise. data — a weave of Microsoft Access, Excel and other desktop tool-based sources that are just People-based initiatives that will impact big data as critical in running enterprises. The contents programs employed at companies include the of these individually small to medium-size following: sources of information collectively add up to a sizable number of sources. One large • Managing the information lifecycle employed at the organizations. For good reason (glaring enterprise was found to have close to 1,000 privacy and security concerns, among them), operational metrics managed in desktop appli- organizations have placed significant focus cations (i.e., Excel) — which is not an uncommon on information governance. The mandate for situation. These sources never make it to the determining which data deserves focus should production environment and downstream data be part of the overall governance charter. warehouses.• Much of the big data housed in organiza- • Ensuring a sufficient skill set to tackle the issues introduced as a consequence of big tions results from regulatory requirements data. that necessitate storing large amounts of historical data. This large volume of historical • Developing a series of metrics to manage data hides the data required for insight. While the effectiveness of the big data program. it is important to retain this information, the This includes: necessity to house it with the same priority as information used for deriving insight is both > Average, minimum and maximum time re- quired to turn data into insight. expensive and unnecessary. > Average, minimum and maximum time re-The Case for Horizontal Partitioning quired to integrate new information sources.Horizontal partitioning is the process defined as > Average, minimum and maximum time re-segmenting data in such a way that prioritizes quired to integrate existing informationinformation required for value extraction, origi- sources.nation and capture.6 This partitioning processshould result in a process that tiers information > Time required for the management process.along the traditional dimensions of a business > Percentage of people associated with the program participating in the managementinformation model. Such a model enhances the process.focus of information that supports the extraction,origination and capture of value. > Value achieved from the program.Converting Big Data Into Value Relevant Actionable Acquired & Trustworthy Learned Created Knowledge Data Inference Capabilities Customers Markets Channels Value Risks Investors Chain Insight Regulatory Expected Disruptions Outcomes Heard Inference Action Innovation Extracted Originated Value Value Value Captured Captured Transaction Captured Value Value StreamFigure 3 cognizant 20-20 insights 4
According to McKinsey,7 the activities of people Big Data Getting Biggersteering big data will include:• Ensuring that big data is more accessible and TB eBay: 6.5 PB, 2009 timely. 1,000 Google: 1 PB of new data every 3 days, 2009• Measuring the value achieved by unlocking big 800 data. Size of the Largest Data Warehouse in the 600• Embracing experimentation through data to Winter Top Ten Survey CAGR = 173% expose variability and raise performance. 400 Moore’s Law Growth Rate• Enabling the customization of populations 200 through segmentation. 0 1998 2000 2002 2004 2006 2008 2010 2012• Facilitating the use of automated algorithms � Actual � Projected to replace and support human decision- making, thereby improving decisions, minimiz- Source: Winter Corp. ing risks and unearthing valuable insights that Figure 4 would otherwise remain hidden.• Facilitating innovation programs that use new Technology-based initiatives that will impact big business models, products and services. data programs employed at companies include:Process-based initiatives that will impact big dataprograms are best enabled as augmentations of a • Ensuring that the specialized skills required to administer and use the fruits of the bigcompany’s governance activities. These augmen- data initiative are present. Some of thesetations include: include the databases, the ontologies used to navigate big data and the MapReduce• Ensuring sufficient focus on information concepts that augment or fully replace SQL that will drive value within the organization. access techniques. These processes are best employed as linkages between corporate strategy and information • Ensuring that tools introduced to navigate lifecycle management programs. big data are usable by the intended audience without upsetting self-service paradigms that > It is important to note that information have slowly gained traction during the past lifecycle management is defined in many several years. organizations as a program to manage hi- erarchies and business continuity. For the • Ensuring that the architecture and sup- purposes of this paper, this definition is ex- porting network, technology and software tended to include promotion and demotion infrastructures are capable of supporting big of data items as aligned with the business data. information model (how information is used It is safe to state that if history is any prediction to support enterprise strategies and tactics) of the future, the sheer volume of data that orga- of the organization. nizations will need to deal with will outstrip our > The process used to govern big data and its collective imaginations for how much data will information lifecycle should continually re- be available for generating insights.. Only eight fine and prioritize the benefits, constraints, years ago, a 300 to 400 terabyte data warehouse priorities and risks associated with the was considered an outlier. Today, multi-petabyte timely publication and use of relevant, fo- warehouses are easily found. Failure to take action cused, actionable and trustworthy informa- to manage the usability of information pouring tion published under big data initiatives. into the enterprise is (and will be) a competitive disadvantage (see Figure 4).• Ensuring the metrics that drive proper adhesion and use of big data are developed. This should cover the following topics: Recommendations Big data is a reality in most enterprises. However, > Governing big data. companies that tackle big data as merely a > Big data lifecycle. technology imperative will solve a less important > Big data use and adoption. dimension of their big data challenges. Big data is much more than an extension of the technolo- > Big data publication metrics. cognizant 20-20 insights 5
Storage Definitions to the models (making it difficult for knowledge workers to navigate the data needed for analysis) and added an analytic plaque, which makes finding the required information for analysis more akin to Terabyte Will fit 200,000 photos finding a needle in a haystack. or MP3 songs on a single 1 terabyte hard drive. In many organizations, information lifecycle ini- Petabyte Will fit on 16 Backblaze tiatives are mandated that too often focus on storage pods racked in two data center cabinets. the optimal partitioning and archiving of the enterprise (i.e., vertical partitioning). Largely a Exabyte Will fit in 2,000 cabinets and technology focus, this thread of the information fill a four story data center that takes up a city block. lifecycle overlooks data that is no longer aligned with the strategies, tactics and intentions of the Zettabyte Will fit in 1,000 data centers organization. The scope and breadth of informa- or about 20% of Manhattan, New York. tion housed by enterprises in the Petabyte Age mandates that data be stratified according to its Yottabyte Will fit in the states of Delaware usefulness in the organizational value creation and Rhode Island with a million data centers. (i.e., horizontal partitioning). In today’s organiza- tions, the only cross-functional bodies with the ability to perform this horizontal partitioning are virtual organizations employed to govern the enterprise information asset.Figure 5 It was only a few years ago that a data warehousegies used in partitioning strategies employed at requiring a terabyte of storage was the exception.enterprises. As we embrace the Petabyte Age, companies are entering an era where they will need to beCompanies have proved that they are pack rats. capable of handling and analyzing much largerThey need to house large amounts of history for populations of information. Regardless of theadvanced analytics, and regulatory pressures processes put in place, ever-increasing volumesinfluence them to just store everything. The of structured and unstructured data will onlyreduced cost of storage has allowed companies proliferate, challenging companies to quickly andto turn their data warehousing environments into effectively convert raw data into insight in waysdata dumps, which has added both a complexity that stand the test of time.Footnotes1 Big data refers to data sets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics and visu- alizing.2 “The Toxic Terabyte” (IBM Research Labs, July 2006) provides a thorough analysis of how companies had to get their houses in order to deal with a terabyte of data. If this authorative work were rewritten today, it would be called “The Problematic Petabyte” and in five years, most probably, “The Exhaustive Exabyte.”3 MapReduce is a Google-inspired framework specifically devised for processing large amounts of data optimized across a grid of computing power.4 Information lifecycle management is a process used to improve the usefulness of data by moving lesser used data into segments. Information lifecycle management is most commonly interested in moving data from always-needed partitions to rarely needed partitions and, finally, into archives.5 SNIA Data Management Forum’s 100 Year Archive Task Force, 2007.6 Horizontal partitioning is a term created by the author. It describes the application of generally accepted techniques of gaining performance by segmenting data into partitions (vertical partitioning) to segmenting groups of data by the likelihood of it achieving organizational value.7 “Big Data, The Next Frontier for Innovation, Competition and Productivity,” McKinsey & Company, May 2011. cognizant 20-20 insights 6