Your SlideShare is downloading. ×
Technologyforecast
Making sense
of Big Data
A quarterly journal
2010, Issue 3

In this issue

04

22

36

Tapping into the...
Contents

Features
04	

Tapping into the power of Big Data
Treating it differently from your core enterprise data is essen...
Interviews
14	

The data scalability challenge
John Parkinson of TransUnion describes the data handling issues
more compan...
Message from
the editor

Bill James has loved baseball statistics ever since he was a kid in Mayetta,
Kansas, cutting base...
Addressing these problems effectively doesn’t require
radically new technology. Better architectural design
choices and so...
Tapping into the
power of Big Data

Treating it differently from your core enterprise data is essential.
By Galen Gruman

...
Like most corporations, the Walt Disney Co. is
swimming in a rising sea of Big Data: information
collected from business o...
Bringing Big Data under control
Big Data is not a precise term; rather, it is a
characterization of the never-ending accum...
Improved
experience

4

Internal
business
partners

Site
visitors

Affiliated
businesses

Interface to cluster
(MapReduce/...
Simply put, the low cost of a Hadoop cluster means
freedom to experiment. Disney uses a couple of dozen
servers that were ...
Large
data sets

Small
data sets

Big Data (via
Hadoop/MapReduce)

Little analytical value

Non-relational data

Less scal...
The ways different enterprises approach Big Data
It should come as no surprise that organizations
dealing with lots of dat...
The Google-style techniques in Hadoop, MapReduce,
and related technologies work in a fundamentally
different way from trad...
Mashups like this can also result in customer-facing
services. FlightCaster for iPhone and BlackBerry uses
Big Data approa...
Why the time is ripe for Big Data

Conclusion

The human analysis previously described is old hat
for many business analys...
The data scalability 	
challenge

John Parkinson of TransUnion describes the
data handling issues more companies will face...
We found a number of representational problems
when we used the HDFS/Hadoop/HBase stack to
do something that, according to...
PwC: What are you using in place of something
like Hadoop?

PwC: Of the three kinds of data, which is the
most challenging...
as they go for reads and writes and you can’t go faster
than that. And businesses down the food chain are not
accustomed t...
Creating a cost-effective 	
Big Data strategy

Disney’s Bud Albers, Scott Thompson,
and Matt Estes outline an agile
approa...
We hope to do in other areas what we’ve done with
content distribution networks [CDNs]. We’ve had a
tremendous amount of s...
PwC: Hadoop seems to suggest a feasible way
to analyze data that has only temporal
importance. How did you get to the poin...
PwC: This kind of information doesn’t go in a
cube. Not that data cubes are going away,
but cubes are fairly well known no...
Building a bridge to 	
the rest of your data

How companies are using open-source cluster-computing techniques
to analyze ...
As recently as two years ago, the International
Supercomputing Conference (ISC) agenda included
nothing about distributed ...
“Hadoop will process the data set and output a new data set,
as opposed to changing the data set in place.” —Amr Awadallah...
Client

Switch
1000Mbps

Switch

100Mbps

Switch

100Mbps

Typical node setup
2 quad-core Intel Nehalem
24GB of RAM

Task ...
“Amazon supports Hadoop directly through its Elastic MapReduce
application programming interfaces.” —Chris Wensel of Concu...
HDFS uses multi-gigabyte file sizes to reduce the
management complexity of lots of files in large data
volumes. It typical...
MapReduce
MapReduce is the base programming framework for
Hadoop. It often acts as a bridge between HDFS and
tools that ar...
“You can code in whatever JVM-based language you want, and then
shove that into the cluster.” —Chris Wensel of Concurrent
...
Some useful tools for MapReduce-style
analytics programming
Open-source tools that work via MapReduce on
Hadoop clusters a...
Marz’s use of Thrift to model social graphs like the one
in Figure 7 demonstrates the flexibility of the schema
for Hadoop...
“We established that Hadoop does horizontally scale. This is what’s really
exciting, because I’m an RDBMS guy, right? I’ve...
Selected Big Data tool vendors
Amazon	
Amazon provides a Hadoop framework on its
Elastic Compute Cloud (EC2) and S3 storag...
Hadoop’s foray 	
into the enterprise

Cloudera’s Amr Awadallah discusses how and why
diverse companies are trying this nov...
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Making sense of big data
Upcoming SlideShare
Loading in...5
×

Making sense of big data

1,321

Published on

Typical BI uses data from transactional and other relational database management systems that an enterprise collects, scrubs the data for accuracy and consistency, and puts it into a form. Such systems are vital for accurate analyses of transactional information, but they don’t work well for messy questions, they've been too expensive, and they haven’t scaled efficiently. In contrast, Big Data techniques allow you to sift through data to look for patterns at a lower cost and in less time.

Published in: Economy & Finance, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,321
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Making sense of big data"

  1. 1. Technologyforecast Making sense of Big Data A quarterly journal 2010, Issue 3 In this issue 04 22 36 Tapping into the power of Big Data Building a bridge to the rest of your data Revising the CIO’s data playbook
  2. 2. Contents Features 04 Tapping into the power of Big Data Treating it differently from your core enterprise data is essential. 22 Building a bridge to the rest of your data How companies are using open-source cluster-computing techniques to analyze their data. 36 Revising the CIO’s data playbook Start by adopting a fresh mind-set, grooming the right talent, and piloting new tools to ride the next wave of innovation.
  3. 3. Interviews 14 The data scalability challenge John Parkinson of TransUnion describes the data handling issues more companies will face in three to five years. 18 Creating a cost-effective Big Data strategy Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an agile approach that leverages open-source and cloud technologies. 34 Hadoop’s foray into the enterprise Cloudera’s Amr Awadallah discusses how and why diverse companies are trying this novel approach. 46 New approaches to customer data analysis Razorfish’s Mark Taylor and Ray Velez discuss how new techniques enable them to better analyze petabytes of Web data. Departments 02 Message from the editor 50 Acknowledgments 54 Subtext
  4. 4. Message from the editor Bill James has loved baseball statistics ever since he was a kid in Mayetta, Kansas, cutting baseball cards out of the backs of cereal boxes in the early 1960s. James, who compiled The Bill James Baseball Abstract for years, is a renowned “sabermetrician” (a term he coined himself). He now is a senior advisor on baseball operations for the Boston Red Sox, and he previously worked in a similar capacity for other Major League Baseball teams. James has done more to change the world of baseball statistics than anyone in recent memory. As broadcaster Bob Costas says, James “doesn’t just understand information. He has shown people a different way of interpreting that information.” Before Bill James, Major League Baseball teams all relied on long-held assumptions about how games are won. They assumed batting average, for example, had more importance than it actually does. James challenged these assumptions. He asked critical questions that didn’t have good answers at the time, and he did the research and analysis necessary to find better answers. For instance, how many days’ rest does a reliever need? James’s answer is that some relievers can pitch well for two or more consecutive days, while others do better with a day or two of rest in between. It depends on the individual. Why can’t a closer work more than just the ninth inning? A closer is frequently the best reliever on the team. James observes that managers often don’t use the best relievers to their maximum potential. The lesson learned from the Bill James example is that the best statistics come from striving to ask the best questions and trying to get answers to those questions. But what are the best questions? James takes an iterative approach, analyzing the data he has, or can gather, asking some questions based on that analysis, and then looking for the answers. He doesn’t stop with just one set of statistics. The first set suggests some questions, to which a second set suggests some answers, which then give rise to yet another set of questions. It’s a continual process of investigation, one that’s focused on surfacing the best questions rather than assuming those questions have already been asked. Enterprises can take advantage of a similarly iterative, investigative approach to data. Enterprises are being overwhelmed with data; many enterprises each generate petabytes of information they aren’t making best use of. And not all of the data is the same. Some of it has value, and some, not so much. The problem with this data has been twofold: (1) it’s difficult to analyze, and (2) processing it using conventional systems takes too long and is too expensive. 02 PricewaterhouseCoopers Technology Forecast
  5. 5. Addressing these problems effectively doesn’t require radically new technology. Better architectural design choices and software that allows a different approach to the problems are enough. Search engine companies such as Google and Yahoo provide a pragmatic way forward in this respect. They’ve demonstrated that efficient, cost-effective, system-level design can lead to an architecture that allows any company to handle different data differently. “Revising the CIO’s data playbook,” on page 36 emphasizes that CIOs have time to pick and choose the most suitable approach. The most promising opportunity is in the area of “gray data,” or data that comes from a variety of sources. This data is often raw and unvalidated, arrives in huge quantities, and doesn’t yet have established value. Gray data analysis requires a different skill set—people who are more exploratory by nature. Enterprises shouldn’t treat voluminous, mostly unstructured information (for example, Web server log files) the same way they treat the data in core transactional systems. Instead, they can use commodity computer clusters, open-source software, and Tier 3 storage, and they can process in an exploratory way the less-structured kinds of data they’re generating. With this approach, they can do what Bill James does and find better questions to ask. As always, in this issue we’ve included interviews with knowledgeable executives who have insights on the overall topic of interest: In this issue of the Technology Forecast, we review the techniques behind low-cost distributed computing that have led companies to explore more of their data in new ways. In the article, “Tapping into the power of Big Data,” on page 04, we begin with a consideration of exploratory analytics—methods that are separate from traditional business intelligence (BI). These techniques make it feasible to look for more haystacks, rather than just the needle in one haystack. • Amr Awadallah of Cloudera explores the reasons behind Apache Hadoop’s adoption at search engine, social media, and financial services companies. The article, “Building a bridge to the rest of your data,” on page 22 highlights the growing interest in and adoption of Hadoop clusters. Hadoop provides highvolume, low-cost computing with the help of opensource software and hundreds or thousands of commodity servers. It also offers a simplified approach to processing more complex data in parallel. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze lots of data they didn’t have the means to analyze before. The buzz around Big Data and “cloud storage” (a term some vendors use to describe less-expensive clustercomputing techniques) is considerable, but the article, Message from the editor • John Parkinson of TransUnion describes the data challenges that more and more companies will face during the next three to five years. • Bud Albers, Scott Thompson, and Matt Estes of Disney outline an agile, open-source cloud data vision. • Mark Taylor and Ray Velez of Razorfish contrast newer, more scalable techniques of studying customer data with the old methods. Please visit pwc.com/techforecast to find these articles and other issues of the Technology Forecast online. If you would like to receive future issues of the Technology Forecast as a PDF attachment, you can sign up at pwc.com/techforecast/subscribe. We welcome your feedback and your ideas for future research and analysis topics to cover. Tom DeGarmo Principal Technology Leader thomas.p.degarmo@us.pwc.com 03
  6. 6. Tapping into the power of Big Data Treating it differently from your core enterprise data is essential. By Galen Gruman 04 PricewaterhouseCoopers Technology Forecast
  7. 7. Like most corporations, the Walt Disney Co. is swimming in a rising sea of Big Data: information collected from business operations, customers, transactions, and the like; unstructured information created by social media and other Web repositories, including the Disney home page itself and sites for its theme parks, movies, books, and music; plus the sites of its many big business units, including ESPN and ABC. “In any given year, we probably generate more data than the Walt Disney Co. did in its first 80 years of existence,” observes Bud Albers, executive vice president and CTO of the Disney Technology Shared Services Group. “The challenge becomes what do you do with it all?” Albers and his team are in the early stages of answering their own question with an economical cluster-computing architecture based on a set of cost-effective and scalable technologies anchored by Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by Apache Software Foundation. These still-emerging technologies allow Disney analysts to explore multiple terabytes of information without the lengthy time requirements or high cost of traditional business intelligence (BI) systems. platform make it feasible not only to look for the needle in the haystack, but also to look for new haystacks. This kind of analysis demands an attitude of exploration—and the ability to generate value from data that hasn’t been scrubbed or fully modeled into relational tables. Using Disney and other examples, this first article introduces the idea of exploratory BI for Big Data. The second article examines Hadoop clusters and technologies that support them (page 22), and the third article looks at steps CIOs can take now to exploit the future benefits (page 36). We begin with a closer look at Disney’s still-nascent but illustrative effort. “In any given year, we probably generate more data than the Walt Disney Co. did in its first 80 years of existence.” —Bud Albers of Disney This issue of the Technology Forecast examines how Apache Hadoop and these related technologies can derive business value from Big Data by supporting a new kind of exploratory analytics unlike traditional BI. These software technologies and their hardware cluster Tapping into the power of Big Data 05
  8. 8. Bringing Big Data under control Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. Like everyone else, Disney’s Big Data is huge, more unstructured than structured, and growing much faster than transactional data. The Disney Technology Shared Services Group, which is responsible for Disney’s core Web and analysis technologies, recently began its Big Data efforts but already sees high potential. The group is testing the technology and working with analysts in Disney business units. Disney’s data comes from varied sources, but much of it is collected for departmental business purposes and not yet widely shared. Disney’s Big Data approach will allow it to look at diverse data sets for unplanned purposes and to uncover patterns across customer activities. For example, insights from Disney Store activities could be useful in call centers for theme park booking or to better understand the audience segments of one of its cable networks. The Technology Shared Services Group is even using Big Data approaches to explore its own IT questions to understand what data is being stored, how it is used, and thus what type of storage hardware and management the group needs. Albers assumes that Big Data analysis is destined to become essential. “The speed of business these days and the amount of data that we are now swimming in mean that we need to have new ways and new techniques of getting at the data, finding out what’s in there, and figuring out how we deal with it,” he says. The team stumbled upon an inexpensive way to improve the business while pursuing more IT costeffectiveness through the use of private-cloud technologies. (See the Technology Forecast, Summer 2009, for more on the topic of cloud computing.) When Albers launched the effort to change the division’s cost curve so IT expenses would rise more slowly than the business usage of IT—the opposite had been true—he turned to an approach that many companies use to make data centers more efficient: virtualization. Virtualization offers several benefits, including higher utilization of existing servers and the ability to move workloads to prevent resource bottlenecks. An organization can also move workloads to external cloud providers, using them as a backup resource when needed, an approach called cloud bursting. By using such approaches, the Disney Technology Shared Services Group lowered its IT expense growth rate from 27 percent to –3 percent, while increasing its annual processing growth from 17 percent to 45 percent. While achieving this efficiency, the team realized that the ability to move resources and tap external ones could apply to more than just data center efficiency. At first, they explored using external clouds to analyze big sets of data, such as Web traffic to Disney’s many sites, and to handle big processing jobs more cost-effectively and more quickly than with internal systems. During that exploration, the team discovered Hadoop, MapReduce, and other open-source technologies that distribute data-analysis workloads across many computers, breaking the analysis into many parallel workloads that produce results faster. Faster results mean that more questions can be asked, and the low cost of the technologies means the team can afford to ask those questions. Disney assembled a Hadoop cluster and set up a central logging service to mine data that the organization hadn’t been able to before. It will begin to provide internal group access to the cluster in October 2010. Figure 1 shows how the Hadoop cluster will benefit internal groups, business partners, and customers. “The speed of business these days and the amount of data that we are now swimming in mean that we need to have new ways and new techniques of getting at the data, finding out what’s in there, and figuring out how we deal with it.” —Bud Albers of Disney 06 PricewaterhouseCoopers Technology Forecast
  9. 9. Improved experience 4 Internal business partners Site visitors Affiliated businesses Interface to cluster (MapReduce/Hive/Pig) 1 Usage data D-Cloud data cluster 2 Central logging service Core IT and business unit systems 3 Hadoop Metadata repository Figure 1: Disney’s Hadoop cluster and central logging service Disney’s new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment of (2) a central logging service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more responsive and personalized user experience. Source: Disney, 2010 Tapping into the power of Big Data 07
  10. 10. Simply put, the low cost of a Hadoop cluster means freedom to experiment. Disney uses a couple of dozen servers that were scheduled to be retired, and the organization operates its cluster with a handful of existing staff. Matt Estes, principal data architect for the Disney Technology Shared Services Group, estimates the cost of the project at $300,000 to $500,000. Here are other examples of the kinds of insights that may be gleaned from analysis of Big Data information flows: “Before, I would have needed to figure on spending $3 million to $5 million for such an initiative,” Albers says. “Now I can do this without charging to the bottom line.” • Changes in corporate reputation and the potential for regulatory action, based on the monitoring of social networks as well as Web news sites Unlike the reusable canned queries in typical BI systems, Big Data analysis does require more effort to write the queries and the data-parsing code for what are often unique inquiries of data sources. But Albers notes that “the risk is lower due to all the other costs being lower.” Failure is inexpensive, so analysts are more willing to explore questions they would otherwise avoid. • Real-time demand forecasting, based on disparate inputs such as weather forecasts, travel reservations, automotive traffic, and retail point-of-sale data Even in this early stage, Albers is confident that the ability to ask more questions will lead to more insights that translate to both the bottom line and the top line. For example, Disney already is seeking to boost customer engagement and spending by making recommendations to customers based on pattern analysis of their online behavior. How Big Data analysis is different What should other enterprises anticipate from Hadoopstyle analytics? It is a type of exploratory BI they haven’t done much before. This is business intelligence that provides indications, not absolute conclusions. It requires a different mind-set, one that begins with exploration, the results of which create hypotheses that are tested before moving on to validation and consolidation. These methods could be used to answer questions such as, “What indicators might there be that predate a surge in Web traffic?” or “What fabrics and colors are gaining popularity among influencers, and what sources might be able to provide the materials to us?” or “What’s the value of an influencer on Web traffic through his or her social network?” See the sidebar “Opportunities for Big Data insights” for more examples of the kinds of questions that can be asked of Big Data. 08 Opportunities for Big Data insights • Customer churn, based on analysis of call center, help desk, and Web site traffic patterns • Supply chain optimization, based on analysis of weather patterns, potential disaster scenarios, and political turmoil Disney and others explore their data without a lot of preconceptions. They know the results won’t be as specific as a profit-margin calculation or a drug-efficacy determination. But they still expect demonstrable value, and they expect to get it without a lot of extra expense. Typical BI uses data from transactional and other relational database management systems (RDBMSs) that an enterprise collects—such as sales and purchasing records, product development costs, and new employee hire records—diligently scrubs the data for accuracy and consistency, and then puts it into a form the BI system is programmed to run queries against. Such systems are vital for accurate analyses of transactional information, especially information subject to compliance requirements, but they don’t work well for messy questions, they’ve been too expensive for questions you’re not sure there’s any value in asking, and they haven’t been able to scale to analyze large data sets efficiently. (See Figure 2.) PricewaterhouseCoopers Technology Forecast
  11. 11. Large data sets Small data sets Big Data (via Hadoop/MapReduce) Little analytical value Non-relational data Less scalability Traditional BI Relational data Figure 2: Where Big Data fits in Other companies have also tapped into the excitement brewing over Big Data technologies. Several Weboriented companies that have always dealt with huge amounts of data—such as Yahoo, Twitter, and Google—were early adopters. Now, more traditional companies—such as TransUnion, a credit rating service—are exploring Big Data concepts, having seen the cost and scalability benefits the Web companies have realized. Specifically, enterprises are also motivated by the inability to scale their existing approach for working on traditional analytics tasks, such as querying across terabytes of relational data. They are learning that the tools associated with Hadoop are uniquely positioned to explore data that has been sitting on the side, unanalyzed. Figure 3 illustrates how the data architecture landscape appears in 2010. Enterprises with high processing power requirements and centralized architectures are facing scaling issues. Source: PricewaterhouseCoopers, 2010 In contrast, Big Data techniques allow you to sift through data to look for patterns at a much lower cost and in much less time than traditional BI systems. Should the data end up being so valuable that it requires the ongoing, compliance-oriented analysis of regular BI systems, only then do you make that investment. Big Data approaches let you ask more questions of more information, opening a wide range of potential insights you couldn’t afford to consider in the past. “Part of the analytics role is to challenge assumptions,” Estes says. BI systems aren’t designed to do that; instead, they’re designed to dig deeper into known questions and look for variations that may indicate deviations from expected outcomes. Furthermore, Big Data analysis is usually iterative: you ask one question or examine one data set, then think of more questions or decide to look at more data. That’s different from the “single source of truth” approach to standard BI and data warehousing. The Disney team started with making sure they could expose and access the data, then moved to iterative refinement in working with the data. “We aggressively got in to find the direction and the base. Then we began to iterate rather than try to do a Big Bang,” Albers says. Tapping into the power of Big Data High processing power Low processing power Enterprises facing scaling and capacity/cost problems Google, Amazon, Facebook, Twitter, etc. (all use nonrelational data stores for reasons of scale) Most enterprises Cloud users with low compute requirements Centralized compute architecture Distributed compute architecture Figure 3: The data architecture landscape in 2010 Source: PricewaterhouseCoopers, 2010 Wolfram Research and IBM have begun to extend their analytics applications to run on such large-scale data pools, and startups are presenting approaches they promise will allow data exploration in ways that technologies couldn’t have enabled in the past, including support for tools that let knowledge workers examine traditional databases using Big Data–style exploratory tools. 09
  12. 12. The ways different enterprises approach Big Data It should come as no surprise that organizations dealing with lots of data are already investigating Big Data technologies, or that they have mixed opinions about these tools. “At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions of rows of data, looking for things that match a pattern approximately,” says John Parkinson, TransUnion’s acting CTO. “We want to do accurate but approximate matching and categorization in very large low-structure data sets.” Parkinson has explored Big Data technologies such as MapReduce that appear to have a more efficient filtering model than some of the pattern-matching algorithms TransUnion has tried in the past. “MapReduce also, at least in its theoretical formulation, is very amenable to highly parallelized execution,” which lets the users tap into farms of commodity hardware for fast, inexpensive processing, he notes. However, Parkinson thinks Hadoop and MapReduce are too immature. “MapReduce really hasn’t evolved yet to the point where your average enterprise technologist can easily make productive use of it. As for Hadoop, they have done a good job, but it’s like a lot of open-source software—80 percent done. There were limits in the code that broke the stack well before what we thought was a good theoretical limit.” Parkinson echoes many IT executives who are skeptical of open-source software in general. “If I have a bunch of engineers, I don’t want them spending their day being the technology support environment for what should be a product in our architecture,” he says. That’s a legitimate point of view, especially considering the data volumes TransUnion manages—8 petabytes from 83,000 sources in 4,000 formats and growing— and its focus on mission-critical capabilities for this data. Credit scoring must run successfully and deliver top-notch credit scores several times a day. It’s an operational system that many depend on for critical business decisions that happen millions of times a day. (For more on TransUnion, see the interview with Parkinson on page 14.) 10 Disney’s system is purely intended for exploratory efforts or at most for reporting that eventually may feed up to product strategy or Web site design decisions. If it breaks or needs a little retooling, there’s no crisis. But Albers disagrees about the readiness of the tools, noting that the Disney Technology Shared Services Group also handles quite a bit of data. He figures Hadoop and MapReduce aren’t any worse than a lot of proprietary software. “I fully expect we will run on things that break,” he says, adding facetiously, “Not that any commercial product I’ve ever had has ever broken.” Data architect Estes also sees responsiveness in open-source development that’s laudable. “In our testing, we uncovered stuff, and you get somebody on the other end. This is their baby, right? I mean, they want it fixed.” Albers emphasizes the total cost-effectiveness of Hadoop and MapReduce. “My software cost is zero. You still have the implementation, but that’s a constant at some level, no matter what. Now you probably need to have a little higher skill level at this stage of the game, so you’re probably paying a little more, but certainly, you’re not going out and approving a Teradata cluster. You’re talking about Tier 3 storage. You’re talking about a very low level of cost for the storage.” Albers’ points are also valid. PricewaterhouseCoopers predicts these open-source tools will be solid sooner rather than later, and are already worthy of use in non-mission-critical environments and applications. Hence, in the CIO article on page 36, we argue in favor of taking cautious but exploratory steps. Asking new business questions Saving money is certainly a big reward, but PricewaterhouseCoopers contends the biggest payoff from Hadoop-style analysis of Big Data is the potential to improve organizations’ top line. “There’s a lot of potential value in the unstructured data in organizations, and people are starting to look at it more seriously,” says Tom Urquhart, chief architect at PricewaterhouseCoopers. Think of it as a “Google in a box, which allows you to do intelligent search regardless of whether the underlying content is structured or unstructured,” he says. PricewaterhouseCoopers Technology Forecast
  13. 13. The Google-style techniques in Hadoop, MapReduce, and related technologies work in a fundamentally different way from traditional BI systems, which use strictly formatted data cubes pulling information from data warehouses. Big Data tools let you work with data that hasn’t been formally modeled by data architects, so you can analyze and compare data of different types and of different levels of rigor. Because these tools typically don’t discard or change the source data before the analysis begins, the original context remains available for drill-down by analysts. These tools provide technology assistance to a very human form of analysis: looking at the world as it is and finding patterns of similarity and difference, then going deeper into the areas of interest judged valuable. In contrast, BI systems know what questions should be asked and what answers to expect; their goal is to look for deviations from the norm or changes in standard patterns deemed important to track (such as changes in baseline quality or in sales rates in specific geographies). Such an approach, absent an exploratory phase, results in a lot of information loss during data consolidation. (See Figure 4.) Pattern analysis mashup services There’s another use of Big Data that combines efficiency and exploratory benefits: on-the-fly pattern analysis from disparate sources to return real-time results. Amazon.com pioneered Big Data–based product recommendations by analyzing customer data, including purchase histories, product ratings, and comments. Albers is looking for similar value that would come from making live recommendations to customers when they go to a Disney site, store, or reservations phone line—based on their previous online and offline behavior with Disney. O’Reilly Media, a publisher best known for technical books and Web sites, is working with the White House to develop mashup applications that look at data from various sources to identify patterns that might help lobbyists and policymakers. For example, by mashing together US Census data and labor statistics, they can see which counties have the most international and domestic immigration, then correlate those attributes with government spending changes, says Roger Magoulas, O’Reilly’s research director. Exploration Pre-consolidated data (never collected) All collected data tio n Insight da Information loss oli ns on ati lid o ns Summary enterprise data Information loss Co Co Summary departmental data All collected data Summary departmental data Less information loss Information loss Summary enterprise data Greater insight Figure 4: Information loss in the data consolidation process Source: PricewaterhouseCoopers, 2010 Tapping into the power of Big Data 11
  14. 14. Mashups like this can also result in customer-facing services. FlightCaster for iPhone and BlackBerry uses Big Data approaches to analyze flight-delay records and current conditions to issue flight-delay predictions to travelers. Exploiting the power of human analysis Big Data approaches can lower processing and storage costs, but we believe their main value is to perform the analysis that BI systems weren’t designed for, acting as an enabler and an amplifier of human analysis. Ad hoc exploration at a bargain Big Data lets you inexpensively explore questions and peruse data for patterns that may indicate opportunities or issues. In this arena, failure is cheap, so analysts are more willing to explore questions they would otherwise avoid. And that should lead to insights that help the business operate better. Medical data is an example of the potential for ad hoc analysis. “A number of such discoveries are made on the weekends when the people looking at the data are doing it from the point of view of just playing around,” says Doug Lenat, founder and CEO of Cycorp and a former professor at Stanford and Carnegie Mellon universities. Right now the technical knowledge required to use these tools is nontrivial. Imagine the value of extending the exploratory capability more broadly. Cycorp is one of many startups trying to make Big Data analytic capabilities usable by more knowledge workers so they can perform such exploration. Analyzing data that wasn’t designed for BI Big Data also lets you work with “gray data,” or data from multiple sources that isn’t formatted or vetted for your specific needs, and that varies significantly in its level of detail and accuracy—and thus cannot be examined by BI systems. One analogy is Wikipedia. Everyone knows its information is not rigorously managed or necessarily accurate; nonetheless, Wikipedia is a good first place to look for indicators of what may be true and useful. From there, you do further research using a mix of 12 information resources whose accuracy and completeness may be more established. People use their knowledge and experience to appropriately weigh and correlate what they find across gray data to come up with improved strategies to aid the business. Figure 5 compares gray data and more normalized black data. Black data Classified Provenanced Cleaned Actual Gray data Raw Data and context commingled Noisy Hypothetical e.g., Financial system data e.g., Wikipedia Reviewed Confirming More trustworthy Managed by IT Unchecked Indicative Less trustworthy Managed by business unit Figure 5: Gray versus black data Source: PricewaterhouseCoopers, 2010 Web analytics and financial risk analysis are two examples of how Big Data approaches augment human analysts. These techniques comb huge data sets of information collected for specific purposes (such as monitoring individual financial records), looking for patterns that might identify good prospects for loans and flag problem borrowers. Increasingly, they comb external data not collected by a credit reporting agency—for example, trends in a neighborhood’s housing values or in local merchants’ sales patterns— to provide insights into where sales opportunities could be found or where higher concentrations of problem customers are located. The same approaches can help identify shifts in consumer tastes, such as for apparel and furniture. And, by analyzing gray data related to costs of resources and changes in transportation schedules, these approaches can help anticipate stresses on suppliers and help identify where additional suppliers might be found. All of these activities require human intelligence, experience, and insight to make sense of the data, figure out the questions to ask, decide what information should be correlated, and generally conduct the analysis. PricewaterhouseCoopers Technology Forecast
  15. 15. Why the time is ripe for Big Data Conclusion The human analysis previously described is old hat for many business analysts, whether they work in manufacturing, fashion, finance, or real estate. What’s changing is scale. As noted, many types of information are now available that never existed or were not accessible. What could once only be suggested through surveys, focus groups, and the like can now be examined directly, because more of the granular thinking and behaviors are captured. Businesses have the potential to discover more through larger samples and more granular details, without relying on people to recall behaviors and motivations accurately. PricewaterhouseCoopers believes that Big Data approaches will become a key value creator for businesses, letting them tap into the wild, woolly world of information heretofore out of reach. These new data management and storage technologies can also provide economies of scale in more traditional data analysis. Don’t limit yourself to the efficiencies of Big Data and miss out on the potential for gaining insights through its advantages in handling the gray data prevalent today. This potential can be realized only if you pull together and analyze all that data. Right now, there’s simply too much information for individual analysts to manage, increasing the chances of missing potential opportunities or risks. Businesses that augment their human experts with Big Data technologies could have significant competitive advantages by heading off problems sooner, identifying opportunities earlier, and performing mass customization at a larger scale. Fortunately, the emerging Big Data tools should let businesspeople apply individual judgments to vaster pools of information, enabling low-cost, ad hoc analysis never before feasible. Plus, as patterns are discovered, the detection of some can be automated, letting the human analysts concentrate on the art of analysis and interpretation that algorithms can’t accomplish. Even better, emerging Big Data technologies promise to extend the reach of analysis beyond the cadre of researchers and business analysts. Several startups offer new tools that use familiar data-analysis tools— similar to those for SQL databases and Excel spreadsheets—to explore Big Data sources, thus broadening the ability to explore to a wider set of knowledge workers. Finally, Big Data approaches can be used to power analytics-based services that improve the business itself, such as in-context recommendations to customers, more accurate predictions of service delivery, and more accurate failure predictions (such as for the manufacturing, energy, medical, and chemical industries). Tapping into the power of Big Data Big Data analysis does not replace other systems. Rather, it supplements the BI systems, data warehouses, and database systems essential to financial reporting, sales management, production management, and compliance systems. The difference is that these information systems deal with the knowns that must meet high standards for rigor, accuracy, and compliance—while the emerging Big Data analytics tools help you deal with the unknowns that could affect business strategy or its execution. As the amount and interconnectedness of data vastly increases, the value of the Big Data approach will only grow. If the amount and variety of today’s information is daunting, think what the world will be like in 5 or 10 years. People will become mobile sensors—collecting, creating, and transmitting all sorts of information, from locations to body status to environmental information. We already see this happening as smartphones equipped with cameras, microphones, geolocation, and compasses proliferate. Wearable medical sensors, small temperature tags for use on packages, and other radio-equipped sensors are a reality. They’ll be the Twitter and Facebook feeds of tomorrow, adding vast quantities of new information that could provide context on behavior and environment never before possible—and a lot of “noise” certain to mask what’s important. Insight-oriented analytics in this sea of information— where interactions cause untold ripples and eddies in the flow and delivery of business value—will become a critical competitive requirement. Big Data technology is the likeliest path to gaining such insights. 13
  16. 16. The data scalability challenge John Parkinson of TransUnion describes the data handling issues more companies will face in three to five years. Interview conducted by Vinod Baya and Alan Morrison John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines TransUnion’s considerable requirements for less-structured data analysis, shedding light on the many data-related technology challenges TransUnion faces today—challenges he says that more companies will face in the near future. PwC: In your role at TransUnion, you’ve evaluated many large-scale data processing technologies. What do you think of Hadoop and MapReduce? JP: MapReduce is a very computationally attractive answer for a certain class of problem. If you have that class of problem, then MapReduce is something you should look at. The challenge today, however, is that the number of people who really get the formalism behind MapReduce is a lot smaller than the group of people trying to understand what to do with it. It really hasn’t evolved yet to the point where your average enterprise technologist can easily make productive use of it. PwC: What class of problem would that be? JP: MapReduce works best in situations where you want to do high-volume, accurate but approximate matching and categorization in very large, lowstructured data sets. At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions 14 of rows of data looking for things that match a pattern approximately. MapReduce is a more efficient filter for some of the pattern-matching algorithms that we have tried to use. At least in its theoretical formulation, it’s very amenable to highly parallelized execution, which many of the other filtering algorithms we’ve used aren’t. The open-source stack is attractive for experimenting, but the problem we find is that Hadoop isn’t what Google runs in production—it’s an attempt by a bunch of pretty smart guys to reproduce what Google runs in production. They’ve done a good job, but it’s like a lot of open-source software—80 percent done. The 20 percent that isn’t done—those are the hard parts. From an experimentation point of view, we have had a lot of success in proving that the computing formalism behind MapReduce works, but the software that we can acquire today is very fragile. It’s difficult to manage. It has some bugs in it, and it doesn’t behave very well in an enterprise environment. It also has some interesting limitations when you try to push the scale and the performance. PricewaterhouseCoopers Technology Forecast
  17. 17. We found a number of representational problems when we used the HDFS/Hadoop/HBase stack to do something that, according to the documentation available, should have worked. However, in practice, limits in the code broke the stack well before what we thought was a good theoretical limit. Now, the good news of course is that you get source code. But that’s also the bad news. You need to get the source code, and that’s not something that we want to do as part of routine production. I have a bunch of smart engineers, but I don’t want them spending their day being the technology support environment for what should be a product in our architecture. Yes, there’s a pony there, but it’s going to be awhile before it stabilizes to the point that I want to bet revenue on it. PwC: Data warehousing appliance prices have dropped pretty dramatically over the past couple of years. When it comes to data that’s not necessarily on the critical path, how does an enterprise make sure that it is not spending more than it has to? JP: We are probably not a good representational example of that because our business is analyzing the data. There is almost no price we won’t pay to get a better answer faster, because we can price that into the products we produce. The challenge we face is that the tools don’t always work properly at the edge of the envelope. This is a problem for hardware as well as software. A lot of the vendors stop testing their applications at about 80 percent or 85 percent of their theoretical capability. We routinely run them at 110 percent of their theoretical capability, and they break. I don’t mind making tactical justifications for technologies that I expect to replace quickly. I do that all the time. But having done that, I want the damn thing to work. Too often, we’ve discovered that it doesn’t work. PwC: Are you forced to use technologies that have matured because of a wariness of things on the absolute edge? JP: My dilemma is that things that are known to work usually don’t scale to what we need—for speed or full capacity. I must spend some time, energy, and dollars betting on things that aren’t mature yet, but that can be sufficiently generalized architecturally. If the one I pick doesn’t work, or goes away, I can fit something else into its place relatively easily. That’s why we like appliances. As long as they are well behaved at the network layer and have a relatively generalized or standards-based business semantic interface, it doesn’t matter if I have to unplug one in 18 months or two years because something better came along. I can’t do that for everything, but I can usually afford to do it in the areas where I have no established commercial alternative. “I have a bunch of smart engineers, but I don’t want them spending their day being the technology support environment for what should be a product in our architecture.” The data scalability challenge 15
  18. 18. PwC: What are you using in place of something like Hadoop? PwC: Of the three kinds of data, which is the most challenging? JP: Essentially, we use brute force. We use Ab Initio, which is a very smart brute-force parallelization scheme. I depend on certain capabilities in Ab Initio to parallelize the ETL [extract, transform, and load] in such a way that I can throw more cores at the problem. JP: We have two kinds of challenges. The first is driven purely by the scale at which we operate. We add roughly half a terabyte of data per month to the credit file. Everything we do has challenges related to scale, updates, speed, or database performance. The vendors both love us and hate us. But we are where the industry is going—where everybody is going to be in two to five years. We are a good leading indicator, but we break their stuff all the time. A second challenge is the unstructured part of the data, which is increasing. PwC: Much of the data you see is transactional. Is it all structured data, or are you also mining text? JP: We get essentially three kinds of data. We get accounts receivable data from credit loan issuers. That’s the record of what people actually spend. We get public record data, such as bankruptcy records, court records, and liens, which are semi-structured text. And we get other data, which is whatever shows up, and it’s generally hooked together around a well-understood set of identifiers. But the cost of this data is essentially free—we don’t pay for it. It’s also very noisy. So we have to spend computational time figuring out whether the data we have is right, because we must find a place to put it in the working data sets that we build. At TransUnion, we suck in 100 million updates a day for the credit files. We update a big data warehouse that contains all the credit and related data. And then every day we generate somewhere between 1 and 20 operational data stores, which is what we actually run the business on. Our products are joined between what we call indicative data, the information that identifies you as an individual; structured data, which is derived from transactional records; and unstructured data that is attached to the indicative. We build those products on the fly because the data may change every day, sometimes several times a day. One challenge is how to accurately find the right place to put the record. For example, we get a Joe Smith at 13 Main Street and a Joe Smith at 31 Main Street. Are those two different Joe Smiths, or is that a typing error? We have to figure that out 100 million times a day using a bunch of custom pattern-matching and probabilistic algorithms. 16 PwC: It’s more of a challenge to deal with the unstructured stuff because it comes in various formats and from various sources, correct? JP: Yes. We have 83,000 data sources. Not everyone provides us with data every day. It comes in about 4,000 formats, despite our data interchange standards. And, to be able to process it fast enough, we must convert all data into a single interchange format that is the representation of what we use internally. Complex computer science problems are associated with all of that. PwC: Are these the kinds of data problems that businesses in other industries will face in three to five years? JP: Yes, I believe so. PwC: What are some of the other problems you think will become more widespread? JP: Here are some simple practical examples. We have 8.5 petabytes of data in the total managed environment. Once you go seriously above 100 terabytes, you must replace the storage fabric every four or five years. Moving 100 terabytes of data becomes a huge material issue and takes a long time. You do get some help from improved interconnect speed, but the arrays go as fast PricewaterhouseCoopers Technology Forecast
  19. 19. as they go for reads and writes and you can’t go faster than that. And businesses down the food chain are not accustomed to thinking about refresh cycles that take months to complete. Now, a refresh cycle of PCs might take months to complete, but any one piece of it takes only a couple of hours. When I move data from one array to another, I’m not done until I’m done. Additionally, I have some bugs and new vulnerabilities to deal with. Today, we don’t have a backup problem at TransUnion because we do incremental forever backup. However, we do have a restore problem. To restore a material amount of data, which we very occasionally need to do, takes days in some instances because the physics of the technology we use won’t go faster than that. The average IT department doesn’t worry about these problems. But take the amount of data an average IT department has under management, multiply it by a single decimal order of magnitude, and it starts to become a material issue. We would like to see computationally more-efficient compression algorithms, because my two big cost pools are Store It and Move It. For now, I don’t have a computational problem, but if I can’t shift the trend line on Store It and Move It, I will have a computational problem within a few years. To perform the computations in useful time, I must parallelize how I compute. Above a certain point, the parallelization breaks because I can’t move the data further. The data scalability challenge PwC: Cloudera [a vendor offering a Hadoop distribution] would say bring the computation to the data. JP: That works only for certain kinds of data. We already do all of that large-scale computation on a file system basis, not on a database basis. And we spend compute cycles to compress the data so there are fewer bits to move, then decompress the data for computation, and recompress it so we have fewer bits to store. What we have discovered—because I run the fourth largest commercial GPFS [general parallel file system, a distributed computing file system developed by IBM] cluster in the world—is that once you go beyond a certain size, the parallelization management tools break. That’s why I keep telling people that Hadoop is not what Google runs in production. Maybe the Google guys have solved this, but if they have, they aren’t telling me how. n “We would like to see computationally more-efficient compression algorithms, because my two big cost pools are Store It and Move It.” 17
  20. 20. Creating a cost-effective Big Data strategy Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an agile approach that leverages open-source and cloud technologies. Interview conducted by Galen Gruman and Alan Morrison Bud Albers joined what is now the Disney Technology Shared Services Group two years ago as executive vice president and CTO. His management team includes Scott Thompson, vice president of architecture, and Matt Estes, principal data architect. The Technology Shared Services Group, located in Seattle, has a heritage dating back to the late 1990s, when Disney acquired Starwave and Infoseek. The group supports all the Disney businesses ($38 billion in annual revenue), managing the company’s portfolio of Web properties. These include properties for the studio, store, and park; ESPN; ABC; and a number of local television stations in major cities. In this interview, Albers, Thompson, and Estes discuss how they’re expanding Disney’s Web data analysis footprint without incurring additional cost by implementing a Hadoop cluster. Albers and team freed up budget for this cluster by virtualizing servers and eliminating other redundancies. PwC: Disney is such a diverse company, and yet there clearly is lots of potential for synergies and cross-fertilization. How do you approach these opportunities from a data perspective? BA: We try and understand the best way to work with and to provide services to the consumer in the long term. We have some businesses that are very data intensive, and then we have some that are less so because of their consumer audience. One of the challenges always is how to serve both kinds of businesses and do so in ways that make sense. The sell-to relationships extend from the studio out to the distribution groups and the theater chains. If you’re selling to millions, you’re trying to understand the different audiences and how they connect. 18 One of the things I’ve been telling my folks from a data perspective is that you don’t send terabytes one way to be mated with a spreadsheet on the other side, right? We’re thinking through those kinds of pieces and trying to figure out how we move down a path. The net is that working with all these businesses gives us a diverse set of requirements, as you might imagine. We’re trying to stay ahead of where all the businesses are. In that respect, the questions I’m asking are, how do we get more agile, and how do we do it in a way that handles all the data we have? We must consider all of the new form factors being developed, all of which will generate lots of data. A big question is, how do we handle this data in a way that makes cost sense for the business and provides us an increased level of agility? PricewaterhouseCoopers Technology Forecast
  21. 21. We hope to do in other areas what we’ve done with content distribution networks [CDNs]. We’ve had a tremendous amount of success with the CDN marketplace by standardizing, by staying in the middle of the road and not going to Akamai proprietary extensions, and by creating a dynamic marketplace. If we get a new episode of Lost, we can start streaming it, and I can be streaming 80 percent on Akamai and 20 percent on Level 3. Then we can decide we’re going to turn it back, and I’m going to give 80 percent to Limelight and 20 percent to Level 3. We can do that dynamically. PwC: What are the other main strengths of the Technology Shared Services Group at Disney? BA: When I came here a couple of years ago, we had some very good core central services. If you look at the true definition of a cloud, we had the very early makings of one—shared central services around registration, for example. On Disney, on ABC, or on ESPN, if you have an ID, it works on all the Disney properties. If you have an ESPN ID, you can sign in to KGO in San Francisco, and it will work. It’s all a shared registration system. The advertising system we built is shared. The marketing systems we built are shared—all the analytics collection, all those things are centralized. Those things that are common are shared among all the sites. Those things that are brand specific are built by the brands, and the user interface is controlled by the brands, so each of the various divisions has a head of engineering on the Web site who reports to me. Our CIO worries about it from the firewall back; I worry about it from the firewall to the living room and the mobile device. That’s the way we split up the world, if that makes sense. PwC: How do you link the data requirements of the central core with those that are unique to the various parts of the business? BA: It’s more art than science. The business units must generate revenue, and we must provide the core services. How do you strike that balance? Ownership is a lot more negotiated on some things today. We typically pull down most of the analytics and add things in, and it’s a constant struggle to answer the question, “Do we have everything?” We’re headed toward this notion of one data element at a time, aggregate, and queue up the aggregate. It can get a little bit crazy because you wind up needing to pull the data in and run it through that whole food chain, and it may or may not have lasting value. It may have only a temporal level of importance, and so we’re trying to figure out how to better handle that. An awful lot of what we do in the data collection is pull it in, lay it out so it can be reported on, and/or push it back into the businesses, because the Web is evolving rapidly from a standalone thing to an integral part of how you do business. “It’s more art than science. The business units must generate revenue, and we must provide the core services. How do you strike that balance? Ownership is a lot more negotiated on some things today.” —Bud Albers Creating a cost-effective Big Data strategy 19
  22. 22. PwC: Hadoop seems to suggest a feasible way to analyze data that has only temporal importance. How did you get to the point where you could try something like a Hadoop cluster? BA: Guys like me never get called when it’s all pretty and shiny. The Disney unit I joined obviously has many strengths, but when I was brought on, there was a cost growth situation. The volume of the aggregate activity growth was 17 percent. Our server growth at the time was 30 percent. So we were filling up data centers, but we were filling them with CPUs that weren’t being used. My question was, how can you go to the CFO and ask for a lot of money to fill a data center with capital assets that you’re going to use only 5 percent of? CPU utilization isn’t the only measure, but it’s the most prominent one. To study and understand what was happening, we put monitors and measures on our servers and reported peak CPU utilization on fiveminute intervals across our server farm. We found that on roughly 80 percent of our servers, we never got above 10 percent utilization in a monthly period. Our first step to address that problem was virtualization. At this point, about 49 percent of our data center is virtual. Our virtualization effort had a sizable impact on cost. Dollars fell out because we quit building data centers and doing all kinds of redundant shuffling. We didn’t have to lay off people. We changed some of our processes, and we were able to shift our growth curve from plus 27 to minus 3 on the shared service. We call this our D-Cloud effort. Another step in this effort was moving to a REST [REpresentational State Transfer] and JSON [JavaScript Object Notation] data exchange standard, because we knew we had to hit all these different devices and create some common APIs [application programming interfaces] in the framework. One of the very first things we put in place was a central logging service for all the events. These event logs can be streamed into one very large data set. We can then use the Hadoop and MapReduce paradigm to go after that data. 20 PwC: How does the central logging service fit into your overall strategy? ST: As we looked at it, we said, it’s not just about virtualization. To be able to burst and do these other things, you need to build a bunch of core services. The initiative we’re working on now is to build some of those core services around managing configuration. This project takes the foundation we laid with virtualization and a REST and JSON data exchange standard, and adds those core services that enable us to respond to the marketplace as it develops. Piping that data back to a central repository helps you to analyze it, understand what’s going on, and make better decisions on the basis of what you learned. PwC: How do you evolve so that the data strategy is really served well, so that it’s more of a data-driven approach in some ways? ME: On one side, you have a very transactional OLTP [online transactional processing] kind of world, RDBMSs [relational database management systems], and major vendors that we’re using there. On the other side of it, you have traditional analytical warehousing. And where we’ve slotted this [Hadoop-style data] is in the middle with the other operational data. Some of it is derived from transactional data, and some has been crafted out of analytical data. There’s a freedom that’s derived from blending these two kinds of data. Our centralized logging service is an example. As we look at continuing to drive down costs to drive up efficiency, we can begin to log a large amount of this data at a price point that we have not been able to achieve by scaling up RDBMSs or using warehousing appliances. Then the key will be putting an expert system in place. That will give us the ability to really understand what’s going on in the actual operational environment. We’re starting to move again toward lower utilization trajectories. We need to scale the infrastructure back and get that utilization level up to the threshold. PricewaterhouseCoopers Technology Forecast
  23. 23. PwC: This kind of information doesn’t go in a cube. Not that data cubes are going away, but cubes are fairly well known now. The value you can create is exactly what you said, understanding the thinking behind it and the exploratory steps. ST: We think storing the unstructured data in its raw format is what’s coming. In a Hadoop environment, instead of bringing the data back to your warehouse, you figure out what question you want to answer. Then you MapReduce the input, and you may send that off to a data cube and a place that someone can dig around in, but you keep the data in its raw format and pull out only what you need. BA: The wonderful thing about where we’re headed right now is that data analysis used to be this giant, massive bet that you had to place up front, right? No longer. Now, I pull Hadoop off of the Internet, first making sure that we’re compliant from a legal perspective with licensing and so forth. After that’s taken care of, you begin to prototype. You begin to work with it against common hardware. You begin to work with it against stuff you otherwise might throw out. Rather than, I’m going to go spend how much for Teradata? We’re using the basic premise of the cloud, and we’re using those techniques of standardizing the interface to virtualize and drive cost out. I’m taking that cost savings and returning some of it to the business, but then reinvesting some in new capabilities while the cost curve is stabilizing. ME: Refining some of this reinvestment in new capabilities doesn’t have to be put in the category of traditional “$5 million projects” companies used to think about. You can make significant improvements with reinvestments of $200,000 or even $50,000. BA: It’s then a matter of how you’re redeploying an investment in resources that you’ve already made as a corporation. It’s a matter of now prioritizing your work and not changing the bottom-line trajectory in a negative fashion with a bet that may not pay off. I can try it, and I don’t have to get great big governancebased permission to do it, because it’s not a bet of half the staff and all of this stuff. It’s, OK, let’s get something on the ground, let’s work with the business unit, let’s pilot it, let’s go somewhere where we know we have a need, let’s validate it against this need, and let’s make sure that it’s working. It’s not something that must go through an RFP [request for proposal] and standard procurement. I can move very fast. n “We think storing the unstructured data in its raw format is what’s coming. In a Hadoop environment, instead of bringing the data back to your warehouse, you figure out what question you want to answer.” —Scott Thompson Creating a cost-effective Big Data strategy 21
  24. 24. Building a bridge to the rest of your data How companies are using open-source cluster-computing techniques to analyze their data. By Alan Morrison 22 PricewaterhouseCoopers Technology Forecast
  25. 25. As recently as two years ago, the International Supercomputing Conference (ISC) agenda included nothing about distributed computing for Big Data— as if projects such as Google Cluster Architecture, a low-cost, distributed computing design that enables efficient processing of large volumes of less-structured data, didn’t exist. In a May 2008 blog, Brough Turner noted the omission, pointing out that Google had harnessed as much as 100 petaflops1 of computing power, compared to a mere 1 petaflop in the new IBM Roadrunner, a supercomputer profiled in EE Times that month. “Have the supercomputer folks been bypassed and don’t even know it?” Turner wondered.2 Turner, co-founder and CTO of Ashtonbrooke.com, a startup in stealth mode, had been reading Google’s research papers and remarking on them in his blog for years. Although the broader business community had taken little notice, some companies were following in Google’s wake. Many of them were Web companies that had data processing scalability challenges similar to Google’s. Yahoo, for example, abandoned its own data architecture and began to adopt one along the lines pioneered by Google. It moved to Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by the Apache Software Foundation; it also adopted MapReduce, Google’s parallel programming framework. Yahoo used these and other open-source tools it helped develop to crawl and index the Web. After implementing the architecture, it found other uses for the technology and has now scaled its Hadoop cluster to 4,000 nodes. By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what O’Reilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage. Big Data refers to data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis by traditional means. Many who are familiar with these new methods are convinced that Hadoop clusters will enable cost-effective analysis of Big Data, and these methods are now spreading beyond companies that mine the public Web as part of their business. By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what O’Reilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage. Building a bridge to the rest of your data 23
  26. 26. “Hadoop will process the data set and output a new data set, as opposed to changing the data set in place.” —Amr Awadallah of Cloudera What are these methods and how do they work? This article looks at the architecture and tools surrounding Hadoop clusters with an eye toward what about them will be useful to mainstream enterprises during the next three to five years. We focus on their utility for less-structured data. Hadoop clusters Although cluster computing has been around for decades, commodity clusters are more recent, starting with UNIX- and Linux-based Beowulf clusters in the mid-1990s. These banks of inexpensive servers networked together were pitted against expensive supercomputers from companies such as Cray and others—the kind of computers that government agencies, such as the National Aeronautics and Space Administration (NASA), bought. It was no accident that NASA pioneered the development of Beowulf.3 Hadoop extends the value of commodity clusters, making it possible to assemble a high-end computing cluster at a low-end price. A central assumption underlying this architecture is that some nodes are bound to fail when computing jobs are distributed across hundreds or thousands of nodes. Therefore, one key to success is to design the architecture to anticipate and recover from individual node failures.4 Other goals of the Google Cluster Architecture and its expression in open-source Hadoop include: • Price/performance over peak performance—The emphasis is on optimizing aggregate throughput; for example, sorting functions to rank the occurrence of keywords in Web pages. Overall sorting throughput is high. In each of the past three years, Yahoo’s Hadoop clusters have won Gray’s terabyte sort benchmarking test.5 24 • Software tolerance for hardware failures—When a failure occurs, the system responds by transferring the processing to another node, a critical capability for large distributed systems. As Roger Magoulas, research director for O’Reilly Media, says, “If you are going to have 40 or 100 machines, you don’t expect your machines to break. If you are running something with 1,000 nodes, stuff is going to break all the time.” • High compute power per query—The ability to scale up to thousands of nodes implies the ability to throw more compute power at each query. That ability, in turn, makes it possible to bring more data to bear on each problem. • Modularity and extensibility—Hadoop clusters scale horizontally with the help of a uniform, highly modular architecture. Hadoop isn’t intended for all kinds of workloads, especially not those with many writes. It works best for read-intensive workloads. These clusters complement, rather than replace, high-performance computing (HPC) and other relational data systems. They don’t work well with transactional data or records that require frequent updating. “Hadoop will process the data set and output a new data set, as opposed to changing the data set in place,” says Amr Awadallah, vice president of engineering and CTO of Cloudera, which develops a version of Hadoop. A data architecture and a software design that are frugal with network and disk resources are responsible for the price/performance ratio of Hadoop clusters. In Awadallah’s words, “You move your processing to where your data lives.” Each node has its own processing and storage, and the data is divided and processed locally in blocks sized for the purpose. This concept of localization makes it possible to use inexpensive serial advanced technology attachment (SATA) hard disks—the kind used in most PCs and servers—and Gigabit Ethernet for most network interconnections. (See Figure 1.) PricewaterhouseCoopers Technology Forecast
  27. 27. Client Switch 1000Mbps Switch 100Mbps Switch 100Mbps Typical node setup 2 quad-core Intel Nehalem 24GB of RAM Task tracker/ DataNode JobTracker Task tracker/ DataNode NameNode Task tracker/ DataNode Task tracker/ DataNode Effective file space per node: 20TB Task tracker/ DataNode Task tracker/ DataNode Claimed benefits Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Rack 12 1TB SATA disks (non-RAID) 1 Gigabit Ethernet card Cost per node: $5,000 Rack Linear scaling at $250 per user TB (versus $5,000–$100,000 for alternatives) Compute placed near the data and fewer writes limit networking and storage costs Modularity and extensibility Figure 1: Hadoop cluster layout and characteristics Source: IBM, 2008, and Cloudera, 2010 Building a bridge to the rest of your data 25
  28. 28. “Amazon supports Hadoop directly through its Elastic MapReduce application programming interfaces.” —Chris Wensel of Concurrent The result is less-expensive large-scale distributed computing and parallel processing, which make possible an analysis that is different from what most enterprises have previously attempted. As author Tom White points out, “The ability to run an ad hoc query against your whole data set and get the results in a reasonable time is transformative.”6 The cost of this capability is low enough that companies can fund a Hadoop cluster from existing IT budgets. When it decided to try Hadoop, Disney’s Technology Shared Services Group took advantage of the increased server utilization it had already achieved from virtualization. As of March 2010, with nearly 50 percent of its servers virtualized, Disney had 30 percent server image growth annually but 30 percent less growth in physical servers. It was then able to set up a multiterabyte cluster with Hadoop and other free opensource tools, using servers it had planned to retire. The group estimates it spent less than $500,000 on the entire project. (See the article, “Tapping into the power of Big Data,” on page 04.) These clusters are also transformative because cloud providers can offer them on demand. Instead of using their own infrastructures, companies can subscribe to a service such as Amazon’s or Cloudera’s distribution on the Amazon Elastic Compute Cloud (EC2) platform. The EC2 platform was crucial in a well-known use of cloud computing on a Big Data project that also depended on Hadoop and other open-source tools. In 2007, The New York Times needed to quickly assemble the PDFs of 11 million articles from 4 terabytes of scanned images. Amazon’s EC2 service completed the job in 24 hours after setup, a feat that received widespread attention in blogs and the trade press. 26 Mostly overlooked in all that attention was the use of the Hadoop Distributed File System (HDFS) and the MapReduce framework. Using these open-source tools, after studying how-to blog posts from others, Times senior software architect Derek Gottfrid developed and ran code in parallel across multiple Amazon machines.7 “Amazon supports Hadoop directly through its Elastic MapReduce application programming interfaces [APIs],” says Chris Wensel, founder of Concurrent, which developed Cascading. (See the discussion of Cascading later in this article.) “I regularly work with customers to boot up 200-node clusters and process 3 terabytes of data in five or six hours, and then shut the whole thing down. That’s extraordinarily powerful.” The Hadoop Distributed File System The Hadoop Distributed File System (HDFS) and the MapReduce parallel programming framework are at the core of Apache Hadoop. Comparing HDFS and MapReduce to Linux, Awadallah says that together they’re a “data operating system.” This description may be overstated, but there are similarities to any operating system. Operating systems schedule tasks, allocate resources, and manage files and data flows to fulfill the tasks. HDFS does a distributed computing version of this. “It takes care of linking all the nodes together to look like one big file and job scheduling system for the applications running on top of it,” Awadallah says. HDFS, like all Hadoop tools, is Java based. An HDFS contains two kinds of nodes: • A single NameNode that logs and maintains the necessary metadata in memory for distributed jobs • Multiple DataNodes that create, manage, and process the 64MB blocks that contain pieces of Hadoop jobs, according to the instructions from the NameNode PricewaterhouseCoopers Technology Forecast
  29. 29. HDFS uses multi-gigabyte file sizes to reduce the management complexity of lots of files in large data volumes. It typically writes each copy of the data once, adding to files sequentially. This approach simplifies the task of synchronizing data and reduces disk and bandwidth usage. HDFS does not perform tasks such as changing specific numbers in a list or other changes on parts of a database. This limitation leads some to assume that HDFS is not suitable for structured data. “HDFS was never designed for structured data and therefore it’s not optimal to perform queries on structured data,” says Daniel Abadi, assistant professor of computer science at Yale University. Abadi and others at Yale have done performance testing on the subject, and they have created a relational database alternative to HDFS called HadoopDB to address the performance issues they identified.8 Equally important are fault tolerance within the same disk and bandwidth usage limits. To accomplish fault tolerance, HDFS creates three copies of each data block, typically storing two copies in the same rack. The system goes to another rack only if it needs the third copy. Figure 2 shows a simplified depiction of HDFS and its data block copying method. Some developers are structuring data in ways that are suitable for HDFS; they’re just doing it differently from the way relational data would be structured. Nathan Marz, a lead engineer at BackType, a company that offers a search engine for social media buzz, uses schemas to ensure consistency and avoid data corruption. “A lot of people think that Hadoop is meant for unstructured data, like log files,” Marz says. “While Hadoop is great for log files, it’s also fantastic for strongly typed, structured data.” For this purpose, Marz uses Thrift, which was developed by Facebook for data translation and serialization purposes.9 (See the discussion of Thrift later in this article.) Figure 3 illustrates a typical Hadoop data processing flow that includes Thrift and MapReduce. Client NameNode (metadata) Files File A File A Blocks 1, 2, 4 3, 5 DataNode DataNode DataNode DataNode 1 5 4 5 2 4 2 3 3 1 2 5 Figure 2: The Hadoop Distributed File System, or HDFS Source: Apache Software Foundation, IBM, and PricewaterhouseCoopers, 2008 Input data Input applications Less-structured information such as: log files messages images Cascading Thrift Zookeeper Pig Output applications Core Hadoop data processing Mashups RDBMS apps BI systems 1 1 Jobs M 1 2 M 2 2 R Results M 3 M 3 3 Map R Reduce 64MB blocks Figure 3: Hadoop ecosystem overview Source: PricewaterhouseCoopers, derived from Apache Software Foundation and Dion Hinchcliffe, 2010 Building a bridge to the rest of your data 27
  30. 30. MapReduce MapReduce is the base programming framework for Hadoop. It often acts as a bridge between HDFS and tools that are more accessible to most programmers. According to those at Google who developed the tool, “it hides the details of parallelization” and the other nuts and bolts of HDFS.10 MapReduce is a layer of abstraction, a way of managing a sea of details by creating a layer that captures and summarizes their essence. That doesn’t mean it is easy to use. Many developers choose to work with another tool, yet another layer of abstraction on top of it. “I avoid using MapReduce directly at all cost,” Marz says. “I actually do almost all my MapReduce work with a library called Cascading.” The terms “map” and “reduce” refer to steps the tool takes to distribute, or map, the input for parallel processing, and then reduce, or aggregate, the processed data into output files. (See Figure 4.) MapReduce works with key-value pairs. Frequently with Web data, the keys consist of URLs and the values consist of Web page content, such as Hypertext Markup Language (HTML). MapReduce’s main value is as a platform with a set of APIs. Before MapReduce, fewer programmers could take advantage of distributed computing. Now that user-accessible tools have been designed, simpler programming is possible on massively parallel systems and less adaptation of the programs is required. The following sections examine some of these tools. Data store 1 Data store n Input key-value pairs Input key-value pairs Map key 1 values Map key 2 values Barrier ... key 1 values key 3 values key 2 values key 3 values Aggregates intermediate values by output key ... Barrier key 1 intermediate values key 2 intermediate values key 3 intermediate values Reduce Reduce Reduce final key 1 values final key 2 values final key 3 values Figure 4: MapReduce phases Source: Google, 2004, and Cloudera, 2009 28 PricewaterhouseCoopers Technology Forecast
  31. 31. “You can code in whatever JVM-based language you want, and then shove that into the cluster.” —Chris Wensel of Concurrent Cascading Wensel, who created Cascading, calls it an alternative API to MapReduce, a single library of operations that developers can tap. It’s another layer of abstraction that helps bring what programmers ordinarily do in non-distributed environments to distributed computing. With it, he says, “you can code in whatever JVM-based [Java Virtual Machine] language you want, and then shove that into the cluster.” Wensel wanted to obviate the need for “thinking in MapReduce.” When using Cascading, developers don’t think in key-value pair terms—they think in terms of fields and lists of values called “tuples.” A Cascading tuple is simpler than a database record but acts like one. Each tuple flows through “pipe” assemblies, which are comparable to Java classes. The data flow begins at the source, an input file, and ends with a sink, an output directory. (See Figure 5.) Map Reduce [f1, f2, ...] P Map [f1, f2, ...] P [f1, f2, ...] So Assembly Flow A A A A A A A A MR MR MR MR Cluster Job Job Reduce [f1, f2, ...] P Rather than approach map and reduce phases large-file by large-file, developers assemble flows of operations using functions, filters, aggregators, and buffers. Those flows make up the pipe assemblies, which, in Marz’s terms, “compile to MapReduce.” In this way, Cascading smoothes the bumpy MapReduce terrain so more developers—including those who work mainly in Client scripting languages—can build flows. (See Figure 6.) [f1, f2, ...] P A [f1, f2, ...] P P MR [f1, f2, ...] Pipe assembly Hadoop MR (translation to MapReduce) MapReduce jobs Si Figure 6: Cascading assembly and flow [f1, f2, ...] So Si P Tuples with field names Source Sink Pipe Source: Concurrent, 2010 Figure 5: A Cascading assembly Source: Concurrent, 2010 Building a bridge to the rest of your data 29
  32. 32. Some useful tools for MapReduce-style analytics programming Open-source tools that work via MapReduce on Hadoop clusters are proliferating. Users and developers don’t seem concerned that Google received a patent for MapReduce in January 2010. In fact, Google, IBM, and others have encouraged the development and use of open-source versions of these tools at various research universities.11 A few of the more prominent tools relevant to analytics, and used by developers we’ve interviewed, are listed in the sections that follow. With LISP, Watson says, he can load the data once and test multiple times. In C++, he would need to use a relational database and reload each time for a program test. Using LISP makes it possible to create and test small bits of code in an iterative fashion, a major reason for the productivity gains.� This iterative, LISP-like program-programmer interaction with Clojure leads to what Hickey calls “dynamic development.” Any code entered in the console interface, he points out, is automatically compiled on the fly.� Clojure Thrift Clojure creator Rich Hickey wanted to combine aspects of C or C#, LISP (for list processing, a language associated with artificial intelligence that’s rich in mathematical functions), and Java. The letters C, L, and J led him to name the language, which is pronounced “closure.” Clojure combines a LISP library with Java libraries. Clojure’s mathematical and natural language processing (NLP) capabilities and the fact that it is JVM based make it useful for statistical analysis on Hadoop clusters. FlightCaster, a commercial-airline-delayprediction service, uses Clojure on top of Cascading, on top of MapReduce and Hadoop, for “getting the right view into unstructured data from heterogeneous sources,” says Bradford Cross, FlightCaster co-founder.� Thrift, initially created at Facebook in 2007 and then released to open source, helps developers create services that communicate across languages, including C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby. With Thrift, according to Facebook, users can “define all the necessary data structures and interfaces for a complex service in a single short file.”� LISP has attributes that lend themselves to NLP, making Clojure especially useful in NLP applications. Mark Watson, an artificial intelligence consultant and author, says most LISP programming he’s done is for NLP. He considers LISP to be four times as productive for programming as C++ and twice as productive as Java. His NLP code “uses a huge amount of memory-resident data,” such as lists of proper nouns, text categories, common last names, and nationalities. “Getting the right view into unstructured data from heterogeneous sources can be quite tricky.” —Bradford Cross of FlightCaster 30 A more important aspect of Thrift, according to BackType’s Marz, is its ability to create strongly typed data and flexible schemas. Countering the emphasis of the so-called NoSQL community on schema-less data, Marz asserts there are effective ways to lightly structure the data in Hadoop-style analysis. Marz uses Thrift’s serialization features, which turn objects�into a sequence of bits that can be stored as files, to create schemas between types (for instance, differentiating between text strings and long, 64-bit integers) and schemas between relationships (for instance, linking Twitter accounts that share a common interest). Structuring the data in this way helps BackType avoid inconsistencies in the data or the need to manually filter for some attributes. BackType can use required and optional fields to structure the Twitter messages it crawls and analyzes. The required fields can help enforce data type. The optional fields, meanwhile, allow changes to the schema as well as the use of old objects that were created using the old schema. PricewaterhouseCoopers Technology Forecast
  33. 33. Marz’s use of Thrift to model social graphs like the one in Figure 7 demonstrates the flexibility of the schema for Hadoop-style computing. Thrift essentially enables modularity in the social graph described in the schema. For example, to select a single age for each person, BackType can take into account all the raw age data. It can do this by a computation on the entire data set or a selective computation on only the people in the data set who have new data. Bob Gender male Age Charlie Gender female Gender male 25 Age Apache Thrift Non-relational data stores have become much more numerous since the Apache Hadoop project began in 2007. Many are open source. Developers of these data stores have optimized each for a different kind of data. When contrasted with relational databases, these data stores lack many design features that can be essential for enterprise transactional data. However, they are often well tailored to specific, intended purposes, and they offer the added benefit of simplicity. Primary non-relational data store types include the following: • Multidimensional map store—Each record maps a row name, a column name, and a time stamp to a value. Map stores have their heritage in Google’s Bigtable. 39 Alice Age Open-source, non-relational data stores 22 Language: C++ Figure 7: An example of a social graph modeled using Thrift schema Source: Nathan Marz, 2010 BackType doesn’t just work with raw data. It runs a series of jobs that constantly normalize and analyze new data coming in, and then other jobs that write the analyzed data to a scalable random-access database such as HBase or Cassandra.12 • Key-value store—Each record consists of a key, or unique identifier, mapped to one or more values. • Graph store—Each record consists of elements that together form a graph. Graphs depict relationships. For example, social graphs describe relationships between people. Other graphs describe relationships between objects, between links, or both. • Document store—Each record consists of a document. Extensible Markup Language (XML) databases, for example, store XML documents. Because of their simplicity, map and key-value stores can have scalability advantages over most types of relational databases. (HadoopDB, a hybrid approach developed at Yale University, is designed to overcome the scalability problems associated with relational databases.) Table 1 provides a few examples of the open-source, non-relational data stores that are available. Map Key-value Document Graph HBase Tokyo Cabinet/Tyrant MongoDB Resource Description Framework (RDF) Hypertable Project Voldemort CouchDB Neo4j Cassandra Redis Xindice InfoGrid Table 1: Example open-source, non-relational data stores Source: PricewaterhouseCoopers, Daniel Abadi of Yale University, and organization Web sites, 2010 Building a bridge to the rest of your data 31
  34. 34. “We established that Hadoop does horizontally scale. This is what’s really exciting, because I’m an RDBMS guy, right? I’ve done that for years, and you don’t get that kind of constant scalability no matter what you do.” —Scott Thompson of Disney Other related technologies and vendors A comprehsensive review of the various tools created for the Hadoop ecosystem is beyond the scope of this article, but a few of the tools merit brief description here because they’ve been mentioned elsewhere in this issue: • Pig—A scripting language called Pig Latin, which is a primary feature of Apache Pig, allows more concise querying of data sets “directly from the console” than is possible using MapReduce, according to author Tom White. • Hive—Hive is designed as “mainly an ETL [extract, transform, and load] system” for use at Facebook, according to Chris Wensel. • Zookeeper—Zookeeper provides an interface for creating distributed applications, according to Apache. Big Data covers many vendor niches, and some vendors’ products take advantage of the Hadoop stack or add to its capabilities. (See the sidebar “Selected Big Data tool vendors.”) Conclusion Interest in and adoption of Hadoop clusters are growing rapidly. Reasons for Hadoop’s popularity include: • Open, dynamic development—The Hadoop/ MapReduce environment offers cost-effective distributed computing to a community of opensource programmers who’ve grown up on Linux and Java, and scripting languages such as Perl and Python. Some are taking advantage of functional programming language dialects such as Clojure. The openness and interaction can lead to faster development cycles. 32 • Cost-effective scalability—Horizontal scaling from a low-cost base implies a feasible long-term cost structure for more kinds of data. Scott Thompson, vice president of infrastructure at the Disney Technology Shared Services Group, says, “We established that Hadoop does horizontally scale. This is what’s really exciting, because I’m an RDBMS guy, right? I’ve done that for years, and you don’t get that kind of constant scalability no matter what you do.” • Fault tolerance—Associated with scalability is the assumption that some nodes will fail. Hadoop and MapReduce are fault tolerant, another reason commodity hardware can be used. • Suitability for less-structured data—Perhaps most importantly, the methods that Google pioneered, and that Yahoo and others expanded, focus on what Cloudera’s Awadallah calls “complex” data. Although developers such as Marz understand the value of structuring data, most Hadoop/MapReduce developers don’t have an RDBMS mentality. They have an NLP mentality, and they’re focused on techniques optimized for large amounts of less-structured information, such as the vast amount of information on the Web. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze the Big Data they didn’t have the means to analyze before. This set of methods is separate from, yet complements, data warehousing. Understanding what Hadoop clusters do and how they do it is fundamental to deciding when and where enterprises should consider making use of them. PricewaterhouseCoopers Technology Forecast
  35. 35. Selected Big Data tool vendors Amazon Amazon provides a Hadoop framework on its Elastic Compute Cloud (EC2) and S3 storage service it calls Elastic MapReduce. Appistry Appistry’s CloudIQ Storage platform offers a substitute for HDFS, one designed to eliminate the single point of failure of the NameNode. Cloudera Cloudera takes a Red Hat approach to Hadoop, offering its own distribution on EC2/S3 with management tools, training, support, and professional services. Cloudscale Cloudscale’s first product, Cloudcel, marries an Excel-based interface to a back end that’s a massively parallel stream processing engine. The product is designed to process stored, historical, or real-time data. Concurrent Concurrent developed Cascading, for which it offers licensing, training, and support. Drawn to Scale Drawn to Scale offers an analytical and transactional database product on Hadoop and HBase, with occasional consulting. IBM IBM introduced a distribution of Hadoop called BigInsights in May 2010. The company’s jStart team offers briefings and workshops on Hadoop pilots. IBM BigSheets acts as an aggregation, analysis, and visualization point for large amounts of Web data. 1 FLOPS stands for “floating point operations per second.” Floating point processors use more bits to store each value, allowing more precision and ease of programming than fixed point processors. One petaflop is upwards of one quadrillion floating point operations per second. 2 Brough Turner, “Google Surpasses Supercomputer Community, Unnoticed?” May 20, 2008, http://blogs.broughturner.com/ communications/2008/05/google-surpasses-supercomputercommunity-unnoticed.html (accessed April 8, 2010). 3 See, for example, Tim Kientzle, “Beowulf: Linux clustering,” Dr. Dobb’s Journal, November 1, 1998, Factiva Document dobb000020010916dub100045 (accessed April 9, 2010). 4 Luis Barroso, Jeffrey Dean, and Urs Hoelzle, “Web Search for a Planet: The Google Cluster Architecture,” Google Research Publications, http://research.google.com/archive/googlecluster.html (accessed April 10, 2010). 5 See http://sortbenchmark.org/ and http://developer.yahoo.net/blog/ (accessed April 9, 2010). 6 Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly Media, 2009), 4. 7 See Derek Gottfrid, “Self-service, Prorated Super Computing Fun!” The New York Times Open Blog, November 1, 2007, http://open. blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/(accessed June 4, 2010) and Bill Snyder, “Cloud Computing: Not Just Pie in the Sky,” CIO, March 5, 2008, Factiva Document CIO0000020080402e4350000 (accessed March 28, 2010). 8 See “HadoopDB” at http://db.cs.yale.edu/hadoopdb/hadoopdb.html (accessed April 11, 2010). 9 Nathan Marz, “Thrift + Graphs = Strong, flexible schemas on Hadoop,” http://nathanmarz.com/blog/schemas-on-hadoop/ (accessed April 11, 2010). 10 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Google Research Publications, December 2004, http://labs.google.com/papers/mapreduce.html (accessed April 22, 2010). 11 See Dean, et al., US Patent No. 7,650,331, January 19, 2010, at http:// www.uspto.gov. For an example of the participation by Google and IBM in Hadoop’s development, see “Google and IBM Announce University Initiative to Address Internet-Scale Computing Challenges,” Google press release, October 8, 2007, http://www.google.com/intl/en/ press/pressrel/20071008_ibm_univ.html (accessed March 28, 2010). 12 See the Apache site at http://apache.org/ for descriptions of many tools that take advantage of MapReduce and/or HDFS that are not profiled in this article. Microsoft Microsoft Pivot uses the company’s Deep Zoom technology to provide visual data browsing capabilities for XML files. Azure Table services is in some ways comparable to Bigtable or HBase. (See the interview with Mark Taylor and Ray Velez of Razorfish on page 46.) ParaScale ParaScale offers software for enterprises to set up their own public or private cloud storage environments with parallel processing and large-scale data handling capability. Building a bridge to the rest of your data 33
  36. 36. Hadoop’s foray into the enterprise Cloudera’s Amr Awadallah discusses how and why diverse companies are trying this novel approach. Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya Amr Awadallah is vice president of engineering and CTO at Cloudera, a company that offers products and services around Hadoop, an open-source technology that allows efficient mining of large, complex data sets. In this interview, Awadallah provides an overview of Hadoop’s capabilities and how Cloudera customers are using them. PwC: Were you at Yahoo before coming to Cloudera? AA: Yes. I was with Yahoo from mid-2000 until mid2008, starting with the Yahoo Shopping team after selling my company VivaSmart to Yahoo. Beginning in 2003, my career shifted toward business intelligence and analytics at consumer-facing properties such as Yahoo News, Mail, Finance, Messenger, and Search. I had the daunting task of building a very large data warehouse infrastructure that covered all these diverse products and figuring out how to bring them together. That is when I first experienced Hadoop. Its model of “mine first, govern later” fits in with the well-governed infrastructure of a data mart, so it complements these systems very well. Governance standards are important for maintaining a common language across the organization. However, they do inhibit agility, so it’s best to complement a well-governed data mart with a more agile complex data processing system like Hadoop. PwC: How did Yahoo start using Hadoop? AA: In 2005, Yahoo was faced with a business challenge. The cost of creating the Web search index was approaching the revenues being made from the keyword advertising on the search pages. Yahoo Search adopted Hadoop as an economically scalable solution, 34 and worked on it in conjunction with the open-source Apache Hadoop community. Yahoo played a very big role in the evolution of Hadoop to where it is today. Soon after the Yahoo Search team started using Hadoop, other parts of the company began to see the power and flexibility that this system offers. Today, Yahoo uses Hadoop for data warehousing, mail spam detection, news feed processing, and content/ad targeting. PwC: What are some of the advantages of Hadoop when you compare it with RDBMSs [relational database management systems]? AA: With Oracle, Teradata, and other RDBMSs, you must create the table and schema first. You say, this is what I’m going to be loading in, these are the types of columns I’m going to load in, and then you load your data. That process can inhibit how fast you can evolve your data model and schemas, and it can limit what you log and track. With Hadoop, it’s the other way around. You load all of your data, such as XML [Extensible Markup Language], tab delimited flat files, Apache log files, JSON [JavaScript Object Notation], etc. Then in Hive or Pig [both of which are Hadoop data query tools], you point your metadata toward the file and parse the data on PricewaterhouseCoopers Technology Forecast

×