1. ®
CLOUD COMPUTING: BIG DATA IS THE FUTURE OF IT
Winter 2009 | Ping Li | ping@accel.com
Cloud computing has been generating considerable hype these from this exponential data growth – as inexpensively as
days. Every participant in the datacenter and IT ecosystem has possible.
been rolling out “cloud” initiatives and strategies from hardware
vendors, ISVs, SaaS providers, and Web 2.0 companies - start- Previous computing platform transitions had technology
ups and incumbents are equally active. dislocations similar to cloud computing but along different
dimensions. The shift from mainframe to client-server was
Cloud computing promises to transform IT infrastructure and fueled by disruptive innovation in computing horsepower that
deliver scalability, flexibility, and efficiency, as well as new enabled distributed microprocessing environments. The
services and applications that were previously unthinkable. following shift to web applications/web services during the last
Despite all of this activity, cloud computing remains as decade was enabled by the open networking of applications and
amorphous today as its name suggests. However, one critical services through the internet buildout. While cloud computing
trend shines through the cloud – Big Data. Indeed, it’s the core will leverage these prior waves of technology – computing and
driver in cloud computing and will define the future of IT. networking – it will also embrace deep innovations in storage/
data management to tackle big data.
BIG DATA – THE PERFECT STORM
Along these lines, many of the early uses of cloud computing
Cloud computing has been driven fundamentally by the need to have been focused less on “computing” and more on “storage.”
process an exploding quantity of data. Data is no longer measured For example, a significant portion of the initial applications on
in gigabytes but in exabytes as we are “Approaching the AWS were primarily leveraging just S3 with applications
ZettaByte Era.”1 Moreover, data types – structured, semi- executing behind the firewall. Popular storage applications, like
structured, or unstructured – continue to proliferate at an Jungle Disk and Smug Mug, were early AWS customers. This
alarming rate as more information is digitized, from family explosion of data has driven enterprises (and consumers for
pictures to historical documents to genome mapping to financial that matter) to find cheap, on-demand storage in unlimited
transactions to utility metering. The list is truly unbounded. But quantities – which cloud storage promises to deliver. Until
today, data is not only being generated by users and applications. now, massive tape archives in the middle of nowhere (like Iron
It is increasingly being “machine-generated,” and such data is Mountain) have been the only means to achieve that cheap
exponentially leading the charge in the Big Data world. In a storage. However, enterprises today need more; they need
recent article, The Economist called this phenomenon the “Data quick access data retrieval for multiple reasons, from
Deluge” (http://www.economist.com/opinion/displaystory.cfm? compliance to business analytics. It is simply no longer
story_id=15579717). sufficient to have “cold” data; rather, it needs to be online and
resilient (and cheap, of course); hence, the accelerating shift
One can argue that Web 2.0 companies have been pushing the towards storing every piece of data in memory or on disks
upper bounds of large-scale data processing more than anyone. (Data Domain smartly rode this trend).
That being said, this data explosion is not sparing any vertical
industries – financial, health care, biotech, advertising, energy, The need to balance data availability/usability and cost
telecom, etc. All are grappling with this perfect storm. Below are effectiveness has prompted significant innovation in both on-
just a few stats: premise and hosted cloud storage – cloud storage systems
(Caringo, EMC Atmos, and ParaScale, to name just a few),
• Google was processing two years ago more than 400PB of
flash-based storage systems (Fusion IO, Nimble Storage,
data/month in just one application
Pliant, etc.) – are just some current examples. Furthermore,
• The New York Times is processing an 11-million-story
hierarchical storage management (HSM, which has always
archive dating back to 1851
sounded great but has been implemented only rarely) will
• eBay processes more than 50TB/day in its data warehouse become an important element in storage workflows.
• CERN is processing 2GB/second for their most recent Enterprises will require seamless capability to move data
particle accelerator across different tiers of storage (both on-premise and into the
• Facebook crunches 15TB/day into a 2.5PB data warehouse cloud) based on policy and data type to maximize retrieval
costs. As cloud computing matures, true cloud applications will
Without question, data represents the competitive advantage of be (re)written to leverage hierarchical and cloud-like storage
any enterprise, and every organization is now encumbered with tiers to retrieve data dynamically from different storage layers.
the task of storing, managing, analyzing, and extracting value
Page 1
1
Source: “Approaching the Zettabyte Era.” Cisco, 16 June 2008. <http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-
481374_ns827_Networking_Solutions_White_Paper.html>
2. A NEW CLOUD STACK point network and data level security, although high bandwidth
encryption solutions and sophisticated key management will be
In order for cloud computing to become a mainstream approach, needed to match the massively parallel computational cloud
a new “cloud” stack (like mainframe and OSI) will likely emerge. environments. In this case, the primary security challenges will
Just like prior computing platform transitions (client/server, web stem from “control.” User authentication will become
services, etc.), core platform capabilities, such as security, access increasingly challenging as applications are federated outside
control, application management, virtualization, systems the firewall because of SaaS adoption. In addition, managing
management, provisioning, availability, etc. will be a prerequisite and reconciling user identities across individual user directories
before IT organizations are able to adopt the cloud completely. for each SaaS/Cloud application will present further security
issues. Much like web applications in the 90s created an SSO
Clearly, this stack will exist in a different representation than layer, cloud computing is essentially abstracting a web services
prior platform layers to embrace a cloud environment. Simply interface for infrastructure IT, and it will demand a similar
replicating the current computing stack but allowing it to reside unified “authentication/entitlement layer.”
off-premise will not achieve the scale, capabilities, and
economies of cloud computing. In particular, this new cloud In addition to federated user authentication, cloud computing
framework needs the ability to process data in increasingly will also require “data” authentication and security. Imperva’s
greater orders of magnitude – and do it at a fraction of the cost – database firewall is an example of an increasingly important
by leveraging commodity, multi-threaded servers for storage and cloud security product. As applications reside in different
computing. In many ways, this cloud stack has been implemented public and private clouds, it will be critical for the cloud
already, albeit in a primitive form, at large-scale internet applications to be able to “talk” to each other. This will drive
datacenters. the need for ensuring data authentication and policy control for
the volumes of data flowing between cloud applications.
The challenge of processing terabytes of data daily at Google, Moreover, given the multi-tenancy paradigm of cloud
Facebook, and Amazon drove them to adopt a new data environments, policy granularity will be paramount to ensure
architecture, which is essentially Martian to traditional enterprise security and compliance. Data integration across cloud
datacenter architects. No longer are ACID and relational platforms will be more of an obstacle than application
databases back-ending transactional applications. Internet integration, as applications have become more open/standard.
datacenters quickly encountered the scaling limitations of SQL Standard “data” APIs will emerge as part of the new cloud
databases as the volume of data exploded. Instead, high- stack to allow disparate environments to talk to each other and
performance, scalable/distributed non-SQL data stores are being avoid vendor lock-in. Data migration challenges are perhaps
developed internally and implemented at scale. Big Table and the greatest factor today for locking users to a particular cloud
Cassandra are among the many variants, and this “non-database platform.
database” trend has proliferated to the point of having its own
conference: NoSQL. Database caching layers (i.e., Northscale’s Over time, these APIs and layers will harden and will become
Memcached) are also being implemented to further drive tailored, depending on use case and workload for particular
application performance, and its now accepted as a “standard” applications. The adoption of these new frameworks will
tier in datacenters. ultimately make cloud computing “safe” and broaden its
penetration into enterprises of all sizes.
Managing non-transactional data has become even more
daunting. From log files to click stream data to web indexing,
internet data centers are collecting massive volumes of data that WHAT’S BREWING IN A CLOUD?
need to be processed cheaply in order to drive monetization
value. Hadoop is an open source data management framework Despite constant comparisons to grid and utility computing,
that has become widely deployed for massive parallel cloud computing has the potential to address a much broader
computation and distributed file systems in a cloud environment. set of applications and use cases beyond the limited HPC
Hadoop has allowed the largest web properties (Yahoo!, environments served traditionally by grid computing. This
LinkedIn, Facebook, etc.) to store and analyze any data in near breadth of cloud computing is engendered in a new set of
real-time at a fraction of the cost that traditional data underlying technology forces. Virtualization technologies,
management and data warehouse approaches could even high-powered commodity servers, low-cost/high bandwidth
contemplate. Although the framework has roots in internet connectivity, concurrent/multi-threaded programming models
datacenters, Hadoop is quickly penetrating broader enterprise use and open source software stacks are all technology building
cases. The diverse set of participants at Hadoop World NYC blocks that can deliver the high performance and scalability of
hosted by Cloudera clearly points to this trend. grid/utility computing, but importantly – and concurrently –
with underlying commodity resources.
SECURING THE CLOUD These technology drivers enable applications and users to be
abstracted cleanly from particular IT infrastructure resources
Given this data intensive nature, any widely adopted cloud (computing, storage, networking, etc.) in new and powerful
computing platform will inevitably account for richer security ways; i.e., location agnostic and multi-tenancy are two critical
requirements. The security challenges will be focused less on
Page 2
3. elements among others. Unlike traditional HPC grid a powerful trend in the role of developers in driving cloud
environments, which were designed for a specific application in a computing adoptions. Many early users of cloud computing are
single company, cloud computing enables disparate applications examples of developers launching applications without
and entities to harness a shared pool of resources. In addition, requiring the involvement of IT (in the case of a Web 2.0 start-
applications can be “broken up” in the cloud where computing up, they don’t have an IT department). Increasingly,
resources may reside on the client while the data is accessed empowering developers and line of business owners to
portably from multiple cloud locations (as an example). innovate and deploy new applications without the shackles of
IT will be a motivating driver for cloud adoption. No longer do
Many different definitions of cloud computing have surfaced.
users need to have IT’s blessing and time to get their job done.
Rather than posit yet another, several characteristics are resident
This developer-centric nature was a primary motivator of
in any cloud instance: (i) self-provisioned (either by user,
VMware’s strategic acquisition of SpringSource. In addition to
developer, or IT); (ii) elasticity (on-demand allocation of any
inheriting significant Java technology, VMware now has a
computing, storage and networking resources); (iii) multi-
distinct opportunity to transition SpringSource’s dominant Java
“anything” (multi-user, multi-application, multi-session, etc.);
developer mindshare to develop onto VMware’s private cloud
and (iv) portability (applications are abstracted from physical
platform. Amazon Web Services has experienced tremendous
infrastructure and can be migrated easily). These capabilities
success from its developer-centric platform APIs. Unlike
allow enterprise to shift IT resources from capex to opex – a
traditional hosting providers that cater to IT/operations,
usage based model that is particularly appealing during recent
Amazon went after developers first and has only recently
economic constraints.
begun to add the functionality that will appeal to broader
These cloud prerequisites will yield a powerful a set of use cases enterprise IT.
beyond grid computing that are unique to cloud platforms. Cloud
computing will reach its full potential in the future when a whole Within enterprises, there are early signs of developers (Q&A
new set of applications (never possible before) is created that is environments, batch processing, and developer prototyping)
purpose-built for the cloud. For example, one can envision and line of business/departmental leveraging cloud computing.
powerful collaboration applications emerging that enable internal It is not uncommon for new platform technologies to start at
enterprise and external users to seamlessly cooperate that would the “fringes” of IT before mainstream adoption takes place.
have been previously impossible with users and data isolated on Unlike typical three-tier “traditional” enterprise datacenters, the
disparate enterprise islands. It’s likely these innovative internet datacenters of Facebook, Google, etc. were not
applications will require new programming models and encumbered by legacy enterprise stacks, applications, and IT
potentially languages yet to be hardened. rules; which in turn enabled them to be built from the ground
up with cloud stacks to handle elastically large-scale consumer
transactions for multiple applications. Therefore, and
STILL IN THE EARLY DAYS unsurprisingly, Amazon’s internet datacenters was easily
Despite the high energy surrounding cloud computing and early adapted to become the first and leading “public computing”
cloud offering successes, such as Amazon Web Services, cloud provider. It will certainly take significant time/effort for
computing for enterprise services is definitely still in its enterprise IT infrastructure gatekeepers to evolve their current
formative stages. In contrast, however, consumers have already architectures to embrace a new cloud platform. Luckily,
adopted cloud computing technologies. One could argue that web enterprises can reap the technology innovation from internet
companies like Google, Yahoo!, Facebook, and Salesforce are data centers (many which are open source) to accelerate this
examples of consumers leveraging cloud computing. These Web transition.
2.0/SaaS offerings clearly exhibit the core cloud characteristics
outlined above, and in turn are delivering new, value-added MORE THAN ONE FLAVOR
services previously considered unthinkable. Interestingly, this
time the consumers, via their use of Web 2.0 services, have been There have been analogies drawn between cloud computing
teaching the typically early technology adopter enterprises the and public utilities (electric, gas, etc.) where the value is all
effectiveness of cloud computing. about economies of scale. According to this hypothesis, the
world will only have a few cloud providers that reach
Today, the enterprise use of cloud computing represents opposite
maximum efficient scale. It is quite unlikely that this will
ends of the spectrum: (i) Web 2.0 start-ups seeking to launch
happen. Multiple cloud models will emerge depending on the
applications quickly and cheaply, and (ii) compute intensive
user, the workload, and the application. For example, certain
enterprises that need batch processing for bursty, large-scale
developers will prefer to interface with a cloud provider at a
applications. Although these users are driving the early adoption
higher level of abstraction, such as Google App Engine, as
of cloud technology, it’s unlikely these limited use cases will
opposed to a more bare metal API, such as Rackspace.
establish cloud computing as a pervasive platform. Cloud
Alternatively, an application may choose to run on MSFT
computing instead will need to penetrate mainstream IT
Azure to leverage SQL/MSFT services or Salesforce Force for
infrastructure slowly and offer a broader set enterprise
CRM integration and distribution advantages. Today, one can
applications. break cloud platforms into roughly two camps: developer-
It is important to note here that these Web 2.0 start-ups represent centric (Amazon, MSFT) and IT-centric (EMC, VMware).
Page 3
4. Cloud platforms will remain distinct and diverse as long as they Fellow, Yahoo! Research: “So a lot of the companies that are
continue to deliver unique value-add for their particular use cases out there today – Yahoo!, Facebook, Google – they’re all
and users. exposing data APIs. Imagine what’s going to happen once
large clouds are routinely available to build they’re own
To drive this cloud diversity point further, the concept of a “cloud
application and you start aggregating your own data, and you
within a cloud” is also emerging where distinct services,
have the opportunity to fuse that with all the data that’s out
such as data warehousing, can be built atop a more generic cloud there. Someone’s going to figure out the next big thing, by
platform to provide a higher layer cloud service. taking 2 + 2 and coming up with 20.”
In addition, “private clouds” behind the firewalls present yet Mike Schroepfer, VP Engineering, Facebook: “…one of the
another flavor of cloud computing as enterprises leverage the things that is going to happen is that people are going to figure
benefits of cloud frameworks while maintaining security/control out that we need a more blended workload between the cloud
as well as the compliance of their internal datacenters. Lastly, and the client. We’ve been operating kind of in the cycle of
hybrid clouds that bridge private and public clouds on a reincarnation and computer science, moved toward most of the
permanent and temporary basis (also known as “cloud bursting”) computing happening in the cloud, and my browser effectively
will come to fruition for certain applications or as a migration being it’s own terminal. You know, in the last 2 or 3 years, the
path for enterprises. Several start-ups (Cirtas, CloudSwitch and speed and capability of browsers has been outpacing that of
Zetta among them) are building products that make the cloud most chips. You’re seeing 2x to 4x improvements in core
“safe” for enterprises. Innovation will abound to solve the performance on the engines and VMs in those browsers year on
specific issues in all of these various cloud environments. year, which is way outpacing the speed of chip design…So I
believe that there will be a couple of people who will figure out
LOOKING AHEAD ways to blend computation and storage on the client, more
gracefully with that on the server, but still provide you with all
To further parse all this, I hosted a cloud computing panel with an of the benefits of basically access to my data anywhere I need,
esteemed group of technology thought leaders at Accel’s 15th and the kind of reliability of the cloud.”
Stanford Technology Symposium. Needless to say, these
panelists had plenty of deep insights, opinions, and predictions Jayshree Ullal, President and CEO, Arista Networks: “Well,
about cloud computing. there’s a technology impact but I actually think it’s going to
really make CIO’s rethink their jobs. Today, you can have a
The panel brought together technologists who view cloud server administrator, an application administrator, a network
computing from distinctly different lenses: private cloud administrator, and they’re all silos… but you need your general
innovators, public cloud providers, cloud enabling technology practitioner. And that’s really missing right now in the cloud.
solutions and cloud infrastructure applications. In wrapping up So if I had to make a prediction, less on the technology, more
the panel session, I asked each speaker to conjure up a single on the operational side, I would say for the deployment of this,
prediction for cloud computing in the next few years. Here’s what it’s got to be a generalized IT person, whether that’s the CIO or
the experts said: somebody he or she appoints…”
Jonathan Bryce, CTO/Founder, Mosso (Rackspace): “…I think Rich Wolski, Professor of Computer Science, University of
cloud computing is going to be a mindshift; it’s going to take a California, Santa Barbara and CTO/Founder, Eucalyptus
while. But I think an economy like this is actually a huge Systems: “…there’s another revolution coming that’s going to
opportunity for entrepreneurs…I think this is a time when intersect the cloud revolution and that has to do with data
resources are scarce – that’s when great businesses end up getting simulation…pretty much everything you own is going to be
built. And I think part of what’s going to enable some of those trying to send you data. And you’re going to need, personally,
businesses is cloud computing, and being able to get started with a great deal of storage and compute capacity to be able to deal
a lower varied entry, lower price point, all of those kind of with that. I think the cloud is going to make that revolution that
things…” much quicker to come to us.”
Mike Olson, CEO/Co-founder, Cloudera: “I think that a lot of These predictions depict cloud computing as still being in its
what’s been said around here about data is really right on. I formative phases, but that it will emerge as fundamental
predict that in the next 10 years, computer science as computer breakthroughs in datacenter and IT infrastructure in the years to
science isn’t really going to be the place that smart young guys come. Despite the current macro headwinds, deep innovation,
are going to find tremendously rewarding careers. I think that the and market opportunities in cloud computing will persist. Once
application of these new compute systems to large data in the this economic storm passes, I’m convinced the sun will shine
sciences will advance human kind substantially. I think that through, and cloud computing is sure to have many silver
science will be done maybe not even in the lab on the wet bench linings.
anymore, but with data, with computer systems looking at vast
amounts of data….” Ping Li is a partner at Accel Partners in Palo Alto
and focuses primarily on Information Technology
Raghu Ramakrishnan, Chief Scientist for Audience and Research infrastructure and digital media platforms.
Page 4