Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Age of Exabytes: Tools & Approaches for Managing Big Data


Published on

This ReadWriteWeb report is sponsored by HP

Published in: Technology
  • Good Afternoon:
    Acolyst is on the GSA schedule and the Federal Distributor for FalconStor. We are look looking to partner with a limited number of federally focused companies interested in adding FalconStor’s award-winning products into their solution portfolios. Specifically, companies that have data centers and provide data migration for exabytes of data in the government space. We provide:
    • Business continuity
    • Data Storage, Data Protection and Disaster Recovery
    • Performance Optimization
    • Data Migration
    • Centralized storage management, including expansion and consolidation
    Thank You - Manny Doraghi
    Are you sure you want to  Yes  No
    Your message goes here
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking.
    Are you sure you want to  Yes  No
    Your message goes here
  • 'Age of Exabytes'
    Can NOT read the small print.
    Do I need to buy a large flat screen monitor ???
    Are you sure you want to  Yes  No
    Your message goes here
  • Totally agree with rcronk...
    And nobody's there to answer or explain the RWW position?!?
    Are you sure you want to  Yes  No
    Your message goes here
  • Needs too much information to be able to read it so I chose not to.
    Are you sure you want to  Yes  No
    Your message goes here

The Age of Exabytes: Tools & Approaches for Managing Big Data

  2. 2. Contents Introduction: The Rise and Scope of Big Data 3 Innovations in Storage 5 Storage: At the Chip Level 5 Storage: At the Data Center Level 6 Storage: Virtualization and the Cloud 7 Storage: Big Data, New Databases 7 Speed: Big Data, Real-Time 9 The Demand for Big Data Analytics 11 Accessing the Data 13 Via the API 13 Over the Network 13 Use Cases 15 Distributed Computing with CouchDB at CERN 15 Real-Time Retail Analytics 15 Millions of Farmvilles Mean Petabytes of Data Daily: How Zynga Handles Social Gaming Big Data 16 The Big Data Marketplace 17 Bigger Data and a Better Response: Earthquake Detection & Crisis Response 17 Conclusion 19 This premium report has been brought to you courtesy of HP Networking. As you explore networking solutions for your enterprise, don’t miss HP’s helpful resource located at the end of this report, HP’s FlexFabric, which explores the next-generation, highly scalable data center fabric architecture. ReadWriteWeb | The Age of Exabytes | 1
  3. 3. 2 | ReadWriteWeb | The Age of Exabytes
  4. 4. Introduction: The Rise and Scope of Big Data To bytes, the basic unit of computing, we have rapidly added new prefixes as the development of computer technology has hastened the units of storage. From kilobytes (1000 bytes), we’ve moved on to megabytes (1000 KB), gigabytes (1000 MB), and terabytes (1000 GB) of data to “big data,” petabytes (1000 TB), exabytes (1000 PB), zettabytes (1000 EB), and to the as yet unfathomable yottabyte (1000 ZB). This year, estimates put the amount of information in existence at 1.27 zettabytes. One page of typed text, by comparison, is roughly 2 kilobytes of data, while all the books catalogued in the U.S. Library of Congress total around 15 terabytes. Dwarfing that is the approximately 1 petabyte of data processed per hour by Google.1 These numbers, this amount of data, while almost mind-boggling, are nonetheless growing at an exponential rate. Eight years ago, there were only around 5 exabytes of data online.2 Just two years ago, that amount of data passed over the Internet over the course of a single month. And recent estimates put monthly Internet data flow at around 21 exabytes of data.3 Certainly, some industries, such as science and finance, have long had to wrestle with storing and processing massive amounts of data. But even there, the need for more speed and more storage has grown. Walmart, for example, must handle more than 1 million customer transactions per hour. The process of decoding the human genome required the computing power to analyze 3 billion base pairs — something that took 10 years the first time it was done in 2003, but can now be achieved in one week.4 Clearly, to meet these sorts of needs, computing power and storage has improved substantially — a marker of Moore’s law, which dictates that the processing power and storage capacity of computer chips double or their prices halve roughly every 18 months. And the technology has in turn has facilitated this explosion of data. But that’s only part of the picture. 1 2 3,2817,2361820,00.asp 4 ReadWriteWeb | The Age of Exabytes | 3
  5. 5. The data that is being generated today isn’t just “big,” it’s different, and much of it is unstructured. Older collections of data are now being digitized, such as the efforts of Project Gutenberg to digitize and archive the world’s literary works. And many more people than ever before have access to technology tools. The UN estimates there are an estimated 5 billion mobile phone subscriptions worldwide (although many people have more than one, so that doesn’t quite mean that the mobile phones have so completely saturated the world market of 6.8 billion people).5 Billions of people use the Internet, and with the rise of digital literacy and of social networking, more and more people are creating and uploading more and more data. There are 500 million registered Facebook users, for example, sharing 3.5 billion pieces of content weekly and uploading 2.5 billion photos every month, of which Facebook in turn serves up at a rate of about 1.2 million photos per second.6 With the increase in mobile device use in particular, human data creation has soared. Add to that the input from radio-frequency identification (RFID) and wireless sensors — the 35-some-odd billion devices connected to the Internet that are a source of information that is predicted to outpace the generation of data from humans — and clearly data gathering has become ubiquitous.7 This explosion of data — in both its size and form — causes a multitude of challenges for both people and machines. No longer is data something accessed by a small number of people. No longer is the data that’s created simply transactional information; and no longer is the data predictable — either as it’s written, or when, or by whom or what it’s going to be read by. Furthermore, much of this data is unstructured, meaning that it does not clearly fall into a schema or database. How can this data move across networks? How can it be processed? The size of the data, along with its complexity, demand new tools for storage, processing, networking, analysis and visualization. This report will survey some of the developments underway to address these challenges: the challenges of computing in the exabyte era. 5 6 7 4 | ReadWriteWeb | The Age of Exabytes
  6. 6. Innovations in Storage STORAGE: AT THE CHIP LEVEL Gordon Moore, the co-founder of Intel predicted in a research paper in 1965 that “the number of transistors incorporated in a chip will approximately double every 24 months.” Moore’s Law, as it’s known, is generally accepted by the computer industry that has seen the growth processing power and storage capacity of computer chips. Many analysts, however, predict that the rate that data is being created today is at a pace that will exceed Moore’s Law. This poses a challenge to chip-makers who are researching new storage and storage reduction technologies. After all, there are physical limitations to the miniaturization of transistors, a point that some predict could be reached by 2020. So while Moore’s Law has driven the computer industry for over 40 years, if the storage capacity and processing power are to continue, innovations must occur not just in terms of dimensions and scaling but in terms of alternate computing mechanisms and logic devices. Hewlett Packard, for example, has reported advances in the design of a new class of diminutive switches that would be capable of replacing transistors and help aid the shrinkage of computer chips closer to the atomic scale.8 The devices, known as memristors, or memory resistors, are modeled along the lines of biological systems. These are purportedly simpler than today’s semiconducting transistors, can store information even in the absence of an electrical current and can be used for both data processing and storage applications.9 Researchers also say they have devised a new method for storing and retrieving information from a vast three-dimensional array of memristors, something that could allow designers to stack switches beyond the limitations of two-dimensional scaling. A different approach is being taken by researchers at IBM, Intel, and others, who are investigating a type of storage called “phase-change memory.” PCM offers high performance along with low power consumption, combining the best attributes of NOR, NAND and RAM — fast read and write speed, non- volatility, bit-alterability and good scalability, for example — within a single chip. Unlike flash memory technology, for example, PCM allows stored information to be switched from one to zero or zero to one without a separate erase step. And unlike RAM, PCM does not require a constant energy supply.10 8 9 10 Storage.htm ReadWriteWeb | The Age of Exabytes | 5
  7. 7. And earlier this year, researchers at the Tyndall National Institute in Cork, Ireland announced they had created the world’s first junction-less transistor. Current transistors are based on junctions, which are formed by placing two pieces of silicon with different polarities side-by-side. Controlling the junction allows the current in the device to be switched on and off. The new transistor technology uses a control gate around a silicon nanowire that can tighten around the wire to the point of closing down the passage of electrons without the use of junctions or doping.11 As researchers pursue different solutions to the question of building computer chips with better processing and storage capabilities, they must address not just performance, but cost and power consumption. STORAGE: AT THE DATA CENTER LEVEL The impact of Moore’s Law does not occur simply at a chip level, of course. The increase in computer power at lower cost has, in part, spurred this data explosion, which in turn has demanded the building of more computers, more servers, more data centers. So at the other end of the spectrum from the innovations happening to storage at the chip level are the massive data centers that house thousands of chips on thousands of servers. While computing power has increased and the cost of chips has fallen, the cost of building and powering data centers has increased dramatically. An analysis of Facebook’s spending posits that the company will spend about $50 million this year on data centers — a figure that has more than doubled since similar estimates for 2009.12 No longer is the bulk of the expense of those facilities merely a question of large and powerful equipment. (In fact, those figures from Facebook do not include equipment). Rather, it is this equipment’s skyrocketing demands for electricity for both powering and cooling. According to some calculations, for every Watt of server power used at a well-managed data center, an additional Watt is consumed by the chillers, air handlers, and so on. But in many cases the energy consumed is much higher.13 According to Greenpeace, at current growth rates data centers and telecommunication networks will consume about 1,963 billion kilowatts hours of electricity in 2020 — more than triple their current consumption and more than the current electricity consumption of France, Germany, Canada and Brazil combined.14 Energy consumption is prompting the search for more efficient ways of powering and cooling. Data centers are being located in areas near alternative sources of energy, such as Google’s recent announcement of a new center in Finland that will be cooled by sea water. Other facilities are experimenting with using offset heat to warm nearby offices. Some researchers are investigating ways that data centers can utilize energy from the heat to fuel cooling mechanisms, for example.15 And others are building new and different containers for the servers so that they are less capital-intensive and can be powered and cooled more efficiently. 11 12 13 14 15 6 | ReadWriteWeb | The Age of Exabytes
  8. 8. STORAGE: VIRTUALIZATION AND THE CLOUD One of the factors that has contributed to the explosion of data is the increasing adoption of virtualization. Virtualization allows companies to take advantage of greater storage and processing capabilities without having to run their own, physical machines. Virtualization, or cloud computing, has created many opportunities for businesses to leverage the elastic computing to do things otherwise not possible because of the costs of building and maintaining their own hardware infrastructure. Although it’s common practice for many companies to move to dedicated data centers once they reach a certain size, many companies are running quite sizable businesses on public clouds. Playfish, for example, once of the largest social gaming companies, runs its operations with Amazon Web Services.16 Cloud computing facilitates the speed with which new companies and new processes can be set up, as new servers can be launched and scaled with ease. As cloud computing allows for scaling to happen horizontally and not just vertically, it has, along with other developments in distributed computing, provided new ways for thinking about how data can be stored and processed. STORAGE: BIG DATA, NEW DATABASES It’s no surprise that as data has grown, databases have had to adapt. One of example of the innovation occurring in recent years is the number of new databases that break from the relational database management system (RDBMS) model. The latter has a long history, dating back to the 1970s. In a relational database, data is stored in the form of tables, as is the relationship among the data. This system has worked well to handle transactional and structured data. But as the amount of information, the kind of information, and the number of users accessing the information have grown, the relational database has faced some challenges. With new data comes new storage demands. And the traditional RDBMS is not optimized for the kind of environment that big data and cloud computing have created — one that’s elastic and distributed. Traditional RDBMS software, such as MySQL, can handle huge amounts of data but often requires extensive knowledge to manage. MySQL in particular is well known by many developers and has remained the data storage choice for many people. But a growing number of “NoSQL” — “Not Only SQL” — alternatives have been developed in the last year or so. These databases are designed to be Web-scale. They can be characterized as non-relational, distributed and horizontally scalable. Many of them are open source. Examples of NoSQL databases include CouchDB, MongoDB, Membase, and Redis. Perhaps due to the acronym containing “No,” there has been skepticism about some of these new technologies by those who do not want to abandon the relational database. Often, it’s not a choice between only one or the other as many businesses operate with a combination, where some data is stored in an RDBMS with other data better suited to a NoSQL datastore. 16 ReadWriteWeb | The Age of Exabytes | 7
  9. 9. 8 | ReadWriteWeb | The Age of Exabytes
  10. 10. Speed: Big Data, Real-Time The storing of exabytes of data is only part of the challenge, as the demands aren’t merely to be able to warehouse big data, but to be able to process and analyze it. Furthermore, the demands for read and write access are often real-time. As with the necessity for the development of better storage, big data requires better processing power, something accomplished at the level of the processor and up through the system. With the advent of networking, one of the ways in which computational power is increased is by distributed computing. That is, processing is not necessarily done in a single powerful mainframe computer, but is instead distributed to a number of computers in clusters or nodes. With distributed computing, a problem is divided into many tasks, each of which is solved by one computer. According to one report, for example, an ordinary Google search query involves between 700 and 1,000 servers, all so that a response can come within a sub half-second.17 To perform tasks like this, Google has built MapReduce. MapReduce is a framework for processing huge datasets by using a large number of computer nodes applied to certain kinds of distributable problems. In this way, computational processing can occur on structured or unstructured data. The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. The terms “map” and “reduce” refer to steps the tool takes to distribute, or map, the input for parallel processing, and then reduce, or aggregate, the processed data into output files. In other words, during the map step, a master node takes the input and chops it up into small sub-problems, then distributes those to worker nodes. In the reduce step, the master node then takes all the answers to the sub-problems and combines them to get the answer to the problem it was originally trying to solve. Some have posited that MapReduce is inefficient, but a large server farm like those operated by Google can use MapReduce to purportedly sort a petabyte of data in only a few hours. And the MapReduce framework has been incredibly influential on the development of other new tools to handle big data. Another important tool recently developed to handle large amounts of data is Hadoop. Derived from MapReduce, Hadoop is an open source project that, like MapReduce, handles large files across multiple machines. Hadoop consists of two key services: MapReduce and a data-storage system called the Hadoop Distributed File System (HDFS). A key feature of Hadoop is that for effective scheduling 17 ReadWriteWeb | The Age of Exabytes | 9
  11. 11. of work, every filesystem should provide location awareness — the name of the rack where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, so as to reduce backbone traffic. The filesystem uses this when replicating data, keeping different copies of the data on different racks with the goal of reducing the impact of a rack power outage or switch failure. Even if these occur, the data may still be readable. To illustrate: Hadoop was recently utilized to calculate the 2,000,000,000,000,000th digit of pi, more than doubling the record of the previous longest calculation. Using a cluster of 1000 computers at Yahoo, it took 23 days to calculate, something that would have taken over 500 years on a standard PC. Rather than calculating of each digit, Hadoop allowed computers to work with a formula that turned a complex equation for pi into a small set of mathematical steps. And then, in the end, the formula returned just one specific piece of pi, that record-breaking digit (which is, incidentally, “0”).18 But Hadoop and MapReduce are batch processes, and as such can have high latency. At the scale of big data, speed is assessed in terms of performance — the speed with which a system answers a query. But just as important is the idea of “speed to insight,” that is the amount of time it takes for analysts to glean insights from these massive data sets. 18 10 | ReadWriteWeb | The Age of Exabytes
  12. 12. The Demand for Big Data Analytics “Success” in big data isn’t simply a matter of building and implementing better storage or processing tools. Success involves being able to gain insights from the big data — and to gain it quickly. But the scale of the data does make search, analysis, and visualization challenging — even more so with the demands of real-time. Analytics have often accompanied data warehousing for sectors like finance, retail, and research. But just as big data creates challenges for databases and processing, it also poses new problems for analytics. Traditional databases struggle with the complexity and poor performance that result from trying to express complex analytics in SQL. So until recently, many advanced analytics were handled outside the database. In other words, analytics procedures and models were run on statistical analysis platforms — and so optimizations to the database wouldn’t necessarily speed up the analysis. Furthermore, data needed to be copied and moved from the data warehouse to a statistical platform. Between the constraints of disk speed and network bandwidth, moving big data out of a warehouse can be slow, further compounded by the speeds it takes a statistical platform to process the data. These challenges have been so severe that in many cases, the depth of the analysis is compromised. This has occurred when big data is reduced — via sampling, for example — to smaller subsets for computation, meaning that critical insights may be overlooked. Furthermore, developers have been forced to spend a significant amount of time modifying complex analytics in order to fit with the limitations of traditional databases. Arguably, traditional business intelligence applications are not designed to handle the amount or the complexity of the data, nor are they necessarily built to handle real-time reporting. As a result, the quality of the analytics suffers. Rather than reports created on past events, analytics should be based on real-time data. And rather than results that come from periodic reports created by statisticians, the need is that this information be open for constant and on-demand analysis. Big data analysis is changing in part due to in-database analytics, but database vendors like Aster Data are beginning to add analytics to their feature lists. These vendors now support a range of analytic queries that can be written in or converted to SQL, as well as those written in C/C++, Java, Python, Perl, R, and other languages inside their database. ReadWriteWeb | The Age of Exabytes | 11
  13. 13. In addition to demands to deliver complex analyses on big data, there is also increased interest in visualizations. And again, as with database and analytic technologies, many of the existing tools have not been designed to handle the massive quantities of data. Efforts like CalTech’s Large Data Visualization Initiative are seeking to develop multiresolution visualization and modeling technologies.19 The ability to perform analytics on big data in near-real-time will become increasingly important for organizations, and the market opportunities are substantial for companies and data scientists who can provide these services. 19 12 | ReadWriteWeb | The Age of Exabytes
  14. 14. Accessing the Data VIA THE API Moving large volumes of data around can be difficult for all the reasons explained above. The requirements for moving data have necessitated development on a couple of levels: in terms of networking and in terms of the API. APIs aren’t designed necessarily to solve a company’s big data problem. Nonetheless, they can be utilized in a number of ways to offer access to developers to all or part of a company’s data. And as companies generate and store more data and as data becomes a more important commodity, having an API becomes more and more important. An API allows companies to open access to this information to not simply internal analyses and processes, but to other third-party developers as well. Having an API has become “BizDev 2.0”. In other words, in a Web-oriented world, it’s the way business development is done. APIs facilitate business-to-business relations by opening data and systems to business partners. And having an API makes new queries possible (if not easier), enhancing information discovery for companies. OVER THE NETWORK The amount of data that is being generated taxes network capabilities, even with the best broadband infrastructure. With a T1 (1.544Mbps) Internet connection, it would take approximately 82 days to upload one terrabyte of data. Even at 10Mbps, it would take almost two full weeks to do so.20 But it isn’t just the size of the data that makes portability a problem. It’s also the rapidly increasing number of machines that are connecting to the Internet. In August 2010, wireless analyst Chetan Sharma reported on figures for the U.S. wireless data market, noting that mobile phone subscription penetration had crossed 95% at the end of the second quarter of 2010. Excluding those aged 5 and under, this means that the mobile penetration for the U.S. is now past 100%.21 But the increase in new mobile phone subscriptions is only part of the picture. Outpacing these new human subscriptions for the same quarter were those of “connected devices.” Even as the U.S. nears full penetration of mobile devices, an array of other devices and everyday objects are coming online, via sensors, RFID chips — the “Internet of Things.” The pressures from more devices coming online are leading governments and organizations to rethink how Internet bandwidth, wireless spectrum and Internet addresses are allocated and managed. 20 21 ReadWriteWeb | The Age of Exabytes | 13
  15. 15. 14 | ReadWriteWeb | The Age of Exabytes
  16. 16. Use Cases DISTRIBUTED COMPUTING WITH COUCHDB AT CERN Scientific research has long had to wrestle with capturing, storing, managing and analyzing massive amounts of data, but the rise of big data has taxed even the systems designed to study the intricacies of genomes, weather patterns, outer space, and so on. One such facility is CERN, the European Organization for Nuclear Research. Situated on the Franco- Swiss border, CERN is the world’s largest particle physics laboratory and the site of the Large Hadron Collider, a global scientific project that researches particle collisions using the world’s largest and most powerful particle accelerator. The LHC produces an enormous amount of data — around 15 petabytes a year. And when the LHC was in its planning stages, CERN’s IT department quickly realized that that amount of data was more than a data center — and perhaps even the Geneva power grid — could handle. Instead of one large data warehouse facility, they opted for a grid computing solution, distributing the collider data to a dozen or so data centers. CERN’s grid consists of 100,000 processors at 140 scientific institutions in 33 countries.22 One of the LHC experiments is the Compact Muon Solenoid. In order to manage the roughly 10 petabytes of data it collects, CERN announced that it plans to deploy the NoSQL database CouchDB.23 This particular experiment requires a database solution that not only can handle large amounts of data — often without metadata — but can distribute the data quickly in an environment in which incoming database connections are frequently impossible. CouchDB is specifically designed for distributed environments, and one of its key benefits is its replication and syncing features. Furthermore, the researchers have pointed to the speed with which they can prototype tools using CouchDB. REAL-TIME RETAIL ANALYTICS Big data is poised to deliver tremendous insights about consumer’s spending patterns. Retailers have long tracked when people spend and what they buy. After all, past shopping behavior is the best way to predict future purchases. But marketing efforts, as the term “mass marketing” implies, have been imprecise. Now, an incredible amount of information can be gathered about consumers’ shopping habits: how they browse online, where they shop, when they shop, what brands they buy with what frequency. And rather than just general demographic information gleaned after-the-fact — knowing, for example, that a certain coupon worked well with women in their 40s — companies can drill down into an individual consumer’s profile, and be able to serve them specifically targeted offers in real-time. For example, as Akamai’s network has grown to encompass more than 450 brands and multi-channel Internet retailers, it has run into challenges delivering the right ad at the right time to the right 22 23 ReadWriteWeb | The Age of Exabytes | 15
  17. 17. audience. Akamai must deal with up to 75 million daily events, and as its core business value relies on being able to data-mine that information for advertisers, it needs to be able to analyze data quickly. With the number of users, profiles, transactions increasing the number of models that must be run for these records, Akamai found that daily reporting was being delayed by up to 20 hours. Akamai recently moved its database to Aster Data to take advantage of the company’s nCluster in order to reduce analytics time.24 MILLIONS OF FARMVILLES MEAN PETABYTES OF DATA DAILY: HOW ZYNGA HANDLES SOCIAL GAMING BIG DATA One part of social networking that has seen the meteoric rise has been social gaming. Some 65 million people play Zynga’s online games every day. According to Zynga CTO Cadir Lee, 10% of the world’s population has played a Zynga game. That’s millions of Web browsers open to millions farms and millions of frontiers. They take turns; they tend crops; they send gifts. They buy millions of objects and upgrades. Zynga says its technology supports 3 billion neighbor connections throughout its games. And all told, it moves around 1 petabyte of data daily, using a combination of its own data centers and a hybrid public/private cloud. It’s a mind-boggling amount of data. And it’s a new kind of data — it’s more than simply transactional data. And it’s accessed in many ways by many millions of users. This necessitates not simply massive server resources (the company says it adds as many as 1,000 new servers every week to accommodate traffic), but has also required the development of a new sort of database management system. Zynga has been a major contributor to the open source Membase project, taking some of the concepts of Memcached — low cost, high performance, schema-less caching — in order to develop a database that works with similar speed, flexibility and simplicity. Zynga needs to be able to serve up all this data not only to its millions of users. It also has to be able to undertake analytics on the gameplay in order to, for example, design engaging and viral games and to ascertain the points at which players are willing to purchase virtual goods. 24 16 | ReadWriteWeb | The Age of Exabytes
  18. 18. THE BIG DATA MARKETPLACE The amount of data being produced — by science, governments and social networks — has given rise to a number of companies that are specifically geared towards the storage, sale, and analysis of data. For example, Infochimps, a startup based out of Austin, Texas, describes itself as a marketplace for data: “A site to find, sell, or share any dataset in the world.” Infochimps makes a variety of datasets available, including massive data scraped from Twitter. (A recent scrape contains data about 35 million users, 500 million tweets, and 1 billion relationships between users). Some of the datasets are available for free, and some for a price. Infochimps also makes some of the data available via an API, in lieu of sending an entire dataset.25 Factual is another startup that is offering access to massive datasets, in this case geolocation data, alongside an API and other tools for building geolocation applications.26 BIGGER DATA AND A BETTER RESPONSE: EARTHQUAKE DETECTION & CRISIS RESPONSE Although big data is often touted for its scientific and commercial implications, it has also becoming an important tool for humanitarian purposes, as responses to recent natural disasters have demonstrated. Open data advocates and developers have formed groups like CrisisCommons and projects like OpenStreetMaps in order to build tools to help the public good. The World Bank, for example, has made a substantial amount of its data open, and has encouraged people to build tools to help understand the information to be able to better respond to natural disasters and other crises. 25 26 ReadWriteWeb | The Age of Exabytes | 17
  19. 19. 18 | ReadWriteWeb | The Age of Exabytes
  20. 20. Conclusion We marvel at the fact that today our smartphones have far more RAM than our first personal computers did. But with these phones, PCs, and with other connected devices, we are generating almost unfathomable amounts of data, and generating a demand, in turn, for ever more storage. The average person is uploading over 15 times more data to the Internet today than they did just three years ago.27 And the information uploaded by humans is dwarfed by the Internet of Things, the networking of everyday objects. The explosion in data is creating challenges and prompting innovation in computer storage and processing, in terms of software, hardware and data center architecture. The desire to be able to glean insights from all this data is also set to be a boon for analysts and statisticians. And it’s creating many opportunities for new companies who can deliver technology products and services to help solve some of the challenges associated with big data. And there are plenty of challenges. Moore’s Law has so far proven accurate — processing power has increased and costs of manufacturing computer chips have gone down. But the cost of powering the machines has soared. And when you are handling data on an exabyte scale, the energy costs to power and cool machines — particularly those in the massive data centers — are substantial. In addition to facing problems with power consumption, the amount of data being generated also taxes network infrastructure. As the Internet struggles to maintain speeds and bandwidth, broadband and wireless continue their penetration into new areas. We have only begun to develop the tools to manage and analyze all this data. As the majority of this data is unstructured, it has often remained beyond the scope of analysis. As the data is classified, questions of interoperability are raised — how can we structure and classify this information so it is usable within companies and across industries? 27 ReadWriteWeb | The Age of Exabytes | 19
  21. 21. But some people are cautious about the race to create and network all this data — to make this data available and useful — particularly when it comes to personal information. How will organizations ensure that data is kept private and secure? What sorts of controls will people have over the data they create, over the data their personal objects create? As we continue generating almost inconceivable amounts of information, it is clear that the data explosion will bring about challenges for businesses and for IT departments. Big data will be a problem that all organizations will need to address, whether “big” is on the scale of terabytes or exabytes of data. As companies increasingly look for solutions to their big data problems, this will in turn create opportunities for others to develop technologies and practices to best store, manage and analyze big data. 20 | ReadWriteWeb | The Age of Exabytes
  22. 22. HP FlexFabric Virtualize network connections and capacity—From the edge to the core An HP Converged Infrastructure innovation primer
  23. 23. Table of contents Data center networking dynamics ........................... 3 Introducing HP FlexFabric ...................................... 3 HP FlexFabric benefits ......................................... 4 The key attributes of HP FlexFabric.......................... 5 The FlexFabric evolution path ................................. 6 Deliver “networking as a service” to the Converged Infrastructure ...................................................... 6
  24. 24. Data center networking Network teams are faced with a race to build out data center network capacity and to effectively dynamics provision connectivity at an increasing speed. To keep pace, IT organizations need a network The fundamental nature of data center computing architecture that is more coherent, flexible, and is rapidly changing. The traditional model of agile. But they don’t want to give up the stability, separately provisioned and maintained server, high availability, and security offered by the proven storage, and network resources are constraining compute and storage networks currently installed in data center agility and pushing budget envelopes their data centers. to the limit. IT organizations recognize that these static pools of isolated resources are being HP is creating a new balance by combining some of underutilized—a problem that can be exacerbated the best, new, standards-based technologies with a when dedicated infrastructure or computer streamlined, modular architecture that fully optimizes systems are used to support different classes of virtualized resources, while meeting business data center workloads. One response has been requirements for low total cost of ownership, for IT organizations to adopt virtualization and faster time-to-service, and critical requirements for blade technologies, which enable a more flexible reliability, IT governance, and compliance. and highly utilized infrastructure. These new, more scalable technologies can be dynamically Introducing HP FlexFabric provisioned to meet continuously evolving business HP FlexFabric is the next-generation, highly scalable requirements. At the same time, these technologies data center fabric architecture of an HP Converged apply new pressures to the multiple networks in Infrastructure. With FlexFabric, you can provision the data center, further worsening spend issues. your network resources efficiently and securely to And it increases the burden on the IT teams that accelerate deployment of virtualized workloads. support them: With highly-scalable platforms and advanced • A proliferation of virtual machines is driving much networking and management technologies, more frequent changes to network configurations. FlexFabric network designs are simpler, flatter, and • Data center network processes must be easier to manage and grow over time. This open coordinated through multiple IT teams and are too architecture uses industry standards to simplify time-consuming. server and storage network connections while • Increases in server utilization require more network providing seamless interoperability with existing bandwidth per server. core data center networks. FlexFabric combines intelligence at the server edge with a focus on • Traditional hierarchical network designs cannot centrally-managed connection policy management to scale nor provide the performance, low latency, enable virtualization-aware networking and security, availability, and quality of service demanded by predictable performance, and rapid, business-driven a virtualized data center. provisioning of data center resources. • Blade technology is further escalating the number of connections to be managed and increasing bandwidth density. 3
  25. 25. HP FlexFabric overview HP FlexFabric brings together a highly-scalable, high performance, secure network infrastructure with comprehensive management and policy-driven connectivity provisioning integrated into a data center converged infrastructure Converged Infrastructure/Matrix VM Edge Access Operating Environment Flexible virtual I/O, hypervisor Data center management and orchestration agnostic, emerging VEPA standard support Highly-available data Intelligent Server Access center Backbone Flexible form factors, pragmatic Carrier-class routing and Integrate management storage-server I/O consolidation, wide-area connectivity and administration with future-proofed for convergence, converged infrastructure optimized for data center workload mobility and utilization Servers Backbone Storage “FC-SAN” Interconnect Server Edge FlexFabric Management Multi-site, multi-vendor network resource FlexFabric Security management and “Days to minutes” rapid, dynamic, policy-driven resource High performance Layer provisioning, data center integration 2/Layer 3 Interconnect Virtualization-integrated Security Predictable, high-performance, High capacity, high performance, high-bandwidth, existing Layer 3 highly-available threat management core-compatible, designed to fully exploit workload virtualization FlexFabric can enable your IT organization to • Modular, scalable, industry standards-based build a wire-once data center that responds to platforms and multi-site, multi-vendor management application and workload mobility, and provides tools to connect and manage thousands of server resource elasticity. You can move your network and storage devices using industry-standard connections with your workloads as you migrate building blocks them across or between data centers. Also, the • Investment protection for existing Layer 3 core fabric can stretch and reclaim pools of resources to systems with seamless compatibility and support meet rapidly changing needs. High-performance for open standards threat management tools unify physical and virtual • Flexibility to manage and administer server, security into a common, extensible framework. storage, and network resources in any Dynamic provisioning capabilities fully exploit organizational model—from completely separate virtualized connections to achieve new levels of to fully integrated—while consistently enforcing data center efficiency and accelerate time-to-service. governance, security and SLA policies The FlexFabric management and provisioning tools help align the fabric with governance policies and • Removal of costly and time-consuming change service-level agreements (SLAs), while reducing the management processes, while reducing the cost of operations. number of error-prone or conflicting configuration steps HP FlexFabric benefits • Support for a wide range of data center deployment models • Improved business agility, faster time-to-service and higher resource utilization by dynamically FlexFabric delivers true “networking-as-a-service” and securely scaling capacity and provisioning to the various consumers of connectivity within connections to meet virtualized application the data center and accelerate deployment of demands “on the fly” applications and services. It provides a unified • Breakthrough cost reductions by converging connectivity infrastructure—across servers, storage, and consolidating server, storage, and network and networking—that dynamically adapts to the connectivity onto a common fabric with a flatter demands of the heavily virtualized and more flexible topology and fewer switches data center architectures of tomorrow, while meeting increasing pressures for price/performance and • Predictable performance and low latency to time-to-service. support some of the most demanding application workloads 4
  26. 26. The key attributes of within the server edge and advanced multi-switch virtualization and management in the interconnect. HP FlexFabric Multiple server edge and interconnect switches can be virtualized and managed as single logical By radically simplifying and flattening network devices with improved utilization, high availability, designs and using emerging data center networking scalability, and flexibility to handle virtualized standards, HP FlexFabric creates a more robust, workloads with very high throughput. Capacity can flexible, and efficient data center network be dynamically scaled or divided. infrastructure. Rather than relying on a traditional hierarchical networking architecture, FlexFabric FlexFabric networks are designed to meet the offers a flatter data center topology with edge security, resiliency, and reliability requirements intelligence, designed to complement the intelligent expected in today’s data center. virtualized network interfaces offered by the latest Open and standards-based for investment HP data center servers and storage systems. This flat protection fabric interconnect is more fungible and provides FlexFabric is designed to interoperate with existing superior network performance and quality of service. third-party Layer 3 core switches to protect existing To manage the FlexFabric network, you can design investments and enable smooth network migration. and centrally manage fully-virtualized network This standards-based approach removes the connections and resources that allow for dynamic risk of vendor lock-in and lets your organization provisioning from the edge to the core and support incrementally deploy a FlexFabric network without for application mobility, enabling connections to disruptive forklift upgrades. You can mix and match move with workloads as they migrate across the existing operational processes with new approaches fabric. This allows resources to be created, moved, using industry-leading HP products to coordinate IT and scaled from centralized connection pools “on teams. Finally, this approach helps your organization the fly,” putting to work an integrated resource and manage the high purchase, support, and operations provisioning management toolset. costs associated with proprietary environments. To secure the FlexFabric network, Pragmatic deployment of new technologies a virtualization-integrated security framework HP FlexFabric utilizes the latest emerging industry provides business continuity with unified, high standards, including higher speed Ethernet links, performance physical/virtual server network security Virtual Ethernet Port Aggregation (VEPA), Fibre architecture. This framework enables seamless threat Channel over Ethernet (FCoE), and Converged management and leverages a global threat Enhanced Ethernet (CEE). The CEE standard enables intelligence network to block bad traffic in virtual Ethernet to deliver a “lossless” transport technology and physical environments. with congestion management and flow control features needed in storage environments. Leveraging FlexFabric is designed to support a much wider FCoE today, FlexFabric server edge platforms allow set of data center architectures, workloads, and for sensible storage-server I/O consolidation with requirements than is otherwise possible with assured compatibility with existing Fibre Channel traditional data center networking approaches. Storage Area Networks (FC-SANs). This allows users It supports specialized back office, cloud, web, to reduce cost and complexity without jeopardizing or high-performance computing models. Instead business continuity. HP is championing many of these of locking organizations into a proprietary and other emerging standards in the IEEE end-to-end solution, FlexFabric gives them the and other organizations, to give users a data flexibility to incrementally deploy a heterogeneous center fabric that protects their technology data center network that meets their workload needs investments instead of proprietary approaches that and protects existing investments. can cause organizational disruption and wholesale Predictable performance supports diverse equipment replacement. workloads Data center-integrated management and A highly scalable, flat network domain enables provisioning for business agility HP FlexFabric to deliver flexible provisioning, With management and provisioning integrated ultra-low latency, high performance, and fast down to the component level—including networking workload mobility. The architecture provides and virtual I/O—HP is revolutionizing data center breakthrough cost structures by removing provisioning and operation. Comprehensive networking layers and complexity, and applying network resource management tools allow users to new technologies including higher speed Ethernet administer networks across multiple sites and against links, active load balancing, and link aggregation a combination of HP and multi-vendor platforms 5
  27. 27. from a single pane of glass. Integrated FlexFabric business continuity at the top of the list of provisioning capabilities reduce time to service principles guiding our vision for a Converged and the chance of costly errors while accelerating Infrastructure network. IT alignment with business demands and goals. Today—A network foundation for FlexFabric enables administrators to centrally FlexFabric agility define connection and network policies that can be First introduced in 2006, Virtual Connect technology dynamically matched to workloads and provisioned is a key enabler of an integrated, data “on the fly” from pools of available resources. The center-aligned network, and delivers against FlexFabric model allows a “design once, replicate foundation HP FlexFabric principles by providing many” approach to provisioning that is optimized for some of the simplest, most flexible ways in the world workload mobility, streamlines network provisioning, to provide high-performance, secure server and reduces the number of error-prone or possibly connectivity. With reduced complexity, improved conflicting configuration steps that make change agility, and reduced cost, Virtual Connect radically management time-consuming and costly. simplifies network infrastructure and provisioning FlexFabric removes a major barrier to automation without disrupting “upstream” network operations. and orchestration—the “all-or-nothing” proposition HP Virtual Connect virtualizes server edge I/O, organizations face with other data center enabling server administrators to provision Local management frameworks. Designed to support Area Network (LAN) and Storage Area Network a wide range of IT organizational models, (SAN) resources in advance, and then enable FlexFabric offers interfaces designed specifically them when needed. Virtual Connect enables for each operator type found in IT teams. Network server administrators to move workloads and administrators can provision resources in advance virtual machines, or add, move, or replace servers and make them available to server and storage teams transparently to LANs and SANs in minutes without to utilize instantly when needed, saving time and having to engage LAN and SAN administrators. speeding service. Attacking head-on the expensive proliferation FlexFabric management integrates seamlessly across of Ethernet connections caused by increased the entire spectrum of HP data center management network capacity requirements for virtual machines, systems to streamline the activities of your data HP Virtual Connect FlexFabric modules and adaptors center IT teams without requiring extensive overhauls can reduce sprawl at the edge by 95%. Virtual of organizational structure and processes. This Connect FlexFabric modules provides up to four powerful system can automate and coordinate physical connections for each network port, with network services with application deployment, and the unique ability to fine-tune bandwidth to adapt to free up data center administrators from repetitive virtual server workload demands on the fly. operational activities that drain IT budgets. The system administrator can now define the FlexFabric provides open interfaces for third-party hardware personalities of these connections as functionality that integrates application delivery and FlexNICs to support only Ethernet traffic or as virtualization engines. Finally, FlexFabric management FlexHBAs that combine Ethernet and Fibre Channel is fully integrated with industry-leading IT orchestration or iSCSI protocol support. Each connection has and management systems from HP, giving your IT staff 100 percent hardware-level performance and unprecedented control that spans networks, servers, provides the I/O connections needed to take full applications, and even physical plant attributes. advantage of multi-core processors and to support more virtual machines per physical server. Each The FlexFabric evolution path server can support many more connections— up to 40—with less investment in expensive network Deliver “networking as a service” to equipment on the server, in the enclosure and in the the Converged Infrastructure corporate network. FlexFabric is more than just an aspirational model The bandwidth of each connection can be of the ideal data center network. Users can deploy fine-tuned and adapted with 100 Mb increments networks today that deliver on the FlexFabric value up to 10 Gb as workload demands change. The proposition—aggressively or incrementally—in server comes with 10 Gb capability built into it, keeping with overall technology and business ready for today’s investments in 10 Gb networks and objectives. This evolutionary and flexible approach converged fabric technologies like Fibre Channel to data center deployment across the infrastructure over Ethernet. Virtual Connect FlexFabric modules puts real user needs, investment protection, and allow users to take advantage of edge convergence by providing Fibre Channel over Ethernet (FCoE) 6
  28. 28. downlinks to the blades while maintaining standard HP provides powerful tools for managing and proven Ethernet LAN, Fibre Channel SAN, and large-scale FlexFabric networks both in advanced iSCSI external connections with their associated Virtual Connect-based and traditional network IT practices. This allows system administrators to server edge deployments. With HP Virtual Connect simplify enclosure infrastructure and lower costs Enterprise Manager, users can manage the setup by combining Ethernet, Fibre Channel, and iSCSI and migration of up to 16,000 Virtual protocols over one wire and managing them from connect-based servers from a single pane of glass. a single management application and interface. As the foundation for comprehensive network For any virtual server environment, Virtual Connect resource management across the entire enterprise FlexFabric modules and adapters are simply some network, Intelligent Management Center (IMC) of the most affordable, flexible, and power-efficient lets users manage an entire multi-site, multi-vendor solutions available from any blade portfolio. network, edge to core, from a single management console. For organizations preferring a traditional server edge implementation, network management and Securing the FlexFabric is a set of tools that design methodology, HP offers scalable blade-based brings threat management for both virtual and switching. For users looking to achieve high levels physical networking together into a single, of server connectivity consolidation and top-of-rack enterprise-class architecture. The HP TippingPoint switch platforms that deliver high performance, Secure Virtualization Framework lets users leverage advanced multi-switch virtualization, and flexible highly scalable appliance-based Intrusion Prevention connectivity, options like FCoE that provide Systems (IPS) to comprehensively secure VM-to-VM cost-effective storage-server I/O consolidation and as well as inter-server and inter-network traffic from 1 Gb to 10 Gb migration are available. With the a common IPS infrastructure. Combined with a wide 6120 series of blade switches or the A5820 series range of security subscription services that leverage of fixed and semi-modular top-of-rack switches, a global threat intelligence network to block bad users have multiple ways to incrementally deploy traffic in virtual and physical environments, users a FlexFabric server edge that are in keeping with can provide continuity as they scale out server traditional network designs. virtualization deployments. Complementing the FlexFabric Server Edge offering, Tomorrow—A new model for deploying HP offers a complete portfolio of enterprise-class networking as a service interconnect and backbone platforms that deliver With a vision toward provisioning of network aggregation, core switching, and enterprise connectivity and resources completely synchronized routing functionality. These platforms are built in an end-to-end data center orchestration layer, on cutting-edge technology and provide HP has developed the Data Center Connection industry-leading performance, lower power Manager (DCM) appliance as a proof-point for consumption, and lower TCO with a unified switch how networking can be enabled to accelerate operating system that let users built simpler, flatter deployment of virtualized server workloads. networks with comprehensive management. HP Data Center Connection Manager begins to Complete feature functionality and mission-critical implement the HP FlexFabric dynamic provisioning high availability means that users can deploy a wide vision. DCM allows network architects to variety of designs to accommodate existing Layer preconfigure server connection policies that are 3 core investments or to radically simplify the network enforced at the network edge through common in collapsed aggregation/core designs. Advanced RADIUS and DHCP standards. Virtual and physical multi-switch virtualization technologies allow users server interfaces are individually associated or to build cost-effective, large layer 2 aggregation subscribed to connection profiles from a pool of layers ideally suited for large-scale virtualization resources by the server administrator at build time, installations. With a continued commitment to open allowing rapid, secure provisioning and workload standards-based interoperability, users can easily mobility without the repetitive manual tasks and integrate, proven third-party data center applications turnaround time associated with provisioning today. and technologies, and avoid vendor lock-in. These These policies can drive events directly to the HP data center networking products include the BSA Network Automation software product suite, A-series of switches and routers, such as A6600/ enabling deep levels of dynamic automation to A8800 enterprise routers and the industry’s highest provision firewalls or application delivery controllers performance A12500 series switches. in response to server provisioning, de-provisioning or configuration changes. These capabilities give network administrators the power to deploy, manage, and evolve server connectivity flexibly, quickly, and in line with business policy and demands. 7
  29. 29. Beyond—The evolution to a fully-converged, Most importantly, FlexFabric allows the rest of the synchronized FlexFabric network data center infrastructure to exploit the benefits of HP is committed to serving the diverse needs of server, storage, and network virtualization going modern data centers without imposing a specific forward. The nature of I/O buses and adapters is operating model, proprietary architecture, or expected to change dramatically in the next five network fabric. With advances in next generation years; as the portion of server deployments whose high-speed connectivity including 10B-BaseT I/O is completely virtualized increases, the nature (10 Gbps over copper) and 40 Gb/100 Gb fiber, of server I/O itself can evolve. No vendor is better FlexFabric can evolve to allow your organization to positioned for this new world—from a skill set and build single, large Layer 2 domains with thousands intellectual property perspective—than HP, because of direct, low-cost 10 Gbps Ethernet-connected HP is the only company with deep intellectual servers, in virtual or non-virtual, rack mount or property in servers, blade servers, networking, blade environments, all with equal ultra-low latency storage, and virtualized I/O. paths. The fabric supports Converged Enhanced Ultimately, our goal is to allow IT to deploy new Ethernet (CEE) either from the server edge or through systems into a converged infrastructure that can the aggregation layers, offer full support for Fiber automatically discover capacity, add it to resource Channel over Ethernet (FCoE), and be capable pools, and put it to work to support the needs of providing active load balancing across converged business applications. As IT takes advantage of and traditional Ethernet-only connections. application convergence and uses cloud computing, To drive next generation security and forwarding HP can be a comprehensive partner to help you capabilities, FlexFabric uses emerging industry drive down maintenance costs, change economics, standards to build and support virtual switches and and enable your data center network and IT staff virtual I/O adapters. HP has co-authored the IEEE help your organization thrive and respond to Virtual Ethernet Port Aggregator (VEPA) proposal, business demands. which aims to provide multi-vendor, standardized discovery, configuration, and forwarding for virtual switching. FlexFabric plans to be capable of managing VEPA and other virtual I/O components from day one. This standards-based approach gives your IT organization a choice of virtualization vendors and approaches. Your next step To learn more about the HP vision of Converged Infrastructure and how the HP FlexFabric plays a key role in it, visit Share with colleagues Get connected Get the insider view on tech trends, alerts, and HP solutions for better business outcomes © Copyright 2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. 4AA0-7725ENN, Created June 2010, Rev. 1
  30. 30. If you liked this report, check out our other reports: Guide to Online Community Management The ReadWriteWeb Our first premium report for businesses comes in two parts: Guide to Online Community Management a 75 page collection of case studies, advice and discussion concerning Edited By Marshall Kirkpatrick May 2009 the most important issues in online community; and a companion online aggregator that delivers the most-discussed articles each day written by experts on community management from around the Web. ReadWriteWeb Premium Guide to Online Community Management page 1 The Real-Time Web Real-time Web technologies and applications have the potential to change and its Future everything—at a real-time pace. If you are a CTO, work in development, marketing or you are planning your next website or mobile application upgrade, you need to know about the real-time Web. Edited by Marshall Kirkpatrick Augmented Reality for Marketers and Developers: Analysis of the Leaders, the Challenges and the Future Augmented Reality for Marketers and Developers: AR offers a new paradigm for high impact, high value customer Analysis of the Leaders, the Challenges and the Future experience.  Decrease your AR development time to market by learning from the first wave of early adopters to this new technology.  In this Written by Chris Cameron ReadWriteWeb Premium Report we profile successful companies and their campaigns as well as development lessons learned.
  31. 31.