Hadoop_Its_Not_Just_Internal_Storage_V14

  • 2,280 views
Uploaded on

John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop …

John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop situations absolutely required direct attached storage. However, there are many intelligent situations where shared external storage may make sense in a Hadoop environment. This presentation details how/why/where, and promotes taking an intelligent, Hadoop-aware approach to deciding between internal storage and external shared storage. Having full awareness of Hadoop considerations is essential to selecting either internal or external shared storage in Hadoop environment.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Excellent slides for understanding Hadoop! There is an interesting question: IBM priced the BigInsights by storage capacity, but the other Hadoop players price their SW by nodes, which would be reasonable, why IBM chooses storage price metric?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,280
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
158
Comments
1
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Traditional applications work on the model where data is loaded into memory from wherever it is stored onto the computer where the application is run. As Google was processing ever increasing amounts of internet data, the IT people there quickly realized that this centralized approach to computation was not sustainable. So they decided to move to a model where they would scale out their processing and storage and created a system where the data would be processed on the machine where it is stored. This processing technology became MapReduce and the storage model is known as the Google File System (GFS), which is a direct descendant to today’s HDFS. Hadoop is a top-level Apache project being built and used by a global community of contributors. Yahoo has been the largest contributor to the project, and it uses Hadoop extensively across its businesses. One of its employees, Doug Cutting, reviewed key papers from Google and concluded that the technologies they described could solve the scalability problems of Nutch, an open source Web search technology. So Cutting led an effort to develop Hadoop (which, incidentally, he named after his son’s stuffed elephant). Hadoop is particularly well-suited to batch-oriented, read-intensive applications. Key features include the ability to distribute and manage data across a large number of nodes and disks. By using the MapReduce programming model with the Hadoop framework, programmers can create applications that automatically take advantage of parallel processing. A single commodity box consisting of, let’s say, a single CPU and disk, forms a node in Hadoop. Such boxes can be combined into clusters, and new nodes can be added to a cluster without an administrator or programming changing the format of the data, how the data was loaded, or how the jobs (programming logic) were written. The following overview of Hadoop was extracted from the Hadoop wiki at http://wiki.apache.org/hadoop/ Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
  • http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ http://www.hadoopwizard.com/which-big-data-company-has-the-worlds-biggest-hadoop-cluster/
  • http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond A Hadoop “stack” is made up of a number of components. They include: Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop cluster; Name Node: The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail; Secondary Node: A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail; Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data. Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker. In addition to the above, the Hadoop ecosystem is made up of a number of complimentary sub-projects. NoSQL data stores like Cassandra and HBase are also used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. Following is a guide to Hadoop's components: Hadoop Distributed File System:  HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce:  MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive:  Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig:  Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase:  HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume:  Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Oozie:  Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Whirr:  Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market. Avro:  Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout:  Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop:  Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. BigTop:  BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.
  • 5 cents / GB = $50/TB = $500/10TB = $2500/50TB x 3 Hadoop copies = $7500 storage in Hadoop server + $2500 for Hadoop server = $10,000 26 cents / GB = $260/TB = $2600/10TB = $13,000/50TB = $13,000 storage in external array + cost of SAN or NAS network + cost of rebalancing Hadoop cluster for equivalent performance
  • There are two aspects of Hadoop that are important to understand: MapReduce is a software framework introduced by Google to support distributed computing on large data sets of clusters of computers. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data. This file system spans all the nodes in a cluster. Effectively, HDFS links together the data that resides on many local nodes, making the data part of one big file system. Furthermore, HDFS assumes nodes will fail, so it replicates a given chunk of data across multiple nodes to achieve reliability. The degree of replication can be customized by the Hadoop administrator or programmer. However, by default is to replicate every chunk of data across 3 nodes: 2 on the same rack, and 1 on a different rack. You can use other file systems with Hadoop, but HDFS is quite common. (ex GPFS) The key to understanding Hadoop lies in the MapReduce programming model. This is essentially a representation of the divide and conquer processing model, where your input is split into many small pieces (the map step), and the Hadoop nodes process these pieces in parallel. Once these pieces are processed, the results are distilled (in the reduce step) down to a single answer.
  • http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf
  • http://www.forbes.com/sites/johnwebster/2013/02/27/hadoop-appliances-the-lengthening-list/
  • The filer is present just for completeness. It has no part in the story we are telling about GPFS or GPFS-FPO replacement for HDFS. No need to talk to it, and in your personal copies of the deck, if you wish to remove it, that is fine.
  • Policy based ingest – move data into and out of FPO pool using GPFS policies.
  • Summary: When you break it down even further, IBM has constructed a portfolio of software and solutions with the breadth and depth to meet all of the needs of all organizations today, combined with unique synergies across this portfolio that enable organizations to start with their most pressing needs knowing that they will be able to leverage their skills and investment in future projects to reduce risk, lower costs and achieve faster time to value in meeting the needs of the business. There are multiple “entry-points”, driven by your most pressing needs, that help you start moving down the path for an information-led transformation. (Note: describe this slide from the bottom-up ) When you think about an information-led transformation, you need to ensure that your infrastructure and systems are optimized to handle the various workloads that are demanded of it. Especially today when you are faced with a glut of new information, you need to ensure that relevant information is available, that it is secure and that you are able to retrieve it in a timely manner not only for analytical, operational and transactional systems, but also for regulatory compliance. That is why IBM Software Group and our Systems & Technology Group are working together to provide optimized solutions focused on delivering greater business value to our customers, faster, for increased return on investment. From the new IBM Smart Analytics System, to the new DB2 PureScale for continuous availability, unlimited capacity and application transparency, to the deep integration of System z, IBM has unparalleled expertise in designing and implementing workload optimized systems and services. On top of that infrastructure, there is also the need to ensure that you can bring all of those sources of information together to create a single, trusted view of information from across your business – regardless of whether that information is structured or unstructured – and then manage it over time. From data warehousing, Master Data Management, information integration, and Agile ECM and integrated data management, IBM’s InfoSphere portfolio ensures that organizations will be able to leverage their information over time to drive innovation across their business. And armed with this single-view of your business, you can then look to optimize business processes and drive greater performance across your organization. Decision makers will have the right information, at the right time, in the right context to make better, more informed decisions, and even anticipate new opportunities or counter potential threats more effectively. The Business Analytics and Optimization Platform supports and information-led transformation in that it focuses on establishing well-constructed processes and empowering individuals throughout the organization with pervasive, predictive real-time analytics . From Cognos and the newly acquired SPSS portfolios, organizations can now be more pro-active and predictive in innovating their business.
  • The IBM Big Data Platform extends the traditional warehouse in two ways: Big Data in Motion , which is streaming data, such as securities data (like stock tickers), or sensor data, (like temperature readings, heart rates, or the revolutions per second of a piece of machinery). This data can be streaming at a very high transfer rate, and vary greatly in its structure. Our product offering for Big Data in Motion is InfoSphere Streams. This product is capable of performing analytics on the streaming data in real-time. Big Data at Rest , which is a set of data in static storage, for instance a large set of log files from a web site’s click-stream analysis, or pools of raw text from service engagements with customers. Our product offering for Big Data at Rest is InfoSphere BigInsights. This product is capable of performing analytics on this large set of varied data. Both Streams and BigInsights interface with each-other, and can use existing data warehouses as data sources for their analytics. Or the data warehouse can pull data from Streams and BigInsights. Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: 06/25/13 Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: Prensenter name here.ppt Transcript : BIG Data's all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes:
  • Summary: When you break it down even further, IBM has constructed a portfolio of software and solutions with the breadth and depth to meet all of the needs of all organizations today, combined with unique synergies across this portfolio that enable organizations to start with their most pressing needs knowing that they will be able to leverage their skills and investment in future projects to reduce risk, lower costs and achieve faster time to value in meeting the needs of the business. There are multiple “entry-points”, driven by your most pressing needs, that help you start moving down the path for an information-led transformation. (Note: describe this slide from the bottom-up ) When you think about an information-led transformation, you need to ensure that your infrastructure and systems are optimized to handle the various workloads that are demanded of it. Especially today when you are faced with a glut of new information, you need to ensure that relevant information is available, that it is secure and that you are able to retrieve it in a timely manner not only for analytical, operational and transactional systems, but also for regulatory compliance. That is why IBM Software Group and our Systems & Technology Group are working together to provide optimized solutions focused on delivering greater business value to our customers, faster, for increased return on investment. From the new IBM Smart Analytics System, to the new DB2 PureScale for continuous availability, unlimited capacity and application transparency, to the deep integration of System z, IBM has unparalleled expertise in designing and implementing workload optimized systems and services. On top of that infrastructure, there is also the need to ensure that you can bring all of those sources of information together to create a single, trusted view of information from across your business – regardless of whether that information is structured or unstructured – and then manage it over time. From data warehousing, Master Data Management, information integration, and Agile ECM and integrated data management, IBM’s InfoSphere portfolio ensures that organizations will be able to leverage their information over time to drive innovation across their business. And armed with this single-view of your business, you can then look to optimize business processes and drive greater performance across your organization. Decision makers will have the right information, at the right time, in the right context to make better, more informed decisions, and even anticipate new opportunities or counter potential threats more effectively. The Business Analytics and Optimization Platform supports and information-led transformation in that it focuses on establishing well-constructed processes and empowering individuals throughout the organization with pervasive, predictive real-time analytics . From Cognos and the newly acquired SPSS portfolios, organizations can now be more pro-active and predictive in innovating their business.
  • IBM offers two basic models – the GSS24 and GSS26 – with 4 or 6 JBODs, respectively. These two basic configurations are scalable into larger storage solutions by using them as building blocks.
  • IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. Apache Hadoop is the open source software framework, used to reliably managing large volumes of structured and unstructured data. BigInsights enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is that you get a more developer and user-friendly solution for complex, large scale analytics. InfoSphere BigInsights allows enterprises of all sizes to cost effectively manage and analyze the massive volume, variety and velocity of data that consumers and businesses create every day. Infosphere Streams Part of IBM’s platform for big data, IBM InfoSphere Streams allows you to capture and act on all of your business data... all of the time... just in time. InfoSphere Streams radically extends the state-of-the-art in big data processing; it’s a high-performance computing platform that allows user-developed applications to rapidly ingest, analyze, & correlate information as it arrives from thousands of real-time sources. Users are able to: Continuously analyze massive volumes of data at rates up to petabytes per day. Perform complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, police scanners, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information that is relevant to your business. Leverage sub-millisecond latencies to react to events and trends as they are unfolding, while it is still possible to improve business outcomes. Adapt to rapidly changing data forms and types. Seamlessly deploy applications on any size computer cluster. Meet current reaction time and scalability requirements with the flexibility to evolve with future changes in data volumes and business rules. Quickly develop new applications that can be mapped to a variety of hardware configurations, and adapted with shifting priorities. Provides security and information confidentiality for shared information. Learn more about how InfoSphere Streams aligns with any industry.
  • http://www.forbes.com/sites/johnwebster/2013/02/27/hadoop-appliances-the-lengthening-list/
  • Thank you!
  • Link to enter your email address and then get free copy of this book downloaded: https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=sw-infomgt&S_PKG=500016891&S_CPM=is_bdebook1_biginsightsfp Direct URL to load book (3.5 MB Acrobat Reader file): http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
  • InfoSphere BigInsights features Apache Hadoop as a core component. There are two releases of InfoSphere BigInsights: Basic and Enterprise. Basic edition –This is a free offering. It has open source components as well as IBM value add (maintenance console, DB2 integration, integrated installation) You can purchase support for this offering. This is an excellent choice for companies who want a Hadoop environment up and running or conducting a POC – but it lays the foundation to turn that POC into a pilot or full enterprise deployment. The Enterprise edition adds significant value on the same base platform – you can grow into it. There are two main value adds: this offering hardens Hadoop, providing enterprise-quality stability, and it provides an analytics layer. Specifically, it includes a rock-solid file system alternative to what’s included in open source Hadoop, text analytics, analytics visualization, security, integrated web console, workflow and scheduling, indexing, and documentation. Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: 06/25/13 Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: Prensenter name here.ppt Transcript : What we're doing is putting together a comprehensive solution around BIG Data and if you're going to the sessions here you'll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we're looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we're applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes:

Transcript

  • 1. © 2013 IBM CorporationHadoop – It’s Not Just Internal StorageJohn Sing, Executive ConsultantIBM Systems and Technology Group Session 1185A Tuesday, June 11, 201311 June 2013
  • 2. © 2013 IBM CorporationIBM Storage Solutions for Big Data2John Sing 31 years of experience with IBM in high end servers, storage,and software– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy andPlanning, Enterprise Large Scale Storage, Internet Scale Workloadsand Data Center Design, Big Data Analytics, HA/DR/BC– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, BusinessContinuity, HA/DR/BC, IBM Storage– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Plannerfor ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror)– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA) singj@us.ibm.com You may follow my daily IT research blog– http://www.delicious.com/atsf_arizona You may follow me on Slideshare.net:– http://www.slideshare.net/johnsing1 My LinkedIn:– http://www.linkedin.com/in/johnsing
  • 3. © 2013 IBM CorporationIBM Storage Solutions for Big Data44Agenda Understanding today’s Hadoop environments– Hadoop architecture, usage cases, deployments– Hadoop design, performance, and cost considerations Differing Hadoop perspectives: Applications/Business Line vs. Operations– Understanding implications of direct attached storage (DAS) vs. Shared Storage Intelligently choosing Hadoop storage solutions– Usage cases where Direct Attached Storage makes sense– Intelligent usage cases where Shared Storage makes sense– Future evolution of storage, Hadoop, and cross-section of the two IBM Hadoop, Storage, Big Data hardware and software components, tools,offeringsSource: If applicable, describe source origin
  • 4. © 2013 IBM CorporationIBM Presentation Template Full Version55Understandingtoday’s HadoopenvironmentsHadoop – It’s Not Just Internal Storage
  • 5. © 2013 IBM CorporationIBM Storage Solutions for Big Data6What is Hadoop?Instead of the traditional IT computation model: Which brings the data to the function/program on application server Loads data into memory on an application server and processes it Unfortunately, this doesn’t scale for internet-scale Big Data problemsApache Hadoop: open source framework for data-intensive applications Inspired by Google technologies (MapReduce, GFS) Well-suited to batch-oriented, read-intensive applications Yahoo! adopted these technologies and open sourced them into the Apache Hadoop projectHadoop has become a pervasive enabler of internet-scale applications, working with thousandsof nodes, petabytes of data in highly parallel, cost effective manner CPU + disks of commodity storage = Hadoop “node” Hadoop nodes today running mission-critical production in massive clusters 10s of thousands of servers New nodes can be added as needed to the cluster, without changing: Data formats, how data is loaded, how jobs are written6Tutorials: http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
  • 6. © 2013 IBM CorporationIBM Storage Solutions for Big Data7The World of Hadoop: worldwide usage eBay Linkedin Yahoo! Facebook New York Times Many, manymore…http://www.datanami.com/datanami/2012-04-26/six_super-scale_hadoop_deployments.htmlOne source for Hadoop users (but not the only one!): http://wiki.apache.org/hadoop/PoweredBy
  • 7. © 2013 IBM CorporationIBM Storage Solutions for Big Data8Hadoop is today a well-developed ecosystem Hadoop– Overall name of softwarestack HDFS– Hadoop Distributed FileSystem MapReduce– Software compute framework• Map = queries• Reduce=aggregatesanswers Hive– Hadoop-based datawarehouse Pig– Hadoop-based language Hbase– Non-relationship databasefast lookups Flume– Populate Hadoop with data Oozie– Workflow processingsystem Whirr– Libraries to spin up Hadoopon Amazon EC2,Rackspace, etc. Avro– Data serialization Mahout– Data mining Sqoop– Connectivity to non-Hadoopdata stores BigTop– Packaging / interop of allHadoop componentshttp://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyondhttp://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
  • 8. © 2013 IBM CorporationIBM Storage Solutions for Big Data9Hadoop vendor ecosystem todayhttp://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://www.forbes.com/special-report/2013/industry-atlas.html
  • 9. © 2013 IBM CorporationIBM Storage Solutions for Big Data10Why understand the Hadoop stack andenvironment? Hadoop is being used for much more than just internet-scale Big Data analytics Hadoop is increasingly being used by enterprises for inexpensive data storage– As an industry we’re strongly exploiting a much wider variety of data types– With tools like Hadoop, it’s become affordable to ingest, analyze, have available an internet-scale“Big Landing Zone” Hadoop cluster for storing data• Previously not viable to keep online– Hadoop cluster also then can run internet-scale analytics on this data– Significant driver: move to Hadoop to reduce traditional database licensing costs Storage industry dynamics:– Today, JBOD storage in a server chassis might be as low as 4-6 cents/raw GB• At these prices, adding 50TB usable to Hadoop cluster might only cost $10K in total including server• Even at typical Hadoop 3X copies, this is still less initial cost than enterprise storage at 26 cents/GB• Not saying this includes all factors, but these dynamics clearly affect the decision– And then, there’s flash storage coming…..Must understand full depth of the Hadoop environment and storage industry dynamics:– In order to decide if/when/where Hadoop internal storage or shared storage is appropriate
  • 10. © 2013 IBM CorporationIBM Storage Solutions for Big Data11Why Hadoop was created for Big DataTraditional approach : Move data to programBig Data approach: Move function/programs to dataDatabaseserverDataQuery Datareturn Dataprocess DataMasternodeDatanodesDataApplicationserverUser requestSend resultUser requestSend Function toprocess on DataQuery &process DataDatanodesDataDatanodesDataDatanodesDataSend Consolidate resultTraditional approachApplication server and Databaseserver are separateData can be on multiple serversAnalysis Program can run onmultiple Application serversNetwork is still in the middleData has to go through network•Big Data Approach Analysis Program runs where thedata is : on Data NodeOnly Analysis Program has to gothrough the networkAnalysis Program need to beMapReduce awareHighly Scalable :1000s NodesPetabytes and moreThank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
  • 11. © 2013 IBM CorporationIBM Storage Solutions for Big Data12Example of Hadoop in actionTraditional approach : Move data to programDatabaseserverDataQuery Datareturn Dataprocess DataApplicationserverUser requestSend result Big Data approach : Move program to DataMasternodeDatanodesDataUser requestSend Function toprocess on DataQuery &process DataDatanodesDataDatanodesDataDatanodesDataSend Consolidate resultExample :How many hours Clint Eastwoodappears in all the movies he hasdone ?All movies need to be parsed tofind Clint’s faceTraditional approach : Allmovies are uploaded toapplication server, through thenetwork• Big Data Approach : TheAnalysis Program and copy ofClint’s picture are downloadedto data nodes, through thenetworkThank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
  • 12. © 2013 IBM CorporationIBM Storage Solutions for Big Data13Hadoop principles: Storage, HDFS and MapReduce Hadoop Distributed File System = HDFS : where Hadoop stores the data– HDFS file system spans all the nodes in a cluster with locality awareness Hadoop data storage, computation model– Data stored in a distributed file system, spanning many inexpensive computers– Send function/program to the data nodes– i.e. distribute application to compute resources where the data is stored– Scalable to thousands of nodes and petabytes of dataMapReduce Application1. Map Phase(break job into small parts)2. Shuffle(transfer interim outputfor final processing)3. Reduce Phase(boil all output down toa single result set)Return a single result setResult SetShufflepublic static class TokenizerMapperextends Mapper<Object,Text,Text,IntWritable> {private final static IntWritableone = new IntWritable(1);private Text word = new Text();public void map(Object key, Text val, ContextStringTokenizer itr =new StringTokenizer(val.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritaprivate IntWritable result = new IntWritable();public void reduce(Text key,Iterable<IntWritable> val, Context context){int sum = 0;for (IntWritable v : val) {sum += v.get();. . .public static class TokenizerMapperextends Mapper<Object,Text,Text,IntWritable> {private final static IntWritableone = new IntWritable(1);private Text word = new Text();public void map(Object key, Text val, ContextStringTokenizer itr =new StringTokenizer(val.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritaprivate IntWritable result = new IntWritable();public void reduce(Text key,Iterable<IntWritable> val, Context context){int sum = 0;for (IntWritable v : val) {sum += v.get();. . .Distribute maptasks to clusterHadoop Data NodesData is loaded,spread, residentin Hadoop clusterPerformance =tuning Map Reduce workflow,network, application,servers, and storagehttp://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/http://www.slideshare.net/allenwittenauer/2012-lihadoopperf
  • 13. © 2013 IBM CorporationIBM Storage Solutions for Big Data14Big Data hadoop system architectureDatanodeDatanodeDatanodeManagement nodesNamenode nodesJobTracker nodesData nodes withlocal disksNetwork1-10GB Ethernetor InfinibandNetwork1-10GB Ethernetor InfinibandManagementnodes for Hadoopand clusterManagementnodes for Hadoopand clusterIO performance = type and # of disksReference architecture:From 12-24 disks, ~1.5GB/s, >35TB, 12-16 CPUs per datanodeHadoop Distributed File System (HDFS)• HDFS stores data across multiple data nodes, Namenode knows where data is• HDFS assumes data nodes and disks will fail, so it achieves reliability by replicating data acrossmultiple data nodes (typically 3 or more)• HDFS file system is built from a cluster of data nodes, each of which serves up blocks of dataover the network using a block protocol specific to HDFS• HDFS Name Node is a single point of failureScaling granularity: data node,scaling both IO and CPULocalityawarenessNote: any other location fordata adds network latency
  • 14. © 2013 IBM CorporationIBM Presentation Template Full Version1515Differing HadoopstorageperspectivesHadoop – It’s Not Just Internal Storage
  • 15. © 2013 IBM CorporationIBM Storage Solutions for Big Data16Understanding Hadoop rationale forDirect Attached Storage (latency)Primary Hadoop design goal is affordability at internet scale: Data is loaded into Hadoop cluster with data locality– Spreading data across Data Nodes– Achieve lowest disk latency through direct attached storage Send programs to data (not other way around)– Data in general does not move within the Hadoop cluster Key performance components: disk latency, network interconnect, utilization, bandwidth Based on low capital expenditure, low cost commodity components– Goal: lowest capital cost at scale (adapters, switches and # of ports)Hadoop Application and performance tuning: Fallacy: “all Hadoop jobs are IO-bound” Truth: there are many many Hadoop workflow and tuningvariables, widely varying workloads– CPU/storage ratio different for different workloadsNetwork latency is major performance impact on Hadoop cluster– Adding external storage layer network latency causes major retuning of networkHadoop Applicationteam to Operations:“Until you’ve read theHadoop book, pleasedon’t waste my time”http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
  • 16. © 2013 IBM CorporationIBM Storage Solutions for Big Data17Yet, there are valid operational issues with Hadoop fromEnterprise Shared Storage management, cost standpoint Servers under-utilized? Another storage silo? Amount of physical storage required per usable GB/TB? Reliability as Hadoop application goes into mission critical production? Hadoop-specific storage management, migration, backup, recovery? Hadoop-specific skill set? Ability to understand what data is used where? Audit, security, legacy application integration? Share Hadoop storage (and servers) dynamically, in a pool with other data centerresources?Ultimately, it becomes a matter of perspective, type of infrastructure,and associated priority. Let’s explore this further………..
  • 17. © 2013 IBM CorporationIBM Storage Solutions for Big Data18Today: two different types of ITSource: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/Internet scale wkloadsTransactional IT
  • 18. © 2013 IBM CorporationIBM Storage Solutions for Big Data19Today’s two major IT workload typesSource: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads
  • 19. © 2013 IBM CorporationIBM Storage Solutions for Big Data20How to build these two different cloudsSource: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/Transactional ITInternet scale wkloads
  • 20. © 2013 IBM CorporationIBM Storage Solutions for Big Data21Hadoop storage choices based on perspective:This is where a Hadoop external sharedstorage infrastructure may often be foundThis is where HadoopDAS-focused infrastructuremay often be found
  • 21. © 2013 IBM CorporationIBM Storage Solutions for Big Data22Differing valid perspectives on Hadoop storage issuesVery specific reasons why Direct Access Storage isused: Performance and throughput (lowest latency) Low cost commodity components– cost of JBOD at 4-6 cents/GB today– Even at 3x copies, still very inexpensive Many Hadoop workflow, software components totune:– Map and Reduce workflow– Memory allocation and usage– Algorithms, tuning at all levels– What are the tasks doing Hadoop overall cluster configuration– Server and DAS storage configuration– 3X copies for performance reasons– Squeeze out all latency– Network topology, speeds, utilization– Compression– Type of data Etc…..……Very specific reasons why shared storage is desired: Cost CAPEX / OPEX?– Fixed server/storage ratio?– Low server % utilization = excess cost? Reliability? Backup? Disaster Recovery? Another silo of storage? Managing data:– Within the Hadoop cluster– Between Hadoop and other existing storage?Hadoop Applications,Business Line teamOperations teamClearly, different perspectives!
  • 22. © 2013 IBM CorporationIBM Storage Solutions for Big Data23Bottom line on Direct Attached Storage (DAS)vs. Shared Storage for Hadoop Avoid “brute force” one-for-one direct replacement of Hadoop direct attached storage with external sharedstorage– This is too blunt an instrument• Doesn’t intelligently consider Hadoop design characteristics, performance requirements, overall Hadoop clustertuning, workload variations, customer’s environment Instead, an intelligent, blended Hadoop storage approach, with full awareness of the Hadoop stack and customerenvironment, and multiple perspectives:– To identify cases where Direct Attached Storage (DAS) makes sense• Many Hadoop cases where DAS is the correct Hadoop primary storage choice• For issues of very large scale, performance and throughput, minimize network, adapter costs– To identify cases where shared storage makes sense• While maintaining the Hadoop benefits of DAS latency, cost, scale• Specific intelligent implementations are effective, if designed properly with full Hadoop stack awareness Without an intelligent in-depth Hadoop-aware approach:– Likely may not meet Hadoop performance or cost objectives• Replacing DAS one-for-one with external shared storage today isn’t cost-effective at true internet scale• SAN switches / port costs today cannot affordably reach thousands of data nodes– Must use intelligent approach, otherwise SAN/NAS will introduce significant % disk IO latency increase• Requiring rebalancing of entire Hadoop cluster and requiring more expensive networking costshttp://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf
  • 23. © 2013 IBM CorporationIBM Presentation Template Full Version2424Intelligentlychoosing Hadoopstorage solutionsHadoop – It’s Not Just Internal Storage
  • 24. © 2013 IBM CorporationIBM Storage Solutions for Big Data25Intelligently using Hadoop shared storage: goals Wish to perform mixed workloads on a shared storage infrastructure– Some storage for Hadoop, other storage for other things, all on the same storage devices Have a desire to trade off reduced number of Hadoop copies by exploiting higher storage reliability– Saving on total Hadoop physical storage space Exploit external storage placement/migration/storage mgmt strategy and capabilities Exploit configurable storage recovery policies, backup/restore Exploit your existing storage infrastructure in balanced, cost-effective way Reduce need for Hadoop storage allocation skills and manual management of Hadoop data Exploit existing shared storage infrastructure tooling / performance monitors Add audit, security, legacy integration opportunities leveraged out of existing infrastructure– Avoiding silo’d Hadoop storage environment Decoupling servers from storage:– Enable using smaller servers (less power, cooling)– Enable better use of resources on differing workloads with differing server/storage ratios– Dynamically allocate servers and storage to work on differing and changing analyticsworkloads
  • 25. © 2013 IBM CorporationIBM Storage Solutions for Big Data26Intelligent usage cases forshared external storage in Hadoop Intelligent usage cases where external shared storage supplements and is appropriate for Hadoop: Stage 1:– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primarystorage while still preserving Direct Attach Storage data locality, ability for internet scale– While using external storage to bring desired function or reduce number of Hadoop copies– Examples: Nseries Open Solution for Hadoop; GPFS File Placement Optimizer Stage 2:– Augment Hadoop DAS primary storage with 2ndstorage layer (external file system, NAS, or SAN) as adata protection or archival layer.– Intelligently allocating, importing, exporting data appropriately Stage 3:– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)– Appropriate for certain clusters and certain Hadoop environments where:• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits• Example: IBM GPFS Storage ServerStage 3Stage 1Stage 2Hadoop Stages originally published by John Webster, Evaluator Group, http://www.evaluatorgroup.com/about/principals/http://searchstorage.techtarget.com/video/Alternatives-to-DAS-in-Hadoop-storagehttp://searchstorage.techtarget.com/answer/Can-shared-storage-be-used-with-Hadoop-architecturehttp://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster
  • 26. © 2013 IBM CorporationIBM Storage Solutions for Big Data27IBM Big Data Networked Storage Solution for Hadoophttp://www.redbooks.ibm.com/redpieces/abstracts/redp5010.htmlStage 1Example: IBMDCS3700 withHadoopreplicationcount = 2Still directattached datalocality
  • 27. © 2013 IBM CorporationIBM Storage Solutions for Big Data28IBM Big Data Network Storage Solution for Hadoophttp://www.redbooks.ibm.com/redpieces/abstracts/redp5010.htmlStage 1HadoopStoragebuildingblocksIBMStorageHadoopreplicationcount = 2HadoopImprovedNamenodeprotection
  • 28. © 2013 IBM CorporationIBM Storage Solutions for Big Data29Another option: Hadoop environment usingIBM GPFS-FPO (File Placement Optimizer)MapReduce ClusterMapReduceMapReduceUsersJobsGPFS-FPO GPFS File Placement Optimzer instead of HDFS - still places disk local to each server Aggregates the local disk space into a single redundant shared file GPFS system Designed for MapReduce workloads Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy Intended as a drop in replacement for open source HDFS (IBM BigInsights productmay be required)Stage 1IBMGeneralParallel FileSystemFPOInstead ofHDFS
  • 29. © 2013 IBM CorporationIBM Storage Solutions for Big Data30GPFS 3.5 HDFSPerformanceTerasort: large reads  Hbase: small write  Metadata intensive  Enterprise readinessPOSIX compliance Meta-data replication Distributed name node Protection &RecoverySnapshot Asynchronous Replication Backup Security & Integrity Access Control Lists Ease of Use Policy based Ingest GPFS File Placement Optimizer sharedstorage advantages in Hadoop environmentStage 1
  • 30. © 2013 IBM CorporationIBM Storage Solutions for Big Data31Augment Hadoop Storage with external storageDatanodeDatanodeDatanodeManagement nodesNamenode nodesJobTracker nodesCompute nodeCompute nodeCompute nodeCompute nodeManagement nodesJob submission nodesBatch scheduler nodesHDFSExternal storagePossibilities:•Allocate one of Hadoop copies externally•Move data back and forth between Hadoopand external storageStage 2
  • 31. © 2013 IBM CorporationIBM Storage Solutions for Big Data32Another option: augment Hadoop withIBM General Parallel File System in “Stage 2” configurationDatanodeDatanodeDatanodeManagement nodesNamenode nodesJobTracker nodesCompute nodeCompute nodeCompute nodeCompute nodeManagement nodesJob submission nodesBatch scheduler nodesGPFSStorageServerGPFSStorageserverGPFS-FPOPOSIX GPFSAdd GPFS ClusterPOSIX worldAll nodes can write/read data• Integration with existing or new external GPFS cluster• Policy based file movement in/out of GPFS-File Placement Optimizer pool• Seamlessly integrate tape as part of the same namespaceStage 2
  • 32. © 2013 IBM CorporationIBM Storage Solutions for Big Data33Replace Hadoop DAS with intelligentexternal Hadoop storage implementationCompute node1Compute node3Compute node2Namenode nodesJobTracker nodesGPFSStorageServerGPFSStorageserver/gpfs/node1/dsk1/gpfs/node1/dsk2…/gpfs/node1/dskX/gpfs/gpfs/node2/dsk1/gpfs/node2/dsk2…/gpfs/node2/dskX/gpfs/node3/dsk1/gpfs/node3/dsk2…/gpfs/node3/dskXHDFSStage 3Example:GPFSStorageServer
  • 33. © 2013 IBM CorporationIBM Storage Solutions for Big Data34IBM Big Data Network Storage Solution for Hadoophttp://www.redbooks.ibm.com/redpieces/abstracts/redp5010.htmlStage 3HadoopImprovedNamenodeprotectionHadoopStoragebuildingblocksOther IBMStorageHadoopreplicationcount = 2NASSANIBM NAS filerNASSAN
  • 34. © 2013 IBM CorporationIBM Storage Solutions for Big Data35Future evolution: Hadoop, storage, intersection of the two Continued evolution of Big Data workloads, Hadoop, and storage are all fast movingtargets– Already in mid-2013, we’re seeing HDFS 2.0 offering HA, snapshots, better resiliency• http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum– We are seeing a huge adoption rate of Hadoop as inexpensive cheap, deep storage More importantly, very soon flash storage costs will start to affect Hadoop referencearchitectures– By 2015, costs on SSD will reach point (15 cents/GB) that future yet-to-be-determined Hadoopdeployments– Will start move Hadoop bottleneck from storage to network interconnect– Whoever best solves that future network interconnect issue will be the next big Hadoop winner Today’s intelligent Hadoop usage cases will continue to evolve quickly. Watch thisspace!
  • 35. © 2013 IBM CorporationIBM Presentation Template Full Version3636IBM HadoopStorage components,tools, offeringsHadoop – It’s Not Just Internal Storage
  • 36. © 2013 IBM CorporationIBM Storage Solutions for Big Data37Big Data application stackUser Interface LayerReports, Dashboards, Mashups, Search,Ad hoc reporting, SpreadsheetsAnalytic Process LayerReal-time computing and analysis, stream computing,entity analytics, data mining, data proximity, contentmanagement, text analytics, etc.Infrastructure layerVirtualization, central end to end management, control,deployment on software, server, storage in ageographically dispersed environmentUsersSecurityauthorizationOS softwareLocation ofcompetitiveadvantageAnalyticsapplications.Cloud infrastructurelayer Servers, storageIBM Big Data SoftwareVisualization layerAnalytics layer
  • 37. © 2013 IBM CorporationIBM Storage Solutions for Big Data38IBM Big Data Analytics SolutionsStreamingDataTraditionalWarehouseAnalytics onData at RestDataWarehouseAnalytics onStructured DataAnalytics onData In-MotionIBM InfoSphereBigInsightsTraditional /RelationalData SourcesNon-Traditional /Non-RelationalData SourcesNon-Traditional/Non-RelationalData SourcesTraditional/RelationalData SourcesInternet-ScaleData SetsIBM InfoSphereStreams
  • 38. © 2013 IBM CorporationIBM Storage Solutions for Big Data39Big Data infrastructure layerUser Interface LayerReports, Dashboards, Mashups, Search,Ad hoc reporting, SpreadsheetsAnalytic Process LayerReal-time computing and analysis, stream computing,entity analytics, data mining, data proximity, contentmanagement, text analytics, etc.Infrastructure layerVirtualization, central end to end management, control,deployment on software, server, storage in ageographically dispersed environmentUsersSecurityauthorizationOS softwareCloud infrastructurelayer Servers, storageVisualization layerAnalytics layer
  • 39. © 2013 IBM CorporationIBM Storage Solutions for Big Data40IBM Direct Attached Storage solutions for HadoopRack-Level FeaturesUp to 20 System x3630 M4 nodesUp to 6 System x3550 M4Management nodesUp to 960TB storageUp to 240 Intel Sany Bridge coresUp to 3,840GB memoryUp to two 10Gb Ethernet (IBMG8264-T) switchesScalable to multi-rack configurationsAvailable Enterprise andPerformance FeaturesRedundant storageRedundant networkingHigh performance coresIncreased memoryHigh performance networkingReference architectureHigh volume x86 systemsIntegrated solutionPureData System for HadoopEach system has local storage
  • 40. © 2013 IBM CorporationIBM Storage Solutions for Big Data41JBODDisk Enclosurex3650 M4 ServerStorage solution includes Data Servers,Disk (2TB or 3TB NL-SAS, SSD), Software,InfiniBand / Ethernet with no Storage ControllersGSS 24: Light and Fast2 3650 servers +4 JBOD 20U rack10 GB/SecGSS 26: Workhorse2 3650 servers +6 JBOD Enclosures, 28U12 GB/secHigh-Density Option6 3650 servers + 18 JBOD2 - 42U Standard Racks36 GB/secIBM external Big Data storage:GPFS Storage Server scalable building block approachGPFSsoftwareRAID
  • 41. © 2013 IBM CorporationIBM Storage Solutions for Big Data42High Volume & Availability : Mainframe & OpenStorage for Distributed SystemsStoragemanagementSWTivoli StorageProductivity CenterTivoli StorageFlashCopy ManagerTivoli StorageManagerTivoli Key LifecycleManagerXIVSONASDS8000Optimized System StorageN seriesStorwize V7000UnifiedStorwize V7000IntegratedInnovationStorageVirtualizationSW and SVCReal-timeCompressionDéduplicationDS3500/DCS3700Integrated SolutionsVirtual Storage CenterEasy TierIBM ActiveCloud EngineTMLinear Tape FileSystem (LTFS)IBM Shared Storage infrastructure solutionsV7000 UnifiedV7000 UnifiedV7000 UnifiedV7000 UnifiedTape LibraryTS3310Tape VirtualizationTS7740Tape AutomationTS3500Tape drivesLTO 3, 4 and 5ProtecTIERTS7610/20/50Data protection & retention
  • 42. © 2013 IBM CorporationIBM Storage Solutions for Big Data43IBM solutions for a Big Data worldIBM NetezzaStorwizeV7000“Unified”Storage“File”Storage“Block”StorageDisks 3TB, 4 TB• Storwize V7000• XIV Gen3• DS8800Solid State Drives (SDD)• Storwize V7000• XIV Gen3• DS8800Scale Out NAS(SONAS)IBM TapeSystems2.7 ExaBytesTS3500InfoSphere StreamsGPFSStorageServer
  • 43. © 2013 IBM CorporationIBM Storage Solutions for Big Data44Learning Points Many, most cases where traditional Hadoop Direct Attached Storage is appropriate However, many Intelligent usage cases where Hadoop external shared storage, intelligentlyimplemented, brings significant value Stage 1:– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primarystorage while still preserving Direct Attach Storage data locality, ability for internet scale– While using external storage to bring desired function or reduce number of Hadoop copies Stage 2:– Augment Hadoop DAS primary storage with 2ndstorage layer (external file system, NAS, or SAN) as adata protection or archival layer– Intelligently allocating, importing, exporting data appropriately Stage 3:– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)– Appropriate for certain clusters and certain Hadoop environments where:• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits Most importantly, Hadoop and Storage topic is both fast moving, constantly evolving– Soon, adoption of Hadoop primary flash storage will significantly change Hadoop dynamics– Will move Hadoop bottleneck from storage to network
  • 44. © 2013 IBM CorporationIBM Storage Solutions for Big Data45
  • 45. © 2013 IBM CorporationIBM Storage Solutions for Big Data46Trademarks and disclaimers© IBM Corporation 2011. All rights reserved.References in this document to IBM products or services do not imply that IBM intends to make them available in every country.Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. ITInfrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, IntelInside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or itssubsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, andthe Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Officeof Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Javaand all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment,Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBMCorp. and Quantum in the U.S. and other countries.Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind.The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs andperformance characteristics may vary by customer.Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute anendorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements andvendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questionson the capability of non-IBM products should be addressed to the supplier of those products.All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or deliveryschedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBMs currentinvestment and development activities as a good faith effort to help with our customers future planning.Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experiencewill vary depending upon considerations such as the amount of multiprogramming in the users job stream, the I/O configuration, the storage configuration, and the workload processed.Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBMrepresentative or Business Partner for the most current pricing in your geography.Photographs shown may be engineering prototypes. Changes may be incorporated in production models.Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on theWorld Wide Web at http://www.ibm.com/legal/copytrade.shtml.ZSP03490-USEN-00
  • 46. © 2013 IBM CorporationIBM Storage Solutions for Big Data47Appendix
  • 47. © 2013 IBM CorporationIBM Storage Solutions for Big Data48Recommend you download, read,this very informative IBM book ”Understanding Big Data”– Published April 2012– Free download– Well worth reading to understand componentsof Big Data, and how to exploit Part 1: The Big Deal about Big Data– Chapter 1 – What is Big Data? Hint: You’re aPart of it Every Day– Chapter 2 – Why Big Data is Important– Chapter 3 – Why IBM for Big Data Part II: Big Data: From the TechnologyPerspective– Chapter 4 - All About Hadoop: The Big DataLingo Chapter– Chapter 5 – IBM InfoSphere Big Insights –Analytics for “At Rest” Big Data– Chapter 6 – IBM InfoSphere Streams –Analytics for “In Motion” Big Datahttp://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDFDownload your free copy here
  • 48. © 2013 IBM CorporationIBM Storage Solutions for Big Data49IBM InfoSphere BigInsights = IBM Hadoop distributionCoreHadoopBigInsights BasicEditionBigInsights EnterpriseEditionFree download with web supportLimit to <= 10 TB of data(Optional: 24x7 paid supportFixed Term License)Professional Services OfferingsQuickStart, Bootcamp, Education, Custom DevelopmentEnterprise-grade features:Tiered Terabyte-based pricingEasy InstallationAnd programmingAnalytics tooling/visualizationAdministration toolingDevelopment toolingHigh AvailabilityFlexible storageRecoverabilitySecurity