NetApp Open Solution for Hadoop
 

Like this? Share it with your network

Share

NetApp Open Solution for Hadoop

on

  • 1,354 views

This ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highly reliable, ready to deploy, scalable storage solution for Enterprise Hadoop.

This ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highly reliable, ready to deploy, scalable storage solution for Enterprise Hadoop.

Statistics

Views

Total Views
1,354
Views on SlideShare
1,354
Embed Views
0

Actions

Likes
1
Downloads
43
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NetApp Open Solution for Hadoop Document Transcript

  • 1. Lab ValidationReportNetApp Open Solution for HadoopOpen Source Data Analytics with Enterprise-class Storage ServicesBy Brian Garrett, VP, ESG Lab, & Julie Lockner, Sr Analyst & VP, Data ManagementMay 2012© 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 2. Lab Validation: NetApp Open Solution for Hadoop 2Contents Introduction .................................................................................................................................................. 3 Background ............................................................................................................................................................... 3 NetApp Open Solution for Hadoop .......................................................................................................................... 5 ESG Lab Validation ........................................................................................................................................ 6 Getting Started ......................................................................................................................................................... 6 Performance and Scalability ..................................................................................................................................... 7 Efficiency................................................................................................................................................................. 10 Recoverability ......................................................................................................................................................... 12 ESG Lab Validation Highlights ..................................................................................................................... 16 Issues to Consider ....................................................................................................................................... 16 The Bigger Truth ......................................................................................................................................... 17 Appendix ..................................................................................................................................................... 18 ESG Lab Reports The goal of ESG Lab reports is to educate IT professionals about data center technology products for companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Labs expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments. This ESG Lab report was sponsored by NetApp. All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of the Enterprise Strategy Group, Inc., is in violation of U.S. Copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 3. Lab Validation: NetApp Open Solution for Hadoop 3IntroductionThis ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highlyreliable, ready to deploy, scalable storage solution for Enterprise Hadoop.BackgroundDriven by unrelenting data volume growth, the need for real-time data processing and data analytics, and theincreasing complexity and variety of data sources, ESG expects broad adoption of MapReduce data processing andanalytics frameworks over the next two to five years. These frameworks require new approaches for storing,integrating and processing “big data.” ESG defines big data as any data set that exceeds the boundaries and sizes oftraditional IT processing; big data sets can range from ten to hundreds of terabytes in size.Data analytics is a top IT priority for forward-looking IT organizations. In fact, a recent ESG survey indicates thatmore than half (54%) of enterprise organizations (i.e., 1,000 or more employees) consider data analytics to be atop-five IT priority and 38% plan on deploying a new data analytics solution in the next 12-18 months. A growingnumber of IT organizations are using the open source Apache Hadoop MapReduce framework as a foundation fortheir big data analytics initiatives. As shown in Figure 1, more than 50% of organizations polled by ESG are usingHadoop, planning to deploy Hadoop in the next 12 months, or considering Hadoop.1 Figure 1. Plans to Implement a MapReduce Framework such as Apache Hadoop What are your organization’s plans to implement a MapReduce framework (e.g., Apache Hadoop) to address data analytics challenges? (Percent of respondents, N=270) Already using, 8% Don’t know, 11% Plan to implement within 12 months, 13% No plans to implement, 33% No plans to implement at this time but interested, 35% Source: Enterprise Strategy Group, 2011.As with any exciting and emerging technology, big data analytics also has its challenges. Management is an issuebecause the platforms are expensive and require new server and storage purchases, integration with existing datasets and processes, training in new technologies, an analytics toolset, and people with expertise in dealing with it.When IT managers were asked about their data analytics challenges, 47% named data integration complexity, 34%cited a lack of skills necessary to properly manage large data sets and derive value from them, 29% said data setsizes limiting their ability to perform analytics, and 28% said difficulty in completing analytics within a reasonableperiod of time.1 Source: ESG Research Report, The Impact of Big Data on Data Analytics, September 2011. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 4. Lab Validation: NetApp Open Solution for Hadoop 4Looking beyond the high level organizational challenges associated with a big data analytics initiative, the Hadoopframework adds technology and implementation issues that need to be considered. The common referencearchitecture for a Hadoop cluster leverages commodity server nodes with internal hard drives; for conventionaldata centers with mature ITIL processes, this introduces two challenges. First, data protection is, by default,handled in the Hadoop software layer; every time a file is written to the Hadoop Distributed File System (HDFS),two additional copies are written in case of a disk drive or a data node failure. This not only impacts data ingest andthroughput performance, but also reduces disk capacity utilization. Second, high availability is limited based on anexisting single point of failure in the Hadoop metadata repository. This single point of failure will eventually beaddressed by the Hadoop community, but, in the meantime, analytics downtime due to a name node failure is akey concern. As shown in Figure 2, a majority of ESG survey respondents (55%) indicate that three hours or less ofdata analytics platform downtime would result in a significant revenue loss or other adverse business impact.2 Figure 2. Data Analytics Downtime Tolerance Please indicate the amount of downtime your organization’s data analytics platforms can tolerate before your organization experiences significant revenue loss or other adverse business impact. (Percent of respondents, N=399) Don’t know, 6% None, 6% More than 3 days, 4% Less than 1 hour, 1 day to 3 days, 10% 21% 11 hours to 24 hours, 10% 4 hours to 10 hours, 1 hour to 3 hours, 18% 26% Source: Enterprise Strategy Group, 2011.NetApp, in collaboration with leading Hadoop distribution vendors, is working to develop reference architectures,best practices, and solutions that address these challenges while maximizing the speed, efficiency, and availabilityof open source Hadoop deployments.2 Source: ESG Survey, The Convergence of Big Data Processing, Hadoop, and Integrated Infrastructure, December 2011. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 5. Lab Validation: NetApp Open Solution for Hadoop 5NetApp Open Solution for HadoopHadoop is an open source and significant emerging technology for solving business problems around large volumesof mostly unstructured data that cannot be analyzed with traditional database tools. The NetApp Open Solution forHadoop combines the power of the Hadoop framework with flexible storage, professional support, and services ofNetApp and its partners to deliver higher Hadoop cluster availability and efficiency. Based on a referencearchitecture, it focuses on scaling Hadoop from its departmental origins to an enterprise infrastructure withindependent compute and storage scaling, faster cluster ingest, and faster job completion under failure conditions.NetApp Open Solution for Hadoop extends the value of the open source Hadoop framework with enterprise-classstorage and services. As shown in Figure 3, NetApp FAS2040 and E2660 storage replace traditional DAS internalhard drives within a Hadoop cluster. Compute and storage resources are decoupled with SAS attached NetAppE2660 arrays and the recoverability of a failed Hadoop name node is improved with a NFS attached FAS2040. Thestorage components are completely transparent to the Hadoop distribution and require no modification to thenative, underlying Hadoop platform. Note that while the FAS2040 is used for this testing configuration, any otherproduct in the FAS storage family can also be used. Figure 3. NetApp Open Solution for HadoopThe NetApp Open Solution for Hadoop includes: NetApp E2660s with hardware RAID and hot-swappable disks increases efficiency, performance, scalability, availability, serviceability, and manageability compared to a traditional Hadoop deployment with internal hard drives and replication at the application layer. With data being protected by hardware RAID, higher storage utilization rates can be achieved by reducing the default Hadoop replication count. A NetApp FAS2040 with shared NFS attached capacity accelerates recoverability after a primary name node failure, compared to a traditional Hadoop deployment with internal hard drives. A high speed 10 Gbps Ethernet network and direct attached 6 Gbps SAS-attached E2600s with network free hardware RAID increases the performance, scalability and efficiency of the Hadoop infrastructure. High capacity E2660 disk arrays and a building block design that decouples the compute and storage layers provide near-linear scalability that’s ideally suited for big data analytics applications with extreme compute and storage capacity requirements. A field-tested solution comprised of open source Apache Hadoop distribution and enterprise-class NetApp storage, with professional design services and support reduces risk and accelerates deployment. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 6. Lab Validation: NetApp Open Solution for Hadoop 6ESG Lab ValidationESG Lab performed hands-on evaluation and testing of solution at a NetApp facility in Research Triangle Park, NorthCarolina. Testing was designed to demonstrate that the NetApp Open Solution for Hadoop can perform and scalelinearly as data volumes and load increase, can recover from both a single and double node failure with nodisruption to a running Hadoop job, and can quickly recover from a name node failure. The performance andscalability benefits of using network-free hardware RAID and a lower Hadoop replication count were evaluated aswell. Testing was performed using open source software, workload generators, and monitoring tools.Getting StartedA Hadoop cluster with one name node, one secondary name node, one job tracker node, and up to 24 data nodeswas used during ESG Lab testing. Rack-mounted servers with quad core Intel Xeon processors and 48GB of RAMwere connected to six NetApp E2660s, with the name node and secondary name node connected to a singleNetApp FAS2040. Each NetApp E2660 was filled with 60 2TB 7200 RPM NL-SAS drives for a total raw capacity of720TB. A building block approach was used, with groups of four data nodes sharing an E2660 through 6 Gbps SASconnections. A 1 Gbps Ethernet network was used for the cluster interconnect and NFS connections to name andjob tracker nodes. Cloudera Distribution for Hadoop software was installed over the Red Hat Linux operatingsystem on each of the nodes in the cluster.3 Figure 4. The ESG Lab Test Bed3 Configuration details are listed in the Appendix. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 7. Lab Validation: NetApp Open Solution for Hadoop 7Performance and ScalabilityHadoop uses a shared nothing programming paradigm and a massively parallel clustered architecture to meet theextreme compute and capacity requirements of big data analytics applications. Aiming to augment theperformance and scalability capacity of traditional database architectures, Hadoop brings the compute power tothe data. The name node and job trackers handle distribution and orchestration as the data nodes do all of theanalytical processing work.HDFS is a distributed network file system used by nodes in a Hadoop cluster. Software mirroring is the default dataprotection scheme within the HDFS file system. For every block of data written into the HDFS file system, anadditional two copies are written to other nodes for a total of three copies. This is referred to as a replication countof three, and is the default for most Hadoop implementations that rely on internal hard drive capacity. Thissoftware data mirroring increases the processing load on data nodes and the utilization of the shared networkbetween nodes. To put this into perspective, consider what happens when a 2TB data set is loaded into a Hadoopcluster with a default replication count of three: in this example, 2TB of application data results in 6TB of raw databeing processed and moved over the network.A NetApp E2660 with hardware RAID reduces the processing and network overhead associated with softwaremirroring, which increases the performance and scalability of a Hadoop cluster. With up to 15 high capacity, highperformance disk drives (2TB, 7.2K NL-SAS) available for each data node, the performance of a Hadoop cluster ismagnified compared to a traditional Hadoop cluster with internal SATA drives. A right-sized building block approachprovides near-linear scalability as compute and storage capacity are added to a cluster.ESG Lab TestingESG Lab performed a series of tests to measure the performance and scalability of a 24-data-node NetApp OpenSolution for Hadoop. Note that there are actually 27 nodes, 24 data nodes, one name node, one secondary namenode and one job tracker node. The TeraGen utility, included in the Hadoop open source distribution, was used tosimulate the loading of a large analytic data set. Testing was performed with cluster sizes of 8, 16, and 24 datanodes and a Hadoop replication count of two. Testing began with the creation of a 1TB data set on an 8-data-nodecluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-nodecluster. The results are presented in Figure 5 and Table 1. Figure 5. Data Loading Performance Analysis © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 8. Lab Validation: NetApp Open Solution for Hadoop 8 Table 1. Performance Scalability Test Results: Data Loading with the TeraGen Utility Data nodes 8 16 24 NetApp E2660 arrays 2 4 6 NetApp E2660 drives 120 240 360 Usable capacity (TB) 180 360 720 Hadoop data set size (TB) 1 2 3 Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18 Aggregate throughput (MB/sec) 1,574 3,222 4,630What the Numbers Mean The NetApp Solution for Hadoop was designed to scale performance in near-linear fashion as data nodes and E2660 disk arrays are added to the cluster. This modular building block approach can also be used to provide consistent levels of performance as a data set grows. The job completion time for each of the TeraGen runs was recorded as the amount of data generated, the number of data nodes, and the number of E2660 arrays added linearly. In this example, the solution scaled up to 24 data nodes and six E2660 arrays with a total of 360 drives and 720TB of usable disk capacity. As the number of data nodes increased and the volume of data generated increased linearly, the completion time remained flat, at approximately 10 minutes (+/- 3%). This demonstrates the linear performance scalability of the NetApp Solution for Hadoop. A job completion time of ten minutes for the creation of a 3TB data set indicates that the 24-node NetApp solution sustained a high aggregate throughput rate of 4.630 GB/sec. An aggregate data creation rate of 4.630 GB/sec can be used to create 16.7TB of data per hour.Performance testing continued with a similar series of tests designed to measure the scalability of the solutionwhen processing long running data analytics jobs. The open source TeraSort utility included in the Hadoopdistribution was used during this phase of testing. Using the data created with TeraGen, TeraSort was tested withcluster sizes of 8, 16, and 24 data nodes, a map count of seven, and a reducer count of five per data node. Testingbegan with a sort of the 1TB data set on an eight-data-node cluster. The test was repeated with a 2TB data set on a16-data-node cluster and a 3TB data set on a 24-data-node cluster. The elapsed job run time was recorded aftereach test. Each test began with a freshly created TeraGen data source. The results are presented in Table 2 andFigure 6. Table 2. Performance Scalability Test Results: Data Analytics with the TeraSort Utility Data nodes 8 16 24 Hadoop data set size (TB) 1 2 3 Job completion time (hh:mm:ss) 00:29:19 00:30:19 00:30:21 Aggregate throughput (MB/sec) 542 1,049 1,571 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 9. Lab Validation: NetApp Open Solution for Hadoop 9 Figure 6. Data Analytics Performance AnalysisWhat the Numbers Mean The job completion time for each of the TeraSort runs was recorded as the amount of data generated, the number of data nodes, and the number of E2660 arrays increased linearly. As the number of data nodes grew and the volume of data generated increased linearly, job completion time remained flat at approximately 30 minutes (+/- 2%). As shown in Figure 6, aggregate analytics throughput scaled linearly as data nodes and E2660 arrays were added to the cluster.Why This MattersA growing number of organizations are deploying big data analytics platforms to improve the efficiency andprofitability of their businesses. ESG research indicates that data analytics and managing data growth are amongthe top five IT priorities in more than 50% of organizations. When asked about their data analytics challenges, 29%said data set sizes are limiting their ability to perform analytics, and 28% reported difficulty in completing analyticswithin a reasonable period of time.The NetApp Open Solution for Hadoop combines the compute scalability of a shared Hadoop cluster with thestorage efficiency and scalability of network-free hardware RAID. Because the solution was designed to have theHadoop data replication setting lower than the default and because it standardizes on a 10GbE network, there isless chance of having a network bottleneck compared to a traditional Hadoop deployment as data volumes grow.ESG Lab confirmed that NetApp has created a big data analytics solution with near-linear performance scalabilitythat dwarfs the capabilities of traditional databases and disk arrays—testing with a 24-node cluster and a 3TB dataset scaled up to 4.63 GB/sec of aggregate load throughput and 1.57 GB/sec of aggregate analytics throughput. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 10. Lab Validation: NetApp Open Solution for Hadoop 10EfficiencyThe NetApp Open Solution for Hadoop improves capacity and performance efficiency compared to a traditionalHadoop deployment. With protection from disk failures provided by NetApp E2660s with hardware RAID, theHadoop default replication setting of three can be reduced to two. NetApp E2660s with network-free hardwareRAID-5 (6+1) and a Hadoop replication count of two increase storage capacity utilization by 22%, compared to aHadoop cluster with internal drives and a default replication count of three. Network-free hardware RAID alsoincreases the performance and scalability of the cluster due to a reduction in the amount of mirrored data flowingover the network.ESG Lab TestingThe TeraGen tests were repeated with a replication count of two as the size of the cluster was increased from eightto 24 data nodes. The elapsed job time was compared with those collected earlier with a default Hadoopreplication count of three. The results are summarized in Figure 7 and Table 3. Figure 7. Increasing Hadoop Cluster Efficiency with the “NetApp Effect” Table 3. Performance Efficiency Test Results: Data Loading with TeraGen Replication Count Data nodes 8 16 24 Hadoop data set size (TB) 1 2 3 2 Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18 2 Aggregate throughput (MB/sec) 1,573 3,221 4,629 3 Job completion time (hh:mm:ss) 00:15:32 00:16:11 00:16:44 3 Aggregate throughput (MB/sec) 1,023 1,964 2,849 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 11. Lab Validation: NetApp Open Solution for Hadoop 11What the Numbers Mean As the number of data nodes grew and the volume of data generated increased linearly, the job completion time remained flat at approximately ten minutes (+/- 2%) with a NetApp-enabled replication count of two. Job completion time increased by 50% or more with a Hadoop default replication count of three due to the extra processing and network overhead associated with triple mirroring. The increase in cluster efficiency (the NetApp effect) not only reduced job completion times, but also increased aggregate throughput. As shown in Figure 7, the NetApp effect was magnified as the size of the cluster and the amount of network traffic increased. Note how the “Replication 2 with NetApp” line (green, circles) increases linearly compared to the “Replication 3” line (red, triangles). Also note how the gap between the two increases as the cluster grows due to the increase in network traffic. The NetApp effect resulted in a peak aggregate throughput improvement of 62.5% during the 24-node test (4.629 vs. 2.849 GB/sec).Why This MattersData growth shows no signs of abating. As data accumulates, there is a corresponding burden on IT to maintainacceptable levels of performance, whether that is measured by the speed with which an application responds, theability to aggregate and deliver data, or the ultimate business value of information. Management teams arerecognizing that their growing data stores bring massive, and largely untapped, potential to improve businessintelligence. At the same time, they also recognize the challenges that big data poses to existing analytics tools andprocesses, as well as the impact data growth is having on the bottom line in the form of increased requirements forstorage capacity and compute power. It is for these reasons that IT managers are struggling to meet the conflictinggoals of keeping up with explosive data growth and lowering the cost of delivering data analytics services.The default replication count for Hadoop is three. This is strongly recommended for data protection with Hadoopconfigurations with internal disk drives. Replication is also needed for cluster self-healing. “Self-healing” is used todescribe Hadoop’s ability to ensure job completion in the event of task failure. It does this by reassigning failedtasks to other nodes in the cluster. This is made possible by the replication of blocks throughout the cluster.With the NetApp Open Solution for Hadoop, replication is not required for data protection since data is protectedwith hardware RAID. As a result, a replication count of two is sufficient for self-healing. Hadoop MapReduce jobsthat write data to the HDFS, such as data ingest, benefit from the lower replication count: they generally run fasterand require less storage space than a Hadoop cluster with internal disk storage and a replication count of three.During ESG Lab testing with a 24-node cluster, the NetApp effect reduced disk capacity requirements by 22% as itincreased aggregate data load performance by 62%. In other words, organizations can manage more data at alower cost with NetApp. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 12. Lab Validation: NetApp Open Solution for Hadoop 12RecoverabilityWhen a name node fails, a Hadoop administrator needs to recover the metadata and restart the Hadoop clusterusing a standby, secondary name node.In a Hadoop server cluster with internal storage, when a disk drive fails, the entire data node is “blacklisted” and nolonger available to execute tasks. This can result in degraded performance and the need for a Hadoop administratorto take the data node offline, service and replace the failed component, and then redeploy. This process can takeseveral hours to complete. This single point of failure is being addressed by the open source Hadoop community,but was not yet generally available when this report was published.NetApp Open Solution for Hadoop increases the availability and recoverability of a Hadoop cluster in threesignificant ways: 1. Recovery from a name node failure is accelerated dramatically using an NFS attached FAS2040 instead of internal storage on the primary and secondary name nodes. If and when a name node failure occurs, a quick recovery from an NFS attached FAS2040 can restore analytics services in minutes instead of hours. 2. NetApp E2600s with hardware RAID provide transparent recovery from hard drive failures. The data node is not blacklisted and any job tasks that were running continue uninterrupted. 3. The NetApp E2660 management console (SANtricity) provides a centralized management GUI for monitoring and managing drive failures. This reduces the complexity associated with manually recovering from drive failures in a Hadoop cluster with internal drives.ESG Lab TestingA variety of errors were tested with a 24-data-node Hadoop cluster running a 3TB TeraSort job. As shown in Figure8, errors were injected to validate that jobs continue to run after data node and E2660 hard drive failures, and thatthe cluster can be quickly recovered after a name node failure. A dual drive failure was also tested to simulate andmeasure job recovery time after an internal hard drive failure in a traditional Hadoop cluster. Figure 8. ESG Lab Error Injection Testing © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 13. Lab Validation: NetApp Open Solution for Hadoop 13Disk Drive FailureTo simulate a disk drive failure, a drive was taken offline while a Hadoop TeraSort job was running.4 The Hadoop jobtracker web interface was used to confirm that the job completed successfully. The NetApp E2660 SANtricitymanagement console was used to identify which drive had failed and monitor automatic recovery from a hot spare.A SANtricity management console screenshot taken shortly after the drive had failed is shown in Figure 9. Figure 9. Transparent Recovery from a Hard Drive Failure with E2660 Hardware RAIDAnother TeraSort job was started. While it was running, a lab manager physically replaced the failed hard drive.The TeraSort job completed without error, as expected. Another TeraSort job was started and a dual drive errorwas introduced to simulate and measure the job completion time after a traditional Hadoop hard drive failure in adata node.5 As shown in Table 4, the TeraSort job took slightly longer (5.7% longer) to complete during the singledrive failure with the hardware RAID recovery of the NetApp E2660. The simulated internal drive failure took morethan twice as long (236.2%) as the data node was blacklisted and job tasks were restarted on surviving nodes. Table 4. Drive Failure Recovery Results Job Completion Time Throughput Delta Test Scenario (hh:mm:ss) (MB/sec) (vs. Healthy Cluster) Healthy cluster 00:30:21 1,821 N/A NetApp E2660 drive failure 00:32:06 1,486 -5.7% Internal data node drive failure 01:12:13 660 -237.9%4 Drive failures were introduced when the Hadoop job tracker indicated that the TeraSort job was 80% complete.5 In a Hadoop cluster using internal disk drives, a local file system is created on each disk. If a disk fails, that file system fails. A local diskfailure was simulated during ESG Lab testing by failing two disk drives in the same RAID 5 volume group. All data on that file system was lostand all tasks running on that file system failed. The job tracker detected this and reassigned failed tasks to other nodes where copies of thelost blocks exist. With the NetApp solution, a single disk drive has very little impact on running tasks, and all data in the local file system usingthat LUN remains available as RAID reconstruct begins. With direct attached disks, if a single disk fails, a file system fails as described above. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 14. Lab Validation: NetApp Open Solution for Hadoop 14The screen shot shown in Figure 10 shows Hadoop job tracker status after the successful completion of theTeraSort job following the simulated internal hard drive failure. Note how the non-zero failed/killed counts indicatethe number of map and reduce tasks that were restarted on surviving nodes (439 and 5, respectively). Figure 10. Jobs Completion after a Simulated Internal Hard Drive FailureThe screen shot shown in Figure 11 summarizes the status of the Hadoop Distributed File System (HDFS) after thedata node with a simulated internal hard drive failure was blacklisted. These errors didn’t occur with the E2660drive failure, as the Hadoop job ran uninterrupted. Figure 11. Hadoop Self-healing in Action: Cluster Summary after a Simulated Internal Drive Failure © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 15. Lab Validation: NetApp Open Solution for Hadoop 15Name Node FailureThe Hadoop name node server was halted while a TeraSort job was running with a goal of demonstrating how anNFS attached NetApp FAS2040 can be used to quickly recover from the single point of failure when a name nodegoes offline in a Hadoop cluster. As shown in Figure 12, the job failed as expected after 13 minutes and 23 seconds.After the job failed, name node metadata was copied to the secondary name node and the name node daemon wasstarted on the secondary name node server. The procedure outlined in the NetApp Open Solution for HadoopSolutions Guide6 was used to copy metadata to the secondary name node and start the name node daemon on thesecondary name node. Figure 12. Job Failure after a Name Node Failure: NetApp FAS2040 Recovery BeginsFive minutes after getting started with the recovery process, the Hadoop cluster was up and running. An fsck of theHDFS file system indicated that the cluster was healthy and a restarted TeraSort job completed without error.Why This MattersA majority of respondents to a recent ESG survey indicated that three hours or less of data analytics platformdowntime would result in significant revenue loss or other adverse business impact. The single point of HDFSfailure in the open source Hadoop distribution that was generally available as of this writing can lead to three ormore hours of data analytics platform unavailability.ESG Lab has confirmed NetApp Open Solution for Hadoop reduces name node recovery time from hours to minutes(five minutes during ESG Lab testing). NetApp E2660s with hardware RAID dramatically improved recoverabilityafter simulated hard drive failures. The complexity and performance impact of a blacklisted name node wasavoided as 3TB TeraSort analytics job with NetApp completed more than twice as quickly as with a simulatedinternal hard drive failure.6 http://media.netapp.com/documents/tr-3969.pdf © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 16. Lab Validation: NetApp Open Solution for Hadoop 16ESG Lab Validation Highlights  The capacity and performance of a NetApp solution scaled linearly when data nodes and NetApp E2660 storage arrays were added to a Hadoop cluster.  ESG Lab tested up to 24 data nodes and six NetApp E2660 arrays with 720TB of usable disk capacity.  Load performance testing with the TeraGen utility delivered linear performance scalability.  A 24-node cluster sustained a high aggregate load throughput rate of 4.630 GB/sec.  Big data analytics performance testing with the TeraSort utility yielded linear performance scalability as data nodes and E2660 arrays were added.  Network-free hardware RAID and a lower Hadoop replication count reduced network overhead, which increased the aggregate performance of the cluster. A peak aggregate throughput improvement of 62.5% was recorded during the 24-node test (4.629 vs. 2.849 GB/sec).  A MapReduce job running during a simulated internal drive failure took more than twice as long (225%) to complete than during failure of a hardware RAID protected E2660 drive.  An NFS attached NetApp FAS2040 for name node metadata storage was used to recover from a primary name node failure in five minutes, compared to multiple hours in a traditional configuration.Issues to Consider  While the results demonstrate how the NetApp Open Solution for Hadoop is ideally suited to meet the extreme compute and storage performance needs of big data analytic load and long running queries, applications with lots of small files, multiple writers, or many users with low response time requirements may be better suited for traditional relational databases and storage solutions.  The single point of failure issue in the Hadoop distribution used during this ESG Lab Validation is being fixed in the open source community, but was not yet available and therefore not tested as part of ESG Lab’s assessment of the NetApp Open Solution for Hadoop. Even so, future releases of Hadoop that resolve the name node failure problem are still expected to rely on NFS shared storage as a functional requirement. NetApp, with its FAS family, is an industry leader in NFS shared storage.  The test results presented in this report are based on a benchmarks deployed in a controlled environment. Due to the many variables in each production data center environment, capacity planning and testing in your own environment are recommended.  A growing number of best practices, tuning guidelines, and proof points are available for reference when planning, deploying, and tuning a Hadoop Open Solution for NetApp. To learn more, visit: http://www.netapp.com/hadoop. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 17. Lab Validation: NetApp Open Solution for Hadoop 17The Bigger TruthWhether measured by increased revenues, market share gains, reduced costs, or scientific breakthroughs, dataanalytics has always played a key role in the ability to harness value from electronically-stored information. Whathas changed recently is that, as more business processes have become automated, information that was oncestored in separate online and offline repositories and formats is now readily available for amalgamation andanalysis to increase business insight and enhance decision support. Business executives are asking more of theirdata and are expecting faster and more impactful answers. The result is an ever-increasing priority on data analyticsactivities and, subsequently, more pressure on existing business analyst and IT teams to deliver.Hadoop is a powerful open source framework for data analytics. It’s an emerging and fast growing solution that’sconsidered one of the most impactful technology innovations since HTML. While ESG research indicates that a smallnumber of organizations are using Hadoop at this time, interest and plans for adoption over the next 12-18 monthsis high (48%).For those new to Hadoop, there is a steep learning curve. Very few enterprise applications are built to run onmassively parallel clusters, so there is much to learn. The NetApp Open Solution for Hadoop is a tested and provenreference architecture storage appliance that reduces the risk and time associated with Hadoop adoption.NetApp has embraced the open source Hadoop model and is working with major distributors to support opensource Hadoop software running on industry standard servers. Instead of promoting the use of a proprietaryclustered file system, NetApp has embraced the use of the open source Hadoop file system (HDFS). Instead ofpromoting the use of SAN or NAS attached storage, NetApp has embraced the use of direct attached storage. UsingSAS direct connected NetApp E2660 arrays with hardware protected RAID, the NetApp solution improvesperformance, scalability, and availability compared to typical internal hard drive Hadoop deployments. Thanks to anNFS attached NetApp FAS2040 for shared access to metadata, recovery from a Hadoop name node failure isreduced from hours to minutes.With up to 5 GB/sec of aggregate TeraGen load performance on a 24-node cluster, ESG Lab has confirmed that theNetApp Solution for Hadoop provides excellent near-linear performance scalability that dwarfs the capabilities oftraditional disk arrays and databases. NetApp E2660s with network-free hardware RAID improved the efficiencyand performance of the cluster by 66% compared to a traditional Hadoop deployment with triple mirroring. Thevalue of transparent RAID recovery was obvious after drive failures were simulated: the performance impact on along running sort job was less than 6% compared to more than 200% for a simulated internal drive failure thatblacklisted a Hadoop data node.If you’re looking to accelerate the delivery of insight to your business with an enterprise-class big data analyticsinfrastructure, ESG Lab recommends a close look at the NetApp Open Solution for Hadoop—it reduces risk with astorage solution that delivers reliability, fast deployment, and scalability of open source Hadoop for the enterprise. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 18. Lab Validation: NetApp Open Solution for Hadoop 18AppendixThe configuration of the test bed that was used during the ESG Lab Validation is summarized in Table 5. Table 5. Configuration Summary Servers HDFS data nodes 24 servers, each with quad core Intel Xeon CPU, 48GB RAM HDFS name node 1 server with quad core Intel Xeon CPU, 48GB RAM HDFS secondary name node 1 server, with quad core Intel Xeon CPU, 48GB RAM HDFS job tracker 1 server, with quad core Intel Xeon CPU, 48GB RAM Network 10 GbE host connect One 10GbE CAN connection for all data nodes, name node, secondary name 10 GbE switched fabric node, job tracker Cisco Nexus 5010, 10 GigE, Jumbo Frames (MTU=9000) Storage HDFS data node storage 6 NetApp E2660 6Gb SAS host connect, 6+1 RAID-5, 2TB near line SAS 7.2K RPM drives, 360 drives total, version 47.77.19.99 HDFS name node storage 1 NetApp FAS2040, 1GbE NAS host connect, 6 disks, 1TB each, 7.2K RPM, Data ONTAP 8.0.2 7 mode Operating system boot drives Local 1TB 7.2K RPM SATA drive in each node Software Operating system Red Hat Enterprise Linux version 5, update 6 (RHEL5.6) Analytics platform Cloudera Hadoop (CDH3u2) HDFS Configuration Changes vs. Cloudera V3U2 Distribution Local file system XFS Map/reduce tasks per data node 7/5Table 6 lists the differences between Hadoop core-site.xml defaults and the settings used during ESG Lab testing. Table 6. Hadoop core-site Settings Option Name Purpose Actual/Default Name of the default file system specified as a URI (IP address hdfs:// 10.61.189.64:8020/ fs.default.name or hostname of the name node along with the port to be used). [Default Value: file:///]] Enables or disables certain management functions within webinterface.private.actions the Hadoop Web user interface, including the ability to kill true / false jobs and modify job priorities. Memory in MB to be used for merging map outputs during fs.inmemory.size.mb 200 / 100 the reduce phase. io.file.buffer.size Size in bytes of the read/write buffer. 262144 / 4096 Script used to resolve the slave node’s name or IP address to a rack ID. Used to invoke Hadoop rack awareness. The /etc/hadoop/conf/topology_script topology.script.file.name default value is null and results in all slaves being given a [Default value is null] rack ID of “/default-rack.” Sets the maximum acceptable number of arguments to be topology.script.number.args 1 / 100 sent to the topology script at one time. /home/hdfs/tmp hadoop.tmp.dir Hadoop temporary directory storage. [Default value: /tmp/hadoop- ${user.name}] © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 19. Lab Validation: NetApp Open Solution for Hadoop 19Table 7 lists the differences between Linux sysctl.conf defaults and the settings used during ESG Lab testing. Table 7. Linux sysctl.conf Settings Parameter Description Actual /Default net.ipv4.ip_forward Controls IP packet forwarding. 0/0 net.ipv4.conf.default.rp_filter Controls source route verification. 1/0 net.ipv4.conf.default.accept_ Do not accept source routing. 0/1 source_route Controls the system request debugging functionality of the kernel.sysrq 0/1 kernel. Controls whether core dumps will append the PID to the kernel.core_uses_pid core filename. Useful for debugging multithreaded 1/0 applications. kernel.msgmnb Controls the maximum size of a message, in bytes. 65536 / 16384 kernel.msgmax Controls the default maximum size of a message queue. 65536 / 8192 kernel.shmmax Controls the maximum shared segment size, in bytes. 68719476736 / 33554432 Controls the maximum number of shared memory kernel.shmall 4294967296 / 2097512 segments, in pages. net.core.rmem_default Sets the default OS receive buffer size. 262144 / 129024 net.core.rmem_max Sets the max OS receive buffer size. 16777216 / 131071 net.core.wmem_default Sets the default OS send buffer size. 262144 / 129024 net.core.wmem_max Sets the max OS send buffer size. 16777216 / 131071 Maximum # of sockets the kernel will serve at one time. Set net.core.somaxconn 1000 / 128 on name node, secondary name node and job tracker. fs.file-max Sets the total number of file descriptors. 6815744 / 4847448 net.ipv4.tcp_timestamps Disables the TCP time stamps if set to “0” 0/1 net.ipv4.tcp_sack Enables select ACK for TCP. 1/1 net.ipv4.tcp_window_scaling Enables TCP window scaling. 1/1 kernel.shmmni Sets the maximum number of shared memory segments. 4096 / 4096 Sets the maximum number and size of semaphore sets that 250 32000 100 128 / kernel.sem can be allocated. 250 32000 32 128 fs.aio-max-nr Sets the maximum number of concurrent I/O requests. 1048576 / 65536 4096 262144 16777216 / net.ipv4.tcp_rmem Sets min, default, and max receive window size. 4096 87380 4194304 4096 262144 16777216 / net.ipv4.tcp_wmem Sets min, default, and max transmit window size. 4096 87380 4194304 net.ipv4.tcp_syncookies Disables TCP syncookies if set to “0”. 0/0 Sets the maximum number of in-flight rpc requests between sunrpc.tcp_slot_table_entries a client and a server. This value is set on the name node and 128 / 16 secondary name node to improve NFS performance. Maximum percentage of active system memory that can be vm.dirty_background_ratio used for dirty pages before dirty pages are flushed to 1 / 10 storage. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 20. Lab Validation: NetApp Open Solution for Hadoop 20Table 8 lists the differences between Hadoop hdfs-site.xml defaults and the settings used during ESG Lab testing. Table 8. HDFS Site Settings Option Name Purpose Actual/Default Path on the local file system where the name node stores the namespace and transaction logs persistently. If this is a comma-delimited list of directories (as used in this configuration), then the name table is replicated in all of the /local/hdfs/namedir/mnt/fsimage_bkp directories for redundancy. dfs.name.dir [Default value: Note: Directory /mnt/fsimage_bkp is a location on NFS- ${hadoop.tmp.dir}/dfs/name] mounted NetApp FAS storage where name node metadata is mirrored and protected, a key feature of NetApp’s Hadoop solution. Specifies a list of machines authorized to join the Hadoop /etc/hadoop-0.20/conf/dfs_hosts dfs.hosts cluster as a data node. [Default value is null] /disk1/data,/disk2/data Directory paths on the data node local file systems where dfs.data.dir [Default value: HDFS data blocks are stored. ${hadoop.tmp.dir}/dfs/data] /home/hdfs/namesecondary1 Directory path where checkpoint images are stored (used by fs.checkpoint.dir [Default value: secondary name node). ${hadoop.tmp.dir}/dfs/namesecondary] HDFS block replication count. Hadoop default is 3. The dfs.replication 2/3 NetApp Hadoop solution uses a replication setting of 2. dfs.block.size HDFS data storage block size in bytes. 134217728 (128MB) / 67108864 dfs.namenode.handler.count Number of server threads for the name node. 128 / 10 dfs.datanode.handler.count Number of server threads for the data node 64 / 3 Maximum number of replications a data node is allowed to dfs.max-repl-streams 8/2 handle at one time. dfs.datanode.max.xcievers Maximum number of files a data node will serve at one time. 4096 / 256 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 21. Lab Validation: NetApp Open Solution for Hadoop 21Table 9 lists the differences between mapred-site.xml defaults and the settings used during ESG Lab testing. Table 9. mapred-site Settings Option Name Purpose Actual/Default Job tracker address as a URL (Job tracker IP address 10.61.189.66:9001 mapred.job.tracker or hostname with port number). [Default value: local] /disk1/mapred/local,/disk2/mapred/loc al Comma-separated list of the local file system where mapred.local.dir [Default value: temporary MapReduce data is written. ${hadoop.tmp.dir}/mapred/local Specifies the file containing the list of nodes mapred.hosts /etc/hadoop-0.20/conf/mapred.hosts allowed to join the Hadoop cluster as task trackers. [Default value is null] Path in HDFS where the MapReduce framework /mapred/system mapred.system.dir stores control files. [Default value: ${hadoop.tmp.dir}/mapred/system] Enables the job tracker to detect slow-running mapred.reduce. reduce tasks, assign them to run in parallel on other tasks.speculative. false / true nodes, use the first available results, and then kill execution the slower running reduce tasks. Enables the job tracker to detect slow-running map mapred.map.tasks. tasks, assign them to run in parallel on other nodes, false / true speculative.execution use the first available results and then kill the slower running map tasks. mapred.tasktracker. Maximum number of reduce tasks that can be run 5/2 reduce.tasks.maximum simultaneously on a single task tracker node. mapred.tasktracker.map. Maximum number of map tasks that can be run 7/2 tasks.maximum simultaneously on a single task tracker node. Java options passed to the task tracker child mapred.child.java.opts processes. (In this case, 1 GB defined for heap -Xmx1024m / -Xmx200m memory used by each individual JVM). Total amount of buffer memory allocated to each io.sort.mb merge stream while sorting files on the mapper, in 340 / 100 MB. org.apache.hadoop.mapred.FairSchedul er mapred.jobtracker. Job tracker task scheduler to use (in this case use [Default value: taskScheduler the FairScheduler). org.apache.hadoop.mapred.JobQueueT askScheduler] Number of streams to merge at once while sorting io.sort.factor 100 / 10 files. Enables/disables MapReduce output file mapred.output.compress false / false compression. mapred.compress.map. Enables/disables map output compression. false / false output mapred.output.compression.type Sets output compression type. block / record Fraction of the number of map tasks that should be mapred.reduce.slowstart.completed.m complete before reducers are scheduled for the 0.05 / 0.05 aps MapReduce job. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 22. Lab Validation: NetApp Open Solution for Hadoop 22 Option Name Purpose Actual/Default 40 for 8 DataNodes Total number of reduce tasks available for the 80 for 16 DataNodes mapred.reduce.tasks entire cluster. 120 for 24 DataNodes [Default value: 1] 56 for 8 DataNodes Total number of map tasks available for the entire 112 for 16 DataNodes mapred.map.tasks cluster. 168 for 24 DataNodes [Default value: 2] mapred.reduce.parallel. Number of parallel threads used by reduce tasks to 64 / 5 copies fetch outputs from map tasks. mapred.compress.map.output Enable/disable map output compression. false / false Number of map outputs in the reduce task tracker’s mapred.inmem.merge. memory at which map data is merged and spilled to 0 / 1000 threshold disk. mapred.job.reduce. Percent usage of the map outputs buffer at which 1/0 input.buffer.percent the map output data is merged and spilled to disk. mapred.job.tracker. Number of job tracker server threads for handling 128 / 10 handler.count RPCs from the task trackers. tasktracker.http. Number of task tracker worker threads for fetching 60 / 40 threads intermediate map outputs for reducers. Maximum number of tasks that can be run in a mapred.job.reuse.jvm. single JVM for a job. A value of "-1" sets the number -1 / 1 num.tasks to "unlimited."mapred.jobtracker.restart.recover Enables job recovery after restart. true / false © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 23. 20 Asylum Street | Milford, MA 01757 | Tel: 508.482.0188 Fax: 508.482.0218 | www.enterprisestrategygroup.com