Balancing the compute, storage, and network resources in an Apache Hadoop*cluster enabled the full benefit of the latest I...
1 Evolving Tools to Manage Big DataEnormous data stores have become a factof life for organizations of all types andsizes....
2 Establishing Balance AmongCritical ComponentsWhile Hadoop clusters are typically builtfrom generally available, mainstre...
Intel® Manager for Apache Hadoop softwareDeployment, Configuration, Monitoring, Alerts, and SecurityHDFS 2.0.3Hadoop Distr...
Figure 3. Dashboards to simplify setup and configuration, reducing deployment time.The Intel Distribution provides tuning ...
Intel® EthernetGigabit ServerAdaptersIntel EthernetGigabit ServerAdaptersConventionalHard DrivesConventionalHard DrivesBef...
Intel Xeon ProcessorE5-2600 Product FamilyIntel® Xeon® ProcessorE5-2600 Product FamilyConventionalHard DrivesIntel Solid-S...
In addition to the hardware upgradesalready described, the Intel Distributionprovides an intuitive interface and manyunder...
5 Tuning and OptimizationConsiderationsIn addition to upgrading networkcomponents and considering the useof the Intel Dist...
•	Number of concurrent Map and Reducetasks. On Intel Xeon processors, theoptimal number of Map tasks tendsto be the number...
6 ConclusionThe ability to store and analyze huge amounts of unstructured data promises ongoing opportunities for business...
Learn more by visiting the following
Upcoming SlideShare
Loading in...5

Big data-apache-hadoop-technologies-for-results-whitepaper


Published on

Whitepaper that shows how you can massively increase the performance of Hadoop implementations using the Intel distribution and the full array of platform technologies available in a modern server deployment. 10Gb Ethernet, Solid State Drives and modern Xeon processors.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data-apache-hadoop-technologies-for-results-whitepaper

  1. 1. Balancing the compute, storage, and network resources in an Apache Hadoop*cluster enabled the full benefit of the latest Intel® processors, solid-statedrives, 10 Gigabit Intel® Ethernet Converged Network Adapters, and the Intel®Distribution for Apache Hadoop* software. Building a balanced infrastructurefrom these components enabled reducing the time required for Intel tocomplete a TeraSort benchmark test workload from approximately four hours toapproximately seven minutes, a reduction of roughly 97 percent.1Results suchas this from big data technologies pave the way for low-cost, near-real-time dataanalytics that will help businesses respond almost instantly to changing marketconditions and unlock more value from their assets.Big Data Technologies forNear-Real-Time Results:Balanced Infrastructure that Reduced Workload CompletionTime from Four Hours to Seven Minutes1Improvements in widely available compute, storage, and network components arerapidly improving the ability of commercial, academic, and government organizationsto handle big data effectively. Intel has demonstrated dramatic results from optimized,balanced Apache Hadoop* clusters that include the latest Intel® Xeon® processors, solid-state local storage, and 10 Gigabit Intel® Ethernet Converged Network Adapters.In fact, in tests conducted by Intel, upgrading these components and using the Intel®Distribution for Apache Hadoop* software (Intel® Distribution) reduced the time requiredto sort a terabyte of data using the TeraSort benchmark workload from approximatelyfour hours to approximately seven minutes.1, 2These results represent significantprogress toward near-real-time data analytics at significantly lower cost than waspreviously possible with proprietary hardware and software implementations. Thedramatic cost savings and improved efficiency are vital aspects of enabling the fullpotential of big data technologies.This paper demonstrates how those results were achieved, as guidance for IT decisionmakers and others as they consider where to invest for optimal results from Hadoop*environments. It also introduces some of the benefits available from the pre-optimized,turnkey Intel Distribution for Apache Hadoop software. Using the guidelines presentedhere, organizations can work toward optimal performance within their budgetaryrequirements for specific workloads and circumstances.1WHITE PAPERIntel® Xeon® ProcessorsIntel® Solid-State DrivesIntel® Ethernet Converged Network AdaptersIntel® Distribution for Apache Hadoop* Software
  2. 2. 1 Evolving Tools to Manage Big DataEnormous data stores have become a factof life for organizations of all types andsizes. The ability to manipulate, transform,and drive benefit from that data—which isoften unstructured big data—is quicklybecoming the norm, and the tools andtechniques for doing so are becomingmore commonplace. Frameworks suchas Hadoop are widely used, and ITorganizations are increasingly buildingtheir own computing environments forhandling big data.1.1 Introducing Hadoop*: A RobustFramework for Generating Value fromBig DataHadoop is an open-source softwareframework written in Java* and basedon Google’s MapReduce* and distributedfile system work. It is built to supportdistributed applications by analyzing verylarge bodies of data using clusters ofservers, transforming it into a form that ismore usable by those applications. Hadoopis designed to be deployed on commonlyavailable, general-purpose infrastructure.Tasks that the framework is particularlywell suited for include indexing andsorting large data sets, data mining, loganalytics, and image manipulation. Keyparts of the Hadoop framework includethe following:• Hadoop Distributed File System(HDFS*) achieves fault tolerance andhigh performance by breaking data intoblocks and spreading them across largenumbers of worker nodes.• Hadoop’s MapReduce Engine acceptsjobs from applications and divides thosejobs into tasks that it assigns to variousworker nodes.1.2 The Industry Ecosystem of Big DataTechnologiesA large ecosystem of solutions—ofwhich Hadoop is just one part—has beendesigned to derive maximum value frombig data. Another key component is NoSQL(“Not only SQL”) databases, which existas an alternative (or complement) to themore common, table-based relationaldatabase management systems (RDBMSs).Unlike RDBMSs, NoSQL databases arenot based primarily on tables. Thatcharacteristic makes them somewhat lessefficient than RDBMSs for functionalitythat depends on the relationshipsbetween data elements, but morestreamlined for handling large amountsof data where those relationships are notcentrally important.While structured data can be storedin NoSQL databases, these systemsare especially well suited for handlingunstructured data and in particular toproviding high scalability and performancefor retrieving and appending that data inlarge quantities. As big data technologieshave grown to hold a more prominent rolein the data center, open source solutionssuch as Hadoop have become increasinglyimportant and well-developed, throughwork by a combination of commercialand non-commercial entities. High-profile NoSQL databases in the Hadoopecosystem include Apache Cassandra*,HBase*, and MongoDB*.In addition to Hadoop, other examplesof big data technologies range from thesimple-to-use, open source Disco Project*,for which developers write jobs usingPython* scripting, to the SAP HANA*real-time data platform, an enterprise-scale environment focused on businessintelligence and related usages. All ofthese industry innovations are beingdeveloped and optimized for Intel Xeonprocessor-based platforms. This paperfocuses on Hadoop as an example ofbuilding systems for processing big databecause of its widespread and fast-growing presence in commercial, research,and academic environments.Table of Contents1 Evolving Tools toManage Big Data. . . . . . . . . . . . . . . . 2 1.1 Introducing Hadoop*: A RobustFramework for Generating Valuefrom Big Data. . . . . . . . . . . . . . . . . . 2 1.2 The Industry Ecosystem ofBig Data Technologies. . . . . . . . . . 22 Establishing Balance AmongCritical Components. . . . . . . . . . . . . 3 2.1 Advancements inCompute Resources. . . . . . . . . . . . 3 2.2 Advancements inStorage Technology. . . . . . . . . . . . 4 2.3 Advancements inNetworking. . . . . . . . . . . . . . . . . . . . 4 2.4 An Optimized HadoopDistribution from Intel. . . . . . . . . . 43 The Role of 10GbE andOther Factors in AcceleratingHadoop Workflows. . . . . . . . . . . . . . 5 3.1 Optimizing for theImport Stage. . . . . . . . . . . . . . . . . . . 6 3.2 Optimizing for the ProcessingStage: The Path from FourHours to Seven Minutes. . . . . . . . 6 3.3 Optimizing for theExport Stage. . . . . . . . . . . . . . . . . . . 84 Description of theResearch Environment. . . . . . . . . . . 85 Tuning and OptimizationConsiderations. . . . . . . . . . . . . . . . . . 9 5.1 Networking, OS,and Driver Optimizations. . . . . . . .9 5.2 Hadoop ConfigurationParameters. . . . . . . . . . . . . . . . . . . . 9 5.3 Future Enhancements. . . . . . . . . 106 Conclusion. . . . . . . . . . . . . . . . . . . . . 112Big Data Technologies for Near-Real-Time Results
  3. 3. 2 Establishing Balance AmongCritical ComponentsWhile Hadoop clusters are typically builtfrom generally available, mainstreamcomponents, that fact doesn’t diminish thechallenges associated with choosing andmatching those components for maximumbenefit. The primary consideration isto create a balance in the environmentbetween compute, storage, and networkresources, as depicted in Figure 1.Before turning to specific strategies fordeciding on combinations of componentsfor your cluster, it is useful to consider thestate of generally available technology ineach category, as shown in Table 1. Afterestablishing which types of resourcesare desirable, the discussion will focus onhow a Hadoop cluster takes advantageof those resources, before describingthe role of 10 Gigabit Ethernet (10GbE)networking in delivering the benefits ofeach.ComputeIntel® Xeon® processor E5 FamilyNetwork10 Gigabit Intel® EthernetStorageIntel® Solid-State DrivesName NodeJobTrackerData NodesTaskTrackerFigure 1. Balance among compute, network, and storage resources for optimalApache Hadoop* results.Compute Storage NetworkingGoodEarlier-generation Intel®Xeon® processorsConventional spinninghard drivesIntel® Ethernet GigabitServer AdaptersBetterIntel® Xeon® processorE5 familyTiered storage(conventional plussolid-state drives)10 Gigabit Intel®Ethernet ConvergedNetwork AdaptersBestIntel Xeon processorE5 familyAll solid-state drives10 Gigabit IntelEthernet ConvergedNetwork AdaptersTable 1. Upgrades across the Hadoop* solution stack for a high-performance balanced infrastructure.2.1 Advancements in ComputeResourcesPlatform architecture introduced with theIntel® Xeon® processor E5 family makesbetter use of resources throughout thesolution stack, compared to previous-generation platforms. For example,increased core count, from six cores(12 hardware threads) to eight cores(16 hardware threads) per socket,enhances the ability to handle ahigher degree of parallelism, which isparticularly beneficial for data-intensiveHadoop workloads. Intel® Data DirectI/O Technology (Intel® DDIO) is a newfeature in the Intel Xeon processorE5 family that allows Intel® EthernetControllers and adapters to talk directlywith the processor cache instead of themain memory, helping deliver increasedbandwidth and lower latency that areparticularly beneficial when processinglarge data sets.3Big Data Technologies for Near-Real-Time Results
  4. 4. Intel® Manager for Apache Hadoop softwareDeployment, Configuration, Monitoring, Alerts, and SecurityHDFS 2.0.3Hadoop Distributed File SystemHBase0.94.1ColumnarStoreSqoop1.4.1DataExchangeFlume1.3.0LogCollectorYARN (MRv2)Distributed Processing FrameworkHive 0.9.0SQL QueryOozie 3.3.0WorkflowPig 0.9.2ScriptingMahout 0.7MachineLearningR connectorsStatisticsZookeeper3.4.5CoordinationIntel proprietary Intel enhancements contributedback to open sourceOpen source componentsincluded without change2.2 Advancements in StorageTechnologySolid-state drives (SSDs) represent a majorshift in persistent storage for mainstreamclient and server computers. Eliminatingthe electro-mechanical parts ofconventional hard disk drives (HDDs), suchas spinning disks and read/write heads,is a key factor in providing dramaticallyimproved data-access time and reducedlatency.The Intel® Solid-State Drive (Intel® SSD)520 Series used in the testing describedin this paper is available in a wide rangeof capacities, offering built-in data-protection features and exceptionalperformance improvements overconventional HDDs.3Built with compute-quality Intel 25-nanometer NAND FlashMemory, the Intel SSD 520 Series offersrandom read performance of up to 50,000input/output operations per second(IOPS)4and sequential read performanceof up to 550 megabytes per second (MB/s).5As a transitional step betweenconventional hard drives and SSDs, ithas become increasingly common fororganizations to provision servers withboth types of drives on the same machine.In this scenario, SSDs function as high-speed data cache devices, reducing theneed for reads from and writes to theconventional HDDs, improving overallperformance.2.3 Advancements in NetworkingRegular advances in the wire speedsof networking components have beenongoing for many years, together withcomplementary technologies that addvalue such as increasing throughput,improving cost-effectiveness, andenhancing flexibility. The mainconsideration in the present study is thetransition from Gigabit Ethernet (1GbE)to 10GbE.Intel® Ethernet Controllers andConverged Network Adapters havebeen instrumental in driving down thecost of 10GbE networking. In turn,increasing deployment of virtualizationand bandwidth-hungry applications suchas data analytics and video on demandhas led to more widespread adoption of10GbE, establishing a virtuous cycle wherecost benefits and broadening mainstreamadoption continue to support each other.Intel Ethernet software drivers have beenoptimized for big data implementations;for example, they are engineered tominimize I/O interference to Hadoop dataprocessing.The Intel® Ethernet Converged NetworkAdapter X540 is a low-cost, low-power10GBASE-T solution that providesbackward compatibility with existing1000BASE-T networks using Category6 and Category 6A copper cabling. TheIntel® Ethernet Controller X540 lowersboth initial cost and power requirementsby integrating the MAC and PHY into asingle-chip solution. The Intel® EthernetConverged Network Adapter X520 offersSFP+ connectivity for 10GbE using eithercopper or fiber-optic networking.2.4 An Optimized Hadoop Distributionfrom IntelThe Intel Distribution helps streamline andimprove the implementation of Hadoop onIntel® architecture-based infrastructure.It is the only distribution built from thesilicon up to enable the widest range ofdata analysis on Hadoop, and it is the firstwith hardware-enhanced performance andsecurity capabilities. The solution includesthe Hadoop framework, MapReduce,Hadoop Distributed File System (HDFS),and other related components to supportboth batch processing and near-real-time analytics, including the Hive* datawarehouse infrastructure, Pig* dataflow language, and HBase database.The components included in the IntelDistribution are shown in Figure 2.Figure 2. Components in the Intel® Distribution for Apache Hadoop* Software.4Big Data Technologies for Near-Real-Time Results
  5. 5. Figure 3. Dashboards to simplify setup and configuration, reducing deployment time.The Intel Distribution provides tuning and optimization guidelines based on real-world experience, as well as wizards and otherautomated deployment tools. The Intel® Manager for Apache Hadoop* software provides automated installation on Hadoop clusternodes, as well as real-time management, monitoring, and diagnostic capabilities using powerful, intuitive dashboards, as shownin Figure 3. The Intel Distribution also offers customers extensive training resources, as well as assistance with system design,deployment, customization, and tuning. Enterprise support is also available on a 24/7 basis to meet the needs of organizations andpeople whose success depends on high degrees of cluster uptime.3 The Role of 10GbE and OtherFactors in Accelerating HadoopWorkflowsImprovements in the ability to managebig data continue to bring us closerto achieving real-time analytics inmainstream environments. Using commondata center hardware and software, weare improving on solutions that takehours or even days to derive value outof source data.Improving results in this context requiresconsideration of the entire workflow,including the point where data enters thesystem, the mechanisms for processing it,and the tasks associated with exportingthe processed data back out of thesystem. Accordingly, the Hadoop workflowcan be considered as consisting of threestages:1. Import. The first step of using Hadoopto derive an answer from a largedata set is to get the data from anapplication into HDFS. Data can beimported in either a streamed or abatch modality.2. Processing. Once the data hasbeen imported into HDFS, Hadoopmanipulates that data to extractvalue from it. The MapReduce engineaccepts jobs from applications throughits JobTracker node, which dividesthe work into smaller tasks that itassigns to TaskTracker nodes. Typicaloperations include sorting, searching, oranalytics. TeraSort is a standard Hadoopbenchmark based on a data-sortingworkload.1In our test environment, weuse TeraGen to actually generate thedata set.3. Export. After the operations have beencompleted on the data in the processingstage, the results are made availableto applications.5Big Data Technologies for Near-Real-Time Results
  6. 6. Intel® EthernetGigabit ServerAdaptersIntel EthernetGigabit ServerAdaptersConventionalHard DrivesConventionalHard DrivesBefore AfterIntel® Xeon® Processor5600 Series250Minutes125MinutesIntel® Xeon® ProcessorE5-2600 Product FamilyProcessorUpgrade:~50%TimeReductionApacheHadoop*ApacheHadoopFigure 4. Processor upgrade to Intel® Xeon® processor E5-2600 product family:processing-stage speed improvement of approximately 50 percent.1This model demonstrates the value of10GbE throughout the Hadoop workflow,although the relative demands of thesethree stages compared to one anothervary significantly according to the natureof the work being performed. To give asimple example, a workload that includeslarge-scale data compression might beexpected to have a larger workload burdenfor importing data than for exporting itback out in its compressed form. Similarly,some tasks are more computationallyintensive than others, even though theamounts of data they work on couldbe similar. The tuning and optimizationguidelines and support servicesassociated with the Intel Distributioncan be beneficial in identifying the bestapproach for specific needs.3.1 Optimizing for the Import StageThe import stage within this modelconsists simply of putting data intoHDFS for processing. That process willalways occur at least once, and in somecases—particularly where MapReduce issold as a service—many imports could benecessary. This stage and the Hadoopreplication factor place intense network-performance demands on the system, interms of both networking and storage I/O.10GbE networking is instrumental insupporting these demands as data isimported into the system. Migratingfrom 1GbE to 10GbE produces up to a4x performance improvement in importoperations using conventional HDDs withparallel writes and up to 6x improvementusing SSDs.1The greater improvementwith SSDs can be attributed to fasterwrites into the storage subsystem usingnon-volatile memory.3.2 Optimizing for the ProcessingStage: The Path from Four Hours toSeven MinutesIntel testing used a 1 terabyte (TB)TeraSort workload distributed over 10data nodes with one name node. Togauge the benefit of upgrading variousresources, results were gathered beforeand after migrating the cluster from theIntel® Xeon® processor 5600 series to theIntel® Xeon® processor E5-2600 productfamily, followed by an upgrade fromconventional HDDs to SSDs, and finallyfrom 1GbE to 10GbE.These hardware upgrades resulted ina reduction in processing time fromapproximately four hours to about12 minutes. Implementing the IntelDistribution reduced that processing timefurther to approximately seven minutes.Altogether, this performance increasereflects nearly a 97-percent reduction inthe time required for the processing stage.The first hardware change tested was anupgrade from the Intel® Xeon® processorX5690 to the Intel® Xeon® processorE5-2690. As illustrated in Figure 4,the processor upgrade reduced thetime required to sort the 1 TB data setapproximately in half, from 250 minutesto 125 minutes.66Big Data Technologies for Near-Real-Time Results
  7. 7. Intel Xeon ProcessorE5-2600 Product FamilyIntel® Xeon® ProcessorE5-2600 Product FamilyConventionalHard DrivesIntel Solid-StateDrive520 SeriesBefore After125Minutes23MinutesUpgradeto SSDs:~80%TimeReductionApacheHadoop*ApacheHadoopIntel® EthernetGigabit ServerAdaptersIntel EthernetGigabit ServerAdaptersFigure 5. Storage upgrade to solid-state drives: processing-stage speed improvement ofapproximately 80 percent.7Intel Solid-StateDrive520 SeriesIntel® Solid-StateDrive520 SeriesIntel Xeon ProcessorE5-2600 Product FamilyIntel® Xeon® ProcessorE5-2600 Product FamilyIntel® EthernetGigabit ServerAdapters10 Gigabit IntelEthernet ConvergedNetwork AdaptersBefore After23Minutes12MinutesUpgradeto 10GbE:~50%TimeReductionApacheHadoop*ApacheHadoopFigure 6. Networking upgrade to 10 Gigabit Ethernet: processing-stage speed improvement ofapproximately 50 percent.9Working with this massive data set,the ability to access non-sequentialdata quickly is a key performanceconsideration. Therefore, to reduceany existing storage bottleneck, thenext upgrade tested was to replace theconventional HDDs with SSDs, takingadvantage of their dramatically higherrandom read times. Building on theprevious performance improvement fromthe processor upgrade, shifting fromconventional HDDs to the Intel SSD 520Series reduced the time to complete theworkload by approximately another 80percent—from about 125 minutes toabout 23 minutes, as shown in Figure 5.7For those customers interested incombining conventional HDDs and SSDs inthe same server, Intel offers Intel® CacheAcceleration Software. This tiered storagemodel provides some of the performancebenefits of SSDs at a lower acquisitioncost, but in addition to the performancedifferences compared to an SSD-onlyconfiguration, this approach also sacrificessome of the reliability benefits availablefrom SSDs. Nevertheless, this storagemodel offers customers another option asthey move toward full adoption of SSDs todramatically reduce the time to get fromdata to insight.Testing has also revealed that when fiveSSDs are installed per task node, theHadoop framework is able to run enoughsimultaneous Map tasks to generateparallel I/O to each SSD and utilize theprocessors at almost 100 percent.8This state allows for highly optimizedperformance of Map tasks. Setting io.sort.mb and io.sort.record.percent flagsappropriately avoided intermediate Mapoutput spills and excessive disk readsand writes. Each Map task processing a128 MB block of data is completed in lessthan 10 seconds and generates 128 MB ofoutput. Running 32 Map tasks in parallelenables each task node to generate morethan 5 Gb/s.8Upgrading the cluster hardware from1GbE to 10GbE to build on the upgradesof the processor and storage reducedprocessing time on the test workload byup to an additional 50 percent, from about23 minutes to about 12 minutes, as shownin Figure 6.9Using 10GbE interconnectsand SSDs supported running more than100 concurrent reducer tasks on the10-node test cluster, exhibiting good jobscaling and high resource utilization.5The large scale and distributed natureof Hadoop workloads makes network I/Oa vital aspect of overall performance atevery stage in the workflow, and 10GbEis a cost-effective, scalable solution thathelps reduce wait times for data. Havinga high bandwidth 10GbE network notonly allows for data to be imported andexported from the cluster quickly, butit also improves the shuffling phase ofthe TeraSort workload to be accelerated.Using 10GbE links between Map andReduce nodes allows the Reduce nodesto fetch data rapidly, helping improveoverall job-execution time and clusterperformance.7Big Data Technologies for Near-Real-Time Results
  8. 8. In addition to the hardware upgradesalready described, the Intel Distributionprovides an intuitive interface and manyunder-the-hood optimizations, includingadvanced data compression, dynamicreplica selection for HDFS, and MapReducespeedup. This additional engineeringhelps deliver performance gains and,together with trusted, enterprise-gradesupport from Intel, helps customers morerapidly deploy and successfully maintaina Hadoop environment. Implementing theIntel Distribution on top of the hardwareupgrades already made reduced workload-completion time requirements by up toapproximately another 40 percent—fromabout 12 minutes to about 7 minutes, asshown in Figure 7.103.3 Optimizing for the Export StageSimilar to the import stage, 10GbEdramatically benefits performance ofextracting data from the system afterprocessing, particularly in conjunctionwith upgrading from conventional HDDsto SSDs. Initial testing with conventionalHDDs revealed a significant bottleneck inthe export stage of the Hadoop workflowdue to random disk seeks. Replacingthe local storage with SSDs eliminatedthis shortcoming, enabling resultsin keeping with those we saw in theProcessor and BaseSystem ComparisonStorageComparisonNetwork AdapterComparisonSoftwareComparisonBaselineComponentsSuperMicro SYS-1026T-URF 1Uservers with two Intel® Xeon®processors X5690 @ 3.47 GHz,48 GB RAM700 GB 7200 RPM SATA harddrivesIntel® Ethernet Server AdapterI350-T2(Gigabit Ethernet)Apache Hadoop* 1.0.3UpgradedComponentsDell PowerEdge* R720 2Uservers with two Intel® Xeon®processors E5-2690 @ 2.90 GHz,128 GB RAMIntel® Solid-State Drive 520SeriesIntel® Ethernet ConvergedNetwork Adapter X520-DA2(10 Gigabit Ethernet)Intel® Distribution for ApacheHadoop* software 2.1.1Table 2. Worker-node components compared in the test environment.import stage. Using SSDs, we again sawup to approximately a 6x performanceimprovement from 10GbE relative to1GbE, dramatically improving the timerequirements for the overall operationacross the workflow. As described above,using SSDs and 10GbE also enhances thebenefit from high levels of processorresources, emphasizing the benefit ofbalancing resources in the Hadoop cluster.4 Description of theResearch EnvironmentThe Hadoop test bed used to generate theresults reported in this paper included onehead node (name node, job tracker), 10workers (data nodes, task trackers), and aCisco Nexus* 5020 10 Gigabit switch. Thevarious baseline and upgraded worker-node components that are compared inthe testing are detailed in Table 2.2Intel Xeon ProcessorE5-2600 Product FamilyIntel Solid-StateDrive520 SeriesIntel® Solid-StateDrive520 SeriesIntel® Xeon® ProcessorE5-2600 Family10 Gigabit IntelEthernet ConvergedNetwork AdaptersBefore After12Minutes7MinutesImplementIntel®Distributionfor ApacheHadoop*Software:~40%TimeReductionApacheHadoop*Intel®Distribution forApache Hadoop*Software10 Gigabit Intel®Ethernet ConvergedNetwork AdaptersFigure 7. Implementation of Intel® Distribution for Apache Hadoop* Software: processing-stagespeed improvement of approximately 40 percent.108Big Data Technologies for Near-Real-Time Results
  9. 9. 5 Tuning and OptimizationConsiderationsIn addition to upgrading networkcomponents and considering the useof the Intel Distribution, organizationsworking to maximize the value theyget from big data technologies mustconsider configurations and settings inthe networking stack and the Hadoopsoftware environment itself. While thepotential scope of configurations andsettings to be considered is daunting,best practices suggest that engineersshould pay particular attention to theconsiderations listed in this section foroptimal value.5.1 Networking, OS, and DriverOptimizationsIn the OS and networking stack, thenumber of open files, open networkconnections, and running processes atany time should be adjusted according tothe needs of the specific workload. TheIntel 10GbE Linux* drivers should alsobe optimized by adjusting the number ofRSS queues (two was the optimal numberin the testing described in this paper),and the number of context switchesshould be reduced by adjusting interruptthrottling. In the OS and the networkingand TCP/IP stack, the following settings,optimizations, and practices are ofparticular note:• Increase the limit of open files inthe OS. Hadoop inherently opens largenumbers of files, so increasing thelimit of concurrent open files reducesjob failures; Intel testing found 32K tobe sufficient. Increasing the limit ofconcurrent processes also helps reducejob failures.• Increase the number of outstandingconnections and outstandingSYN requests. Hadoop’s HDFS andMapReduce engine open large numbersof short-lived TCP/IP connections. InIntel setup configurations, this settingis increased to 3,240,000, reducing thewait time for HDFS and MapReducecommunications.• If resource sharing beyond the Hadoopworkload is not required, considerincreasing TCP/IP maximum windowsize and scaling up to 16 MB. Wherepossible, this approach helps to maximizethe value of the 10GbE investment.• If enough system memory is available,increase the TCP/IP send and receivebuffer sizes. This change may improvenetwork throughput. In the Intel testcluster, the maximum values were set to16 MB.• Disable TCP/IP selective ACK whenbandwidth is readily available. Whenselective ACKs are enabled, the responseto client requests can be delayed,degrading performance in terms ofjob execution and completion time;disabling them helps improve overallserver response time and Hadoop jobperformance.• Use JBOD for storage. Hadoop hasbuilt-in load balancing and uses efficientround-robin among the available HDFSand MapReduce JBOD Disks. Using RAIDwith SSDs and fast storage limits storagethroughput and overall job performance.Instead, use Disks in JBOD mode forHDFS and MapReduce, as both includebuilt-in functionality for load balancingacross multiple JBOD disks.5.2 Hadoop Configuration ParametersWithin the more than 200 configurationparameters available within the Hadoopstack, starting with the followingsubset of tuning considerations can helpengineering teams apply their effortsefficiently:• Memory configurations for Java virtualmachine (JVM) tasks. Each Map andReduce task runs in an independent JVMinstance. One can specify the amountof memory each Map and Reduce taskcan allocate, using the configurationparameters and for Map and Reduce tasks,respectively. The Intel test cluster setthe Map task heap to 512 MB and theReduce task heap to 1.5 GB.- Memory requirements for Maptasks will depend how much outputis being generated from each Map.A 128 MB block size with sortapplications requires about 200MB to store intermediate recordswithout any spill. This memory canbe managed using configurationparameters that follow the examplesio.sort.mb=200mb, io.sort.record.percent=.15, and io.sort.spill.percent=1.0.- Memory use for Reduce tasks canalso be tuned.Intel testing suggests that thedefault settings for most parametersare optimal, although mapred.job.reduce.input.buffer.percent can bechanged to 0.7 so the reducer doesnot need to empty out the memorybefore it starts the final merge.9Big Data Technologies for Near-Real-Time Results
  10. 10. • Number of concurrent Map and Reducetasks. On Intel Xeon processors, theoptimal number of Map tasks tendsto be the number of logical cores, andthe number of reducers should be setequal to the number of physical cores.These settings can be configured usingthe and mapred.tasktracker.reduce.tasks.maximum parameters.• NameNode and DataNodes requesthandler count and thread count. Ifenough memory and compute resourcesare available on the NameNode, thenumber of thread handles in theNameNode can be increased to 100or more to enable support for largernumbers of concurrent requests.Increasing the handler count forDataNodes can also be beneficial,particularly when SSDs or fast storageare being used.• Reducing network latency for IPCcommunications between nodes. Setipc.server.tcpnodelay and ipc.client.tcpnodelay to ‘true’.• Heartbeat frequency betweenjob and task trackers. The defaultheartbeat frequency is 3 seconds.For smaller jobs where each Map taskcompletes more quickly, this settingcould delay task-completion notificationsand scheduling of new tasks. Settingmapreduce.tasktracker.outofband.heartbeat to send immediateindications for job completions canimprove job performance. Tuningheartbeat frequency using mapreduce.tasktracker.outofband.heartbeat.damper parameters can also providesome benefit.• Speculative task executions. Hadoopmay schedule the same tasks onmultiple nodes as a safeguard againstnode failure or delayed execution.This practice is useful when it takesadvantage of unused resources, but notwhen the cluster is running at capacity.Therefore, disabling speculative taskexecutions is sometimes advisable,particularly in the presence of high-performance processing, storage, andnetwork resources.• Intermediate Map output compression.Enabling compression for intermediateMap output can help improveperformance on a cluster that is boundby storage or network performance.Note that, when using 10GbE and SSDs,compressing the output may providelittle if any improvement.• HDFS block size. The optimal size ofHDFS blocks is workload-specific, butlarger block sizes often do not generatebetter performance, because they maycause an extra load on memory and tendto cause intermediate spills in the Mapphase. At the same time, smaller blocksizes create extra overhead for smallerand more parallel tasks. In Intel testing,the 128 MB block size gave the bestoverall performance for the TeraSortbenchmark.5.3 Future EnhancementsAs described in this paper, Intel has beenable to produce enormous performanceadvantages using the latest Intel Xeonprocessors, SSDs, and 10 Gigabit IntelEthernet Converged Network Adapters, aswell as the Intel® Distribution for ApacheHadoop* software. Ongoing advanceswith all of these components areexpected to produce further performancebenefits in big data implementations.A wide variety of software enhancementsare under consideration for Hadoop andother big data technologies. One exampleis the possibility of accelerating thetransport layer by replacing HTTP-basedinter-node communication with other,more optimized options, improving overallthroughput without adding physicalresources. This and other softwareenhancements are an ongoing area ofresearch at Intel.Virtualized, pre-built Hadoop clustersfor business analytics and other big datausages represent another promising areaof research and development, with thepotential for a number of benefits, such asthe following:• Reducing the complexity ofimplementing Hadoop environments• Encapsulating best practices forconfiguring virtualized resources,without manual tuning• Enabling multi-purpose environmentsbuilt on big data technologies10Big Data Technologies for Near-Real-Time Results
  11. 11. 6 ConclusionThe ability to store and analyze huge amounts of unstructured data promises ongoing opportunities for businesses, academicinstitutions, and government organizations. Intel has shown through this research that significant performance gains are possiblefrom Hadoop through a balanced infrastructure based on well-selected hardware components and the use of the Intel® Distribution forApache Hadoop* software.The results featured in this paper are part of a large and growing body of research being conducted at Intel and elsewhere in theindustry to identify best practices for building and operating Hadoop clusters and other big data solutions, as well as for developingand tuning software to run optimally in those environments. That progress continues to help guide the computing industry towardsimplified, low-cost implementations to drive the future of pervasive real-time analytics.11Big Data Technologies for Near-Real-Time Results
  12. 12. Learn more by visiting the following Data Technologies for Near-Real-Time Results12 1 TeraSort Benchmarks conducted by Intel in December 2012. Custom settings: mapred.reduce.tasks=100 and mapred.job.reuse.jvm.num.tasks=-1.For more information: 2 Cluster configuration: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Cisco Nexus* 5020 10 Gigabit switch. Baseline worker node: SuperMicro SYS-1026T-URF 1U servers with two Intel®Xeon®processors X5690 @ 3.47 GHz, 48 GB RAM, 700 GB 7200 RPM SATA hard drives, Intel®Ethernet Server Adapter I350-T2, Apache Hadoop* 1.0.3, Red HatEnterprise Linux* 6.3, Oracle Java* 1.7.0_05. Upgraded processor and base system in worker node: Dell PowerEdge* R720 2U servers with two Intel®Xeon®processors E5-2690 @ 2.90 GHz, 128 GB RAM. Upgraded storage in worker node: Intel®Solid-State Drive 520 Series. Upgraded network adapter in worker node: Intel®Ethernet Converged Network Adapter X520-DA2. Upgraded software in worker node: Intel®Distribution for Apache Hadoop* software 2.1.1. 3 The Intel®Solid-State Drive 520 Series is currently not validated for data center usage. 4 Solid-state drive performance varies by capacity. 5 Performance measured using Iometer* with Queue Depth 32. Tests conducted in December 2012. 6 Baseline worker node: SuperMicro SYS-1026T-URF 1U servers with two Intel®Xeon®processors X5690 @ 3.47 GHz, 48 GB RAM, 700 GB 7200 RPM SATA hard drives, Intel®Ethernet Server Adapter I350-T2, Apache Hadoop* 1.0.3, Red HatEnterprise Linux* 6.3, Oracle Java* 1.7.0_05. Upgraded processor and base system in worker node: Dell PowerEdge* R720 2U servers with two Intel®Xeon®processors E5-2690 @ 2.90 GHz, 128 GB RAM, 700 GB 7200 RPM SATA hard drives. 7 Baseline storage: 700 GB 7200 RPM SATA hard drives, upgraded storage: Intel®Solid-State Drive 520 Series. 8 Source: Intel internal testing, December 2012. 9 Baseline network adapter: Intel®Ethernet Server Adapter I350-T2, upgraded network adapter: Intel®Ethernet Converged Network Adapter X520-DA2. 10 Upgraded software in worker node: Intel®Distribution for Apache Hadoop* software 2.1.1. Software and workloads used in performance tests may have been optimized for performance only on Intel®microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems,components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,including the performance of that product when combined with other products. For more information go to Results are based on Intel internal testing, using third party benchmark test data and software. Intel does not control or audit the design or implementation of third party benchmark data, software or Web sites referenced in this document. Intelencourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems availablefor purchase. Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and otheroptimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL®PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANYINTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, ANDINTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATINGTO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OFTHE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructionsmarked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes tothem. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects orerrors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales officeor your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, orother Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel’s Web Site Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and other countries.*Other names and brands may be claimed as the property of others. Printed in USA 0313/ME/MESH/PDF Please Recycle 328340-001US