• Share
  • Email
  • Embed
  • Like
  • Private Content
Hw09   Optimizing Hadoop Deployments
 

Hw09 Optimizing Hadoop Deployments

on

  • 6,646 views

 

Statistics

Views

Total Views
6,646
Views on SlideShare
6,573
Embed Views
73

Actions

Likes
6
Downloads
220
Comments
0

4 Embeds 73

http://www.slideshare.net 37
http://blog.omega-delta.de 33
https://twitter.com 2
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hw09   Optimizing Hadoop Deployments Hw09 Optimizing Hadoop Deployments Presentation Transcript

    • Optimizing Hadoop* Workloads Nurcan Coskun Intel Software & Solutions Group October 2, 2009 Acknowledgements to Jason Dai, Intel SSG, for many of the test results and optimization techniques Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2009, Intel Corporation.
    • Legal Disclaimers Disclaimers & Legal Notices THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR NONINFRINGEMENT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site http://www.intel.com/. 2 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Why Optimize Hadoop Deployments? Handle At In With More Lower Less Less Data Cost Time Power 3 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Where to Optimize? Hardware Software 4 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Hadoop Servers Masters: JobTracker, NameNode, Secondary NameNode – Deploy additional RAM and secondary power supplies – Ensure highest performance and reliability Slaves: DataNodes, TaskTrackers – Hadoop Framework handles slave failures well – Data blocks are replicated and distributed – Workload may be bound by I/O, memory or processor resources – The system level hardware should be adjusted on a case-by-case basis 5 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Server Platform •Dual-socket servers are optimal for Hadoop deployments •Dual-socket servers are more efficient than large-scale multi- processor platforms from a per-node, cost benefit perspective •Dual-socket servers offset the added per-node hardware cost relative to entry-level servers through superior efficiencies in terms of load-balancing and parallelization overheads •Choosing hardware based on the most current platform technologies available helps to ensure the optimal intra-server throughput and efficiency 6 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Processor Choice Matters Faster Handles More Data More Energy Efficient 7 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Processor Choice Impacts Speed Data Source: Intel internal measurements by using Hadoop 0.19.1 as of September 20, 2009. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 8 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Processor Choice Impacts Throughput • Throughput = # of tasks completed / minute when cluster is at 100% utilization. • Intel Xeon processor 5500 provides up to 86% more throughput than 5400 series. Data Source: Intel internal measurements by using Hadoop 0.19.1 as of September 20, 2009. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 9 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Processor Scaling Inte l® X e on® P roce ssor 5400 Se rie s (H arpe rtown) C luste r Inte l® X e on® P roce ssor 5500 S e rie s (N e hale m) C luste r (Lowe r Value s are B e tte r) (Lowe r Value s are B e tte r) 30000 20000 1G B 1G B 2G B 18000 2G B 25000 3G B 16000 3G B 4G B JavaS ort Tom pletion Tim e (seconds) 4G B J a v a S ort Tom ple tion Tim e (s e c onds ) 5G B 14000 20000 5G B 6G B 6G B 7G B 12000 7G B 15000 8G B 10000 8G B 9G B 9G B 10G B 8000 10G B 10000 50G B 6000 50G B 100G B 100G B 150G B 4000 5000 150G B 200G B 200G B 2000 250G B 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Num be r of Node s Num be r of Node s •Hadoop workloads scales well on Intel processors •Intel® Xeon® processor 5500 can handle larger data sizes than 5400 series. Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 10 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Turn on Intel® Hyper-threading Technology Intel® Hyper-threading Intel® Xeon® Processor 5500 Series (Nehalem) Technology SMT effect in 8 node cluster (Lower Values Are Better) 250 JavaSort Completion Time (seconds) Increases performance for threaded applications delivering greater throughput 200 and responsiveness 150 SMT ON SMT OFF 100 50 0 1GB 2GB 3GB 4GB 5GB 6GB 7GB 8GB 9GB 10GB Data Set Size Up to 25% better performance Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 11 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Memory •Sufficient memory capacity is critical for efficient operation of servers in a Hadoop cluster, supporting high throughput by allowing large number of map/reduce tasks to be carried out simultaneously •Typical Hadoop applications require approximately 1-2 GB of RAM per processor core, which corresponds to 8-16GB for a dual-socket server using quad-core processors •Error Correcting Code (ECC) memory is highly recommended to detect and correct errors introduced during storage and transmission of data 12 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Selecting Server Motherboard •Select server motherboards which are optimized for high density computing environments. – They should use high efficiency voltage regulators – They need to be optimized for airflow – They should use certified power supplies •Optimized server motherboards will use less power, need less cooling, and save money 13 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Hard Disk and SSD •Large number of hard drives per server (4-6) •Hadoop orchestrates data provisioning and redundancy across individual nodes (Using RAID 0 is not needed) •SSD’s are faster and they require very little power, SSD usage will also eliminate cooling cost created by hard disk drives •Use SSD’s: – To store mission critical smaller data sets – To store map/reduce intermediate results – To replace HDD’s with SDD’s to reduce power consumption, increase throughput and improve performance 14 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Use Intel® X25-E SATA SSD’s 10 N ode Inte l® X e on® L5520 (N e hale m) C luste r (Lowe r Value s are B e tte r) 2500 2000 JavaS ort Com pletion Tim e 1500 (seconds) hdd ssd 1000 500 0 1G B 10G B 50G B 80G B 100G B Da ta S e t S iz e Data Source: Intel internal measurements by using Hadoop 0.19.0 as of September 20, 2009. Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 15 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • System Software •Use a Linux* distribution based on kernel version 2.6.30 or later because of the optimizations included for energy and threading efficiency – For Example: energy consumption can be up to 60 percent (42 watts) higher at idle for each server using older versions of Linux •Optimize Linux* file system configurations – Noatime attribute – Open file descriptor limit •Use latest Java (for example Sun Java* 6u14) – Use 64 bit optimized JVM builds 16 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Hadoop Configuration Tuning •The number of NameNode and JobTracker threads(10 -> 64) •The number of DataNode server threads (3 -> 8) •The number of work threads on HTTP server that runs on each TaskTracker (40-50) •HDFS replication factor (3) •Default HDFS block size (64MB -> 128MB) •Maximum number of map/reduce tasks per node – (cores_per_node)/2 -> 2*(cores_per_node) • The number of input streams (files) to be merged at once in map/reduce tasks (example: 100) • JVM settings • The total size of result and metadata buffers associates with a map task (100MB -> 200 MB) 17 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • System-stack Example Two-way Intel® Xeon® processor 5500 series Intel® X25-E SATA SSD’s Four to six 7200 RPM SATA drives 12-24 GB DDR3 ECC RAM Intel® Server Board S5500WB 80 PLUS* Gold Certified power supplies Linux* based on kernel 2.6.30 or later Sun Java* 6u14 or later Hadoop* (0.18.3 or 0.20.0) 18 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Summary Hardware selection: • Intel® Xeon® 5500 (“Nehalem”) improves Hadoop Workload performance • Choosing an optimized server board such as Intel® SB5500WB (“WillowBrook”) can reduce power consumption • Use Intel® X25-E SATA SSD’s to improve performance Software & configurations: • Use latest Linux kernel • Turn on Intel® Hyper-threading • Optimize Hadoop Configuration • Tuning may be different for different workload types 19 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • References: 1. http://www.intel.com/p/en_US/products/server/processor 2. http://www.intel.com/it/pdf/server-rightsizing.pdf 3. http://www.80plus.org/ 4. https://opencirrus.org/content/agenda-open-cirrus-summit-palo- alto-june-8-9-2009 20 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Cluster Configurations Information (Slides: “Processor Scaling” and “Turn on Intel® Hyper- threading”) Hardware Configuration Item Endeavor Atlantis Node count 1-10 nodes 1-10 nodes Platform Intel SR1600UR Intel SR1560SF system Intel S5520UR main board Intel S5400SF main board 1U chassis 1U chassis CPU/Stepping Intel® Xeon® X5560 C1 step Intel® Xeon® X5482; C0 step (Nehalem EP) (Harpertown) 2.8GHz / 6.4 QPI 1333 95 W 3.2 GHz / 12 MB L2 cache 1MB L2 cache, 8M L3 cache RAM 24 GB total/node 16 GB 6*4GB 1333MHz Reg ECC DDR3 (FBDIMM 8x2-GB 667MHz) Chipset Tylersburg Seaburg BIOS Version Rev 26 Rev 22.1 08 Apr 2008 7 Nov 2007 Interconnects Gigabit Ethernet Gigabit Ethernet QDR InfiniBand DDR InfiniBand Hard drive specs Seagate Cheetah NS Seagate Barracuda ES 400 GB SAS HDD 10kRPM 250 GB SATA HDD Model: ST3400755SS Model: ST3250620NS Using onboard Intel Entry Level Raid controller 21 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Cluster Configurations Information (Slides: “Processor Choice Impacts Speed” and “Processor Choice Impacts Throughput”) Intel® Xeon® X5460-based server Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and adjacent cache-line, prefetch disable Intel® Xeon® X5570-based server Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache-line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve Hadoop performance on Xeon X5460 server according to our benchmarking.) 22 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.
    • Cluster Configurations Information (Slides: “Use Intel® X25-E SATA SSD’s”) Slaves: • Intel® Xeon® L5520 Processor (Nehalem) @ 2.27 GHz CPUs 5.8 GB/sec QPI, 24 GBy RAM • Server Board: Intel® SB5500WB (Willowbrook) • 1x 1 TB SATA HDD boot disk, holds ${HOME} dirs: / • 2x 1 TB SATA HDD scratch/experiment disks: • 2x 64 GB Intel® X25-E SATA SLC SSD scratch/experiment disks •OS: Ubuntu* 9.04 == 2.6.28-4 kernel (to enable power saving with preserved performance) Master: •Intel® Xeon® Processor 2.93 GHz CPUs, 6.4 GB/sec QPI, 16 GBy RAM •Server Board: Intel® SB5500WB (Willowbrook) •Hard Disks: • 1x 500 GB SATA OS boot disk (/dev/sda1), holds installed software and ${HOME} dirs • 2x 500 GB SATA scratch disks • 2x64 GB Intel® X25-E SATA SLC SSDs •OS: RedHat* Enterprise Linux 5.3 Server x64t 23 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2009, Intel Corporation.