• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

Intel - Nurcan Coskun - Hadoop World 2010

  • 2,587 views
Uploaded on

Optimizing Hadoop* Workloads ...

Optimizing Hadoop* Workloads

Nurcan Coskan, Ph.D.
Intel Software & Solutions Group

Learn more @ http://www.cloudera.com/hadoop/

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,587
On Slideshare
2,095
From Embeds
492
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
5

Embeds 492

http://www.cloudera.com 427
http://www.nosqldatabases.com 63
unmht:// 1
http://blog.cloudera.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Optimizing Hadoop* Workloads Nurcan Coskun, Ph.D. Intel Software & Solutions Group October 12, 2010 Acknowledgements to Jason Dai, Intel SSG, for of the test results and optimization techniques Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2010, Intel Corporation.
  • 2. Legal Disclaimers Disclaimers & Legal Notices THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR NONINFRINGEMENT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site http://www.intel.com/. 2 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 3. Why Optimize Hadoop Deployments? Handle At In With More Lower Less Less Data Cost Time Power 3 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 4. Workload traits drive optimization approach 4 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 5. Where to Optimize ? Hardware Hadoop / HDFS Software Equipment, Settings Version, Settings OS, JVM, Settings 5 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 6. Server Considerations 2 Socket Systems Sweet spot for with Intel® performance, Xeon® Processor efficiency, cost 5600 series 12-24 GB DDR3 CPU intense or HBASE may require more. 4-6 1TB SATA Pure I/O workloads HDD 7200 may require more 1-2GB Ethernet Channel bonding for increased throughput Energy efficient Gold certified power components supplies, efficient fans, low power 6 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 7. Processor Choice Matters Faster Handles More Data More Energy Efficient 7 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 8. Processor Choice Impacts Speed Last Year This Year Up to Up to 36%faster 29% faster Data Source: Intel internal measurements. Hadoop 0.19.1 results as of September 20, 2009 and Hadoop 0.20.2 results as of August 8, 2010. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 8 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 9. Processor Choice Impacts Throughput • Throughput = # of tasks completed / minute when cluster is at 100% utilization • Intel Xeon processor 5600 provides up to 30% more throughput than 5500 series1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 9 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 10. Turn on Intel® Hyper-threading Technology Intel® Hyper-threading Job Running Time Technology (Lower values are better) 1.4 1.2 Increases performance for threaded applications delivering greater throughput 1 and responsiveness 0.8 SMT OFF 0.6 SMT ON 0.4 0.2 0 Terasort Wordcount Up to 28% better performance1 1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 10 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 11. Memory & Storage Memory • Equipping 1~3 GB of RAM per CPU core • ECC memory is highly recommended1, to detect and correct errors introduced during storage transmission of data. Hard drives • Run in AHCI mode with NCQ enabled to improve multiple simultaneous Read/Write performance • Enable hard drive’s write cache 1. See in the discussion mail list http://mail-archives.apache.org/mod_mbox/hadoop-core- dev/200705.mbox/%3C465C3065.09050501@dragonflymc.com%3E 11 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 12. Networking • 1-2 x 1Gigabit Ethernet per node • Ensure multiple RX/TX queue support for multi-core processors • Enable channel bonding to resolve network-bound workloads if needed – E.g., Improves sort workload by 30% in job running time1 Sort – no channel bonding Sort – channel bonding bootstrap 100% network bootstrap Map/Reduce Tasks Map/Reduce Tasks map map utilization without shuffle shuffle sort sort reduce channel bonding reduce idle idle 100% 100% CPU Utilization 80% idle 80% idle CPU Utilization 60% wait I/O 60% wait I/O 40% system 40% system 20% user 20% user I/O improves 0% 100% 0% 100% Disk Utilization 80% substantially Disk Utilization 80% 60% disk 60% 40% disk 40% 20% 20% 0% 0% 100% 100% 80% 80% Utilization Network Utilization Network 60% network 60% 40% network 40% 20% 20% 0% 0% time time 1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 24. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 12 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 13. Hadoop Disk Drives Doubling disk drives  >2x Speedup 13 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 14. OS •Use a Linux* distribution based on kernel version 2.6.30 or later because of the optimizations included for energy and threading efficiency – For Example: energy consumption can be up to 60 percent (42 watts) higher at idle for each server using older versions of Linux •Optimize Linux* configurations – Linux open file descriptor limit using /etc/security/limits.conf • Default 1024 is too low for Hadoop daemon, and try to increase to approximately 64,000 – In kernel 2.6.28, epoll file descriptor limit using /etc/sysctl.conf • Default 128 is too low for Hadoop daemon, and try to increase to approximately 4096 14 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 15. JVM JVM (set in hadoop-env.sh) • Prefer Sun Hotspot Java Runtime Environment • Prefer 1.6 update 14 or later 64-bit version JVM • “-server” option – Recommend for Hadoop framework processes (E.g., JobTracker, Namenode), targeting at the production deployments • Specific GC related options for framework process – E.g., Using parallel GC algorithm -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC • Set the parameter java.net.preferIPv4Stack to true as well. 15 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 16. Data Compression Choosing a proper codec for your IO intensive workloads • Compress data wherever possible • Reduces storage footprint • Speed I/O bound workloads • Set mapred.output.compress and/or mapred.compress.map.output to be true • Consider LZO format • Terasort with LZO compression: • 60% faster than uncompressed • 56% faster than zlib Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 25. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 16 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 17. Hadoop Configuration Tuning 1. Increase DFS block size • dfs.block.size – HDFS file block size, to use larger block size (such as 128M or 256M) for large file system. • E.g., Increasing block size from 128M to 256M saves Terasort running time by 7% 2. Supply enough handlers (HDFS) – dfs.datanode.max.xcievers • The maximum number of threads that can be connected to a data node simultaneously, set larger number (e.g., 2048) rather than the default value 256. 17 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 18. For More Detail – See Intel’s Recent Paper 18 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 19. Summary • Tune and optimize Hadoop case by case • Most of the Hadoop applications are data-intensive • Tune your IO related application subsystem first • Processor choice matters: • X5670 (Westmere) shows 20-40% improvement for CPU-intensive workloads over X5570 (Nehalem)1 • For I/O Intensive workloads – consider scaling HDD with core count • Performance tuning tips: • Channel bonding can reduce the network bottleneck for I/O intensive workloads • Using larger DFS block size decreases task overhead • Enabling HT shows gains up to 28% for CPU intensive workloads2 • Using LZO can significantly improve TeraSort results 1,2 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations (speed test) are on slide 21. Hardware configurations (HyperThreading) are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 19 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 20. Backup 20 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 21. Cluster Configurations Information (Slide: “Processor Choice Impacts Speed”) Source: Intel internal measurement as of September 19 2009 running Hadoop*, WordCount, and TeraSort Intel® Xeon® X5460-based server Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and adjacent cache-line, prefetch disable Intel® Xeon® X5570-based server Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache- line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve Hadoop performance on Xeon X5460 server according to our benchmarking.) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Results: WordCount single job running time was 407 seconds on the Xeon® 5500® processor series and 289 seconds on the Intel® Xeon® 5600 processor series. TeraSort single job running time was 2,541 seconds on the Xeon processor 5500 series and 2,182 seconds on the Intel Xeon processor 5600 series. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel® Xeon® processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)]. 21 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 22. Cluster Configurations Information (Slide: “Processor Choice Impacts Throughput”) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Results: Total completed tasks per minute of WordCount over Intel® Xeon® processor 5500 series was approximately 71.58, and over Intel® Xeon® process 5600 series was approximately 93.22. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)] 22 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 23. Cluster Configurations Information (Slide: “Intel® Hyper-threading Technology”) Source: Intel internal measurement as of August 8, 2010 based on the following cluster and server configuration: 6 nodes (1 NameNode/JobTracker, 5 DataNode/TaskTracker) in each, configured with 2GbE connectivity to each server. Intel® Xeon® processor 5600 series servers: HP ProLiant* z6000 G6 Server 2 x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for file system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology (Intel® HT Technology) requires a computer system with an Intel® processor supporting Intel HT Technology and an Intel HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support Intel HT Technology. 23 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 24. Cluster Configurations Information (Slide: “Networking”) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop- 0.20.2-CDH3 beta 2 (hadoop patch level 320)] 24 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
  • 25. Cluster Configurations Information (Slide: “Data Compression”) Source: Intel internal measurement as of August 8,2010 running Hadoop* TeraSort. Results: TeraSort single job running time was 1477 seconds without compression, 1256 seconds with default(zlib) compression, and 586 seconds with LZO compression. Hardware, cluster configuration, and settings were as follows: (1 NameNode/JobTracker + 32 DataNode/TaskTracker; each has 1 port 1 GbE connectivity to a single GbE switch) Intel Xeon processor 5500 series servers: 2x Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 4 SATA disks per node (All 4 for HDFS and intermediate results, sharing 1 for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30.10 x86_64). Ext3 filesystem, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Hadoop 0.20.1 version 25 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.