SlideShare a Scribd company logo
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may
be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2010, Intel Corporation.
Optimizing Hadoop* Workloads
Nurcan Coskun, Ph.D.
Intel Software & Solutions Group
October 12, 2010
Acknowledgements to Jason Dai, Intel SSG, for
of the test results and optimization techniques
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
2
Legal Disclaimers
Disclaimers & Legal Notices
THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD
NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR
LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE
PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY
EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR
NONINFRINGEMENT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any difference in system hardware or software design or
configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance
of systems or components they are considering purchasing. For more information on performance tests and on the
performance of Intel products, visit Intel Performance Benchmark Limitations
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED
IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE
FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the
absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The
information here is subject to change without notice. Do not finalize a design with this information. The products described in
this document may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to
obtain the latest specifications and before placing your product order. Copies of documents which have an order number and
are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's
Web Site http://www.intel.com/.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
3
Why Optimize Hadoop Deployments?
Handle
More
Data
At
Lower
Cost
In
Less
Time
With
Less
Power
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
4
Workload traits drive optimization approach
4
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
5
Where to Optimize ?
Hardware Hadoop / HDFS Software
Equipment, Settings Version, Settings OS, JVM, Settings
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
6
Server Considerations
2 Socket Systems
with Intel®
Xeon® Processor
5600 series
Sweet spot for
performance,
efficiency, cost
12-24 GB DDR3 CPU intense or
HBASE may require
more.
4-6 1TB SATA
HDD 7200
Pure I/O workloads
may require more
1-2GB Ethernet Channel bonding for
increased throughput
Energy efficient
components
Gold certified power
supplies, efficient
fans, low power
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
7
Processor Choice Matters
Faster
Handles More Data
More Energy Efficient
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
8
Processor Choice Impacts Speed
Data Source: Intel internal measurements. Hadoop 0.19.1 results as of September 20, 2009 and Hadoop 0.20.2 results as of August 8, 2010.
Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
Last Year This Year
36%
Up to
faster
29%
Up to
faster
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
9
Processor Choice Impacts Throughput
• Throughput = # of tasks
completed / minute when
cluster is at 100% utilization
• Intel Xeon processor 5600
provides up to 30% more
throughput than 5500 series1
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010.
Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Turn on Intel® Hyper-threading Technology
Intel® Hyper-threading
Technology
Increases performance for threaded
applications delivering greater throughput
and responsiveness
Up to 28% better performance1
1
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010.
Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Terasort Wordcount
SMT OFF
SMT ON
Job Running Time
(Lower values are better)
10
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Memory
• Equipping 1~3 GB of RAM per CPU core
• ECC memory is highly recommended1, to detect and correct errors
introduced during storage transmission of data.
Hard drives
• Run in AHCI mode with NCQ enabled to improve multiple
simultaneous Read/Write performance
• Enable hard drive’s write cache
1. See in the discussion mail list http://mail-archives.apache.org/mod_mbox/hadoop-core-
dev/200705.mbox/%3C465C3065.09050501@dragonflymc.com%3E
Memory & Storage
11
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
• 1-2 x 1Gigabit Ethernet per node
• Ensure multiple RX/TX queue support for multi-core processors
• Enable channel bonding to resolve network-bound workloads if needed
– E.g., Improves sort workload by 30% in job running time
1
NetworkingMap/ReduceTasks
bootstrap
map
shuffle
sort
reduce
idle
0%
20%
40%
60%
80%
100%
CPUUtilization
idle
wait I/O
system
user
0%
20%
40%
60%
80%
100%
DiskUtilization
disk
0%
20%
40%
60%
80%
100%
Network
Utilization
network
Map/ReduceTasks
bootstrap
map
shuffle
sort
reduce
idle
0%
20%
40%
60%
80%
100%
CPUUtilization
idle
wait I/O
system
user
0%
20%
40%
60%
80%
100%
DiskUtilization
disk
0%
20%
40%
60%
80%
100%
Network
Utilization
network
I/O improves
substantially
Sort – no channel bonding Sort – channel bonding
100% network
utilization without
channel bonding
12
1
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 24. Performance tests
and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured
by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
time time
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Hadoop Disk Drives
13
Doubling disk drives  >2x Speedup
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
14
OS
•Use a Linux* distribution based on kernel version 2.6.30 or
later because of the optimizations included for energy and
threading efficiency
– For Example: energy consumption can be up to 60 percent (42 watts)
higher at idle for each server using older versions of Linux
•Optimize Linux* configurations
– Linux open file descriptor limit using /etc/security/limits.conf
• Default 1024 is too low for Hadoop daemon, and try to increase to approximately 64,000
– In kernel 2.6.28, epoll file descriptor limit using /etc/sysctl.conf
• Default 128 is too low for Hadoop daemon, and try to increase to approximately 4096
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
15
JVM
JVM (set in hadoop-env.sh)
• Prefer Sun Hotspot Java Runtime Environment
• Prefer 1.6 update 14 or later 64-bit version JVM
• “-server” option
– Recommend for Hadoop framework processes (E.g., JobTracker, Namenode), targeting at the production
deployments
• Specific GC related options for framework process
– E.g., Using parallel GC algorithm -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC
• Set the parameter java.net.preferIPv4Stack to true as well.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
16
Choosing a proper codec for your IO intensive workloads
Data Compression
• Compress data wherever possible
• Reduces storage footprint
• Speed I/O bound workloads
• Set mapred.output.compress
and/or
mapred.compress.map.output to
be true
• Consider LZO format
• Terasort with LZO compression:
• 60% faster than uncompressed
• 56% faster than zlib
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 25. Performance tests
and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured
by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
17
Hadoop Configuration Tuning
1. Increase DFS block size
• dfs.block.size
– HDFS file block size, to use larger block size (such as 128M or 256M) for large file system.
• E.g., Increasing block size from 128M to 256M saves Terasort running time by 7%
2. Supply enough handlers (HDFS)
– dfs.datanode.max.xcievers
• The maximum number of threads that can be connected to a data node simultaneously, set
larger number (e.g., 2048) rather than the default value 256.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
18
For More Detail – See Intel’s Recent Paper
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
19
Summary
• Tune and optimize Hadoop case by case
• Most of the Hadoop applications are data-intensive
• Tune your IO related application subsystem first
• Processor choice matters:
• X5670 (Westmere) shows 20-40% improvement for CPU-intensive workloads over
X5570 (Nehalem)1
• For I/O Intensive workloads – consider scaling HDD with core count
• Performance tuning tips:
• Channel bonding can reduce the network bottleneck for I/O intensive workloads
• Using larger DFS block size decreases task overhead
• Enabling HT shows gains up to 28% for CPU intensive workloads2
• Using LZO can significantly improve TeraSort results
1,2
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010.
Hardware configurations (speed test) are on slide 21. Hardware configurations (HyperThreading) are on slide 23. Performance tests and ratings are
measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those
tests. Any difference in system hardware or software design or configuration may affect actual performance.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
20
Backup
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Processor Choice Impacts Speed”)
Source: Intel internal measurement as of September 19 2009 running Hadoop*, WordCount, and TeraSort
Intel® Xeon® X5460-based server
Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz
Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM
Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results
Network: 1 Gigabit Ethernet NIC
BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and
adjacent cache-line, prefetch disable
Intel® Xeon® X5570-based server
Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz
Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM
Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results
Network: 1 Gigabit Ethernet NIC
BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache-
line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve
Hadoop performance on Xeon X5460 server according to our benchmarking.)
Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort.
Results: WordCount single job running time was 407 seconds on the Xeon® 5500® processor series and 289 seconds on the Intel® Xeon® 5600
processor series. TeraSort single job running time was 2,541 seconds on the Xeon processor 5500 series and 2,182 seconds on the Intel Xeon processor
5600 series.
Hardware, cluster configuration, and settings were as follows:
(1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel
Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3
RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST
(Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-
Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel® Xeon® processor X5570 2.93
GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with
isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line
prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system,
mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual
machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)].
21
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Processor Choice Impacts Throughput”)
Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort.
Results: Total completed tasks per minute of WordCount over Intel® Xeon® processor 5500 series
was approximately 71.58, and over Intel® Xeon® process 5600 series was approximately 93.22.
Hardware, cluster configuration, and settings were as follows:
(1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a
single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP
ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB
DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system
and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo
mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-
Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6
Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA
disks per node (All six for HDFS and intermediate results, sharing one for system and log files with
isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled.
Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading
Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file
system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE
Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop
[hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)]
22
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Intel® Hyper-threading Technology”)
Source: Intel internal measurement as of August 8, 2010 based on the following cluster and server
configuration: 6 nodes (1 NameNode/JobTracker, 5 DataNode/TaskTracker) in each, configured with
2GbE connectivity to each server. Intel® Xeon® processor 5600 series servers: HP ProLiant* z6000
G6 Server 2 x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6
SATA disks per node (All six for HDFS and intermediate results, sharing one for file system and log
files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode
disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-
Threading Technology (Intel® HT Technology) requires a computer system with an Intel® processor
supporting Intel HT Technology and an Intel HT Technology-enabled chipset, BIOS, and operating
system. Performance will vary depending on the specific hardware and software you use.
See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on
which processors support Intel HT Technology.
23
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Networking”)
24
Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort.
Hardware, cluster configuration, and settings were as follows:
(1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single
GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant*
z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3
RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and
log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode
disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-Threading
Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x
Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node
(All six for HDFS and intermediate results, sharing one for system and log files with isolated
partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both
hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology
enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system,
mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime
Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-
0.20.2-CDH3 beta 2 (hadoop patch level 320)]
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Data Compression”)
Source: Intel internal measurement as of August 8,2010 running Hadoop* TeraSort.
Results: TeraSort single job running time was 1477 seconds without compression, 1256 seconds with
default(zlib) compression, and 586 seconds with LZO compression.
Hardware, cluster configuration, and settings were as follows: (1 NameNode/JobTracker + 32
DataNode/TaskTracker; each has 1 port 1 GbE connectivity to a single GbE switch) Intel Xeon
processor 5500 series servers: 2x Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3
RAM, 4 SATA disks per node (All 4 for HDFS and intermediate results, sharing 1 for system and log
files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode
disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading
Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30.10 x86_64). Ext3
filesystem, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java
SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Hadoop 0.20.1 version
25

More Related Content

Similar to Intel - Nurcan Coskun - Hadoop World 2010

Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop Deployments
Cloudera, Inc.
 
Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop Deployments
Cloudera, Inc.
 
Intel Mobile Launch Information
Intel Mobile Launch InformationIntel Mobile Launch Information
Intel Mobile Launch Information
Anna Yovka
 
AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12
Jomar Silva
 
Intel HPC Update
Intel HPC UpdateIntel HPC Update
Intel HPC Update
IBM Danmark
 
Intel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterIntel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data Center
Dr. Wilfred Lin (Ph.D.)
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura IntelTDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
tdc-globalcode
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
tdc-globalcode
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Igor José F. Freitas
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
Gael Hofemeier
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
Intel IT Center
 
Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*
Intel® Software
 
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
IntelAPAC
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge Economy
Intel IT Center
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
Intel Software Brasil
 
8 intel network builders overview
8 intel network builders overview8 intel network builders overview
8 intel network builders overview
videos
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
Ceph Community
 
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year HorizonGary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
AugmentedWorldExpo
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for Intel
Intel® Software
 
Driving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to ExascaleDriving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to Exascale
Intel IT Center
 

Similar to Intel - Nurcan Coskun - Hadoop World 2010 (20)

Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop Deployments
 
Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop Deployments
 
Intel Mobile Launch Information
Intel Mobile Launch InformationIntel Mobile Launch Information
Intel Mobile Launch Information
 
AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12
 
Intel HPC Update
Intel HPC UpdateIntel HPC Update
Intel HPC Update
 
Intel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterIntel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data Center
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura IntelTDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
 
Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*
 
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge Economy
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
8 intel network builders overview
8 intel network builders overview8 intel network builders overview
8 intel network builders overview
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
 
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year HorizonGary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for Intel
 
Driving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to ExascaleDriving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to Exascale
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 

Recently uploaded (20)

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 

Intel - Nurcan Coskun - Hadoop World 2010

  • 1. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2010, Intel Corporation. Optimizing Hadoop* Workloads Nurcan Coskun, Ph.D. Intel Software & Solutions Group October 12, 2010 Acknowledgements to Jason Dai, Intel SSG, for of the test results and optimization techniques
  • 2. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 2 Legal Disclaimers Disclaimers & Legal Notices THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR NONINFRINGEMENT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site http://www.intel.com/.
  • 3. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 3 Why Optimize Hadoop Deployments? Handle More Data At Lower Cost In Less Time With Less Power
  • 4. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 4 Workload traits drive optimization approach 4
  • 5. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 5 Where to Optimize ? Hardware Hadoop / HDFS Software Equipment, Settings Version, Settings OS, JVM, Settings
  • 6. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 6 Server Considerations 2 Socket Systems with Intel® Xeon® Processor 5600 series Sweet spot for performance, efficiency, cost 12-24 GB DDR3 CPU intense or HBASE may require more. 4-6 1TB SATA HDD 7200 Pure I/O workloads may require more 1-2GB Ethernet Channel bonding for increased throughput Energy efficient components Gold certified power supplies, efficient fans, low power
  • 7. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 7 Processor Choice Matters Faster Handles More Data More Energy Efficient
  • 8. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 8 Processor Choice Impacts Speed Data Source: Intel internal measurements. Hadoop 0.19.1 results as of September 20, 2009 and Hadoop 0.20.2 results as of August 8, 2010. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Last Year This Year 36% Up to faster 29% Up to faster
  • 9. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 9 Processor Choice Impacts Throughput • Throughput = # of tasks completed / minute when cluster is at 100% utilization • Intel Xeon processor 5600 provides up to 30% more throughput than 5500 series1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
  • 10. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Turn on Intel® Hyper-threading Technology Intel® Hyper-threading Technology Increases performance for threaded applications delivering greater throughput and responsiveness Up to 28% better performance1 1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Terasort Wordcount SMT OFF SMT ON Job Running Time (Lower values are better) 10
  • 11. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Memory • Equipping 1~3 GB of RAM per CPU core • ECC memory is highly recommended1, to detect and correct errors introduced during storage transmission of data. Hard drives • Run in AHCI mode with NCQ enabled to improve multiple simultaneous Read/Write performance • Enable hard drive’s write cache 1. See in the discussion mail list http://mail-archives.apache.org/mod_mbox/hadoop-core- dev/200705.mbox/%3C465C3065.09050501@dragonflymc.com%3E Memory & Storage 11
  • 12. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. • 1-2 x 1Gigabit Ethernet per node • Ensure multiple RX/TX queue support for multi-core processors • Enable channel bonding to resolve network-bound workloads if needed – E.g., Improves sort workload by 30% in job running time 1 NetworkingMap/ReduceTasks bootstrap map shuffle sort reduce idle 0% 20% 40% 60% 80% 100% CPUUtilization idle wait I/O system user 0% 20% 40% 60% 80% 100% DiskUtilization disk 0% 20% 40% 60% 80% 100% Network Utilization network Map/ReduceTasks bootstrap map shuffle sort reduce idle 0% 20% 40% 60% 80% 100% CPUUtilization idle wait I/O system user 0% 20% 40% 60% 80% 100% DiskUtilization disk 0% 20% 40% 60% 80% 100% Network Utilization network I/O improves substantially Sort – no channel bonding Sort – channel bonding 100% network utilization without channel bonding 12 1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 24. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. time time
  • 13. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Hadoop Disk Drives 13 Doubling disk drives  >2x Speedup
  • 14. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 14 OS •Use a Linux* distribution based on kernel version 2.6.30 or later because of the optimizations included for energy and threading efficiency – For Example: energy consumption can be up to 60 percent (42 watts) higher at idle for each server using older versions of Linux •Optimize Linux* configurations – Linux open file descriptor limit using /etc/security/limits.conf • Default 1024 is too low for Hadoop daemon, and try to increase to approximately 64,000 – In kernel 2.6.28, epoll file descriptor limit using /etc/sysctl.conf • Default 128 is too low for Hadoop daemon, and try to increase to approximately 4096
  • 15. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 15 JVM JVM (set in hadoop-env.sh) • Prefer Sun Hotspot Java Runtime Environment • Prefer 1.6 update 14 or later 64-bit version JVM • “-server” option – Recommend for Hadoop framework processes (E.g., JobTracker, Namenode), targeting at the production deployments • Specific GC related options for framework process – E.g., Using parallel GC algorithm -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC • Set the parameter java.net.preferIPv4Stack to true as well.
  • 16. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 16 Choosing a proper codec for your IO intensive workloads Data Compression • Compress data wherever possible • Reduces storage footprint • Speed I/O bound workloads • Set mapred.output.compress and/or mapred.compress.map.output to be true • Consider LZO format • Terasort with LZO compression: • 60% faster than uncompressed • 56% faster than zlib Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 25. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
  • 17. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 17 Hadoop Configuration Tuning 1. Increase DFS block size • dfs.block.size – HDFS file block size, to use larger block size (such as 128M or 256M) for large file system. • E.g., Increasing block size from 128M to 256M saves Terasort running time by 7% 2. Supply enough handlers (HDFS) – dfs.datanode.max.xcievers • The maximum number of threads that can be connected to a data node simultaneously, set larger number (e.g., 2048) rather than the default value 256.
  • 18. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 18 For More Detail – See Intel’s Recent Paper
  • 19. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 19 Summary • Tune and optimize Hadoop case by case • Most of the Hadoop applications are data-intensive • Tune your IO related application subsystem first • Processor choice matters: • X5670 (Westmere) shows 20-40% improvement for CPU-intensive workloads over X5570 (Nehalem)1 • For I/O Intensive workloads – consider scaling HDD with core count • Performance tuning tips: • Channel bonding can reduce the network bottleneck for I/O intensive workloads • Using larger DFS block size decreases task overhead • Enabling HT shows gains up to 28% for CPU intensive workloads2 • Using LZO can significantly improve TeraSort results 1,2 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations (speed test) are on slide 21. Hardware configurations (HyperThreading) are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
  • 20. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 20 Backup
  • 21. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Processor Choice Impacts Speed”) Source: Intel internal measurement as of September 19 2009 running Hadoop*, WordCount, and TeraSort Intel® Xeon® X5460-based server Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and adjacent cache-line, prefetch disable Intel® Xeon® X5570-based server Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache- line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve Hadoop performance on Xeon X5460 server according to our benchmarking.) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Results: WordCount single job running time was 407 seconds on the Xeon® 5500® processor series and 289 seconds on the Intel® Xeon® 5600 processor series. TeraSort single job running time was 2,541 seconds on the Xeon processor 5500 series and 2,182 seconds on the Intel Xeon processor 5600 series. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel® Xeon® processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)]. 21
  • 22. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Processor Choice Impacts Throughput”) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Results: Total completed tasks per minute of WordCount over Intel® Xeon® processor 5500 series was approximately 71.58, and over Intel® Xeon® process 5600 series was approximately 93.22. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)] 22
  • 23. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Intel® Hyper-threading Technology”) Source: Intel internal measurement as of August 8, 2010 based on the following cluster and server configuration: 6 nodes (1 NameNode/JobTracker, 5 DataNode/TaskTracker) in each, configured with 2GbE connectivity to each server. Intel® Xeon® processor 5600 series servers: HP ProLiant* z6000 G6 Server 2 x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for file system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology (Intel® HT Technology) requires a computer system with an Intel® processor supporting Intel HT Technology and an Intel HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support Intel HT Technology. 23
  • 24. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Networking”) 24 Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop- 0.20.2-CDH3 beta 2 (hadoop patch level 320)]
  • 25. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Data Compression”) Source: Intel internal measurement as of August 8,2010 running Hadoop* TeraSort. Results: TeraSort single job running time was 1477 seconds without compression, 1256 seconds with default(zlib) compression, and 586 seconds with LZO compression. Hardware, cluster configuration, and settings were as follows: (1 NameNode/JobTracker + 32 DataNode/TaskTracker; each has 1 port 1 GbE connectivity to a single GbE switch) Intel Xeon processor 5500 series servers: 2x Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 4 SATA disks per node (All 4 for HDFS and intermediate results, sharing 1 for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30.10 x86_64). Ext3 filesystem, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Hadoop 0.20.1 version 25