Your SlideShare is downloading. ×
0
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters

1,522

Published on

The Emulex Advanced Development Organization offers an in-depth analysis of how Emulex OneConnect Adapters quadruple the performance over 1GbE networks for Hadoop cluster environments, addressing the …

The Emulex Advanced Development Organization offers an in-depth analysis of how Emulex OneConnect Adapters quadruple the performance over 1GbE networks for Hadoop cluster environments, addressing the 'Big Data' performance needs of cloud providers and users. Traditional 1GbE networks have not kept pace with the growth of Big Data – Emulex offers an ideal solution.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,522
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop WebcastBoosting Hadoop Performance withEmulex OneConnect® 10Gb Ethernet Adapters
  • 2. Agenda Digital Content, today and tomorrow What is Big Data? Information as an Asset A Solution to the Problem The Moving Bottleneck Hadoop on 10GbE Testing Configurations and Objectives Testing Results Comparison Analysis – The Tale of the Tape Q&A © 2011 Emulex Corporation 2
  • 3. Digital Content – Big Data’s Singularity A Decade of Digital Universe Growth: Storage in Exabytes Sources of growth:10000 – Consumer participation 8000 – Photo and video archiving – eCommerce 6000 – Social media – Social networking 4000 – Mobile applications – Search engine indexing 2000 – Web logs – Medical records 0 – Financial transactions 2005 2010 2015 – Scientific research Source: IDCs Digital Universe Study, sponsored by EMC, June 2011 – Surveillance © 2011 Emulex Corporation 3
  • 4. What is Big Data? Collections of data exceeding the capabilities of traditional database management tools… – with dynamic, incremental data created around the data preceding it – scaling with advances in technology – from a growing number of sources Think Big Bang theory… – but in the order of bytes Spawning an entire ecosystem of new technologies and services – Powerful – Dynamic – Scalable © 2011 Emulex Corporation 4
  • 5. Tapping into Information as an Asset Organizations actively analyze data rather than just store it  Increased Velocity  Actionable Data  Larger Volume  Competitive Differentiation  Greater Variety  Unlocking Value © 2011 Emulex Corporation 5
  • 6. A Solution to the Problem – Hadoop A powerful, fault-tolerant, self-healing open source platform, allowing for the distributed computing on commodity clusters Scaling to thousands of compute nodes, and efficiently managing petabytes of data Leverages two key pieces of technology: – Hadoop Distributed File System (HDFS) – Hadoop MapReduce Capable of being deployed alongside legacy Enabling old and new data to be combined in powerful ways Accessed by data intensive applications © 2011 Emulex Corporation 6
  • 7. Artem GavrilovSenior ArchitectAdvanced Development Organization
  • 8. Agenda Digital Content, today and tomorrow What is Big Data? Information as an Asset A Solution to the Problem The Moving Bottleneck Hadoop on 10GbE Testing Configurations and Objectives Testing Results Comparison Analysis – The Tale of the Tape Q&A © 2011 Emulex Corporation 8
  • 9. The Moving Bottleneck in Hadoop Clusters Designed to run on 1GbE performance characteristics – Ubiquity – Availability – Cost Today’s commodity servers deliver astounding performance gains over their predecessors Multi-core multi-threaded processors, fast DDR, and expanded memory space, faster and larger internal system drives have moved the bottleneck to the legacy 1GbE network Performance characteristics available on today’s servers: – Processor (4 cores, 8 threads): 25.6GB/s max. memory bandwidth – PCIe 3.0 bus: 8GT/s bit rate – DDR4 memory modules: up to 3,200 MT/s – Storage: SSDs capable of 6Gb/s; SATA drivers capable of 600MB/s © 2011 Emulex Corporation 9
  • 10. Hadoop Cluster Hardware – Then and Now 4 Processor Generations DDR2 to DDR3 Transition Higher Density Drives & SSDs No Change – 1GbE © 2011 Emulex Corporation 10
  • 11. Hadoop on 10GbE Network I/O performance must scale with the increase in… – Processing power – Memory capacity – Storage performance Network performance is essential to support larger and faster systems Migrating from a 1GbE to a 10GbE network, leveraging Emulex OneConnect adapters resulted in a massive performance gain © 2011 Emulex Corporation 11
  • 12. Fine Tuning Hadoop Hadoop workloads vary greatly – No “one size fits all” approach – 200+ cluster-wide and job-specific parameters that can be fine tuned With the workload variety comes a disparity in the distribution of resource demands, which can be classed as: CPU Intensive I/O Intensive – Machine learning – Indexing – Complex data/text mining – Searching – Natural language processing – Grouping – Feature extraction – Decoding/decompressing – Data importing/exporting © 2011 Emulex Corporation 12
  • 13. The Setup Servers: Storage: – HP ML350 G6 – SATA II 500GB 7200rpm Disk • Dual, Quad core Xeon 2GHz Drives, 6 per node • 16 GB DDR3 – HP Smart Array G6 RAID • Broadcom 1GbE BCM5715 Controller (JBOD - No RAID • Emulex OneConnect 10GbE configured) OCe11102 Ethernet Adapter Cluster Configuration: OS and Software: – 15 servers with discrete roles – Ubuntu 64 bit • 1 NameNode – Hadoop (Cloudera Distribution) • 11 DataNodes • 3 Clients – 1GbE and 10GbE Switches © 2011 Emulex Corporation 13
  • 14. The Setup NameNode DataNode 11 Client 1 10Gb Switch Client 2 1Gb Switch DataNode 2 Client 3 DataNode 1 © 2011 Emulex Corporation 14
  • 15. Test Objective Measure HDFS throughput ingesting data into a Hadoop cluster – Examining multiple client configurations – Raising HDFS „put‟ operations per client – Transferring a constant 5GB file – Replication factor set to three – Duplicated for 1GbE and 10GbE NetworksClients 1 2 3DataNodes 11 11 11„Put‟ Operations 1, 2, 4, 6, 8 1, 2, 4, 6, 8 1, 2, 4, 6, 8Total Operations 1, 2, 4, 6, 8 2, 4, 8, 12, 16 3, 6, 12, 18, 24 © 2011 Emulex Corporation 15
  • 16. Test Results – Legacy 1GbE Data Import – Single Client, Single „Put‟ Operation 1000 A single client, running a 800 single operation makes maximal use of the network 600MBps 400 HDFS efficiently transfers 200 data to DataNodes within the cluster, averaging 108MBps 0 0 8 16 24 32 40 48 56 out of the client server Time (sec) 1 Operation © 2011 Emulex Corporation 16
  • 17. Test Results – Legacy 1GbE Data Import – Single Client, Multiple „Put‟ Operations 1000 When more than one ‘put’ 800 operation runs on a client, the 1GbE network 600 becomes the bottleneckMBps 400 200 Increasing the number of operations did not increase 0 0 8 16 24 32 40 48 56 client throughput – restricted Time (sec) by the network connection 1 Operation 4 Operations 8 Operations © 2011 Emulex Corporation 17
  • 18. Test Results – Legacy 1GbE Data Import – Multiple Clients, Multiple „Put‟ Operations 1000 Expected to observe 800 throughput scale with additional clients 600MBps 400 Combined In and Out traffic 200 averaged 225MBps 0 0 8 16 24 32 40 48 56 Time (sec) 1 Operation 4 Operations 8 Operations © 2011 Emulex Corporation 18
  • 19. Test Results – Legacy 1GbE Data Import – Multiple Clients, Multiple „Put‟ Operations 1000 As network load increases 800 600MBps 1GbE quickly reaches saturation 400 200 0 0 8 16 24 32 40 48 56 becomes the system bottleneck Time (sec) 1 Operation 4 Operations 8 Operations © 2011 Emulex Corporation 19
  • 20. Test Results – Emulex OneConnect 10GbE Data Import – Single Client, Single „Put‟ Operation 180 Immediate performance 160 improvement of 50% 140 compared to 1GbE network 120MBps 100 80 60 Data transfer completed in 40 less than three quarters of 20 the time 0 0 8 16 24 32 40 48 56 Time (sec) 1GbE 10GbE © 2011 Emulex Corporation 20
  • 21. Test Results – Emulex OneConnect 10GbE Data Import – Single Client, Multiple „Put‟ Operations 1000 Increased network load is 800 met with increased throughput 600MBps 400 Achieved transfer rates of 200 800MBps, nearly 8X the observed throughput of the 0 0 8 16 24 32 40 48 56 64 72 80 1GbE configuration Time (sec) 1 Operation 4 Operations 8 Operations © 2011 Emulex Corporation 21
  • 22. Test Results – Emulex OneConnect 10GbE Data Import – Multiple Clients, Multiple „Put‟ Operations 1800 Throughput scales with 1600 additional clients being 1400 brought on-line 1200MBps 1000 800 600 The 10GbE network does not 400 limit transfer rates as the 200 clients and their operations 0 0 25 50 75 100 125 150 increase Time (sec) 1 Operation 4 Operations 8 Operations © 2011 Emulex Corporation 22
  • 23. Tale of the Tape – 1GbE vs 10GbE Maximum Throughput Achieved 1800 Clients 3 1600 1400 DataNodes 11 1200MBps 1000 „Put‟ Operations 6 800 600 400 Total Operations 18 200 0 Data Size 270GB 1 101 201 301 401 Time (sec) 1GbE Max MBps 250 1G 10G 10GbE Max MBps 1,674 (6.7X faster) © 2011 Emulex Corporation 23
  • 24. Tale of the Tape – 1GbE vs 10GbE Average Throughput Achieved 1000 ~4X throughput enables more Clients 3 efficient real time analysis 800 DataNodes 11 600MBps „Put‟ Operations 6 400 Total Operations 18 200 0 Data Size 270GB 1 2 4 8 12 18 Number of put operations 1GbE Avg MBps 216 1G 10G 10GbE Avg MBps 831 (3.85X faster) © 2011 Emulex Corporation 24
  • 25. Tale of the Tape – 1GbE vs 10GbE Time to Completion (seconds) 600 Load times reduced by 75% Clients 3 500 improving batch analysis DataNodes 11 400Time (sec) 300 „Put‟ Operations 6 200 Total Operations 18 100 0 Data Size 270GB 1 2 4 8 12 18 Numer of put operations 1GbE Completion 453 1G 10G 10GbE Completion 115 (3.94X faster) © 2011 Emulex Corporation 25
  • 26. Key Takeaways Hadoop runs faster with 10G – Up to 8 times faster in some scenarios Fine tuning parameters is important for performance – Improvements may not be possible without proper configuration Future performance gains are possible – Hadoop was designed for 1GbE, but small changes will enable the full potential of 10GbE Hadoop is better with Emulex OneConnect Ethernet Adapters – “It just works” – right out of the box – Leverage our expertise to configure your Hadoop installation for maximum performance © 2011 Emulex Corporation 26
  • 27. Questions
  • 28. Questions Which 1GbE and 10GbE switches were included in our tests? And would we see better performance with a switch that had lower latency? We used several different models of Cisco switches – each with different latency attributes. We found that latency didn’t impact throughput performance in a significant way. In one case, when moving to a switch with double the latency performance, we only witnessed roughly 1% increase in the throughput performance. Within the construct of our tests, we did not find that latency was critical to the performance results. © 2011 Emulex Corporation 28
  • 29. Questions Did we find the network being the bottleneck prior to the disk subsystem becoming the bottleneck? Yes, and it comes through in our graphs. It’s important to note that at the beginning of our tests, we encountered some disk performance bottlenecks due to some configuration issues. Proving that it is essential to understand the configuration settings for your Hadoop cluster in order to tap the full potential of your disks. With commodity disks, the standard performance characteristics is 100MBps per disk, typical environments have 6 disks per node, totaling 600MBps in performance potential. In some cases, you don’t need disk operations to actually happen – data is moved from memory to memory, but in most cases, data is moved from disk to disk on different machines. In those cases, disk performance is important. However, in our test cases, the disk performance was not a bottleneck. © 2011 Emulex Corporation 29
  • 30. Questions How many 1GbE NICs were used? Were multiple 1GbE NICs bridged together, or just a single 10GbE NIC? Our configuration used a single 1GbE NIC with two ports. Which is the typical commodity server configuration. Theoretically, you can install multiple cards, and get better performance, but it is a more difficult proposition, and would cost more than a single 10GbE NIC, aside from the fact that there likely would not be enough slots on the motherboard to accommodate that many cards. © 2011 Emulex Corporation 30
  • 31. Questions What is the maximum throughput of 10GbE? 10GbE maximum throughput is 1.25GB/s for single direction data transfer. When aggregated with receiving data, 2.5GB/s is the maximum. Hadoop is not designed to accommodate this speed, yet. Hopefully, it will be there soon. It’s important to mention that most 10GbE solutions today come with two ports, which means that you can achieve up to 5GB/s performance. Of course, in order to leverage that performance, you have to have a disk sub-system that operates close to that level. We observed that in cases where two 10GbE ports were used, you have 12 high performance disks. Today, it is not necessary because Hadoop does not use the network efficiently, so even with 6 disks, you will see a significant performance gain. © 2011 Emulex Corporation 31
  • 32. Questions Do we have a list of the parameters that need to be tuned within Hadoop in order to maximize the performance of our 10GbE NICs? The settings will vary depending on the environment. There isn’t a one-size-fits-all approach. Some of these parameters have been published in our white paper, and we will review that paper to ensure that all of those parameters are addressed. © 2011 Emulex Corporation 32
  • 33. Questions Are these results comparable to other 10GbE NICs or is this something unique to the Emulex technology portfolio? We included multiple cards from our competitors in this research project. Emulex cards did offer performance advantages over our competition – approximately 10%. The important observation was that competitors cards were more prone to failures – servers stopped responding, system reboots needed, etc. Emulex cards were far more reliable across the board, which we believe is more important than fractional performance gains. © 2011 Emulex Corporation 33
  • 34. Questions If the tests did not saturated the bandwidth of a 1GbE link, is the cause of the performance increase with 10GbE attributable to the “bursty” nature of the transfer itself? Hadoop is not optimized for networking, which is why there are some odd observations from time to time. There are times when even on 1GbE connections, it’s possible to not reach 50% of maximum throughput – a by product of its design. Hadoop was designed to run multiple jobs and operations, and in those instances these performance issues do not manifest themselves. © 2011 Emulex Corporation 34
  • 35. Questions Would a round-robin bonding configuration be possible with 10GbE, and would there be a performance gain from that? Theoretically, it is possible. Practically, it is unlikely due to the underlying disk system becoming the bottleneck (for the moment). If there are SSDs, or more than 6 disks being used, there is potential for performance improvement. © 2011 Emulex Corporation 35
  • 36. Questions Have we run tests with SSDs, higher RPM spindles, or larger spindle configurations? Yes, we did. And we encountered some interesting results. While we did see improvements of approximately 40%, we anticipated much better results with SSDs. The biggest issue with SSDs has to do with the way Hadoop interfaces with them – it does not tap into the full potential of the disk. Ultimately, we landed on throughput being the most important factor for performance, not necessarily I/O. © 2011 Emulex Corporation 36
  • 37. Thank You… © 2011 Emulex Corporation 37

×