Optimizing Dell PowerEdge Configurations for Hadoop

6,136 views

Published on

Hadoop Hardware configurations for Dell PowerEdge Servers. (Presentation from 2013 Dell Enterprise Forum. )

Published in: Technology
  • Be the first to comment

Optimizing Dell PowerEdge Configurations for Hadoop

  1. 1. Optimizing PowerEdgeConfigurations forHadoopMichael PittaroPrincipal Architect, Big Data SolutionsDell
  2. 2. Big Data is when the dataitself is part of the problem.Volume• A large amount of data, growing atlarge ratesVelocity• The speed at which the data must beprocessedVariety• The range of data types and datastructureWhat is Big Data ?
  3. 3. Dell | Cloudera Apache Hadoop Solution3Retail Telco Media WebFinance
  4. 4. • A Proven Big Data Platform– Cloudera CDH4 Hadoop Distribution with Cloudera Manager– Validated and Supported Reference Architecture– Production deployments across all verticals• Dell Crowbar provides deployment and management at scale– Integrated with Cloudera Manager– Bare metal to deployed cluster in hours– Lifecycle management for ongoing operations• Dell Partner Ecosystem– Pentaho for Data Integration– Pentaho for Reporting and Visualization– Datameer for Spreadsheet style analytics and visualization– Clarity and Dell Implementation ServicesDell | Cloudera Apache Hadoop Solution4
  5. 5. • Customers want results– Performance– Predictability– Reliability– Availability– Management– Monitoring• Customers want value• Big Data has many options– Servers– Networking– Software– Tools– Application Code– Fast Evolution• Wide range of applicationsThe Problem with Big Data Projects5
  6. 6. • Tested Server Configurations• Tested Network Configurations• Base Software Configuration– Big Data Software– OS Infrastructure– Operational Infrastructure• Predefined configuration– Recommended starting point• Patterns, Use Cases, and BestPractices are emerging in Big Data• Reference Architectures helppackage this knowledge for reuseA Reference Architecture Fills The Gap6
  7. 7. • PowerEdge R720, R720XD– Balanced Compute and Storage• PowerEdge C6105– Scale Out Computing– Large Disk Capacity• PowerEdge C8000– Scale Out Computing– Flexible Configuration7Reference Architecture : Servers
  8. 8. 1GbE 10GbETop of RackForce 10 S60 Force 10 S4810ClusterAggregationForce 10 S4810 Force 10 S4810BondedConnectionsRedundantNetworkingReference Architecture: Networking8
  9. 9. • Hadoop– Cloudera CDH 4– Cloudera Manager– Hadoop Tools• Infrastructure Management– Nagios– Ganglia• Configuration Management– Predefined parameters– Role based configuration9Reference Architecture: SoftwareHivePigHBaseSqoopOozieHueFlumeWhirrZookeeper
  10. 10. Tying it all Together: Crowbar10Dell“Crowbar”OpsManagementCore Components &Operating SystemsBig DataInfrastructure & DellExtensionsPhysical ResourcesAPIs, User Access, &Ecosystem PartnersCrowbarDeployerProvisionerNetwork RAIDBIOS IPMINTPDNS LoggingHDFS HBase HiveNagios GangliaPentahoClouderaCloudera PigForce10
  11. 11. 11 Revolutionary Cloud SolutionsConfidentialHadoop Node ArchitectureCloudera ManagerHadoop ClientsTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeJobTrackerJobTrackerCrowbarNagiosGangliaAdmin NodeEdge Node Data Node Data Node Data NodeMaster Name Node Secondary Name NodeStandbyNameNodeJournalNodeJournalNodeStandbyNameNodeHigh Availability NodeActiveNameNodeJournalNodeJobTracker
  12. 12. 12 Revolutionary Cloud SolutionsConfidentialHadoop Cluster Scaling
  13. 13. Learning The Reference Architecture• Read It !– Read it again– Keep it under your pillow• Three Documents– Reference Architecture– Deployment Guide– Users Guide• Deploy it– Works on 4 or 5 nodes• Available through the Dell Sales Team13
  14. 14. Leveraging the Reference Architecture• Start with the base configuration– It works, and eliminates mix and match problems– There are a lot of subtle details hidden behind the configurations• Easy changes: processor, memory, disk– Will generally not break anything– Will affect performance, however• Harder changes: Hadoop configuration– Mainly, need to know what youre doing here– We have experience and recommendations•Hardest Changes: Optimization for workloads– The default configuration is a general purpose one– Specific workloads must be tested and benchmarked14
  15. 15. • Assume 1.5 Hadoop Tasks per physical core– Turn Hyperthreading on– This allows headroom for other processes• Configure Hadoop Task slots– 2/3 map tasks– 1/3 reduce tasks• Dual Socket 6 core Xeon example› mapred.tasktracker.map.tasks.maximum: 12› mapred.task.tracker.reduce.tasks.maximum: 6• Faster is better– Hadoop compression uses processor cycles– Most Hadoop jobs are I/O bound, not processor bound– The Map / Reduce balance depends on actual workload– It’s hard to optimize more without knowing the actual workloadSelecting Processors15
  16. 16. • Hadoop scales processing and storage together– The cluster grows by adding more data nodes– The ratio of processor to storage is the main adjustment• Generally, aim for a 1 spindle / 1 core ratio– I/O is large blocks (64Mb to 256Mb)– Primarily sequential read/write, very little random I/O– 8 tasks will be reading or writing 8 individual spindles• Drive Sizes and Types– NL SAS or Enterprise SATA 6 Gb/sec– Drive size is mainly a price decision• Depth per node– Up to 48 TB/node is common– 112 Tb / node is possible– Consider how much data is ‘active’– Very deep storage impacts recovery performanceSpindle / Core / Storage Depth Optimization16
  17. 17. PowerEdge C8000 Hadoop Scaling - 16 core Xeon1705,00010,00015,00020,00025,00030,00035,000115294357718599113127141155169183197211225239TbStorage(1) 12 spindle 3Tb versus (3) 6 spindle 3TbCores (1)Storage (1)IOPS (1)Storage (3)IOPS (3)
  18. 18. • Workload optimization requires profiling and benchmarking• HBase versus pure Map/Reduce are different– I/O patterns are different– Hbase requires more memory– Cloudera RTQ (Impala) is I/O Intensive• Map Reduce usage varies– I/O intensive to CPU intensive• Ingestion and Transfer impact the edge (gateway) nodes• Heterogenous Cluster versus dedicated Clusters ?– Cloudera have added support for heterogenous clusters and nodes– Dedicated cluster makes sense if workload is consistent› Primarily for ‘data’ businessesWorkload Optimization :Hadoop has widely varying workloads18
  19. 19. Reference Architecture Options• High Availability– Networking configuration– Master / Secondary Name Node configuration• Alternative Switches– It’s possible– Contact us for advice• Cluster Size– The Reference Architecture Scales Easily to Around 720 Nodes– Beyond that, a network engineer needs to take a closer look• Node Size– Memory recommendations are a starting point– Disk / Core balance is a never ending debate19
  20. 20. Model Data Node Configuration Comments RAR720Xd Dual socket, 12 cores,24 x 2.5” spindlesMost popular platform forHadoopC8000 Dual socket, 16 cores,16 x 3.5” spindlesPopular for deep/dense HadoopapplicationsC6100 /C6105Dual socket, 8/12 cores,12 x 3.5” spindlesTwo node version. C6100 ishardware EOLC2100 Dual Socket, 12 cores,12 x 3.5” spindlesPopular, hardware EOL but oftenrepurposed for HadoopR620 Dual Socket, 8 cores,10 x 2.5” spindles1U form factorC6220 Dual-socket, 8 cores,6 x 2.5” spindlesCore/spindle ratio is not ideal forHadoop.In the Wild – Dell Customer Hadoop Configurations20
  21. 21. SecureWorks : Based on R720xd Reference ArchitectureSecureWorks24 hours a day, 365 days a year, helping protectthe security of its customers’ assets in real timeChallengeCollecting, processing, and analyzing massiveamounts of data from customer environmentsResults• Reduced cost of data storage to ~21 centsper gigabyte• 80% savings over previous proprietarysolution• 6 months faster deployment• < 1 yr. payback on entire investment• Data doubles every 18 months, magnifyingsavings
  22. 22. Further Information• Dell Hadoop Home Page– http://www.dell.com/hadoop• Dell Cloudera Apache Hadoop install with Crowbar (video)– http://www.youtube.com/watch?v=ZWPJv_OsjEk• Cloudera CDH4 Documentation– http://ccp.cloudera.com/display/CDH4DOC/CDH4+Documentation• Crowbar homepage and documentation on GitHub– http://github.com/dellcloudedge/crowbar/wiki• Open Source Crowbar Installers– http://crowbar.zehicle.com/22
  23. 23. Q&A23
  24. 24. Thank you!24
  25. 25. 25Notices & DisclaimersCopyright © 2013 by Dell, Inc.No part of this document may be reproduced or transmitted in any form without the written permission from Dell, Inc.This document could include technical inaccuracies or typographical errors. Dell may make improvements or changes in the product(s)or program(s) described herein at any time without notice. Any statements regarding Dell’s future direction and intent are subject tochange or withdrawal without notice, and represent goals and objectives only.References in this document to Dell products, programs, or services does not imply that Dell intends to make such products, programsor services available in all countries in which Dell operates or does business. Any reference to an Dell Program Product in thisdocument is not intended to state or imply that only that program product may be used. Any functionality equivalent program, thatdoes not infringe Dell’s intellectual property rights, may be used.The information provided in this document is distributed “AS IS” without any warranty, either expressed or implied. Dell EXPRESSLYDISCLAIMS any warranties of merchantability, fitness for a particular purpose OR INFRINGEMENT. Dell shall have no responsibility toupdate this information.The provision of the information contained herein is not intended to, and does not, grant any right or license under any Dell patents orcopyrights.Dell, Inc.300 Innovative WayNashua, NH 03063 USA

×