Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise
 

Like this? Share it with your network

Share

Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise

on

  • 4,767 views

 

Statistics

Views

Total Views
4,767
Views on SlideShare
1,938
Embed Views
2,829

Actions

Likes
2
Downloads
97
Comments
0

13 Embeds 2,829

http://blogs.cisco.com 1706
http://www.scoop.it 1085
http://eventifier.co 15
http://irq.tumblr.com 6
https://blogs.cisco.com 4
http://webcache.googleusercontent.com 3
http://tech4b.blogspot.com.br 2
http://tech4b.blogspot.com 2
http://translate.googleusercontent.com 2
http://www.twylah.com 1
http://twimblr.appspot.com 1
http://www.blogger.com 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Diverse data sources - Subscriber based (Census, Proprietary, Buyers, Manufacturing)
  • Hadoop was designed with failure in mind
  • Talk about intensity of failure with smaller job vs bigger jobThe MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less completes the job roughly at the same time. However during the failure, set of MAP task remains pending (since other nodes in the cluster are still completing their task) till ALL the node finishes the assigned tasks.Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its just happened to be NOT done in parallel thus it could double job completion time. This is the worst case scenario with Terasort, other workload may have variable completion time.

Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise Presentation Transcript

  • 1. Hadoop Summit 2012 - Validated NetworkArchitecture and Reference Deployment inEnterprise Nimish Desai – nidesai@cisco.com Technical Leader, Data Center Group Cisco Systems Inc.
  • 2. Session Objectives & TakewaysGoal 1: Provide Reference NetworkArchitecture for Hadoop in EnterpriseGoal 2: Characterize Hadoop Applicationon NetworkGoal 3: Network Validation Results withHadoop Workload 2
  • 3. Big Data inEnterprise 3
  • 4. Validated 96 Node Hadoop Cluster Nexus 7000 Nexus 7000 Nexus 5548 Nexus 5548 2248TP-E Nexus 3000 Nexus 3000 2248TP-E Name Node Name Node Cisco UCS C 200 Cisco UCS C200 Single NIC Single NIC … … … … Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96 Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology Hadoop Framework  Network Apache 0.20.2 Three Racks each with 32 nodes Linux 6.2 Distribution Layer – Nexus 7000 or Slots – 10 Maps & 2 Reducers per node Nexus 5000 Compute – UCS C200 M2 ToR – FEX or Nexus 3000 Cores: 12 2 FEX per Rack Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Each Rack with either 32 single or Disk: 4 x 2TB (7.2K RPM) dual attached host Network: 1G: LOM, 10G: Cisco UCS P81E
  • 5. Data Center InfrastructureWAN Edge Layer FC FC SAN A SAN B Nexus 7000 Layer 3 MDS 9500 10 GE Core Layer 2 - 1GE SAN Layer 2 - 10GE DirectorCore Layer 10 GE DCB(LAN & SAN) 10 GE FCoE/DCB 4/8 Gb FC Nexus 7000 10 GE Aggr vPC+ L3 FabricPathAggregation & Services L2 Layer Network Services FC FC SAN Access SAN A B Layer Nexus SAN Edge 5500 MDS 9200 / FCoE 9100 B22 FEX Nexus 5500 10GE CBS 31xx Nexus 7000 Nexus 5500 FCoE UCS FCoE HP Bare Metal Nexus 2148TP-E Blade switch Nexus 2232 Nexus 3000 End-of-Row Blade 1G Nexus 3000 Bare Metal Top-of-Rack Top-of-Rack C- Top-of-Rack 10G class 1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
  • 6. Big Data Application Realm – Web 2.0 & Social/Community Networks Data live/die in Internet only entities Data Domain Partially private Data UI Service Homogeneous Data Life Cycle store Mostly Unstructured Web Centric, User Driven Unified workload – few process & owners Typically non-virtualized Scaling & Integration Dynamics Purpose Driven Apps Thousands of nodes Hundreds of PB and growing exponentially 6
  • 7. Big Data Application Realm - Enterprise Data Lives in a confined zone of enterprise repository  Long Lived, Regulatory and Compliance Call Sales ERP Doc Recor Doc Driven Cente Pipeli Modul Mgmt ds Mgmt r ne eA A Mgmt B Heterogeneous Data Life Cycle Data ERP Soc Office Video  Many Data Models Servic Media Modul Apps Conf Collab e eB  Diverse data – Structured and Unstructured Produc  Diverse data sources - Subscriber based Customer DB t Catal og VOIP Exec Report (Oracle/SAP) Catalo Data s  Diverse workload from many g sources/groups/process/technology  Virtualized and non-virtualized with mostly SAN/NAS base Scaling & Integration Dynamics are different  Data Warehousing(structured) with divers repository + Unstructured Data  Few hundred to thousand nodes, few PB  Integration, Policy & Security Challenges Each Apps/Group/Technology limited in  data generation  Consumption  Servicing confined domains 7
  • 8. Data Sources Enterprise Application Machine logs Sensor data Sales Products Call data records Web click Process Inventory stream data Finance Payroll Satellite feeds GPS data Sales Shipping Tracking data Blogs Emails Pictures Authorization Video Customers Profile 8
  • 9. Big Data Building Blocks into the Enterprise Event Socia Click Application Streams Data l Media Mobilit y Virtualized, Trends Bare Metal and Cloud Sensor Logs Data Cisco Unified Fabric Traditional “Big Data” Storage “Big Data” Database NoSQL Real-Time Capture, Store and Analyze SAN and NAS Read and Update RDBMS Operations 9
  • 10. Hadoop Cluster Design &Reference NetworkArchitecture 10
  • 11. Characteristics that Affect Hadoop Clusters Cluster Size  Characteristics of Data Node Number of Data Nodes ‒ I/O, CPU, Memory, etc. Data Model & Mapper/Reduces  Networking Characteristics Ratio ‒ Availability MapReduce functions ‒ Buffering Input Data Size ‒ Data Node Speed (1G vs. 10G) Total starting dataset ‒ Oversubscription Data Locality in HDFS ‒ Latency Ability to processes data where it already is located Background Activity Number of Jobs running http://www.cloudera.com/resource/hadoop-world- type of jobs 2011-presentation-video-hadoop-network-and- Importing compute-architecture-considerations/ exporting 11
  • 12. Hadoop Components and Operations Hadoop Distributed File System Unstructured Data The Data Ingest & Replication External Connectivity Map Map Map Map Map Map Map Map East West Traffic (Replication of data Map Map Map Map Map Map Map Map blocks) Map Map Map Map Map Phase – Raw data Analyzed and converted to name/value pair. Shuffle Phase Workload translate to multiple Key 1 Key 1 Key 1 Key 1 batches of Map task Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Reducer can start the reduce Key 1 Key 2 Key 3 Key 4 phase ONLY after the entire Map set is complete Mostly a IO/compute function Reduce Reduce Reduce Reduce Result/Output 12
  • 13. Hadoop Components and Operations Hadoop Distributed File System Unstructured Data Shuffle Phase - All name/value pair are sorted and grouped by their keys. Map Map Map Map Mapper sending the data to Reducers Map Map Map Map Map Map Map Map High Network Activity Map Map Map Map Map Map Map Map Reduce Phase – All values associates with a key are process for results, three phases Copy - get intermediate result from each data Shuffle Phase node local disk Merge - to reduce the number of files Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Reduce method Key 1 Key 1 Key 1 Key 1 Key 1 Key 2 Key 3 Key 4 Output Replication Phase - Reducer replicating result to multiple nodes Highest Network Activity Reduce Reduce Reduce Reduce Network Activities Dependent on Workload Behavior Result/Output 13
  • 14. MapReduce Data Model ETL & BI Workload Benchmark The complexity of the functions used in Map and/or Reduce has a large impact on the job completion time and network traffic. Yahoo TeraSort – ETL Workload – Most Network Intensive Reducers Start Map Start Map Finish Job Finish• Input, Shuffle and Output data size is the same – e.g. 10 TB data set in all phases• Yahoo Terasort has a more balanced Map vs. Reduce functions - linear compute and IO Shakespeare WordCount – BI Workload Reducers Start Map Finish Map Start Job Finish • Data set size varies in various phase – Varying impact on the network e.g. 1TB Input, 10MB Shuffle, 1MB Output • Most of the processing in the Map Functions, smaller intermediate and even smaller final Data 14
  • 15. ETL Workload (1TB Yahoo Terasort) Network Graph of all Traffic Received on an Single Node (80 Node Run)Shortly after the Reducers start Map tasks are finishing and data is being shuffled to reducersAs Maps completely finish the network is no loner used as Reducers have all the data theyneed to finish the job The red line is the total These amount of symbols traffic represent a received by node sending hpc064 traffic to HPC064 Reducers Job Start Maps Complete Maps Start Finish 15
  • 16. ETL Workload (1TB Yahoo Terasort)Network Activity of all Traffic Received on an Single Node (80 Node Run) If output replication is enabled, then the end of the terasort, must store additional copies. For a 1TB sort, 2TB will need to be replicated across the network. Output Data Replication Enabled  Replication of 3 enabled (1 copy stored locally, 2 stored remotely)  Each reduce output is replicated now, instead of just stored locally 16
  • 17. BI WorkloadNetwork Graph of all Traffic Received on an Single Node (80 Node Run) Wordcount on 200K Copies of complete works of Shakespeare Due the combination of the length of the Map phase and the reduced data set being shuffled, the network is being utilized throughout the job, but by a limited amount. These The red line is symbols the total represent a amount of node sending traffic traffic to received by HPC064 hpc064 Reducers Job Maps Start Start Maps Complete Finish 17
  • 18. Data Locality in HDFS Data Locality – The ability to process data where it is locally stored. Observations Notice this initial spike in RX Traffic is before the Reducers kick in.  It represents data each map task needsNote: that is not local.During the Map Phase, the JobTracker  Looking at the spikeattempts to use data locality to schedule it is mainly data from only a few nodes.map tasks where the data is locallystored. This is not perfect and isdependent on a data nodes where the Reducers Start Jobdata is located. This is a consideration Maps Start Maps Finish Completewhen choosing the replication factor. Map Tasks: Initial spike for non-local data. Sometimes a task may be scheduled on a node that does not have the dataMore replicas tend to create higher available locally.probability for data locality. 18
  • 19. Map to Reducer Ratio Impact on Job Completion 1 TB file with 128 MB Blocks == 7,813 Map Tasks The job completion time is directly related to number of reducers Average Network buffer usage lowers as number of reducer gets lower (see hidden slides) and vice versa. Job Completion Time in Sec 800 Total Graph of Job 700 Completion Time in Sec 600 500 30000 400 300 200 25000 100 0 20000 192 96 48 No. Of Reduceers 15000 Job Completion Time in Sec 10000 30000 25000 20000 5000 15000 10000 0 5000 192 96 48 24 12 6 0 No. Of Reduceers 24 12 6 No. Of Reduceers 19
  • 20. Job Completion Time with 96 Reducers 20
  • 21. Job Completion Time with 48 Reducers 21
  • 22. Job Completion Graph with 24 Reducers 22
  • 23. Network Characteristics The relative impact of various network characteristics on Hadoop clusters* Availablity Buffering Oversubscription Data Node Speed Latency * Not a scaled or measured data 23
  • 24. Validated NetworkReferenceArchitecture 24
  • 25. Data Center Access Connectivity Nexus 7000 MDS 9000 Core Distribution LAN SAN Unified Access Layer Nexus 5000 Nexus 1000V Direct Attach 10GE Nexus 4000 CiscoNexus Nexus Nexus 2000 2000 2000 UCS 1 & 10GE 10GE Rack 10GE Blade 1GE Rack 10GE Rack UCS Compute Blade Servers Mount Servers Switch w/ FCoE Mount Servers Mount Servers Blade & Rack w/ Pass-Thru (IBM/Dell) BRKAPP-2027 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
  • 26. Network Reference Architecture Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre  Architecture  Availability  Capacity, Scale & Oversubscription  Flexibility blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 Management & Visibility blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 blade8 slot 8 slot 8 26
  • 27. Scaling the Data Centre Fabric Changing the device paradigm De-Coupling of the Layer 1 and Layer 2 Topologies Simplified Management Model, plug and play provisioning, centralized configuration Line Card Portability (N2K supported with Multiple Parent Switches – N5K, 6100, N7K) Unified access for any server (100M1GE10GE FCoE): Scalable Ethernet, HPC, unified fabric or virtualization deployment ... Virtualized Switch © 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27
  • 28. Hadoop Network Topologies - Reference Unified Fabric & ToR DC Design Integration with Enterprise architecture – essential pathway for data flow  1Gbps Attached Server Integration  Nexus 7000/5000 with 2248TP-E Consistency  Nexus 7000 and 3048 Management Risk-assurance  NIC Teaming - 1Gbps Attached Enterprise grade features  Nexus 7000/5000 with 2248TP-E Consistent Operational Model  Nexus 7000 and 3048 NxOS, CLI, Fault Behavior and Management  10 Gbps Attached Server Though higher BW east-west compared  Nexus 7000/5000 with 2232PP to traditional transactional networks  Nexus 7000 and 3064 Over the time it will have multi-user, multi- workload behavior  NIC Teaming – 10 Gbps Attached Need enterprise centric features Server Security, SLA, QoS etc  Nexus 7000/5000 with 2232PP  Nexus 7000 & 3064
  • 29. Validated Reference Network Topology Nexus 7000 Nexus 7000 Nexus 5548 Nexus 5548 Nexus 3000 2248TP-E Nexus 3000 2248TP-E Name Node Name Node Cisco UCS C 200 Cisco UCS C200 Single NIC Single NIC … … … … Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96 Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology Hadoop Framework  Network Apache 0.20.2 Three Racks each with 32 nodes Linux 6.2 Distribution Layer – Nexus 7000 or Slots – 10 Maps & 2 Reducers per node Nexus 5000 Compute – UCS C200 M2 ToR – FEX or Nexus 3000 Cores: 12 2 FEX per Rack Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Each Rack with either 32 single or Disk: 4 x 2TB (7.2K RPM) dual attached host Network: 1G: LOM, 10G: Cisco UCS P81E
  • 30. Network Reference ArchitectureCharacteristics Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre  Architecture  Availability  Capacity, Scale & Oversubscription  Flexibility blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1  Management & Visibility blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 blade8 slot 8 slot 8 30
  • 31. High Availability Switching Design Common High Availability Engineering Principles The Core High Availability Design L3 Principles are common across all Dual Node Network Systems Designs Understand the causes of network Full Mesh outages Component Failures L2 Dual Node Network Anomalies Full Mesh Understand the Engineering foundations of systems level availability ToR Dual Node Device and Network level MTBF Understanding Hierarchical and Modular Design NIC Teaming Understand the HW and SW interaction in the system Enhance VPC allows such topology and Dual NIC Dual NIC Single NIC 802.3ad ideally suited for Big Data applications Active/Standby Enhanced vPC (EvPC)configuration any and System High Availability is a function of all server NIC teaming configurations will be topology and component level High supported on any port Availability
  • 32. Availability with Single Attached Server 1G or 10G Important to evaluate the overall availability of the system. Network failures can span many nodes in the system causing rebalancing and decreased overall resources. Typically multi-TB of data transfer occurs for a single ToR or FEX failure Load Sharing, ease of management and consistent SLA is important to enterprise operation Failure Domain Impact on Job Completion 1 TB Terasort typically Takes ~4.20- 4.30 minutes A failure of a SINGLE NODE (either NIC or server component) results in roughly doubling of the job completion time Key observation is that the failure impact is dependent on type of workload being run on the Single NIC cluster 32 per ToR Short lived interactive vs. Short live batch Long job – ETL, Normalization, Joins 32
  • 33. AvailabilitySingle Attached vs. Dual Attached Node No single point of failure from network view point. No impact on job completion time NIC bonding configured at Linux – with LACP mode of bonding Effective load-sharing of traffic flow on two NICs. Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing 33
  • 34. Availability Network Failure Result – 1TB Terasort - ETL Failure of various components FEX/ToR A Failure introduce at 33%, 66% and 99% of reducer completion FEX/ToR A Singly attached NIC server & Rack 96 Nodes failure has bigger impact on job 2 FEX per Rack completion time then any other failure FEX Failure is a RACK failure for 1G topologyJob Completion Time with Various Failure 1G Single 2G Failure Point FEX/ToR B Attached Dual Attached FEX/ToR A Peer Link 5000 301 258 FEX * 1137 259 Rack * 1137 1017 See A Port – Single previous See previous Slide Attached Slide See 1 port – Dual previous See previous Slide Attach Slide*Variance in run time with % reducer completed Rack 1 Rack 2 Rack 3 34
  • 35. Network Reference ArchitectureCharacteristics Network Attributes  Architecture  Availability  Capacity, Scale & Oversubscription  Flexibility  Management & Visibility 35
  • 36. Oversubscription Design Hadoop is a parallel batch job oriented framework Primary benefits of hadoop is the reduction in job completion time that would otherwise would take longer with traditional technique. E.g. Large ETL, Log Analysis, Join-only-Map job etc. Typically oversubscription occurs with 10G server access then at 1G server Non-blocking network is NOT a needed, however degree of oversubscription matters for Job Completion Time Replication of Results Oversubscription during rack or FEX failure Static vs. actual oversubscription Often how much data a single node push is IO bound and number of disk configuration Uplinks Oversubscription Measured Theoretical (16 Servers) 8 2:1 Next Slides 4 4:1 Next Slides 2 8:1 Next Slides 1 16:1 Next Slides 36
  • 37. Network Oversubscriptions  Steady state  Result Replication with 1,2,4, & 8 uplink  Rack Failure with 1, 2, 4 & 8 Uplink 37
  • 38. Data Node Speed Differences 1G vs. 10G TCPDUMP of Reducers TX• Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload• Reduced spike with 10G and smoother job completion time• Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase resiliency. 38
  • 39. 1GE vs. 10GE Buffer Usage Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer. Job Completion Cell Usage 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493 505 517 529 541 553 565 577 589 601 613 625 637 649 661 673 685 697 709 721 733 745 757 769 781 793 1 13 25 37 49 61 73 85 97 1G Buffer Used 10G Buffer Used 1G Map % 1G Reduce % 10G Map % 10G Reduce % By moving to 10GE, the data node has a wider pipe to receive data lessening the need for buffers on the network as the total aggregate transfer rate and amount of data does not increase substantially. This is due, in part, to limits of I/O and Compute capabilities 39
  • 40. Network Reference ArchitectureCharacteristics Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre  Architecture  Capacity  Availability  Scale & Oversubscription blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3  Flexibility slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4  Management & Visibility slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 blade8 slot 8 slot 8 40
  • 41. Multi-use Cluster Characteristics Hadoop clusters are generally multi-use. Theeffect of background usecan effect any single jobs completion. A given Cluster, running many different types of Jobs, Importing into HDFS, Etc. Importing Data into HDFS Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs (Blue lines are ETL Jobs and purple lines are BI Jobs) Example View of 24 Hour Cluster Use 41
  • 42. 100 Jobs each with 10GB Data SetStable, Node & Rack Failure• Almost all jobs are impacted with a single node failure• With multiple jobs running concurrently, node failure impact is as significant as rack failure 42
  • 43. Network Reference ArchitectureCharacteristics Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre  Architecture  Capacity  Availability  Scale & Oversubscription blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3  Flexibility slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4  Management & Visibility slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 blade2 slot 2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 blade6 slot 6 slot 6 blade7 blade7 slot 7 slot 7 blade8 blade8 slot 8 slot 8 43
  • 44. Burst Handling and Queue Depth A network that cannot handle• Several HDFS operations and bursts effectively will drop phases of MapReduce jobs are very packets, so optimal buffering is bursty in nature needed in network devices to absorb bursts.• The extent of bursts largely depend on the type of job (ETL vs. BI) Optimal Buffering • Given large enough incast, TCP will collapse at some point no• Bursty phases can include matter how large the buffer replication of data (either importing • Well studied by multiple into HDFS or output replication) and universities the output of the mappers during • Alternate solutions (Changing TCP behavior) proposed rather the shuffle phase. than Huge buffer switches http://simula.stanford.edu/se dcl/files/dctcp-final.pdf 44
  • 45. Nexus 2248TP-E Buffer Monitoring Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts Hadoop, NAS, AVID are examples of bursty applications You can control the queue limit for a specified Fabric Extender for egress (network to the host) or ingress(host to network) Extensive Drop Counters  Provides drop counters for both directions: Network to host and Host to Network on a per host interface basis  Drop counters for different reason • Out of buffer drop, No credit drop, Queue limit drop(tail drop), MAC error drop, Truncation drop, Multicast drop Buffer Occupancy Counter  How much buffer is being used. One key indicator of congestion or bursty traffic N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rx N5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304 tx fex-110# show platform software qosctrl asic 0 0 45
  • 46. Cell Usage 0:15:26 0:17:31 0:19:56 0:22:28 0:24:48 0:26:55 0:29:41 phases. 0:32:01 0:34:33 0:36:54 0:39:26 0:41:48 0:44:20 0:46:53 0:49:25 completion times. 0:51:58 0:54:44 0:57:29 0:59:50 1:02:09 1:04:29 Shuffle Phase 1:07:03 1:09:36 Buffer Usage During 1:12:08 1:14:40 1:16:59 1:19:19 1:21:39 FEX #1 1:24:25 1:27:48 1:31:11 1:34:44 1:38:18 FEX #2 1:41:39 1:45:00 1:47:56 1:50:50 Map % 1:53:27 1:54:18 2:07:55 2:10:39 2:17:15 2:20:24 Reduce % 2:23:20 2:26:31 2:29:27 2:32:37 2:36:00 2:38:58 2:42:21 2:45:31 2:48:28 2:51:39 2:55:01 2:58:12 3:01:24 3:04:35 3:07:46 output Replication Buffer Usage During 3:10:59 3:14:13 3:17:28 3:21:00 3:24:36 3:28:10 3:31:42 3:34:56  The buffer utilization is highest during the shuffle and output replication 3:37:45 3:40:57 Job Completion  Optimized buffer sizes are required to avoid packet loss leading to slower job TeraSort FEX(2248TP-E) Buffer Analysis (10TB)46
  • 47. Cell Usage 16:39:45 16:42:57 phases. 16:46:09 16:49:22 16:52:34 Rack layer 16:55:47 16:59:00 17:02:12 17:05:25 17:08:37 17:11:50 17:15:02 17:18:15 17:21:27 17:24:39 17:27:53 17:31:05 17:34:17 Shuffle Phase 17:37:29 17:40:40 Buffer Usage During 17:43:53 17:47:05 17:50:18 N3048 #1 slower job completion times. 17:53:29 17:56:42 17:59:55 18:03:07 18:06:20 18:09:32 N3048 #2 18:12:45 18:15:58 18:19:10 18:22:22 N3064 18:25:34 18:28:47 18:31:59 18:35:12 18:38:25 Map % 18:41:37 18:44:49 18:48:02 18:51:15 18:54:27 18:57:39 Reduce % 19:00:52 19:04:04 19:07:17 19:10:29 19:13:42 19:16:53 Replication 19:20:05 19:23:18 19:26:31 19:29:44 19:32:56 19:36:08 Buffer Usage During output 19:39:21 19:42:33 19:45:46 19:48:58 TeraSort(ETL) N3k Buffer Analysis (10TB) 19:52:10 19:55:23 • Optimized buffer sizes are required to avoid packet loss leading to The Aggregation switch buffer remained flat as the bursts were absorbed at the Top of Job Completion • The buffer utilization is highest during the shuffle and output replication47
  • 48. Network Latency Generally network latency, while N3K Topology 5k/2k Topology consistent latencybeing important, does not represent a Completion Time (Sec) significant factor for Hadoop Clusters.Note:There is a difference in network latencyvs. application latency. Optimization inthe application stack can decreaseapplication latency that can potentiallyhave a significant benefit. 1TB 5TB 10TB Data Set Size (80 Node Cluster) 48
  • 49. Summary  10G and/or Dual attached server Extensive Validation of provides consistent job completion time & better buffer utilization Hadoop Workload  10G provide reduce burst at the Reference Architecture access layer Make it easy for Enterprise  A single attached node failure has considerable impact on job Demystify Network for completion time Hadoop Deployment  Dual Attached Sever is recommended Integration with Enterprise design – 1G or 10G. 10G for future proofing with efficient choices of network topology/devices  Rack failure has the biggest impact on job completion time  Does not require non-blocking network  Degree of oversubscription does impact job completion time  Latency does not matter much in Hadoop work load 49
  • 50. 128 Node/1PB testBig Data @ Cisco clusterCisco.com Big Datawww.cisco.com/go/bigdata Certifications and Solutions with UCS C-Series and Nexus 5500+22xx • EMC Greenplum MR Solution • Cloudera Hadoop Certified Technology • Cloudera Hadoop Solution Brief • Oracle NoSQL Validated Solution • Oracle NoSQL Solution Brief Multi-month network and compute analysis testing (In conjunction with Cloudera) • Network/Compute Considerations Whitepaper • Presented Analysis at Hadoop World 50