• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,219
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop - Validated Network Architectureand Reference Deployment in Enterprise Nimish Desai – nidesai@cisco.com Technical Leader, Data Center Group Cisco Systems Inc.
  • 2. Session Objectives & TakewaysGoal 1: Provide Reference NetworkArchitecture for Hadoop in EnterpriseGoal 2: Characterize Hadoop Applicationon NetworkGoal 3: Network Validation Results withHadoop Workload 2
  • 3. Big Data inEnterprise 3
  • 4. Validated 96 Node Hadoop Cluster Nexus 7000 Nexus 7000 Nexus 5548 Nexus 5548 2248TP-E Nexus 3000 Nexus 3000 2248TP-E Name Node Name Node Cisco UCS C 200 Cisco UCS C200 Single NIC Single NIC … … … … Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96 Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology§  Hadoop Framework §  Network Apache 0.20.2 Three Racks each with 32 nodes Linux 6.2 Distribution Layer – Nexus 7000 or Slots – 10 Maps & 2 Reducers per node Nexus 5000§  Compute – UCS C200 M2 ToR – FEX or Nexus 3000 Cores: 12 2 FEX per Rack Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Each Rack with either 32 single or Disk: 4 x 2TB (7.2K RPM) dual attached host Network: 1G: LOM, 10G: Cisco UCS P81E
  • 5. Data Center InfrastructureWAN Edge Layer FC FC SAN A SAN B Nexus 7000 Layer 3 MDS 9500 10 GE Core Layer 2 - 1GE SAN Director Layer 2 - 10GECore Layer 10 GE DCB(LAN & SAN) 10 GE FCoE/DCB 4/8 Gb FC Nexus 7000 10 GE Aggr vPC+ L3 FabricPathAggregation & Services L2 Layer Network Services FC FC SAN SAN A Access B Layer Nexus SAN Edge 5500 MDS 9200 / FCoE 9100 B22 FEX Nexus 5500 10GE CBS 31xx Nexus 7000 Nexus 5500 FCoE UCS FCoE HP Bare Metal Nexus 2148TP-E Blade switch Nexus 2232 Nexus 3000 Blade End-of-Row 1G Nexus 3000 Bare Metal Top-of-Rack Top-of-Rack C-class Top-of-Rack 10G 1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
  • 6. Big Data Application Realm – Web 2.0 & Social/Community Networks§  Data live/die in Internet only entities§  Data Domain Partially private Data UI Service§  Homogeneous Data Life Cycle store Mostly Unstructured Web Centric, User Driven Unified workload – few process & owners Typically non-virtualized§  Scaling & Integration Dynamics Purpose Driven Apps Thousands of nodes Hundreds of PB and growing exponentially 6
  • 7. Big Data Application Realm - Enterprise§  Data Lives in a confined zone of enterprise repository §  Long Lived, Regulatory and Compliance Call Sales ERP Doc Recor Doc Driven Cente Pipeli Modul Mgmt ds Mgmt r ne eA A Mgmt B§  Heterogeneous Data Life Cycle Data ERP Soc Office Video §  Many Data Models Servic Media Modul eB Apps Conf Collab e §  Diverse data – Structured and Unstructured Produc Catalo Exec §  Diverse data sources - Subscriber based Customer DB (Oracle/SAP) t Catalo g VOIP Report g Data s §  Diverse workload from many sources/groups/ process/technology §  Virtualized and non-virtualized with mostly SAN/NAS base§  Scaling & Integration Dynamics are different §  Data Warehousing(structured) with divers repository + Unstructured Data §  Few hundred to thousand nodes, few PB §  Integration, Policy & Security Challenges§  Each Apps/Group/Technology limited in §  data generation §  Consumption §  Servicing confined domains 7
  • 8. Big Data Framework Application Comparison Batch-oriented Real-time Big Relational Big Data Database Data NoSQL (Hadoop) •  Structured Data – Rows •  Unstructured Data – •  Hbase, Cassandra, Oriented Files, logs, Web-Clicks Oracle •  Optimized for OLTP/ •  Data format is •  Structured and OLAP abstracted to higher Unstructured Data •  Rigid schema applied to level application •  Sparse column-family data on insert/update programing data storage or Key- •  Read and write (insert, •  Schema-less, flexible value pair update) many times for later re-use •  Not a RDBMS, though •  Non-linear scaling •  Write once, read many with some schema •  Most transactions and •  Data never dies •  Random read and write queries involve a small •  Linear scaling •  Modeled after Google’s subset of data set •  Entire data set at play BigTable •  Transactional – scaling for a given query •  High transaction – real to thousands of queries •  Multi PB time scaling to millions •  GB to TBs size •  Not suited for ad-hoc analysis •  More suited for ~1 PB 8
  • 9. Data Sources Big Data Enterprise Application Machine logs Sensor data Sales      Products   Call data records Web click Process    Inventory         stream data Finance  Payroll         Satellite feeds GPS data Sales Shipping        Tracking   data Blogs Emails Pictures Authoriza;on     Video Customers  Profile     mn Colu Store ness Busi ence ig Intell 9
  • 10. Big Data Building Blocks into the Enterprise Big Data Socia Event Application Click Streams Data l Media Mobility Trends Virtualized, Bare Metal and Cloud Sensor Logs Data Cisco Unified Fabric Traditional “Big Data” Storage “Big Data” Database NoSQL Real-Time Capture, Store and Analyze SAN and NAS Read and Update RDBMS Operations 10
  • 11. Infinite Use Cases§  Web & E-Commerce Faster User Response Customer Behaviors & Pricing Models Ad Target§  Retails Customer Churn & Integration of brick & mortar with .com business models PoS Transactional Analysis§  Insurance & Finance Risk Management User Behavior & Incentive Management Trade Surveillance for Financials§  Network Analytics – Splunk Text Mining Fault Prediction§  Security & Threat Defense 11
  • 12. Hadoop Cluster Design &Reference NetworkArchitecture 12
  • 13. Hadoop Components and Operations Hadoop Distributed File System Blo Blo Blo Blo Blo Blo ck 1 ck 2 ck 3 ck 4 ck 5 ck 6§  Data is not centrally located, Data is stored across all data nodes in the cluster§  Scalable & Fault Tolerant§  Data is divided in multiple large ToR FEX/ ToR FEX/ ToR FEX/ blocks – 64MB default, typical switch switch switch block 128MB Data Data Data§  Blocks are not the related to disk node 1 node 6 node 11 geometry Data Data Data§  Data is stored reliably. Each block node 2 node 7 node 12 is replicated 3 times Data Data Data§  Types of Functions node 3 node 8 node 13 §  Name Node (Master) - Manages Data Data Data Cluster node 4 node 9 node 14 §  Data Node (Map and Reducer) – Carries blocks Data Data Data node 5 node 10 node 15 13
  • 14. Hadoop Components and Operations Name§  Name Node Node Runs a scheduler – Job Tracker Manages all data nodes, in memory Secondary Name Node – Snapshot of meta data of HDFS cluster ToR FEX/ ToR FEX/ ToR FEX/ Typically all three JVM can run on single node switch switch switch Data Data Data§  Data Node node 1 node 6 node 11 Task Tracker Receives Job Info from Job Tracker Data Data Data (Name Node) node 2 node 7 node 12 Map & Reducer Task Managed by Task Tracker Data Data Data Configurable Ratio of Map & Reduce Task for various node 3 node 8 node 13 workload per Node/CPU/Core Data Data Data Data Locality - IF data not available where the map node 4 node 9 node 14 task is assigned, a missing block be copied over the network Data Data Data node 5 node 10 node 15 14
  • 15. Characteristics that Affect Hadoop Clusters§  Cluster Size §  Characteristics of Data Node Number of Data Nodes ‒ I/O, CPU, Memory, etc.§  Data Model & Mapper/Reduces §  Networking Characteristics Ratio ‒  Availability MapReduce functions ‒  Buffering§  Input Data Size ‒  Data Node Speed (1G vs. 10G) Total starting dataset ‒  Oversubscription ‒  Latency§  Data Locality in HDFS Ability to processes data where it already is located§  Background Activity Number of Jobs running http://www.cloudera.com/resource/hadoop- type of jobs world-2011-presentation-video-hadoop-network- Importing and-compute-architecture-considerations/ exporting 15
  • 16. Hadoop Components and Operations Hadoop Distributed File System Unstructured Data§  The Data Ingest & Replication External Connectivity Map Map Map Map Map Map Map Map East West Traffic (Replication of data Map Map Map Map Map Map Map Map blocks) Map Map Map Map§  Map Phase – Raw data Analyzed and converted to name/value pair. Shuffle Phase Workload translate to multiple batches of Map task Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Reducer can start the reduce Key 1 Key 2 Key 3 Key 4 phase ONLY after the entire Map set is complete§  Mostly a IO/compute function Reduce Reduce Reduce Reduce Result/Output 16
  • 17. Hadoop Components and Operations Hadoop Distributed File System Unstructured Data§  Shuffle Phase - All name/value pair are sorted and grouped by their keys. Map Map Map Map§  Mapper sending the data to Reducers Map Map Map Map Map Map Map Map§  High Network Activity Map Map Map Map Map Map Map Map§  Reduce Phase – All values associates with a key are process for results, three phases Copy - get intermediate result from each data Shuffle Phase node local disk Merge - to reduce the number of files Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Key 1 Reduce method Key 1 Key 1 Key 1 Key 1 Key 1 Key 2 Key 3 Key 4§  Output Replication Phase - Reducer replicating result to multiple nodes Highest Network Activity Reduce Reduce Reduce Reduce§  Network Activities Dependent on Workload Behavior Result/Output 17
  • 18. MapReduce Data Model ETL & BI Workload Benchmark The complexity of the functions used in Map and/or Reduce has a large impact on the job completion time and network traffic. Yahoo  TeraSort  –  ETL  Workload  –  Most  Network  Intensive   Reducers  Start   Map  Start   Map  Finish   Job  Finish  •  Input,  Shuffle  and  Output  data  size  is  the  same  –  e.g.  10  TB  data  set  in  all  phases  •  Yahoo  Terasort  has  a  more  balanced  Map  vs.  Reduce  funcEons  -­‐  linear  compute  and  IO   Shakespeare  WordCount  –  BI  Workload   Reducers  Start   Map  Finish   Map  Start   Job  Finish   •  Data  set  size  varies  in  various  phase  –  Varying  impact  on  the  network  e.g.  1TB  Input,   10MB  Shuffle,  1MB  Output   •  Most  of  the  processing  in  the  Map  FuncEons,  smaller  intermediate  and  even  smaller  final   Data     18
  • 19. ETL Workload (1TB Yahoo Terasort) Network Graph of all Traffic Received on an Single Node (80 Node Run)Shortly  aNer  the  Reducers  start  Map  tasks  are  finishing  and  data  is  being  shuffled  to  reducers  As  Maps  completely  finish  the  network  is  no  loner  used  as  Reducers  have  all  the  data  they  need  to  finish  the  job   The red line is the total These amount of symbols traffic represent a received by node sending hpc064 traffic to HPC064 Reducers Job Start Maps Complete Maps Start Finish 19
  • 20. ETL Workload (1TB Yahoo Terasort)Network Activity of all Traffic Received on an Single Node (80 Node Run) If  output  replica;on  is  enabled,  then  the  end  of  the  terasort,  must  store  addi;onal   copies.  For  a  1TB  sort,  2TB  will  need  to  be  replicated  across  the  network.   Output Data Replication Enabled §  Replication of 3 enabled (1 copy stored locally, 2 stored remotely) §  Each reduce output is replicated now, instead of just stored locally 20
  • 21. BI WorkloadNetwork Graph of all Traffic Received on an Single Node (80 Node Run) Wordcount on 200K Copies of complete works of Shakespeare Due  the  combinaEon  of  the  length  of  the  Map  phase  and  the  reduced  data  set  being   shuffled,  the  network  is  being  uElized  throughout  the  job,  but  by  a  limited  amount.   These The red line is symbols the total represent a amount of node sending traffic traffic to received by HPC064 hpc064 Reducers Job Maps Start Start Maps Complete Finish 21
  • 22. Data Locality in HDFS Data Locality – The ability to process data where it is locally stored. Observations §Notice this initial spike in RX Traffic is before the Reducers kick in. § It represents data each map task needsNote: that is not local.During the Map Phase, the JobTracker § Looking at the spikeattempts to use data locality to schedule it is mainly data from only a few nodes.map tasks where the data is locallystored. This is not perfect and isdependent on a data nodes where the Reducers Start Jobdata is located. This is a consideration Maps Start Maps Finish Completewhen choosing the replication factor. Map  Tasks:  IniEal  spike  for  non-­‐local  data.  SomeEmes  a  task   may  be  scheduled  on  a  node  that  does  not  have  the  data  More replicas tend to create higher available  locally.    probability for data locality. 22
  • 23. Map to Reducer Ratio Impact on Job Completion§  1 TB file with 128 MB Blocks == 7,813 Map Tasks§  The job completion time is directly related to number of reducers§  Average Network buffer usage lowers as number of reducer gets lower (see hidden slides) and vice versa. Job Completion Time in Sec 800 Total Graph of Job 700 Completion Time in Sec 600 500 30000 400 300 200 25000 100 0 192 96 48 20000 No. Of Reduceers 15000 Job Completion Time in Sec 10000 30000 25000 20000 5000 15000 10000 0 5000 192 96 48 24 12 6 0 No. Of Reduceers 24 12 6 No. Of Reduceers 23
  • 24. Job Completion Time with 96 Reducers 24
  • 25. Job Completion Time with 48 Reducers 25
  • 26. Job Completion Graph with 24 Reducers 26
  • 27. Network Characteristics The relative impact of various network characteristics on Hadoop clusters* Availablity Buffering Oversubscription Data Node Speed Latency * Not a scaled or measured data 27
  • 28. Validated NetworkReferenceArchitecture 28
  • 29. Data Center Access Connectivity Nexus 7000 MDS 9000 Core Distribution LAN SAN Unified Access Layer Nexus 5000 Nexus 1000V Direct Attach 10GE Nexus 4000 CiscoNexus Nexus Nexus 2000 2000 2000 UCS 1 & 10GE 10GE Blade 1GE Rack 10GE Rack 10GE Rack UCS Compute Blade Servers Switch w/ FCoE Mount Servers Mount Servers Mount Servers Blade & Rack w/ Pass-Thru (IBM/Dell) BRKAPP-2027 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
  • 30. Network Reference Architecture§ Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre § Architecture §  Availability § Capacity, Scale & Oversubscription blade1 § Flexibility blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 § Management & Visibility blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 30
  • 31. Scaling the Data Centre Fabric Changing the device paradigm§  De-Coupling of the Layer 1 and Layer 2 Topologies§  Simplified Management Model, plug and play provisioning, centralized configuration§  Line Card Portability (N2K supported with Multiple Parent Switches – N5K, 6100, N7K)§  Unified access for any server (100Mà1GEà10GEà FCoE): Scalable Ethernet, HPC, unified fabric or virtualization deployment ... Virtualized Switch © 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
  • 32. Hadoop Network Topologies - Reference Unified Fabric & ToR DC Design§  Integration with Enterprise architecture – essential pathway for data flow §  1Gbps Attached Server Integration §  Nexus 7000/5000 with 2248TP-E Consistency §  Nexus 7000 and 3048 Management Risk-assurance §  NIC Teaming - 1Gbps Attached Enterprise grade features §  Nexus 7000/5000 with 2248TP-E§  Consistent Operational Model §  Nexus 7000 and 3048 NxOS, CLI, Fault Behavior and Management §  10 Gbps Attached Server§  Though higher BW east-west compared §  Nexus 7000/5000 with 2232PP to traditional transactional networks §  Nexus 7000 and 3064§  Over the time it will have multi-user, multi- workload behavior §  NIC Teaming – 10 Gbps Attached Need enterprise centric features Server Security, SLA, QoS etc §  Nexus 7000/5000 with 2232PP §  Nexus 7000 & 3064
  • 33. Validated Reference Network Topology Nexus 7000 Nexus 7000 Nexus 5548 Nexus 5548 Nexus 3000 2248TP-E Nexus 3000 2248TP-E Name Node Name Node Cisco UCS C 200 Cisco UCS C200 Single NIC Single NIC … … … … Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96 Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology§  Hadoop Framework §  Network Apache 0.20.2 Three Racks each with 32 nodes Linux 6.2 Distribution Layer – Nexus 7000 or Slots – 10 Maps & 2 Reducers per node Nexus 5000§  Compute – UCS C200 M2 ToR – FEX or Nexus 3000 Cores: 12 2 FEX per Rack Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Each Rack with either 32 single or Disk: 4 x 2TB (7.2K RPM) dual attached host Network: 1G: LOM, 10G: Cisco UCS P81E
  • 34. Network Reference ArchitectureCharacteristics§ Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre § Architecture §  Availability § Capacity, Scale & Oversubscription blade1 § Flexibility blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 § Management & Visibility blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 34
  • 35. High Availability Switching Design Common High Availability Engineering Principles§  The Core High Availability Design L3 Principles are common across all Dual Node Network Systems Designs§  Understand the causes of network Full Mesh outages Component Failures L2 Dual Node Network Anomalies Full Mesh§  Understand the Engineering foundations of systems level availability ToR Dual Node Device and Network level MTBF Understanding Hierarchical and Modular Design NIC Teaming Understand the HW and SW interaction in the system§  Enhance VPC allows such topology and Dual NIC Dual NIC Single NIC 802.3ad ideally suited for Big Data applications Active/Standby§  Enhanced vPC (EvPC)configuration any and System High Availability is a function of all server NIC teaming configurations will be topology and component level High supported on any port Availability
  • 36. Availability with Single Attached Server 1G or 10G§  Important to evaluate the overall availability of the system. Network failures can span many nodes in the system causing rebalancing and decreased overall resources. Typically multi-TB of data transfer occurs for a single ToR or FEX failure Load Sharing, ease of management and consistent SLA is important to enterprise operation§  Failure Domain Impact on Job Completion§  1 TB Terasort typically Takes ~4.20- 4.30 minutes§  A failure of a SINGLE NODE (either NIC or server component) results in roughly doubling of the job completion time§  Key observation is that the failure impact is dependent on type of workload being run on the Single NIC cluster 32 per ToR Short lived interactive vs. Short live batch Long job – ETL, Normalization, Joins 36
  • 37. Single Node Failure Job Completion Time§  The MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less completes the job roughly at the same time.§  However during the failure, set of MAP task remains pending (since other nodes in the cluster are still completing their task) till ALL the node finishes the assigned tasks.§  Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its just happened to be NOT done in parallel thus it could double job completion time. This is the worst case scenario with Terasort, other workload may have variable completion time. 37
  • 38. 1G Port Traffic & Job Completion Time 38
  • 39. 1G Port Failure Traffic & Job Completion Time 39
  • 40. Availability with Dual Attached Server1G and 10G Server NIC Teaming Topologies §  Dual homing(active-active) network connection from server allows Reduced replication and data movements during failure Allow optimal load-sharing §  Dual homing FEX avoids single point of failure. §  Enhance VPC allows such topology and ideally suited for Big Data applications §  Enhanced vPC (EvPC)configuration any and all server NIC teaming configurations will be supported on any port §  Supported with Nexus 5500 only §  Alternatively Nexus 3000 vPC allows host level redundancy with ToR ECMP Dual NIC Single NIC 802.3ad Dual NIC Active/ Standby 40
  • 41. AvailabilitySingle Attached vs. Dual Attached Node§  No single point of failure from network view point. No impact on job completion time§  NIC bonding configured at Linux – with LACP mode of bonding§  Effective load-sharing of traffic flow on two NICs.§  Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing 41
  • 42. Availability Network Failure Result – 1TB Terasort - ETL§  Failure of various components FEX/ToR A§  Failure introduce at 33%, 66% and 99% of reducer completion FEX/ToR A§  Singly attached NIC server & Rack 96 Nodes failure has bigger impact on job 2 FEX per Rack completion time then any other failure§  FEX Failure is a RACK failure for 1G topologyJob Completion Time with Various Failure 1G Single 2G Failure Point FEX/ToR B Attached Dual Attached FEX/ToR A Peer Link 5000 301 258 FEX * 1137 259 Rack * 1137 1017 See A Port – Single previous See previous Slide Attached Slide See 1 port – Dual previous See previous Slide Attach Slide*Variance in run time with % reducer completed Rack 1 Rack 2 Rack 3 42
  • 43. Network Reference ArchitectureCharacteristics§ Network Attributes § Architecture § Availability § Capacity, Scale & Oversubscription § Flexibility § Management & Visibility 43
  • 44. Cluster ScalingNexus 7K/5K & FEX - 2248TP-E or 2232§ 1G Based - Nexus 2248TP- E 48 1G host ports and up to 4 Uplinks uplinks bundled into a single port channel Host Interface§ 10G Based Nexus 2232 32 10G host ports and up to 8 uplinks bundled into a single port channel 802.3ad & vPC 802.3ad & Single vPC Attached Nexus 2248TP-E and 2232 support both local port channel and vPC for distributed port channels
  • 45. Oversubscription Design§  Hadoop is a parallel batch job oriented framework§  Primary benefits of hadoop is the reduction in job completion time that would otherwise would take longer with traditional technique. E.g. Large ETL, Log Analysis, Join-only-Map job etc.§  Typically oversubscription occurs with 10G server access then at 1G server§  Non-blocking network is NOT a needed, however degree of oversubscription matters for Job Completion Time Replication of Results Oversubscription during rack or FEX failure§  Static vs. actual oversubscription Often how much data a single node push is IO bound and number of disk configuration Uplinks Oversubscription Measured Theoretical (16 Servers) 8 2:1 Next Slides 4 4:1 Next Slides 2 8:1 Next Slides 1 16:1 Next Slides 45
  • 46. Network Oversubscriptions §  Steady state §  Result Replication with 1,2,4, & 8 uplink §  Rack Failure with 1, 2, 4 & 8 Uplink 46
  • 47. Data Node Speed Differences 1G vs. 10G TCPDUMP of Reducers TX•  Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload•  Reduced spike with 10G and smoother job completion time•  Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase resiliency. 47
  • 48. 1GE vs. 10GE Buffer Usage Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer. Job  Completion Cell  Usage 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493 505 517 529 541 553 565 577 589 601 613 625 637 649 661 673 685 697 709 721 733 745 757 769 781 793 1 13 25 37 49 61 73 85 97 1G  Buffer  Used 10G  Buffer  Used 1G  Map  % 1G  Reduce  % 10G  Map  % 10G  Reduce  % By  moving  to  10GE,  the  data  node  has  a  wider  pipe  to  receive  data  lessening  the   need  for  buffers  on  the  network  as  the  total  aggregate  transfer  rate  and  amount   of  data  does  not  increase  substanEally.  This  is  due,  in  part,  to  limits  of  I/O  and   Compute  capabiliEes   48
  • 49. Network Reference ArchitectureCharacteristics§ Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre § Architecture § Capacity § Availability § Scale & Oversubscription blade1 slot 1 blade1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 § Flexibility blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 § Management & Visibility slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 49
  • 50. Multi-use Cluster Characteristics Hadoop clusters are generally multi-use. Theeffect of background usecan effect any single jobs completion. A given Cluster, running many different types of Jobs, Importing into HDFS, Etc. Importing Data into HDFS Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs (Blue lines are ETL Jobs and purple lines are BI Jobs) Example View of 24 Hour Cluster Use 50
  • 51. 100 Jobs each with 10GB Data SetStable, Node & Rack Failure•  Almost all jobs are impacted with a single node failure•  With multiple jobs running concurrently, node failure impact is as significant as rack failure 51
  • 52. Network Reference ArchitectureCharacteristics§ Network Attributes Nexus LAN and SAN Core: Optimized for Data Centre § Architecture § Capacity § Availability § Scale & Oversubscription blade1 slot 1 blade1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 § Flexibility blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 Edge/Access Layer blade1 blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 blade4 § Management & Visibility slot 4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 blade1 blade1 slot 1 slot 1 blade2 slot 2 blade2 slot 2 blade3 blade3 slot 3 slot 3 blade4 slot 4 blade4 slot 4 blade5 blade5 slot 5 slot 5 blade6 slot 6 blade6 slot 6 blade7 blade7 slot 7 slot 7 blade8 slot 8 blade8 slot 8 52
  • 53. Burst Handling and Queue Depth A network that cannot handle•  Several HDFS operations and bursts effectively will drop phases of MapReduce jobs are very packets, so optimal buffering is bursty in nature needed in network devices to absorb bursts.•  The extent of bursts largely depend on the type of job (ETL vs. BI) Optimal Buffering •  Given large enough incast, TCP will collapse at some point no•  Bursty phases can include matter how large the buffer replication of data (either importing •  Well studied by multiple into HDFS or output replication) and universities the output of the mappers during •  Alternate solutions (Changing TCP behavior) proposed rather the shuffle phase. than Huge buffer switches http://simula.stanford.edu/ sedcl/files/dctcp-final.pdf 53
  • 54. Nexus 2248TP-E Buffer Monitoring§  Nexus  2248TP-­‐E  uElizes  a  32MB  shared  buffer  to  handle  larger  traffic  bursts  §  Hadoop,  NAS,  AVID  are  examples  of  bursty  applicaEons  §  You  can  control  the  queue  limit  for  a  specified  Fabric  Extender  for  egress   (network  to  the  host)  or  ingress(host  to  network)  §  Extensive Drop Counters §  Provides drop counters for both directions: Network to host and Host to Network on a per host interface basis §  Drop counters for different reason •  Out of buffer drop, No credit drop, Queue limit drop(tail drop), MAC error drop, Truncation drop, Multicast drop§  Buffer Occupancy Counter §  How much buffer is being used. One key indicator of congestion or bursty traffic N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rx N5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304  tx fex-110# show platform software qosctrl asic 0 0 54
  • 55. Buffer Monitoringswitch# attach fex 110Attaching to FEX 110 ...To exit type exit, to abort type $.fex-110# show platform software qosctrl asic 0 0number of arguments 4: show asic 0 0----------------------------------------QoSCtrl internal info {mod 0x0 asic 0}mod 0 asic 0:port type: CIF [0], total: 1, used: 1port type: BIF [1], total: 1, used: 0port type: NIF [2], total: 4, used: 4port type: HIF [3], total: 48, used: 48bound NIF ports: 2N2H cells: 14752H2N cells: 50784----Programmed Buffers---------Fixed Cells : 14752Shared Cells : 50784 ç Allocated Buffer in terms of cells(512Bytes)----Free Buffer Statistics-----Total Cells : 65374Fixed Cells : 14590Shared Cells : 50784 ç Number of free cells to be monitored 55
  • 56. %)(($,-./)$ !"#$"%& !"#("#( !"#)"#* !"%#"+( !"%+"+% !"%$"+) !"%(",& phases. !"%)"$, !"+%"!# !"+,"!* !"+&"#& !"+*"%+ !",!"+# !",%"$# !",,"$) !",("#) completion times. !",)"%$ !"$#",& !"$,"!& !"$&"%& !"$*"++ #"!!",! #"!%",( #"!,",+ Shuffle Phase #"!("!+ #"!)"## #"##"+! Buffer Usage During #"#+"+& #"#$",+ #"#("$! #"#)"$( #"%%"!$ #"%,"%$ -./0# #"%("%+ #"+!"#) #"++"%, #"+&"+# #"+)"%$ -./0% #",%"++ #",$"%( #",("$& #"$!"+( 1234 #"$%",* #"$,"#* #"$,"#* %"!*",$ %"##"#( %"#(",! 5678964 %"%!"%, %"%%"$$ %"%$",! %"%*"%$ %"+#"!* %"++"$+ %"+&"+* %"+)"%, %",%"%# %",$"!& %",("$# %"$!"+& %"$+"%! %"$&"!$ %"$*"$! +"!#"+( +"!,"+$ +"!("%! +"#!"!* +"#%"$$ +"#$"$& +"#*",( output Replication +"%#"$, Buffer Usage During +"%$"!+ +"%*"#! +"+#"#& +"+,"!( +"+&"++ +"+)"%# !"#$%"&()*"+$ §  The buffer utilization is highest during the shuffle and output replication §  Optimized buffer sizes are required to avoid packet loss leading to slower job TeraSort FEX(2248TP-E) Buffer Analysis (10TB)56
  • 57. Buffer depth monitoring: interface §  Real time command displaying the status of the shared buffer. §  XML support will be added in the maintenance release §  Counters are displayed in cell count. A cell is approximately 208 bytes show hardware internal buffer info pkt-stats [brief|clear|detail] Buffer   Free   Total  buffer   Max  buffer   usage   buffer     space  on  the   usage  since   pla`orm   clear   57
  • 58. %)(($,-./)$ !"#$%#&( !"#&)#&"( phases. !"#&#&*( !"#&+#&+( !"#!#&%( !"#&#!( Rack  layer   !"#*#)( !*#,,#$( !*#,$#&( !*#,"#( !*#,%#"( !*#!)#+( !*#!#%( !*#!%#,,( !*#))#,!( !*#)#,)( !*#)+#,&( !*#$!#,( !*#$&#,"( Shuffle Phase !*#$*#,*( !*#&,#,*( Buffer Usage During !*#&$#,+( !*#&"#,%( !*#&%#!,( !*#)#!!( -$,&+(.!( slower job completion times. !*##!!( !*#+#!$( !+#,!#!&( !+#,&#!( !+#,*#!"( !+#!,#!*( -$,&+(.)( !+#!$#!%( !+#!"#),( !+#!%#)!( !+#))#))( !+#)#)$( -$,"&( !+#)+#)&( !+#$!#)( !+#$&#)*( !+#$*#)+( !+#&,#)%( /01(2( !+#&$#$,( !+#&"#$!( !+#&%#$$( !+#)#$&( !+##$( !+#+#$"( 345674(2( !%#,!#$*( !%#,&#$+( !%#,*#$%( !%#!,#&!( !%#!$#&)( !%#!"#&)( Replication !%#!%#&$( !%#))#&&( !%#)#&"( !%#)+#&*( !%#$!#&+( !%#$&#&%( !%#$*#,( Buffer Usage During output !%#&,#!( !%#&$#)( !%#&"#&( !%#&%#( TeraSort(ETL) N3k Buffer Analysis (10TB) !%#)#"( !%##*( •  Optimized buffer sizes are required to avoid packet loss leading to !"#$%"&()*"+$ The  AggregaEon  switch  buffer  remained  flat  as  the  bursts  were  absorbed  at  the  Top  of   •  The buffer utilization is highest during the shuffle and output replication58
  • 59. Network Latency Generally network latency, while N3K Topology 5k/2k Topology consistent latencybeing important, does not represent a significant factor for Completion Time (Sec) Hadoop Clusters.Note:There is a difference in network latencyvs. application latency. Optimization inthe application stack can decreaseapplication latency that can potentiallyhave a significant benefit. 1TB 5TB 10TB Data Set Size (80 Node Cluster) 59
  • 60. Summary §  10G and/or Dual attached server§  Extensive Validation of provides consistent job completion time & better buffer utilization Hadoop Workload §  10G provide reduce burst at the§  Reference Architecture access layer Make it easy for Enterprise §  A single attached node failure has considerable impact on job Demystify Network for completion time Hadoop Deployment §  Dual Attached Sever is recommended Integration with Enterprise design – 1G or 10G. 10G for future proofing with efficient choices of network topology/devices §  Rack failure has the biggest impact on job completion time §  Does not require non-blocking network §  Degree of oversubscription does impact job completion time §  Latency does not matter much in Hadoop work load 60
  • 61. 128  Node/1PB  test   Big Data @ Cisco cluster      Cisco.com  Big  Data  www.cisco.com/go/bigdata     Cer;fica;ons  and  Solu;ons  with  UCS  C-­‐Series     and  Nexus  5500+22xx     •  EMC  Greenplum  MR  SoluEon   •  Cloudera  Hadoop  CerEfied  Technology   •  Cloudera  Hadoop  SoluEon  Brief   •  Oracle  NoSQL  Validated  SoluEon   •  Oracle  NoSQL  SoluEon  Brief     Mul;-­‐month  network  and  compute  analysis   tes;ng   (In  conjunc;on  with  Cloudera)     •  Network/Compute  ConsideraEons  Whitepaper   •  Presented  Analysis  at  Hadoop  World         61
  • 62. THANK YOU FOR LISTENING Nimish Desai – nidesai@cisco.com Technical Leader, Data Center Group Cisco Systems Inc.
  • 63. Break!Break takes place in the Community Showcase (Hall 2)Sessions will resume at 3:35pm Page 63