Your SlideShare is downloading. ×
0

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Designing Hadoop for the Enterprise Data Center

6,108

Published on

Strata/Hadoop World 2012 with Jacob Rapp, Cisco & Eric Sammer, Cloudera

Strata/Hadoop World 2012 with Jacob Rapp, Cisco & Eric Sammer, Cloudera

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,108
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
1,772
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Diverse data sources - Subscriber based (Census, Proprietary, Buyers, Manufacturing)
  • Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workloadReduced spike with 10G and smoother job completion timeMultiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase resiliency.
  • Talk about intensity of failure with smaller job vs bigger jobThe MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less completes the job roughly at the same time. However during the failure, set of MAP task remains pending (since other nodes in the cluster are still completing their task) till ALL the node finishes the assigned tasks.Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its just happened to be NOT done in parallel thus it could double job completion time. This is the worst case scenario with Terasort, other workload may have variable completion time.
  • Transcript

    • 1. Designing Hadoop for the Enterprise DataCenter Jacob Rapp, Cisco Eric Sammer, Cloudera
    • 2. AgendaHadoop Considerations • Traffic Types • Job Patterns • Network Considerations • ComputeIntegration • Co-exist with current Data Center infrastructureMulti-tenancy • Remove the “Silo clusters” 2
    • 3. Data in the Enterprise Data Lives in a confined zone of enterprise repository  Long Lived, Regulatory and Compliance Call Sales ERP Doc Recor Doc Driven Cente Pipeli Modul Mgmt ds Mgmt r ne eA A Mgmt B Heterogeneous Data Life Cycle Data ERP Soc Office Video  Many Data Models Servic Media Modul Apps Conf Collab e eB  Diverse data – Structured and Unstructured Produc  Diverse data sources - Subscriber based Customer DB t Catal og VOIP Exec Report (Oracle/SAP) Catalo Data s  Diverse workload from many g sources/groups/process/technology  Virtualized and non-virtualized with mostly SAN/NAS base Scaling & Integration Dynamics are different  Data Warehousing(structured) with diverse repository + Unstructured Data  Few hundred to thousand nodes, few PB  Integration, Policy & Security Challenges Each Apps/Group/Technology limited in  data generation  Consumption  Servicing confined domains 3
    • 4. Enterprise Data Center InfrastructureWAN Edge Layer FC FC SAN A SAN B Nexus 7000 Layer 3 MDS 9500 10 GE Core Layer 2 - 1GE SAN Layer 2 - 10GE DirectorCore Layer 10 GE DCB(LAN & SAN) 10 GE FCoE/DCB 4/8 Gb FC Nexus 7000 10 GE Aggr vPC+ L3 FabricPathAggregation & Services L2 Layer Network Services FC FC SAN Access SAN A B Layer Nexus SAN Edge 5500 MDS 9200 / FCoE 9100 B22 FEX Nexus 5500 10GE CBS 31xx Nexus 7000 Nexus 5500 FCoE UCS FCoE HP Bare Metal Nexus 2148TP-E Blade switch Nexus 2232 Nexus 3000 End-of-Row Blade 1G Nexus 3000 Bare Metal Top-of-Rack Top-of-Rack C- Top-of-Rack 10G class 1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
    • 5. Hadoop Cluster Design &Network Architecture 5
    • 6. Validated 96 Node HadoopCluster Nexus 7000 Nexus 7000 Nexus 5548 Nexus 5548 2248TP-E Nexus 3000 Nexus 3000 2248TP-E Name Node Name Node Cisco UCS C 200 Cisco UCS C200 Single NIC Single NIC …Nodes 1 – 48 Data …Nodes 49- 96 Data … … Data Nodes 1 – 48 Data Nodes 49 - 96 Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology Hadoop Framework  Network Apache 0.20.2 Three Racks each with 32 nodes Linux 6.2 Distribution Layer – Nexus 7000 or Slots – 10 Maps & 2 Reducers per node Nexus 5000 Compute – UCS C200 M2 ToR– FEX or Nexus 3000 Cores: 12 2 FEX per Rack Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Each Rack with either 32 single or Disk: 4 x 2TB (7.2K RPM) dual attached host Network: 1G: LOM, 10G: Cisco UCS P81E
    • 7. Hadoop Job Patterns andNetwork Traffic 7
    • 8. Job Patterns Reduce Ingress vs. Analyze Egress Data Set 1:0.3 The Time the reducers start is dependent on: Reduce mapred.reduce.slowstart.co mpleted.maps It doesn’t change the amount Ingress vs. of data sent to Reducers, but Egress may change the timing toExtract Transform Load Data Set send that data (ETL) 1:1 Reduce Ingress vs. Explode Egress Data Set 1:2 8
    • 9. Traffic Types Small Flows/Messaging (Admin Related, Heart-beats, Keep-alive, delay sensitive application messaging) Small – Medium Incast (Hadoop Shuffle) Large Flows (HDFS Ingest) Large Incast (Hadoop Replication) 9
    • 10. Map and Reduce TrafficNameNodeJobTrackerZooKeeper Many-to-Many Traffic Pattern Map 1 Map 2 Map 3 Map N Shuffle Reducer 1 Reducer 2 Reducer 3 Reducer N Output Replication HDFS 10
    • 11. Job PatternsJob Patterns have varying impact on network utilization Analyze Simulated with Shakespeare Wordcount Extract Transform Load (ETL) Simulated with Yahoo TeraSort Extract Transform Load (ETL) Simulated with Yahoo TeraSort with output replication
    • 12. Data Locality in HDFS Data Locality – The ability to process data where it is locally stored. Observations Notice this initial spike in RX Traffic is before the Reducers kick in.  It represents data each map task needsNote: that is not local.During the Map Phase, the JobTracker  Looking at the spikeattempts to use data locality to schedule it is mainly data from only a few nodes.map tasks where the data is locallystored. This is not perfect and isdependent on a data nodes where the Reducers Start Jobdata is located. This is a consideration Maps Start Maps Finish Completewhen choosing the replication factor. Map Tasks: Initial spike for non-local data. Sometimes a task may be scheduled on a node that does not have the dataMore replicas tend to create higher available locally.probability for data locality. 12
    • 13. Multi-Job Cluster Characteristics Hadoop clusters aregenerally multi-use. Theeffect of background use can effect any single job’s completion. A given Cluster, running many different types of Jobs, Importing into HDFS, Etc. Importing Data into HDFS Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs (Blue lines are ETL Jobs and purple lines are BI Jobs) Example View of 24 Hour Cluster Use 13
    • 14. Map to Reducer Ratio Impact on Job Completion 1 TB file with 128 MB Blocks == 7,813 Map Tasks The job completion time is directly related to number of reducers Average Network buffer usage lowers as number of reducer gets lower and vice versa. Job Completion Time in Sec 800 Total Graph of Job 700 Completion Time in Sec 600 500 30000 400 300 200 25000 100 0 20000 192 96 48 No. Of Reduceers 15000 Job Completion Time in Sec 10000 30000 25000 20000 5000 15000 10000 0 5000 192 96 48 24 12 6 0 No. Of Reduceers 24 12 6 No. Of Reduceers 14
    • 15. Network Traffic with Variable ReducersNetwork Traffic Decreases with Less Reducers available 96 Reducers 48 Reducers 24 Reducers 15
    • 16. Summary Running a single ETL or Explode Job Pattern on entire cluster is the most network intensive jobs Analyze Jobs are the least network intensive jobs A mixed environment of multiple jobs is less intensive than one single job due to sharing of resources Large number of reducers can create load on the network, but is dependent on Job Pattern and when reducers start
    • 17. Integration into the DataCenter 17
    • 18. Integration Considerations Network Attributes Architecture Availability Capacity, Scale & Oversubscription Flexibility Management & Visibility 18
    • 19. Data Node Speed DifferencesGenerally 1G is being used largely due to the cost/performance trade-offs.Though 10GE can provide benefits depending on workload Single 1GE 100% Utilized Dual 1GE 75% Utilized 10GE 40% Utilized 19
    • 20. AvailabilitySingle Attached vs. Dual Attached Node No single point of failure from network view point. No impact on job completion time NIC bonding configured at Linux – with LACP mode of bonding Effective load-sharing of traffic flow on two NICs. Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing 20
    • 21. 1GE vs. 10GE Buffer Usage Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer. Buffer Usage During Buffer Usage During output Shuffle Phase Replication Job Completion Cell Usage 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493 505 517 529 541 553 565 577 589 601 613 625 637 649 661 673 685 697 709 721 733 745 757 769 781 793 1 13 25 37 49 61 73 85 97 1G Buffer Used 10G Buffer Used 1G Map % 1G Reduce % 10G Map % 10G Reduce % By moving to 10GE, the data node has a wider pipe to receive data lessening the need for buffers on the network as the total aggregate transfer rate and amount of data does not increase substantially. This is due, in part, to limits of I/O and Compute capabilities 21
    • 22. Network Latency Generally network latency, while N3K Topology 5k/2k Topology consistent latencybeing important, does not represent a Completion Time (Sec) significant factor for Hadoop Clusters.Note:There is a difference in network latencyvs. application latency. Optimization inthe application stack can decreaseapplication latency that can potentiallyhave a significant benefit. 1TB 5TB 10TB Data Set Size (80 Node Cluster) 22
    • 23. Integration Considerations Findings Goals  10G and/or Dual attached server Extensive Validation of provides consistent job completion time & better buffer utilization Hadoop Workload  10G provide reduce burst at the access Reference Architecture layer Make it easy for Enterprise  Dual Attached Sever is recommended design – 1G or 10G. 10G for future Demystify Network for proofing Hadoop Deployment  Rack failure has the biggest impact on Integration with Enterprise job completion time with efficient choices of  Does not require non-blocking network network topology/devices  Latency does not matter much in Hadoop workloads More Details at: http://www.slideshare.net/Hadoop_Summit/ref-arch-validated-and-tested-approach-to-define-a-network-design http://youtu.be/YJODsK0T67A 23
    • 24. Multi-tenant Environments 24
    • 25. Various Multitenant Environments Need to understand  Hadoop + HBASE Traffic Patterns Scheduling  Job Based Dependent Permissions and  Department Based Scheduling Dependent 25
    • 26. Client ClientHadoop + Hbase Read Update Update Read Region RegionMap 1 Map 2 Map 3 Map N Server Server Shuffle Read ReadReducer Reducer Reducer Reducer Major Major Compaction Compaction 1 2 3 N Output Replication HDFS 26
    • 27. Hbase During Major Compaction 9000 8000 7000 ~45% for Read 6000 Improvement Read/UpdateLatency (us) 5000 Latency 4000 Comparison of Non- 3000 2000 QoS vs. QoS Policy 1000 0 Time UPDATE - Average Latency (us) READ - Average Latency (us) QoS - UPDATE - Average Latency (us) QoS - READ - Average Latency (us) Switch Buffer Usage With Network QoS Policy to prioritize Hbase Update/Read Operations 27
    • 28. Hbase + Hadoop Map Reduce 40000 35000 30000 Read/Update Latency (us) 25000 20000 ~60% for Read Latency Improvement Comparison of Non- 15000 QoS vs. QoS Policy 10000 5000 0 Time UPDATE - Average Latency (us) READ - Average Latency (us) QoS - UPDATE - Average Latency (us) QoS - READ - Average Latency (us) Buffer Used Switch Buffer Usage With Network QoS Policy to prioritize Hbase Update/Read 1 70 139 208 277 415 691 829 898 967 1036 1105 1174 1243 1312 1381 1450 1519 1588 1657 1726 1795 1864 1933 2002 2071 2140 2209 2278 2347 2416 2485 2554 2623 2692 2761 2830 2899 2968 3037 3106 3175 3244 3313 3382 3451 3520 3589 3658 3727 3796 3865 3934 4003 4072 4141 4210 4279 4348 4417 4486 4555 4624 4693 4762 4831 4900 4969 5038 5107 5176 5245 5314 5383 5452 5521 5590 5659 5728 5797 5866 5935 346 484 553 622 760 Operations Timeline Hadoop TeraSort Hbase
    • 29. THANK YOU FOR LISTENING Cisco.com Big Data www.cisco.com/go/bigdata Cisco Unified Data Center UNIFIED UNIFIED UNIFIED FABRIC COMPUTING MANAGEMENT Highly Modular Stateless Automated Scalable, Secure Computing Management Network Fabric Elements Manages Enterprisewww.cisco.com/go/nexus www.cisco.com/go/ucs Workloads http://www.cisco.com/go/w orkloadautomation

    ×