Low Power High-Performance Computing on the BeagleBoard Platform

  • 806 views
Uploaded on

The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing …

The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high-performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
806
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Low Power High-Performance Computing on theBeagleBoard PlatformE. Principi, V. Colagiacomo, S. Squartini, and F. PiazzaA3Lab, Department of Information EngineeringUniversit`a Politecnica delle Marche5th European DSP Education and Research Conference13th and 14th September, 2012, Amsterdam, Netherlands
  • 2. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsOutline1 Introduction2 Purpose of this work3 The BeagleClusterHardware PlatformSoftware Platform4 ExperimentsHigh-Performance LinpackMatrix MultiplicationSpeaker DiarizationAnalysis of power consumption5 Conclusions and Future Developments2 / 25
  • 3. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsIntroductionHigh-performance computing clusters are employed in computation-ally intensive tasks (e.g., weather prediction, astronomical mod-elling).Usually, they are evaluated only in terms of Floating Point Opera-tions Per Second (FLOPS) (e.g., Top500 list).The costs of energy and infrastructure exceed the costs of thecomputational devices, and this gap is expected to grow by 2014[Belady, 2007].A new metricFLOPS/Watt3 / 25
  • 4. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsIntroductionHigh-performance computing clusters are employed in computation-ally intensive tasks (e.g., weather prediction, astronomical mod-elling).Usually, they are evaluated only in terms of Floating Point Opera-tions Per Second (FLOPS) (e.g., Top500 list).The costs of energy and infrastructure exceed the costs of thecomputational devices, and this gap is expected to grow by 2014[Belady, 2007].A new metricFLOPS/Watt3 / 25
  • 5. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsIntroductionHigh-performance computing clusters are employed in computation-ally intensive tasks (e.g., weather prediction, astronomical mod-elling).Usually, they are evaluated only in terms of Floating Point Opera-tions Per Second (FLOPS) (e.g., Top500 list).The costs of energy and infrastructure exceed the costs of thecomputational devices, and this gap is expected to grow by 2014[Belady, 2007].A new metricFLOPS/Watt3 / 25
  • 6. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsIntroductionHigh-performance computing clusters are employed in computation-ally intensive tasks (e.g., weather prediction, astronomical mod-elling).Usually, they are evaluated only in terms of Floating Point Opera-tions Per Second (FLOPS) (e.g., Top500 list).The costs of energy and infrastructure exceed the costs of thecomputational devices, and this gap is expected to grow by 2014[Belady, 2007].A new metricFLOPS/Watt3 / 25
  • 7. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsTendency in the industry• Use of processors traditionally employed in the mobile world.• Canonical built a 42-core ARM cluster for compiling theUbuntu distribution.• Calxeda developed the EnergyCore ECX-1000 series ofserver-on-a-chip based on ARM Cortex-A9.4 / 25
  • 8. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsTendency in the industry• Use of processors traditionally employed in the mobile world.• Canonical built a 42-core ARM cluster for compiling theUbuntu distribution.• Calxeda developed the EnergyCore ECX-1000 series ofserver-on-a-chip based on ARM Cortex-A9.4 / 25
  • 9. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsTendency in the industry• Use of processors traditionally employed in the mobile world.• Canonical built a 42-core ARM cluster for compiling theUbuntu distribution.• Calxeda developed the EnergyCore ECX-1000 series ofserver-on-a-chip based on ARM Cortex-A9.• Hewlett-Packard Redstone servers• Four rack chassis = 2800conventional servers• Energy saving: 90%• Space saving: 94%• Currently employed in TryStackfree cloud service(http://trystack.org)4 / 25
  • 10. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsPurpose of this workDevelopDevelop an energy efficient cluster computer composed of off-the-shelf inexpensive hardware and open software and propose it to thescientific community.EvaluateEvaluate the cluster both through conventional benchmarks and areal-time constrained speech processing application.MeasureMeasure the power consumption of the cluster, assess the energyefficiency, and compare it with a laptop PC.5 / 25
  • 11. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsPurpose of this workDevelopDevelop an energy efficient cluster computer composed of off-the-shelf inexpensive hardware and open software and propose it to thescientific community.EvaluateEvaluate the cluster both through conventional benchmarks and areal-time constrained speech processing application.MeasureMeasure the power consumption of the cluster, assess the energyefficiency, and compare it with a laptop PC.5 / 25
  • 12. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsPurpose of this workDevelopDevelop an energy efficient cluster computer composed of off-the-shelf inexpensive hardware and open software and propose it to thescientific community.EvaluateEvaluate the cluster both through conventional benchmarks and areal-time constrained speech processing application.MeasureMeasure the power consumption of the cluster, assess the energyefficiency, and compare it with a laptop PC.5 / 25
  • 13. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsHardware PlatformCluster descriptionThe BeagleCluster is composed of five BeagleBoard-xM.Beagleboard-xMProcessor TI DM3730ARM subsystem Cortex-A8 @ 1 GHzDSP subsystem C64x+ @ 800 MHzGraphics accelerator PowerVR SGX @ 200 MHzRAM 512 MB DDR @ 200 MHzNetwork interface Ethernet 10/1006 / 25
  • 14. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsHardware PlatformCluster description (cont.)• Asymmetric topology: one head node, four worker nodes.• Nodes are connected to a Hewlett-Packard ProCurve 1410-8Gswitch through the BeagleBoard-xM 100 Mbit interface.• Nodes are powered by a Lambda AC-DC power supply.7 / 25
  • 15. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSoftware PlatformSoftware Platform• Operating system: ˚Angstr¨om GNU/Linux distribution (workernodes do not have a GUI).• Tool-chain: CodeSourcery.• Network File System: data and code are shared throughoutthe cluster using Network File System.• Cluster Command Control: a suite of tools for managing thecluster (e.g., terminating processes, rebooting worker nodes,pushing drive images).• Message Passing Interface (Argonne National LaboratoryMPICH2): application programming interface that allows theexchange of messages and data among processes running onthe nodes of a cluster.8 / 25
  • 16. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSoftware PlatformSoftware Platform (cont.)• Ganglia: offers a web interface used to monitor the clusteractivity and to detect abnormal functioning.9 / 25
  • 17. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsHigh-Performance LinpackHigh-Performance Linpack (HPL)• HPL is the de-facto standard benchmark for floating pointperformance measurement.• It is employed in the Top500 and Green500 lists.• HPL solves a dense system of linear equations using doubleprecision arithmetic.• Parallelism is obtained by means of MPI.• Computation is based on BLAS (Vesperix ATLAS-ARM).10 / 25
  • 18. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsHigh-Performance LinpackHigh-Performance Linpack (HPL) (cont.)MFLOPS258.6MFLOPS/W13.26Green500 500th position (June 2012)Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 InternalInterconnect: 32.05 MFLOPS/WNoteArithmetic operations are performed in double precision in theVector Floating Point unit: NEON unit cannot be employed.11 / 25
  • 19. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsHigh-Performance LinpackHigh-Performance Linpack (HPL) (cont.)MFLOPS258.6MFLOPS/W13.26Green500 500th position (June 2012)Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 InternalInterconnect: 32.05 MFLOPS/WNoteArithmetic operations are performed in double precision in theVector Floating Point unit: NEON unit cannot be employed.11 / 25
  • 20. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsMatrix MultiplicationMatrix Multiplication• This benchmark shows the performance improvement that canbe obtained using NEON optimized code.• The benchmark multiplies an m × n matrix A with an n × pmatrix B.• It operates dividing the rows of matrix A in groups, andprocessing each group in a different node.• Communication among nodes is based on MPI.Platform Execution timeBeagleCluster 42.13 sBeagleCluster w/ NEON 5.18 sNEON optimized code significantly reduces the execution time ⇒HPL performance can be improved by properly exploiting NEON12 / 25
  • 21. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsMatrix MultiplicationMatrix Multiplication• This benchmark shows the performance improvement that canbe obtained using NEON optimized code.• The benchmark multiplies an m × n matrix A with an n × pmatrix B.• It operates dividing the rows of matrix A in groups, andprocessing each group in a different node.• Communication among nodes is based on MPI.Platform Execution timeBeagleCluster 42.13 sBeagleCluster w/ NEON 5.18 sNEON optimized code significantly reduces the execution time ⇒HPL performance can be improved by properly exploiting NEON12 / 25
  • 22. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationSpeaker Diarization• A speaker diarization algorithm detects “who speaks now”.• The algorithm here addressed is based on the real-timeimplementation described in [Colagiacomo, et al. 2010].• The calculation of the cross-correlations between the channeli signal xi(t) and the channel j signal xj(t) is the mostcomputational demanding part:Cij(t) = maxτ{IFFT[FFT(xi(t)xj(t − τ)) • FFT(w(t))]} .Here, t is the time index, τ is the correlation lag, w(t) is theHamming window and • denotes the element-wise product.13 / 25
  • 23. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationSpeaker Diarization (cont.)• Cluster-wide parallelism has been obtained assigning thefeature extraction stage of each channel to one of the workernodes.• The server process in the head node dispatches audio framesto the worker nodes through the MPI Bcast instruction andperforms the final classification.• Performance have been evaluated in terms of Real-TimeFactor (RTF):RTF = Total execution timeSpeech segment duration14 / 25
  • 24. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationSpeaker Diarization (cont.)• Audio data: four lapel microphone signals of meetingES2009b contained in the AMI corpus.• Comparison with an Asus F9SG laptop (Intel Core2 DuoT8300 CPU running at 2.4 GHz and with 2 GB of RAM)• Power consumption is measured switching the LCD monitoroff.15 / 25
  • 25. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationSpeaker Diarization (cont.)Single-board implementation results• Real-time execution is achieved through the NEON instructionset and reducing the number of cross-correlations: themaximum of Cij(t) is searched incrementing τ by ∆τ > 1.∆τ Laptop BeagleBoard-xM1 2.47 12.7316 0.25 1.0232 0.18 0.6364 0.14 0.44128 0.12 0.36The choice of ∆τ is critical bothfor the laptop and theBeagleBoard-xM.16 / 25
  • 26. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationSpeaker Diarization (cont.)Cluster-wide implementation results∆τ Single-board Five nodes1 12.73 4.7116 1.02 1.6932 0.63 1.6364 0.44 1.56128 0.36 1.55• The MPI version is almost 3 times as fast as the single-board one when∆τ = 1.• As ∆τ increases, the MPI implementation performance decreases: thecommunication overhead becomes the new bottleneck.17 / 25
  • 27. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationSpeaker Diarization (cont.)Cluster-wide implementation• This has been verified in a four nodes cluster.• Nodes read audio data directly from the local file system.• One of the worker nodes performs both the feature extractionand the classification tasks.∆τ Five nodes Four nodes (w/ local data)1 4.71 3.3516 1.69 0.3332 1.63 0.2364 1.56 0.18128 1.55 0.16Reducing the communication overhead real-time execution can beachieved with ∆τ = 16.18 / 25
  • 28. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsSpeaker DiarizationAnalysis of power consumptionBeagleCluster20.32 WLaptop32.36 WEnergy ratioEr =RTFcluster · PclusterRTFlaptop · Plaptop∼= 1.2The communication overhead limits the energy efficiency of the Bea-gleCluster.Energy ratio of the four nodes clusterEr∼= 0.69Reducing the communication overhead the BeagleCluster is moreefficient than the laptop PC.19 / 25
  • 29. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsConclusions• A cluster computer based on the BeagleBoard-xM platformhas been described.• The cluster is based on open software for executing paralleltasks, management, and monitoring the nodes status.• High Performance Linpack has been used to obtain thenumber of floating point operations per second.• The performance improvement that can be achieved usingNEON optimized code has been shown by means of a matrixmultiplication benchmark.• Processing time and power consumption have been measuredby means of a cluster-wide speaker diarization algorithm toevaluate the real-time capabilities and the energy efficiency ofthe cluster.20 / 25
  • 30. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsConclusions (cont.)• Results showed that using the 100 Mbit Ethernet interface,the BeagleCluster consumes 1.2 times the energy spent by thelaptop PC.• Removing the communication bottleneck, the BeagleClusterachieves a superior energy efficiency.• The cost of the 5 nodes cluster is 655 e. Compared to thelaptop PC, whose cost is 1100 e, the BeagleCluster is about500 e cheaper.21 / 25
  • 31. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsFuture developments• The software platform will be expanded with a resourcemanager and a scheduler to enable the execution of batchjobs.• The energy efficiency will be assessed in a High-Availabilityscenario, for example using the cluster for hosting websites.• The use of more efficient hardware platforms (e.g.,PandaBoards) and of the DM3730 DSP will be considered.22 / 25
  • 32. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future DevelopmentsThank you for your attention!Emanuele Principi Vito Colagiacomoe.principi@univpm.it s1037562@studenti.univpm.itStefano Squartini Francesco Piazzas.squartini@univpm.it f.piazza@univpm.it23 / 25
  • 33. Manufacturer AMPROBEModel LH41AMeasuring Range 0-40A, DC or AC peakResolution 1 mA in 4 A range10 mA in 40 A rangeAccuracy ±1.3% + 5 digitsFrequency Range DC in DC40 Hz to 400 Hz in AC24 / 25
  • 34. High-Performance Linpack: detailsRmax 258.6 MFLOPSProblem size 15000Block size 16Grid ratio 2 × 225 / 25
  • 35. H. W. Meuer, “The TOP500 Project: Looking Back Over 15 Years ofSupercomputing Experience,” Informatik-Spektrum, vol. 31, no. 3, pp. 203–222,2008. [Online]. Available: http://www.top500.orgC. L. Belady, “In the Data Center, Power and Cooling Cost More Than the ITEquipment It Supports,” Electronics Cooling Magazine, vol. 13, no. 1, May 2007.W.-c. Feng and K. Cameron, “The Green500 List: Encouraging SustainableSupercomputing,” IEEE Computer, vol. 40, no. 12, pp. 50–55, Dec. 2007.[Online]. Available: http://www.green500.orgI. Ahmad and S. Ranka, Eds., Handbook of Energy-Aware and Green Computing,1st ed., ser. Information Science. Boca Raton, US: CRC Press, Jan. 2012.S. Andrade, J. Dourado, and C. Maciel, “Low-power cluster using OMAP3530,”in Proc. of EDERC, Nice, France, Dec. 2010, pp. 220–224.K. F¨urlinger, C. Klausecker, and D. Kranzlm¨uller, “Towards energy efficientparallel computing on consumer electronic devices,” in Proc. of ICT-GLOW.Berlin, Heidelberg: Springer-Verlag, 2011, pp. 1–9.M. Brim, R. Flanery, A. Geist, B. Luethke, and S. L. Scott, “Cluster Commandand Control (C3) Tool Suite,” Parallel and Distributed Computing Practices,vol. 4, no. 4, Dec. 2001.25 / 25
  • 36. Argonne National Laboratory, “MPICH2,”http://www.mcs.anl.gov/research/projects/mpich2/.M. L. Massie, B. N. Chun, and D. E. Culler, “The Ganglia distributed monitoringsystem: design, implementation, and experience,” Parallel Computing, vol. 30,no. 7, pp. 817–840, 2004.M. Moattar and M. Homayounpour, “A review on speaker diarization systemsand approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012.E. Principi, R. Rotili, M. W¨ollmer, F. Eyben, S. Squartini, and B. Schuller,“Real-Time Activity Detection in a Multi-Talker Reverberated Environment,”Cognitive Computation, pp. 1–12, 2012.V. Colagiacomo, E. Principi, S. Cifani, and S. Squartini, “Real-Time SpeakerDiarization on TI OMAP3530,” in Proc. of EDERC, Nice, France, Dec. 1st-2nd2010.InfiniBand Trade Association, “InfiniBand Architecture Specification Release1.2.1,” Jan. 2008.N. J. Boden, D. Cohen, R. E. Felderman, A. Kulawik, C. Seitz, J. N. Seizovic,and W. Su, “Myrinet: A Gigabit-per-second Local Area Network,” IEEE Micro,vol. 15, no. 1, pp. 29–36, Feb. 1995.25 / 25