Programming with Relaxed Synchronization


Published on

Presentation by Ravi Nair.
Paper and more information:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Programming with Relaxed Synchronization

  1. 1. © 2012 IBM CorporationTowards Approximate Computing:Programming with Relaxed SynchronizationLakshminarayanan RenganarayanaVijayalakshmi SrinivasanRavi Nair (presenting)Dan PrenerIBM T.J. Watson Research CenterOctober 21, 2012
  2. 2. © 2012 IBMCorporationRN 10/21/12 2Emerging Use of Computing  Emerging important applications  Search  Inexact answers to inexact queries  Proactive delivery of information  Advertisements, offers, news, alerts, advise  Media  Approximate rendition of audio, video, or images  Stream Processing  Fast decisions based on possibly incomplete data  These applications benefit in cost-performance throughthe use of prediction techniques  They allow trading off accuracy of results for speed ofdelivering results
  3. 3. © 2012 IBMCorporationRN 10/21/12 3Approximate Computing  Current techniques  Exact (hence energy-inefficient) computation done on exact(hence expensive) hardware to produce results that need not beexact  With increasing unreliability and variability of devices in newgenerations of technology, the cost of such computation willincrease dramatically  What is needed: Approximate Computing  Computational models that allow computational results to fall insome acceptable range, and  Hardware that produces results that fall in some acceptablerange
  4. 4. © 2012 IBMCorporationRN 10/21/12 4HardwareExact Less ExactAccurateLess Accurate, lessup-to-date, possiblycorruptedReliableVariableComputationDataComputing modeltodayDimensions of Approximate Computing
  5. 5. © 2012 IBMCorporationRN 10/21/12 5HardwareExact Less ExactAccurateLess Accurate, lessup-to-date, possiblycorruptedReliableVariableComputationDataComputing modeltodayFuture modelHuman BrainPath toEfficiencyImproving Efficiency through Approximation
  6. 6. © 2012 IBMCorporationRN 10/21/12 6HardwareExact Less ExactAccurateLess Accurate, lessup-to-date, possiblycorruptedReliableVariableComputationDataComputing modeltoday RelaxedSynchronizationHuman BrainPath toEfficiencyRelaxed Synchronization
  7. 7. © 2012 IBMCorporationRN 10/21/12 7Relaxing Synchronization: Challenges and Opportunities  Principal uses of synchronization  Ensuring that all threads see consistent values of sharedvariables (e.g. atomic updates with locks)  Best candidate for relaxing synchronization  Ensuring that threads reach various points in theirexecution in a predictable manner (e.g. barrier)  Can relax (depending on context)  Parallel update of data structures (e.g. linked lists)  Difficult to relax (can lead to fatal crash of the program)  Key questions about relaxed synchronization  Will the program produce results that are acceptable?  If acceptable, would the results be produced inconsiderably less time?
  8. 8. © 2012 IBMCorporationRN 10/21/12 8  Partition data into k clustersRandomlyselectK “initial”meansCreate k clusters byassociating everydata point with thenearest centroidRecomputecentroids of kclustersIterate until centroidsdo not change8  Widely used  Market segmentation, data mining, computer vision, astronomyKmeans Clustering
  9. 9. © 2012 IBMCorporationRN 10/21/12 9Kmeans Synchronization Overhead  Kmeans : 8 clusters with large input set  OpenMP based parallel code  Parallel execution on an IBM Power7 machine0%20%40%60%80%100%2 3 4 5 6 7 8# of threadsPercentClusteringTimeSynchronizationComputeSynchronization time as high as 90% for 8 threads
  10. 10. © 2012 IBMCorporationRN 10/21/12 10Relaxing Synchronization in Kmeans (NU-MineBench)  Parallel evaluation of all points moving between clusters  Synchronization is used to update list of points moving  Stopping criterion:} while (delta > threshold && loop++ < 500);  Number of points moving (delta) < 0.1% of total points (threshold)  Subject to a maximum of 500 iterations  This is already an approximation  What would happen if synchronization were relaxed?Performance for 8 Clusters01020304050602 3 4 5 6 7 8Number of ThreadsExecutionTime(secs)012345678910SpeedupOriginalRelaxedSpeedup
  11. 11. © 2012 IBMCorporationRN 10/21/12 11Performance ImprovementPerformance Speedup with Relaxed Synchronization024681012144 5 6 7 8 9 10 11 12 13Number of ClustersSpeedup2 threads3 threads4 threads5 threads6 threads7 threads8 threadsSignificant speedup observed in many combinations.No combination showed a degradation in performance.
  12. 12. © 2012 IBMCorporationRN 10/21/12 12Quality of Solution  Computed total sum ofdistances of all points fromthe centroids of clustersthey belong to  Good indicator of quality ofsolution  Each bar in graphrepresents a differentrelaxed run  Non-determinism is a side-effect of relaxationQuality of solution is not degraded due to relaxed synchronizationDegradation  in  Quality  of  Solution  -­‐  8  clusters,  8  thds0.00%0.05%0.10%0.15%0.20%0.25%0.30%0.35%1 2 3 4 5 6 7 8 9 10Run NumberDifferencebetweenRelaxedcostandOriginalCost
  13. 13. © 2012 IBMCorporationRN 10/21/12 13Graph500  Breadth-first search (BFS) of a very large scale-free graph  Representative of certain real-life analytics problems  E.g. Mail a flyer to all people who are connected directlyand transitively to a certain person or sets of persons  Experiments performed on version of award-winning IBMsolutionSample Scale-Free Graphs
  14. 14. © 2012 IBMCorporationRN 10/21/12 14Initial Experiment: Relaxing Synchronization  The parallel examination of nodes on a wavefrontrequires an atomic update of a data structurerepresenting list of visited nodes  Atomic update is a synchronizing operation  Question: What would happen if we updated thedata structure with a simple OR, rather than anatomic-OR operation?
  15. 15. © 2012 IBMCorporationRN 10/21/12 15Sample Result  Each column represents a different relaxed run  Relaxed results are non-deterministic  Very few vertices missing  Results  Very few vertices with wrong level numbers  Better than 3x the performance  In most cases, it was possible to detect all missing vertices by simplyvalidating the obtained treeCPUS 8THREADS 16SCALE 20Time taken with accurate sync 0.248 0.248 0.248 0.248 0.248Time taken with relaxed sync 0.077 0.078 0.078 0.077 0.078Performance gain with relaxed sync 3.20 3.20 3.20 3.20 3.19Vertices visited in accurate list 646056 646056 646056 646056 646056Vertices visited in relaxed list 645985 645991 645990 646002 645993Vertices common in both 645985 645991 645990 646002 645993Missing vertices 29 35 34 46 37Other vertices with wrong level numbers 71 49 66 34 59
  16. 16. © 2012 IBMCorporationRN 10/21/12 16Relax-and-Check: A Framework to Exploit RelaxedSynchronization  Assume there is a criterion that allows us to estimate the quality of asolution without adding high cost  Relax-and-Check  Modifies an existing program  Run initially without synchronizations  Revert to fully synchronized version if quality does not fall within acceptable limits  Such criteria are often already explicit in programs  E.g. convergence criteria  Other “syndromes” can often be identified
  17. 17. © 2012 IBMCorporationRN 10/21/12 17Sample Results  Three benchmarks fromSTAMP suite  Acceptability criteria of resultsidentical to original  Significant reduction of runtime due to relaxation ofsynchronization requirementSSCA2Kmeans (STAMP) Labyrinth
  18. 18. © 2012 IBMCorporationRN 10/21/12 18Relax-and-Check Strategy for BFS: A Slight Twist  Very high correlationbetween validation errorsand time to solution  Possibly because ofredundant visit of vertices  Also, these occurredprincipally when hardwareparameters were not suitablefor problem size  Use validation errors asmeasure of quality  Validate the solutionproduced by approximatetechnique  If number of errors exceeds athreshold, say 1000, revert tosynchronized solution forfuture instances with similarproblem sizePerformance vs. validation errors00.511.522.533.50 500 1000 1500 2000 2500 3000 3500 4000Number of validation errorsPerformancerelativetoaccuratesolutionscale = 18scale = 20scale = 22scale = 24Relax-and-Check provides a technique toexploit the advantages of relaxedsynchronization with a quality safety-net
  19. 19. © 2012 IBMCorporationRN 10/21/12 19Applications Suitable for Relax-and-CheckDomain / Applications ExamplesScientific Finite Difference Methods, Successive over-relaxation, N-body methodsData Mining, Text Mining, Risk Analysis, MachineLearningK-means, Hierarchical clustering, Multinomialregression, Feature reductionNumerical optimization, Forecasting, FinancialModeling, and Mathematical ProgrammingNon-convex optimization solvers, constrainedoptimization, multi-dimensional cubingEDA VLSI place and routeGraph processing and Social Network Analysis Mesh refinement, Metis style graph partitioning,Graph traversalsProgram analyses in compilers and softwareengineeringIterative data flow analysisA large majority of applications in these domains choose a good solution frommany correct solutions, and use acceptance criteria such as convergence tests,fixed-point tests, or boolean acceptability tests
  20. 20. © 2012 IBMCorporationRN 10/21/12 20HardwareExact Less ExactAccurateLess Accurate, lessup-to-date, possiblycorruptedReliableVariableComputationDataComputing modeltodayFuture modelHuman BrainPath toEfficiencyMove Towards Attributes of the Human Brain