Multiclass classification using Massively Threaded Multiprocessors

2,491 views

Published on

Published in: Technology
  • Be the first to comment

Multiclass classification using Massively Threaded Multiprocessors

  1. 1. Multiclass Classification using <br />Massively Threaded Multiprocessors<br />Sergio Herrero<br />January 2010<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  2. 2. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  3. 3. Data Growth & Moore’s Law (I): Evolution of Data and Computation<br />Parallel Acceleration<br />Serial Acceleration<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  4. 4. Data Growth & Moore’s Law (II): Parallel Statistical Learning<br /><ul><li>NIPS 2009
  5. 5. IEEE ICDM 2009
  6. 6. IEEE ICMLA 2009
  7. 7. SIAM DM 2010
  8. 8. JMLR
  9. 9. IEEE Transactions on Pattern Analysis and Machine Intelligence
  10. 10. IEEE Transactions on Knowledge and Data Engineering</li></ul>Physical Limitations<br />No Exponential <br />Frequency <br />Scaling<br />Architecture <br />Parallelism<br />Economic Limitations<br />Parallel Statistical Learning Algorithms<br />Web<br /><ul><li>Google
  11. 11. Microsoft
  12. 12. Yahoo
  13. 13. Amazon
  14. 14. IBM
  15. 15. Facebook</li></ul>Massive <br />Datasets<br />Large-Scale Statistical Learning<br />Social Networks<br />Mobile/Sensor Data <br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  16. 16. Data Growth & Moore’s Law (III): Machine Learning Parallelization<br />Embarrassingly Parallel<br />(Cluster)<br />Does not speed up individual runs <br />L1<br />Run the same algorithm with different <br />parameters on different machines<br />Decomposition is effective in slow algorithms<br />Decompose the algorithm into an <br />adaptive sequence of steps<br />Statistical Query and Summation (MapReduce)<br />L2<br />Exploit fine-grained structural parallelism (Data Parallelism)<br />One thread per data/few point (MPP)<br />Avoid Latency in communication<br />L3<br />Complexity<br />Machine Learning Primitives<br />Parallelizable Machine Learning Algorithms<br /><ul><li>Inner Products (vector or matrix)
  17. 17. Outer Products (between vectors)
  18. 18. Linear Algebra (vector or matrix)
  19. 19. Non-linearity
  20. 20. Matrix Transpose
  21. 21. Naïve Bayes
  22. 22. K-means
  23. 23. Neural Network
  24. 24. Principal Component Analysis
  25. 25. Expectation Maximization
  26. 26. Support Vector Machine Classification
  27. 27. Hidden Markov Models</li></ul>Goal: Design and Implement <br />a Level 3 SVM Classifier. <br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  28. 28. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  29. 29. Multiclass Classification (I): Geometrical Representation of the Binary SVM<br />Binary Classification:<br />Given <br />samples<br />with<br />and<br />,<br />a binary classifier predicts the label <br />of an unseen sample <br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  30. 30. Multiclass Classification (II): Primal & Dual form of the SVM<br />Find the function <br />that solves the following regularization problem:<br />where <br />Then slack variables<br />are introduced to classify non-separable data:<br />Primal form:<br />Dual form:<br />subject to: <br />subject to: <br />where <br />is the kernel function<br />Solving the dual:<br />where <br />is an unreagularized bias term<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  31. 31. Multiclass Classification (III): Multiclass SVM<br />Multiclass Classification:<br />Given <br />samples<br />with<br />and<br />,<br />a multiclass classifier predicts the label <br />of an unseen sample <br />independent binary classification tasks. Binary tasks<br />and<br />are defined by an output code matrix<br />of size <br />All vs All (AVA):<br />One vs All (OVA):<br />Multiclass SVM: Combination of <br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  32. 32. Multiclass Classification (IV): Dual Multiclass SVM<br />is trained separately with <br />where<br />where each <br />is constructed as:<br />The outputs of trained binary classifiers <br />are used to predict the class label that <br />Then each <br />best agrees with the binary predictions:<br />subject to: <br />We need to solve <br />Quadratic Programming<br />Optimization problems, on the same values of <br />where <br />is the kernel function<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  33. 33. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  34. 34. Related Work: Large-Scale SVM Training<br />MulticoreSVMs<br />Massively Threaded Multiprocessor SVMs<br />Serial SVMs<br />Parallel SMO (Cao 2006)<br />MPI SVM (Zanni 2006)<br />Map Reduce Multicore (Chu et al. 2006)<br />Regression on GPU (Do et al. 2008)<br />MapReduce on GPU (Catanzaro 2008)<br />Distributed/Cluster<br /> SVMs<br />Decomposition (Osuna 1997)<br />Shrinking & Caching (Joachims 1999)<br />Sequential Minimal Optimization (Platt 1999)<br />SMO Improvements (Keerthi 2001)<br />Working Set Selection (Fan 2005)<br />Cascade SVM (Graf et al. 2005)<br />Map Reduce on DFS (Chang 2006)<br />Distributed Parallel SVM (Lu 2008)<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  35. 35. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  36. 36. Massively Threaded Multiprocessor (GPU) (I):<br />GPGPU (early 2000s): use Graphics Processing Units for general purpose computing by casting the problem as graphics.<br /><ul><li> Turn data into images (“texture maps”)
  37. 37. Turn algorithms into image synthesis (“rendering passes”) </li></ul>Inconveniences:<br /><ul><li>Need to have graphics background
  38. 38. Graphics API overhead
  39. 39. Constraints in memory layout and access
  40. 40. Need for many passes to drive up bandwith consumption</li></ul>CUDA: Compute Unified <br />Device Architecture<br />OpenCL: Open Computing Language<br />Framework for writing programs that execute across heterogeneous platforms consisting of GPUs and CPUs.<br /><ul><li>Memory Access
  41. 41. Thread can access any location
  42. 42. Thread can access as many </li></ul>locations as needed<br /><ul><li>User Managed Cache
  43. 43. Low learning curve
  44. 44. C Extensions
  45. 45. No-graphics background
  46. 46. No graphics API</li></ul>Multiclass Classification using Massively Threaded Multiprocessors<br />
  47. 47. Massively Threaded Processor (GPU) (II): Tesla Architecture<br />Device<br />Host<br />Stream Multiprocessor N<br />Stream Multiprocessor 2<br />Stream Multiprocessor 1<br />1 Cycle coalesced<br />~10 Cycles uncoalesced<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Instruction<br /> Unit<br />Instruction<br /> Unit<br />Instruction<br /> Unit<br />0 Cycles<br />Processor 1<br />Processor 2<br />Processor M<br />Processor 1<br />Processor 2<br />Processor M<br />SP 1<br />SP 2<br />SP M<br />….<br />….<br />….<br />~10 Cycles Cache Hit<br />Constant Cache <br />Constant Cache <br />Constant Memory <br />Texture Cache<br />Texture Cache<br />Texture Memory<br />~400 Cycles<br />102 GB/s<br />~400 Cycles<br />102 GB/s<br />Host<br />Memory<br />Device Memory<br />PCI-E 16x <br />(8GB/s)<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  48. 48. Massively Threaded Processor (GPU) (III): Representations<br />Logical Representation<br />Physical Representation<br />Thread<br />Processor<br />MultiProcessor<br />Block<br />Shared Memory<br />Shared Memory<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Registers<br />Processor 1<br />Processor 2<br />Processor M<br />Processor 1<br />Processor 2<br />Processor M<br />….<br />….<br />Device<br />Grid<br />Constant Cache <br />Constant Cache <br />Texture Cache<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  49. 49. Massively Threaded Processor (GPU) (IV): CUDA Programming Model<br />Host<br />Device<br />Grid 1<br />Maximum <br />(65535, 65535)<br />Block <br />(0,0)<br />Block <br />(0,1)<br />Block <br />(0,2)<br />Block <br />(0,4)<br />Block <br />(0,3)<br />Kernel 1<br />Block <br />(1,0)<br />Block <br />(1,1)<br />Block <br />(1,2)<br />Block <br />(1,3)<br />Block <br />(1,4)<br />Maximum (512,512,64)<br />But max 512 threads per block<br />Grid 2<br />Block <br />(0,0)<br />Block <br />(0,1)<br />Block <br />(0,2)<br />Kernel 2<br />z<br />Block <br />(1,1)<br />Block <br />(1,0)<br />y<br />Block <br />(2,0)<br />Block <br />(2,1)<br />x<br />Thread (x,y,z)<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  50. 50. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  51. 51. Parallel-Parallel SMO (P2SMO)(I): SMO Algorithm<br />Sequential Minimal Optimization (SMO): <br />Reduction: Chooses the smallest possible optimization problem at every step, which involves two Lagrange multipliers.<br />2. Analytic :For two Lagrange multipliers, the QP problem can be done analytically. <br />3. Update: Updates the SVM to reflect new optimal values.<br />subject to: <br />where <br />Reduction Max/Min<br />Analytic<br />Update<br />J. C. Platt, Fast training of support vector machines using sequential minimal optimization, Advances in kernel methods: support vector learning, MIT Press, Cambridge, MA, 1999.<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  52. 52. Parallel-Parallel SMO (P2SMO)(II): Algorithm Parallelization <br />Reduction<br />Analytic<br />Parallel-Parallel SMO: <br />Data Parallel: Each data point in each binary task will have a thread.<br />2. Task Parallel: Binary tasks will be executed concurrently.<br />Update<br />Task parallel<br />Data Parallel<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  53. 53. Parallel-Parallel SMO (P2SMO)(III): Reduction Phase <br />Grid 1<br />1<br />2<br />…<br />1<br />…<br />…<br />k<br />…<br />…<br />N<br />Reduction<br />CPU<br />Analytic<br />Update<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  54. 54. Parallel-Parallel SMO (P2SMO)(IV): Reduction Phase Details <br />Grid 1<br />1<br />2<br />…<br />1<br />…<br />…<br />k<br />…<br />…<br />N<br />CPU<br />GPU<br />CPU<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  55. 55. Parallel Reduction (I):Recursive Kernel Invocation<br /><ul><li>Need to keep all multiprocessors busy
  56. 56. Process large arrays
  57. 57. Each thread block reduces a portion of the array</li></ul>Find the <br />and <br />on a GPU. <br />….<br />Kernel 1<br />Min/Max block 1<br />Min/Max block 2<br />Min/Max block P-1<br />Min/Max block P<br />….<br />Kernel 2<br />Min/Max block 1<br />Min/Max block P’<br />….<br />….<br />Recursive Kernel <br />Invocation<br />Last Kernel <br />Min/Max block 1<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  58. 58. Parallel Reduction (II): Optimization<br />t threads, t data points: <br />1 data point per thread<br />t threads, t data points: <br />1 data point per thread<br />5x<br />Interleaved Addressing<br />Sequential Addressing<br />2x<br />t/2 threads, t data points: 1 data point per thread, but first comparison during load <br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  59. 59. Parallel Reduction (III): Latency Hiding<br />The work assigned to each thread is very small. Better latency hiding with more work per thread.<br />Set a fixed number of blocks <br />1.5x<br />Give more points to each thread:<br />Algorithm Cascading:<br /><ul><li>Combine sequential and parallel reduction.</li></ul>2.5x<br /><ul><li>Set block sizes at compile time
  60. 60. Use C++ templates
  61. 61. Unroll loops</li></ul>Overall, the parallel reduction was accelerated ~37.5x <br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  62. 62. Parallel-Parallel SMO (P2SMO)(V): Analytic Phase<br />LRU Cache<br />Reduction<br />CUDA Basic Linear Algebra Subroutines<br />SGEMV<br />Analytic<br />SGEMM<br />Grid 2<br />Update<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  63. 63. Parallel-Parallel SMO (P2SMO)(VI): Analytic Phase Details<br />Calculate new alpha values:<br />Grid 2<br />where<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  64. 64. Parallel-Parallel SMO (P2SMO)(VII): Update Phase <br />1<br />…<br />k<br />…<br />N<br />Reduction<br />Grid 3<br />1<br />2<br />…<br />1<br />…<br />k<br />…<br />Analytic<br />…<br />N<br />…<br />Update<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  65. 65. Parallel-Parallel SMO (P2SMO)(VIII): Update Phase Details<br />1<br />…<br />k<br />…<br />N<br />The unregularized bias term:<br />Grid 3<br />1<br />2<br />…<br />…<br />…<br />Stopping Criteria:<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  66. 66. Parallel-Parallel SMO (P2SMO)(IX): Complete Binary Task<br />Block 1<br />Block 2<br />Block P<br />Max<br />Min<br />Max<br />Min<br />Max<br />Min<br />Reduction<br />Max<br />Min<br />Analytic<br />Update<br />Host<br />Device<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  67. 67. Parallel-Parallel SMO (P2SMO)(IX): Shared Support Vectors<br />All vs All<br />One vs All<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  68. 68. Parallel-Parallel SMO (P2SMO)(X): Progressive Grid Reduction<br />Subsets<br />Tasks<br />Task #2<br /> Converged<br />Task #3<br /> Converged<br />Task #4<br /> Converged<br />Task #1<br /> Converged<br /># of iterations<br />Grid Reduction<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  69. 69. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  70. 70. Performance Results (I): Experiment and Hardware<br />Compared the GPU SVM vs LIBSVM (Chang et al.)<br /><ul><li>Same kernel types (RBF)
  71. 71. Same regularization parameter C
  72. 72. Same stopping criteria:
  73. 73. Both SMO based
  74. 74. Both use One vs All in multiclass problems
  75. 75. Both have a 1GB kernel cache
  76. 76. Both consider I/O an intrinsic part of the classifier</li></ul>Multiclass Classification using Massively Threaded Multiprocessors<br />
  77. 77. Performance Results (II): Datasets & Resources<br />(age, work class, education, marital status…) <br />(>50K OR<=50K)<br />ADULT<br />(areas of the web in group 1)<br />(areas of the web in group 2)<br />WEB<br />(handwritten digit in 20x20 pixel box)<br />(digit)<br />MNIST<br />(US Postal Service handwritten digits)<br />(digit)<br />USPS<br />(Shuttle operation measurements)<br />(Shuttle state)<br />SHUTTLE<br />(English character Image)<br />(Character) <br />LETTER<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  78. 78. Performance Results (III): Classification Accuracy<br /><ul><li>GPU used single precision, LIBSVM used double precision
  79. 79. Same accuracy in both classifiers
  80. 80. Variations on the number of support vectors, regularization parameter and number of iterations
  81. 81. LIBSVM uses working set selection heuristics</li></ul>Multiclass Classification using Massively Threaded Multiprocessors<br />
  82. 82. Performance Results (IV): Training Time<br />~35 min<br />~33 hours<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  83. 83. Performance Results (V): Testing Time<br />~8 min<br />~5 s<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  84. 84. Performance Results (VI): Kernel Cache Hit Rate<br />MNIST<br />USPS<br />LETTER<br />SHUTTLE<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  85. 85. Performance Results (VII): Cross-Generational Comparison<br />Tesla C1060 (240 cores)<br />GeForce 8800 GT (112 Cores)<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  86. 86. Agenda<br />Data Growth and Moore’s Law<br />Multiclass Classification<br />Binary SVM<br />Multiclass SVM<br />Related Work<br />Serial Acceleration<br />Parallel or Distributed Acceleration<br />Massively Threaded Multiprocessor (GPU)<br />Architecture<br />Representations<br />Programming Model<br />Parallel-Parallel SMO (P2SMO)<br />Algorithm and Implementation<br />Task-Parallelization Implications<br />Performance Results<br />Datasets & Resources<br />Classification Accuracy<br />Training Time<br />Classification Time<br />Kernel Cache Hit Rate<br />Cross-Generational Comparison<br />Conclusions & Future Work<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  87. 87. Conclusions<br />Designed and implemented a Massively Threaded Multiclass Classifier. <br />Experimental Results:<br />Only added one GPU (<1K $) to the same machine.<br />Dataset dependent speedups in the range of:<br />3-57x for training. <br />3-112x for classification.<br />Reduced the training time more than an order of magnitude while maintaining the accuracy of the classification tasks.<br />Largest speedup when the bottleneck is computational (SGEMM, SGEMV, Reduction)<br />SVM scalability with the number of cores<br />Since November 2009: 1000 visits, ~190 downloads<br />Multiclass Classification using Massively Threaded Multiprocessors<br />
  88. 88. Questions<br />Multiclass Classification using Massively Threaded Multiprocessors<br />

×