Scaling up Machine Learning Algorithms for Classification

7,743 views

Published on

Slides for private meeting of mathematical informatics assistant professors.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
7,743
On SlideShare
0
From Embeds
0
Number of Embeds
6,712
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Scaling up Machine Learning Algorithms for Classification

  1. 1. Scaling up Machine Learning algorithms for classification Department of Mathematical Informatics The University of Tokyo Shin Matsushima
  2. 2. How can we scale up Machine Learning to Massive datasets? • Exploit hardware traits – Disk IO is bottleneck – Dual Cached Loops – Run Disk IO and Computation simultaneously • Distributed asynchronous optimization (ongoing) – Current work using multiple machines 2
  3. 3. LINEAR SUPPORT VECTOR MACHINES VIA DUAL CACHED LOOPS 3
  4. 4. • Intuition of linear SVM – xi: i-th datapoint – yi: i-th label. +1 or -1 – yi w・ xi : larger is better, smaller is worse 4 × × × × × × × × × ×: yi = +1 ×: yi = -1
  5. 5. • Formulation of Linear SVM – n: number of data points – d: number of features – Convex non-smooth optimization 5
  6. 6. • Formulation of Linear SVM – Primal – Dual 6
  7. 7. Coordinate descent 7
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. • Coordinate Descent Method – For each update we solve one-variable optimization problem with respect to the variable to update. 15
  16. 16. • Applying Coordinate Descent for Dual formulation of SVM 16
  17. 17. 17 • Applying Coordinate Descent for Dual formulation of SVM
  18. 18. Dual Coordinate Descent [Hsieh et al. 2008] 18
  19. 19. Attractive property • Suitable for large scale learning – We need only one data for each update. • Theoretical guarantees – Linear convergence(cf. SGD) • Shrinking[Joachims 1999] – We can eliminate “uninformative” data: cf. 19
  20. 20. Shrinking[Joachims 1999] • Intuition: a datapoint far from the current decision boundary is unlikely to become a support vector 20 × × × × ○ ○
  21. 21. Shrinking[Joachims 1999] • Condition • Available only in the dual problem 21
  22. 22. Problem in scaling up to massive data • In dealing with small-scale data, we first copy the entire dataset into main memory • In dealing with large-scale data, we cannot copy the dataset at once 22 Read Disk Memory Data
  23. 23. Read Data • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  24. 24. Train RAM • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  25. 25. Read Data • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  26. 26. Train RAM • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  27. 27. Block Minimization[Yu et al. 2010] 27
  28. 28. Read Data • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  29. 29. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  30. 30. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  31. 31. Read Data • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  32. 32. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  33. 33. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  34. 34. Selective Block Minimization[Chang and Roth 2011] 34
  35. 35. • Previous schemes switch CPU and DiskIO – Training (CPU) is idle while reading – Reading (DiskIO) is idle while training 35
  36. 36. • We want to exploit modern hardware 1. Multicore processors are commonplace 2. CPU(Memory IO) is often 10-100 times faster than Hard disk IO 36
  37. 37. 1.Make reader and trainer run simultaneously and almost asynchronously. 2.Trainer updates the parameter many times faster than reader loads new datapoints. 3.Keep informative data in main memory. (=Evict uninformative data primarily from main memory) 37 Dual Cached Loops
  38. 38. Reader Thread Trainer Thread Parameter Dual Cached Loops RAM Disk Memory Data 38
  39. 39. Reader Thread Trainer Thread Parameter Dual Cached Loops RAM Disk Memory Data 39
  40. 40. Read Disk Memory Data W: working index set 40
  41. 41. Train ParameterMemory 41
  42. 42. Which data is “uninformative”? • A datapoint far from the current decision boundary is unlikely to become a support vector • Ignore the datapoint for a while. 42 × × × × × ○ ○○
  43. 43. Which data is “uninformative”? – Condition 43
  44. 44. • Datasets with Various Characteristics: • 2GB Memory for storing datapoints • Measured Relative Function Value 45
  45. 45. • Comparison with (Selective) Block Minimization (implemented in Liblinear) – ocr:dense, 45GB 46
  46. 46. 47 • Comparison with (Selective) Block Minimization (implemented in Liblinear) – dna: dense, 63GB
  47. 47. 48 Comparison with (Selective) Block Minimization (implemented in Liblinear) – webspam:sparse, 20GB
  48. 48. 49 Comparison with (Selective) Block Minimization (implemented in Liblinear) – kddb: sparse, 4.7GB
  49. 49. • When C gets larger (dna C=1) 51
  50. 50. • When C gets larger(dna C=10) 52
  51. 51. • When C gets larger(dna C=100) 53
  52. 52. • When C gets larger(dna C=1000) 54
  53. 53. • When memory gets larger(ocr C=1) 55
  54. 54. • Expanding Features on the fly – Expand features explicitly when the reader thread loads an example into memory. • Read (y,x) from the Disk • Compute f(x) and load (y,f(x)) into RAM Read Disk Data 12495340 ( )xf R x=GTCCCACCT… 56
  55. 55. 2TB data 16GB memory 10hrs 50M examples 12M features corresponding to 2TB 57
  56. 56. • Summary – Linear SVM Optimization when data cannot fit in memory – Use the scheme of Dual Cached Loops – Outperforms state of the art by orders of magnitude – Can be extended to • Logistic regression • Support vector regression • Multiclass classification 58
  57. 57. DISTRIBUTED ASYNCHRONOUS OPTIMIZATION (CURRENT WORK) 59
  58. 58. Future/Current Work • Utilize the same principle as dual cached loops in multi-machine algorithm – Transportation of data can be efficiently done without harming optimization performance – The key is to run Communication and Computation simultaneously and asynchronously – Can we do more sophisticated communication emerging in multi-machine optimization? 60
  59. 59. • (Selective) Block Minimization scheme for Large- scale SVM 61 Move data Process Optimization HDD/ File system One machine One machine
  60. 60. • Map-Reduce scheme for multi-machine algorithm 62 Move parameters Process Optimization Master node Worker node Worker node
  61. 61. 63
  62. 62. 64
  63. 63. 65
  64. 64. Stratified Stochastic Gradient Descent [Gemulla, 2011] 66
  65. 65. 67
  66. 66. 68
  67. 67. • Map-Reduce scheme for multi-machine algorithm 69 Move parameters Process Optimization Master node Worker node Worker node
  68. 68. Asynchronous multi-machine scheme 70 Parameter Communication Parameter Updates
  69. 69. NOMAD 71
  70. 70. NOMAD 72
  71. 71. 73
  72. 72. 74
  73. 73. 75
  74. 74. 76
  75. 75. Asynchronous multi-machine scheme • Each machine holds a subset of data • Keep communicating a potion of parameter from each other • Simultaneously run updating parameters for those each machine possesses 77
  76. 76. • Distributed stochastic gradient descent for saddle point problems – Another formulation of SVM (Regularized Risk Minimization in general) – Suitable for parallelization 78
  77. 77. How can we scale up Machine Learning to Massive datasets? • Exploit hardware traits – Disk IO is bottleneck – Run Disk IO and Computation simultaneously • Distributed asynchronous optimization (ongoing) – Current work using multiple machines 79

×