Scaling up Machine Learning Algorithms for Classification

  • 3,922 views
Uploaded on

Slides for private meeting of mathematical informatics assistant professors.

Slides for private meeting of mathematical informatics assistant professors.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,922
On Slideshare
0
From Embeds
0
Number of Embeds
19

Actions

Shares
Downloads
16
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scaling up Machine Learning algorithms for classification Department of Mathematical Informatics The University of Tokyo Shin Matsushima
  • 2. How can we scale up Machine Learning to Massive datasets? • Exploit hardware traits – Disk IO is bottleneck – Dual Cached Loops – Run Disk IO and Computation simultaneously • Distributed asynchronous optimization (ongoing) – Current work using multiple machines 2
  • 3. LINEAR SUPPORT VECTOR MACHINES VIA DUAL CACHED LOOPS 3
  • 4. • Intuition of linear SVM – xi: i-th datapoint – yi: i-th label. +1 or -1 – yi w・ xi : larger is better, smaller is worse 4 × × × × × × × × × ×: yi = +1 ×: yi = -1
  • 5. • Formulation of Linear SVM – n: number of data points – d: number of features – Convex non-smooth optimization 5
  • 6. • Formulation of Linear SVM – Primal – Dual 6
  • 7. Coordinate descent 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. • Coordinate Descent Method – For each update we solve one-variable optimization problem with respect to the variable to update. 15
  • 16. • Applying Coordinate Descent for Dual formulation of SVM 16
  • 17. 17 • Applying Coordinate Descent for Dual formulation of SVM
  • 18. Dual Coordinate Descent [Hsieh et al. 2008] 18
  • 19. Attractive property • Suitable for large scale learning – We need only one data for each update. • Theoretical guarantees – Linear convergence(cf. SGD) • Shrinking[Joachims 1999] – We can eliminate “uninformative” data: cf. 19
  • 20. Shrinking[Joachims 1999] • Intuition: a datapoint far from the current decision boundary is unlikely to become a support vector 20 × × × × ○ ○
  • 21. Shrinking[Joachims 1999] • Condition • Available only in the dual problem 21
  • 22. Problem in scaling up to massive data • In dealing with small-scale data, we first copy the entire dataset into main memory • In dealing with large-scale data, we cannot copy the dataset at once 22 Read Disk Memory Data
  • 23. Read Data • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  • 24. Train RAM • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  • 25. Read Data • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  • 26. Train RAM • Schemes when data cannot fit in memory 1. Block Minimization [Yu et al. 2010] – Split the entire dataset into blocks so that each block can fit in memory
  • 27. Block Minimization[Yu et al. 2010] 27
  • 28. Read Data • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  • 29. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  • 30. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  • 31. Read Data • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  • 32. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  • 33. Train RAM Block • Schemes when data cannot fit in memory 2. Selective Block Minimization [Chang and Roth 2011] – Keep “informative data” in memory
  • 34. Selective Block Minimization[Chang and Roth 2011] 34
  • 35. • Previous schemes switch CPU and DiskIO – Training (CPU) is idle while reading – Reading (DiskIO) is idle while training 35
  • 36. • We want to exploit modern hardware 1. Multicore processors are commonplace 2. CPU(Memory IO) is often 10-100 times faster than Hard disk IO 36
  • 37. 1.Make reader and trainer run simultaneously and almost asynchronously. 2.Trainer updates the parameter many times faster than reader loads new datapoints. 3.Keep informative data in main memory. (=Evict uninformative data primarily from main memory) 37 Dual Cached Loops
  • 38. Reader Thread Trainer Thread Parameter Dual Cached Loops RAM Disk Memory Data 38
  • 39. Reader Thread Trainer Thread Parameter Dual Cached Loops RAM Disk Memory Data 39
  • 40. Read Disk Memory Data W: working index set 40
  • 41. Train ParameterMemory 41
  • 42. Which data is “uninformative”? • A datapoint far from the current decision boundary is unlikely to become a support vector • Ignore the datapoint for a while. 42 × × × × × ○ ○○
  • 43. Which data is “uninformative”? – Condition 43
  • 44. • Datasets with Various Characteristics: • 2GB Memory for storing datapoints • Measured Relative Function Value 45
  • 45. • Comparison with (Selective) Block Minimization (implemented in Liblinear) – ocr:dense, 45GB 46
  • 46. 47 • Comparison with (Selective) Block Minimization (implemented in Liblinear) – dna: dense, 63GB
  • 47. 48 Comparison with (Selective) Block Minimization (implemented in Liblinear) – webspam:sparse, 20GB
  • 48. 49 Comparison with (Selective) Block Minimization (implemented in Liblinear) – kddb: sparse, 4.7GB
  • 49. • When C gets larger (dna C=1) 51
  • 50. • When C gets larger(dna C=10) 52
  • 51. • When C gets larger(dna C=100) 53
  • 52. • When C gets larger(dna C=1000) 54
  • 53. • When memory gets larger(ocr C=1) 55
  • 54. • Expanding Features on the fly – Expand features explicitly when the reader thread loads an example into memory. • Read (y,x) from the Disk • Compute f(x) and load (y,f(x)) into RAM Read Disk Data 12495340 ( )xf R x=GTCCCACCT… 56
  • 55. 2TB data 16GB memory 10hrs 50M examples 12M features corresponding to 2TB 57
  • 56. • Summary – Linear SVM Optimization when data cannot fit in memory – Use the scheme of Dual Cached Loops – Outperforms state of the art by orders of magnitude – Can be extended to • Logistic regression • Support vector regression • Multiclass classification 58
  • 57. DISTRIBUTED ASYNCHRONOUS OPTIMIZATION (CURRENT WORK) 59
  • 58. Future/Current Work • Utilize the same principle as dual cached loops in multi-machine algorithm – Transportation of data can be efficiently done without harming optimization performance – The key is to run Communication and Computation simultaneously and asynchronously – Can we do more sophisticated communication emerging in multi-machine optimization? 60
  • 59. • (Selective) Block Minimization scheme for Large- scale SVM 61 Move data Process Optimization HDD/ File system One machine One machine
  • 60. • Map-Reduce scheme for multi-machine algorithm 62 Move parameters Process Optimization Master node Worker node Worker node
  • 61. 63
  • 62. 64
  • 63. 65
  • 64. Stratified Stochastic Gradient Descent [Gemulla, 2011] 66
  • 65. 67
  • 66. 68
  • 67. • Map-Reduce scheme for multi-machine algorithm 69 Move parameters Process Optimization Master node Worker node Worker node
  • 68. Asynchronous multi-machine scheme 70 Parameter Communication Parameter Updates
  • 69. NOMAD 71
  • 70. NOMAD 72
  • 71. 73
  • 72. 74
  • 73. 75
  • 74. 76
  • 75. Asynchronous multi-machine scheme • Each machine holds a subset of data • Keep communicating a potion of parameter from each other • Simultaneously run updating parameters for those each machine possesses 77
  • 76. • Distributed stochastic gradient descent for saddle point problems – Another formulation of SVM (Regularized Risk Minimization in general) – Suitable for parallelization 78
  • 77. How can we scale up Machine Learning to Massive datasets? • Exploit hardware traits – Disk IO is bottleneck – Run Disk IO and Computation simultaneously • Distributed asynchronous optimization (ongoing) – Current work using multiple machines 79