High-performance computingHigh performancecomputing in ArtificialIntelligence & OptimizationOlivier.Teytaud@inria.fr + man...
DisclaimerMany works in parallelism are abouttechnical tricks on SMP programming,message-passing, network organization.==>...
OutlineParallelismBias & varianceAI & Optimization  Optimization  Supervised machine learning  Multistage decision makingC...
ParallelismBasic principle (here!):  Using more CPUs for being faster
ParallelismBasic principle (here!):  Using more CPUs for being fasterVarious cases:  Many cores in one machine (shared mem...
ParallelismBasic principle (here!):  Using more CPUs for being fasterVarious cases:  Many cores in one machine (shared mem...
ParallelismBasic principle (here!):  Using more CPUs for being fasterVarious cases:  Many cores in one machine (shared mem...
ParallelismVarious cases:  Many cores in one machine (shared memory)   ==> your laptop  Many cores on a same fast network ...
ParallelismDefinitions:  p = number of processors  Speed-up(P) = ratio   Time for reaching precision  when p=1  ---------...
OutlineParallelismBias & varianceAI & Optimization  Optimization  Supervised machine learning  Multistage decision makingC...
Bias and varianceI compute x on a computer.Its imprecise, I get x.How can I parallelize this to    make it faster ?
Bias and varianceI compute x on a computer.Its imprecise, I get x.What happens if I compute x   1000 times,   on 1000 diff...
Bias and variancex = average( x1,...,x1000 )If the algorithm is deterministic:   all xi are equal   no benefit   Speed-up ...
Bias and variancex = average( x1,...,x1000 )If the algorithm is deterministic:   all xi are equal   no benefit   Speed-up ...
Bias and variance, concludingTwo classical notions for an estimator x:  Bias = E (x – x)  Variance E (x – Ex)2Parallelism ...
AI & optimization: bias &variance everywhereParallelismBias & varianceAI & Optimization  Optimization  Supervised machine ...
AI & optimization: bias &variance everywhereMany (parts of) algorithms can be rewrittenas follows:  Generate sample x1,......
Example 1: evolutionaryoptimizationWhile (I have time)  Generate sample x1,...,x using current   knowledge  Work on x1,.....
Example 1: evolutionaryoptimizationInitial knowledge = GaussiandistributionWhile (I have time)  Generate sample x1,...,x ...
Example 1: evolutionaryoptimizationInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time)  G...
Example 1: evolutionaryoptimizationInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time)  G...
Example 1: evolutionaryoptimizationInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time)  G...
Example 1: evolutionaryoptimizationInitial knowledge = Gaussian          MANY EVOLUTIONARYdistribution G (mean m, variance...
Ex. 1: bias & variance for EOInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time)  Generat...
Ex. 1: bias & variance for EOHuge improvement in EMNA for lambdalarge just by taking into account bias/variancedecompositi...
Ex. 1: bias & variance for EOInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time)  Generat...
Example 2: supervised machinelearning (huge dataset)  Generate sample x1,...,x using current   knowledge  Work on x1,...,...
Example 2: supervised machinelearning (huge dataset D) Generate data sets D1,...,D using current  knowledge (subsets of t...
Example 2: supervised machinelearning (huge dataset D) Generate data sets D1,...,D using current   Easy tricks for parall...
Example 2: active supervisedmachine learning (huge dataset)While I have time  Generate sample x1,...,x using current   kn...
Example 3: decision makingunder uncertaintyWhile I have time  Generate simulations x1,...,x using   current knowledge  Wo...
UCT (Upper Confidence Trees)Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)
UCT
UCT
UCT
UCT
UCT      Kocsis & Szepesvari (06)
Exploitation ...
Exploitation ...            SCORE =               5/7             + k.sqrt( log(10)/7 )
Exploitation ...            SCORE =               5/7             + k.sqrt( log(10)/7 )
Exploitation ...            SCORE =               5/7             + k.sqrt( log(10)/7 )
... or exploration ?              SCORE =                 0/2               + k.sqrt( log(10)/2 )
Example 3: decision makingunder uncertaintyWhile I have time  Generate simulation x1,...,x using   current knowledge (=sc...
Example 3: decision makingunder uncertainty: parallelizingWhile I have time  Generate simulation x1,...,x using   current...
Example 3: decision makingunder uncertainty: parallelizingWhile I have time   Generate simulation x1,...,x using current ...
Example 3: decision makingunder uncertainty: parallelizingGood news first: its simple and itworks on huge clusters ! ! !  ...
Example 3: decision makingunder uncertainty: parallelizingGood news first: its simple and itworks on huge clusters ! ! !  ...
Example 3: decision makingunder uncertainty: parallelizingGood news first: its simple and itworks on huge clusters ! ! !Wh...
Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9        MoGo2008: wi...
Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9        MoGo2008: wi...
Example 3: decision makingunder uncertainty: parallelizingSo what happened ?great speed-up + moderate results;= contradict...
Example 3: decision makingunder uncertainty: parallelizingSo what happened ?great speed-up + moderate results;= contradict...
Example 3: decision makingunder uncertainty: parallelizingPoorlyhandledsituation,even with10 days ofCPU !
Example 3: decision makingunder uncertainty: limitedscalability(game of Havannah)==> killed by the bias!
Example 3: decision makingunder uncertainty: limitedscalability(game of Go)==> bias trouble ! ! !we reduce the variance bu...
ConclusionsWe have seen that “good old” bias/variance analysis is  quite efficient;  not widely known / used.
Conclusionseasy tricks for evolutionary optimization on grids==> we published papers with great speed-ups with just one li...
Conclusionseasy tricks for supervised machine learning:==> bias/variance analysis here boils down to: choose an algorithm ...
ConclusionsFor sequential decision making under uncertainty, disappointing results: the best algorithms are not“that” scal...
Conclusions and referencesOur experiments: often on Grid5000:  ~5000 cores         - Linux  homogeneous environment  union...
Upcoming SlideShare
Loading in …5
×

Parallel Artificial Intelligence and Parallel Optimization: a Bias and Variance Point of View

184 views
164 views

Published on

A paper on parallel Monte-Carlo Tree Search:

@inproceedings{bourki:inria-00512854,
hal_id = {inria-00512854},
url = {http://hal.inria.fr/inria-00512854},
title = {{Scalability and Parallelization of Monte-Carlo Tree Search}},
author = {Bourki, Amine and Chaslot, Guillaume and Coulm, Matthieu and Danjean, Vincent and Doghmen, Hassen and H{\'e}rault, Thomas and Hoock, Jean-Baptiste and Rimmel, Arpad and Teytaud, Fabien and Teytaud, Olivier and Vayssi{\`e}re, Paul and Yu, Ziqin},
booktitle = {{The International Conference on Computers and Games 2010}},
address = {Kanazawa, Japon},
audience = {internationale },
collaboration = {Grid'5000 },
year = {2010},
pdf = {http://hal.inria.fr/inria-00512854/PDF/newcluster.pdf},
}


And a paper on parallel optimization:
@inproceedings{teytaud:inria-00369781,
hal_id = {inria-00369781},
url = {http://hal.inria.fr/inria-00369781},
title = {{On the parallel speed-up of Estimation of Multivariate Normal Algorithm and Evolution Strategies}},
author = {Teytaud, Fabien and Teytaud, Olivier},
abstract = {{Motivated by parallel optimization, we experiment EDA-like adaptation-rules in the case of $\lambda$ large. The rule we use, essentially based on estimation of multivariate normal algorithm, is (i) compliant with all families of distributions for which a density estimation algorithm exists (ii) simple (iii) parameter-free (iv) better than current rules in this framework of $\lambda$ large. The speed-up as a function of $\lambda$ is consistent with theoretical bounds.}},
language = {Anglais},
affiliation = {Institut National de la Recherche en Informatique et en Automatique - INRIA FUTURS , UFR Sciences - Universit{\'e} Paris-Sud XI , TAO - INRIA Futurs , Laboratoire de Recherche en Informatique - LRI , TAO - INRIA Saclay - Ile de France},
booktitle = {{EvoNum (evostar workshop)}},
publisher = {springer},
address = {Tuebingen, Allemagne},
volume = {EvoNum},
audience = {internationale },
collaboration = {Grid'5000 },
year = {2009},
pdf = {http://hal.inria.fr/inria-00369781/PDF/lambdaLarge.pdf},
}

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
184
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Parallel Artificial Intelligence and Parallel Optimization: a Bias and Variance Point of View

  1. 1. High-performance computingHigh performancecomputing in ArtificialIntelligence & OptimizationOlivier.Teytaud@inria.fr + many peopleTAO, Inria-Saclay IDF, Cnrs 8623,Lri, Univ. Paris-Sud,Digiteo Labs, PascalNetwork of Excellence.NCHC, Taiwan.November 2010.
  2. 2. DisclaimerMany works in parallelism are abouttechnical tricks on SMP programming,message-passing, network organization.==> often moderate improvements, but for all users using a given library/methodologyHere, opposite point of view: Dont worry for 10% loss due to suboptimal programming Try to benefit from huge machines
  3. 3. OutlineParallelismBias & varianceAI & Optimization Optimization Supervised machine learning Multistage decision makingConclusions
  4. 4. ParallelismBasic principle (here!): Using more CPUs for being faster
  5. 5. ParallelismBasic principle (here!): Using more CPUs for being fasterVarious cases: Many cores in one machine (shared memory)
  6. 6. ParallelismBasic principle (here!): Using more CPUs for being fasterVarious cases: Many cores in one machine (shared memory) Many cores on a same fast network (explicit fast communications)
  7. 7. ParallelismBasic principle (here!): Using more CPUs for being fasterVarious cases: Many cores in one machine (shared memory) Many cores on a same fast network (explicit fast communications) Many cores on a network (explicit slow communications)
  8. 8. ParallelismVarious cases: Many cores in one machine (shared memory) ==> your laptop Many cores on a same fast network (explicit fast communications) ==> your favorite cluster Many cores on a network (explicit slow communications) ==> your grid or your lab or internet
  9. 9. ParallelismDefinitions: p = number of processors Speed-up(P) = ratio Time for reaching precision  when p=1 ------------------------------------------------------------- Time for reaching precision  when p=P Efficiency(p) = speed-up(p)/p (usually at most 1)
  10. 10. OutlineParallelismBias & varianceAI & Optimization Optimization Supervised machine learning Multistage decision makingConclusions
  11. 11. Bias and varianceI compute x on a computer.Its imprecise, I get x.How can I parallelize this to make it faster ?
  12. 12. Bias and varianceI compute x on a computer.Its imprecise, I get x.What happens if I compute x 1000 times, on 1000 different machines ?I get x1,...,x1000.x = average( x1,...,x1000 )
  13. 13. Bias and variancex = average( x1,...,x1000 )If the algorithm is deterministic: all xi are equal no benefit Speed-up = 1, efficiency → 0 ==> not good! (trouble=bias!)
  14. 14. Bias and variancex = average( x1,...,x1000 )If the algorithm is deterministic: all xi are equal no benefit Speed-up = 1, efficiency → 0 ==> not good!If unbiased Monte-Carlo estimate: - speed-up=p, efficiency=1 ==> ideal case! (trouble = variance)
  15. 15. Bias and variance, concludingTwo classical notions for an estimator x: Bias = E (x – x) Variance E (x – Ex)2Parallelism can easily reduce variance;parallelism can not easily reduce the bias.
  16. 16. AI & optimization: bias &variance everywhereParallelismBias & varianceAI & Optimization Optimization Supervised machine learning Multistage decision makingConclusions
  17. 17. AI & optimization: bias &variance everywhereMany (parts of) algorithms can be rewrittenas follows: Generate sample x1,...,x using current knowledge Work on x1,...,x, get y1,...,y. Update knowledge.
  18. 18. Example 1: evolutionaryoptimizationWhile (I have time) Generate sample x1,...,x using current knowledge Work on x1,...,x, get y1,...,y. Update knowledge.
  19. 19. Example 1: evolutionaryoptimizationInitial knowledge = GaussiandistributionWhile (I have time) Generate sample x1,...,x using current knowledge Work on x1,...,x, get y1,...,y. Update knowledge.
  20. 20. Example 1: evolutionaryoptimizationInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time) Generate sample x1,...,x using G Work on x1,...,x, get y1,...,y. Update knowledge.
  21. 21. Example 1: evolutionaryoptimizationInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time) Generate sample x1,...,x using G Work on x1,...,x, get y1=fitness(x1),...,y=fitness(x). Update knowledge.
  22. 22. Example 1: evolutionaryoptimizationInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time) Generate sample x1,...,x using G Work on x1,...,x, get y1=fitness(x1),...,y=fitness(x). Update G (rank xis): m=mean(x1,...,x)  2=var(x1,...,x)
  23. 23. Example 1: evolutionaryoptimizationInitial knowledge = Gaussian MANY EVOLUTIONARYdistribution G (mean m, variance  2) ALGORITHMS ARE WEAK FORWhile (I have time) LAMBDA LARGE. GenerateBE EASILY OPTIMIZED CAN sample x1,...,x using G BY A BIAS / VARIANCE Work on x1,...,x, get ANALYSIS y1=fitness(x1),...,y=fitness(x). Update G (rank xis): m=mean(x1,...,x)  2=var(x1,...,x)
  24. 24. Ex. 1: bias & variance for EOInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time) Generate sample x1,...,x using G Work on x1,...,x, get y1=fitness(x1),...,y=fitness(x). Update G (rank xis): m=mean(x1,...,x) <== unweighted!  2=var(x1,...,x)
  25. 25. Ex. 1: bias & variance for EOHuge improvement in EMNA for lambdalarge just by taking into account bias/variancedecomposition: reweighting necessary forcancelling the bias.Other improvements by classical statisticaltricks: Reducing  for  large; Using quasi-random mutations.==> really simple and crucial for large population sizes. (not just for publishing :-) )
  26. 26. Ex. 1: bias & variance for EOInitial knowledge = Gaussiandistribution G (mean m, variance  2)While (I have time) Generate sample x1,...,x using G Work on x1,...,x, get y1=fitness(x1),...,y=fitness(x). Update G (rank xis): m=mean(x1,...,x) <== unweighted!  2=var(x1,...,x)
  27. 27. Example 2: supervised machinelearning (huge dataset) Generate sample x1,...,x using current knowledge Work on x1,...,x, get y1,...,y. Update knowledge.
  28. 28. Example 2: supervised machinelearning (huge dataset D) Generate data sets D1,...,D using current knowledge (subsets of the database) Work on D1,...,D, get f1,...,f. (by learning) Average the fis. ==> (su)bagging: Di=subset of D ==> random subspace: Di=projection of D on random vector space ==> random noise: Di=D+noise ==> random forest: Di = D, but noisy algo
  29. 29. Example 2: supervised machinelearning (huge dataset D) Generate data sets D1,...,D using current Easy tricks for parallelizing supervised knowledge (subsets of the database) machine learning: - use (su)bagging Work on D1,...,D, get f1,...,f. (by learning) - use random subspaces Average the of randomized algorithms - use average fis. (random forests) ==> (su)bagging: Di=subset of D - do the cross-validation in parallel ==> random subspace: Di=projection of D on random vector space ==> from my experience, complicated parallel tools ==> randomimportantDi=D+noise are not that noise: … - polemical issue: many papers on sophisticated parallel ==> supervisedforest: Di = D, algorithms; algo random machine learning but noisy - I might be wrong :-)
  30. 30. Example 2: active supervisedmachine learning (huge dataset)While I have time Generate sample x1,...,x using current knowledge (e.g. sample the maxUncertainty region) Work on x1,...,x, get y1,...,y (labels by experts / expensive code) Update knowledge (approximate model).
  31. 31. Example 3: decision makingunder uncertaintyWhile I have time Generate simulations x1,...,x using current knowledge Work on x1,...,x, get y1,...,y (get rewards) Update knowledge (approximate model).
  32. 32. UCT (Upper Confidence Trees)Coulom (06)Chaslot, Saito & Bouzy (06)Kocsis Szepesvari (06)
  33. 33. UCT
  34. 34. UCT
  35. 35. UCT
  36. 36. UCT
  37. 37. UCT Kocsis & Szepesvari (06)
  38. 38. Exploitation ...
  39. 39. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  40. 40. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  41. 41. Exploitation ... SCORE = 5/7 + k.sqrt( log(10)/7 )
  42. 42. ... or exploration ? SCORE = 0/2 + k.sqrt( log(10)/2 )
  43. 43. Example 3: decision makingunder uncertaintyWhile I have time Generate simulation x1,...,x using current knowledge (=scoring rule based on statistics) Work on x1,...,x, get y1,...,y (get rewards) Update knowledge (= update statistics in memory ).
  44. 44. Example 3: decision makingunder uncertainty: parallelizingWhile I have time Generate simulation x1,...,x using current knowledge (=scoring rule based on statistics) Work on x1,...,x, get y1,...,y (get rewards) Update knowledge (= update statistics in memory ).==> “easily” parallelized on multicore machines
  45. 45. Example 3: decision makingunder uncertainty: parallelizingWhile I have time Generate simulation x1,...,x using current knowledge (=scoring rule based on statistics) Work on x1,...,x, get y1,...,y (get rewards) Update knowledge (= update statistics in memory ).==> parallelized on clusters: one knowledge base per machine, average statistics only for crucial nodes: nodes with more than 5 % of the sims nodes at depth < 4
  46. 46. Example 3: decision makingunder uncertainty: parallelizingGood news first: its simple and itworks on huge clusters ! ! ! Comparison with voting schemes; 40 machines, 2 seconds per move.
  47. 47. Example 3: decision makingunder uncertainty: parallelizingGood news first: its simple and itworks on huge clusters ! ! ! Comparing N machines and P machines ==> consistent with linear speed-up in 19x19 !
  48. 48. Example 3: decision makingunder uncertainty: parallelizingGood news first: its simple and itworks on huge clusters ! ! !When we have produced these numbers, webelieved we were ready to play Go against verystrong players.Unfortunately not at all :-)
  49. 49. Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2010: win against a pro (4p) 19x19, H6 Zen2007: win against a pro (5p) 9x9 (blitz) MoGo2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW==> still 6 stones at least!
  50. 50. Go: from 29 to 6 stones1998: loss against amateur (6d) 19x19 H292008: win against a pro (8p) 19x19, H9 MoGo2008: win against a pro (4p) 19x19, H8 CrazyStone2008: win against a pro (4p) 19x19, H7 CrazyStone2009: win against a pro (9p) 19x19, H7 MoGo2009: win against a pro (1p) 19x19, H6 MoGo2010: win against a pro (4p) 19x19, H6 Zen Wins with H6 / H7 are lucky (rare)2007: win against a pro (5p) 9x9 (blitz) MoGo wins2008: win against a pro (5p) 9x9 white MoGo2009: win against a pro (5p) 9x9 black MoGo2009: win against a pro (9p) 9x9 white Fuego2009: win against a pro (9p) 9x9 black MoGoTW==> still 6 stones at least!
  51. 51. Example 3: decision makingunder uncertainty: parallelizingSo what happened ?great speed-up + moderate results;= contradiction ? ? ?
  52. 52. Example 3: decision makingunder uncertainty: parallelizingSo what happened ?great speed-up + moderate results;= contradiction ? ? ?Ok, we can simulate the sequential algorithm veryquickly = success.But even the sequential algorithm is limited, evenwith huge computation time!
  53. 53. Example 3: decision makingunder uncertainty: parallelizingPoorlyhandledsituation,even with10 days ofCPU !
  54. 54. Example 3: decision makingunder uncertainty: limitedscalability(game of Havannah)==> killed by the bias!
  55. 55. Example 3: decision makingunder uncertainty: limitedscalability(game of Go)==> bias trouble ! ! !we reduce the variance but not the systematic bias.
  56. 56. ConclusionsWe have seen that “good old” bias/variance analysis is quite efficient; not widely known / used.
  57. 57. Conclusionseasy tricks for evolutionary optimization on grids==> we published papers with great speed-ups with just one line of code: Reweighting mainly, and also quasi-random, selective pressure modified for large pop size.
  58. 58. Conclusionseasy tricks for supervised machine learning:==> bias/variance analysis here boils down to: choose an algorithm with more variance than bias and average: random subspace; random subset (subagging); noise introduction; “hyper”parameters to be tuned (cross- validation).
  59. 59. ConclusionsFor sequential decision making under uncertainty, disappointing results: the best algorithms are not“that” scalable.A systematic bias remains.
  60. 60. Conclusions and referencesOur experiments: often on Grid5000: ~5000 cores - Linux homogeneous environment union of high-performance clusters contains multi-core machinesMonte-Carlo Tree Search for decision making and uncertainty: Coulom, Kocsis & Szepesvari, Chaslot et al,...For parallel evolutionary algorithms: Beyer et al, Teytaud et al (this Teytaud is not me...).

×