Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine learning in php php con poland

431 views

Published on

Machine learning is teaching the computer how to learn by itself. It is far easier to be done, especially when you have small data set and a good level of expertise in your field. Classifying objects, predicting who will buy, spotting comments in code is achieved with grassy algorithms like neural networks, genetic algorithms or ant herding. PHP is in good position to make use of such teachings, and take advantages of related technologies like fann. By the end of the session, you'll know where you want to try it.

Published in: Technology
  • Be the first to comment

Machine learning in php php con poland

  1. 1. MachineLearning inPHP Poland,Warsaw,October2016 "Learn, someday this pain will be useful to you"
  2. 2. Agenda • How to teach tricks to your PHP • Application : searching for code in comments • Complex learning
  3. 3. Speaker • Damien Seguy • Exakat CTO • Static analysis of PHP code
  4. 4. MachineLearning • Teaching the machine • Supervised learning : learning then applying • Application build its own model : training phase • It applies its model to real cases : applying phase
  5. 5. Applications • Play go, chess, tic-tac-toe and beat everyone else • Fraud detection and risk analysis • Automated translation or automated transcription • OCR and face recognition • Medical diagnostics • Walk, welcome guest at hotels, play football • Finding good PHP code
  6. 6. PhpApplications • Recommendations systems • Predicting user behavior • SPAM • conversion user to customer • ETA • Detect code in comments
  7. 7. RealUseCase • Identify code in comments • Classic problem • Good problem for machine learning • Complex, no simple solution • A lot of data and expertise are available
  8. 8. SupervisedTraining History data Training ModelReal data Results
  9. 9. SupervisedTraining History data Training ModelReal data Results
  10. 10. TheFannExtension • ext/fann (https://pecl.php.net/package/fann) • Fast Artificial Neural Network • http://leenissen.dk/fann/wp/ • Neural networks in PHP • Works on PHP 7, thanks to the hard work of Jakub Zelenka • https://github.com/bukka/php-fann
  11. 11. NeuralNetworks • Imitation of nature • Input layer • Output layer • Intermediate layers
  12. 12. NeuralNetworks • Imitation of nature • Input layer • Output layer • Intermediate layers
  13. 13. <?php  $num_layers         = 1;  $num_input          = 5;  $num_neurons_hidden = 3;  $num_output         = 1;  $ann = fann_create_standard($num_layers, $num_input,                              $num_neurons_hidden, $num_output);  // Activation function fann_set_activation_function_hidden($ann,                                    FANN_SIGMOID_SYMMETRIC);  fann_set_activation_function_output($ann,                                     FANN_SIGMOID_SYMMETRIC);  Initialisation
  14. 14. PreparingData Raw data Extract Filter Human review Fann ready • Extract data from raw source • Remove any useless data from extract • Apply some human review to filtered data • Format data for FANN
  15. 15. ExpertAtWork // Test if the if is in a compressed format // nie mowie po polsku // There is a parser specified in `Parser::$KEYWORD_PARSERS` // $result should exist, regardless of $_message // TODO : fix this; var_dump($var); // $a && $b and multidimensional // numGlyphs + 1 //$annots .= ' /StructParent '; // $cfg['Servers'][$i]['controlpass'] = 'pmapass'; // if(ob_get_clean()){
  16. 16. InputVector • 'length' : size of the comment • 'countDollar' : number of $ • 'countEqual' : number of = • 'countObjectOperator' number of -> operator ($o->p) • 'countSemicolon' : number of semi-colon ;
  17. 17. InputData 47 5 1 825 0 0 0 1 0 37 2 0 0 0 0 55 2 2 0 1 1 61 2 1 3 1 1 ... NumberOfInput NumberOfIncomingData NumberOfOutgoingData  * (at your option) any later v  *   * Exakat is distributed in the  * but WITHOUT ANY WARRANTY; wi  * MERCHANTABILITY or FITNESS F  * GNU Affero General Public Li  *   * You should have received a c  * along with Exakat.  If not,   *   * The latest code can be found  *  */  // $x[3] or $x[] and multidimen //if ($round == 3) { die('Round //$this->errors[] = $this->lang
  18. 18. BlackMagic 151 372000 0 // $X[3] Or $X[] And Multidimensional EXT/FANN It'sAComment
  19. 19. Training <?php $max_epochs         = 500000;  $desired_error      = 0.001;  // the actual training if (fann_train_on_file($ann,                         'incoming.data',                         $max_epochs,                         $epochs_between_reports,                         $desired_error)) {         fann_save($ann, 'model.out');  } fann_destroy($ann);  ?>
  20. 20. Training • 47 cases • 5 characteristics • 3 hidden neurons • + 5 input + 1 output • Duration : 5.711 s
  21. 21. Application History data Training ModelReal data Results
  22. 22. Application <?php   $ann = fann_create_from_file('model.out');   $comment = '//$gvars = $this->getGraphicVars();';  $input   = makeVector($comment);  $results = fann_run($ann, $input);   if ($results[0] > 0.8) {       print ""$comment" -> $results[0] n";   }   ?>
  23. 23. Results>0.8 • Answer between 0 and 1 • Values ranges from -14 to 0,999 • The closer to 1, the safer. The closer to 0, the safer. • Is this a percentage? Is this a carrots count ? • It's a mix of counts…
  24. 24. ScoresDistribution -16 -12 -8 -4 0 6 0 . 0 0 0 0 0 0 7 0 . 0 0 0 0 0 0 8 0 . 0 0 0 0 0 0 9 0 . 0 0 0 0 0 0 1 0 0 . 0 0 0 0 0 0
  25. 25. RealCases • Tested on 14093 comments • Duration 68.01ms • Found 1960 issues (14%)
  26. 26. 0.99999893 // $cfg['Servers'][$i]['controlhost'] = '';     0.99999928 //$_SESSION['Import_message'] = $message->getDisplay();     /* 0.99999928 if (defined('SESSIONUPLOAD')) {      // write sessionupload back into the loaded PMA session      $sessionupload = unserialize(SESSIONUPLOAD);      foreach ($sessionupload as $key => $value) {          $_SESSION[$key] = $value;      }      // remove session upload data that are not set anymore      foreach ($_SESSION as $key => $value) {          if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX))              == UPLOAD_PREFIX              && ! isset($sessionupload[$key]) 
  27. 27. 0.98780382 //LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232     0.99361396 // We have server(s) => apply default configuration      0.98383027 // Duration = as configured     0.99999928 // original -> translation mapping     0.97590065 // = (   59 x 84   ) mm  = (  2.32 x 3.31  ) in 
  28. 28. TRUE POSITIVE FALSE POSITIVE TRUE NEGATIVE FALSE NEGATIVE FOUND BY FANN (MACHINE LEARNING) TARGET (EXPERT WORK)
  29. 29. TRUE POSITIVE FALSE POSITIVE TRUE NEGATIVE FALSE NEGATIVE FOUND BY FANN TARGET 0.99999923 0.73295981 0.99999851 0.2104115 // $cfg['Servers'][$i]['table_coords'] = 'pma__ //(isset($attribs['height'])?$attribs['height'] // if ($key != null) did not work for index "0" // the PASSWORD() function  
  30. 30. Results • 1960 issues • 50+% of false positive • With an easy clean, 822 issues reported • 14k comments, analyzed in 68 ms (367ms in PHP5) • Total time of coding : 27 mins. // = (   59 X 84   ) Mm  = (  2.32 X 3.31  ) In     /* Vim: Set Expandtab Sw=4 Ts=4 Sts=4: */
  31. 31. Learn Better,NotHarder • Better training data • Improve characteristics • Configure the neural network • Change algorithm • Automate learning • Update constantly Real data History data Training Model Results Retroaction
  32. 32. BetterTrainingData • More data, more data, more data • Varied situations, real case situations • Include specific cases • Experience is capital • https://homes.cs.washington.edu/~pedrod/papers/ cacm12.pdf
  33. 33. ImproveCharacteristics • Add new characteristics • Remove the one that are less interesting • Find the right set of characteristics
  34. 34. NetworkConfiguration • Input vector • Intermediate neurons • Activation function • Output vector 0 5 0 0 0 1 0 0 0 0 1 5 0 0 0 2 0 0 0 0 1 2 3 4 5 6 7 8 9 1 0 1 layer 2 layers 3 layers 4 layers TimeOfTraining(Ms)
  35. 35. ChangeAlgorithm • First add more data before changing algorithm • Try cascade2 algorithm from FANN • 0.6 => 0 found • 0.5 => 2 found • Not found by the first algorithm • Ant colony, genetics algorithm, gravitational search, artificial immune, nie mowie po polsku, annealing, harmony search, interior point search, taboo search
  36. 36. FindingTheBest • Test with 2-4 layers
 10 neurons • Measure results 0 2 2 5 0 4 5 0 0 6 7 5 0 9 0 0 0 1 2 3 4 5 6 7 8 9 1 0 11 1 2 1 3 1 layer 2 layers 3 layers 4 layers
  37. 37. DeepLearning • Chaining the neural networks • Translators, scorers, auto-encoders • Unsupervised Learning
  38. 38. OtherTools • PHP ext/fann • Langage R • https://github.com/kachkaev/php-r • Scikit-learn • https://github.com/scikit-learn/scikit-learn • Mahout • https://mahout.apache.org/
  39. 39. Conclusion • Machine learning is about data, not code • There are tools to use it with PHP • Fast to try, easy results or fast fail • Use it for complex problems, that accepts error
  40. 40. HTTP ://WWW.EXAKAT.IO @EXAKAT HTTP ://WWW.SLIDESH ARE.NET /DSEG UY/ PHP 7.1 P REPARATION WORKSHOP Dzięki Czemu

×