Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ML Pipelineで実践機械学習

1,755 views

Published on

Spark Casual Talk #1のLTで発表した資料です

Published in: Engineering

ML Pipelineで実践機械学習

  1. 1. ML Pipelineで 実践機械学習 株式会社サイバーエージェント アドテク本部 アドテクスタジオ 谷口 和輝
  2. 2. Introduction 2 ■ 所属: SAT
 (Scientific Advertising Team) ■ 職業: Data Scientist ■ Spark活動 ■ OSS (Apache Spark) ■ QConTokyo2015   「Sparkを用いたビッグデータ解析」 ■ Qiita   「Apache Zeppelin」のインストール方法まとめ @kazk1018
  3. 3. 実践機械学習
  4. 4. 機械学習の流れ 4 Training Data Feature Engineering Machine Learning Prediction (label, …)
  5. 5. ML Pipeline
  6. 6. ML Pipeline ■ 記述的に機械学習の流れを書いていくことができる ! ! ! ! ■ Cross Validation, Grid Searchに対応できる ■ Grid Search・・・最適なパラメータの探索 ■ Cross Validation・・・モデルの評価 6
  7. 7. ML Pipeline 7 Training Data Feature Engineering Machine Learning Prediction Pipeline Pipeline DataFrame Model
  8. 8. Ex. Logistic Regression
  9. 9. 9 (id, label, text) 1, 1, I am a student 2, 0, I have no money 3, 0, Where are you … Training Data DataFrame
  10. 10. 10 (id, label, text) 1, 1, I am a student 2, 0, I have no money 3, 0, Where are you … Training Data (id, label, text, words) 1, 1, I am a student , ( I , am , a , student ) 2, 0, I have no money , ( I , have , no , money ) 3, 0, Where are you , ( Where , are , you ) … Tokenizer DataFrame
  11. 11. 11 (id, label, text, words) 1, 1, I am a student , ( I , am , a , student ) 2, 0, I have no money , ( I , have , no , money ) 3, 0, Where are you , ( Where , are , you ) … hashingTF (id, label, text, words, features) 1, 1, …, …, Vector 2, 0, …, …, Vector 3, 0, …, …, Vector …
  12. 12. 12 (id, label, text, words, features) 1, 1, …, …, Vector 2, 0, …, …, Vector 3, 0, …, …, Vector … Logistic Regression Model
  13. 13. Transformer
  14. 14. Transformer 14 ■ Pipelineの一つの要素(Stage) uid time page STRING DATETIME STRING uid time page domain STRING DATETIME STRING STRING Transformer.scalaにある UnanyTransformerの実装がわかりやすい 例) URLからドメインを抜くTransformer
  15. 15. VectorUDT ■ MLのpipelineには最後VectorUDTで渡す ! ! ! ! ! ■ VectorUDTがprivateなので自作するが… 15 uid time page STRING DATETIME STRING uid time page feature STRING DATETIME STRING VectorUDT Transformer
  16. 16. 16 require(feature.dataType == VectorUDT) LogisticRegression 早すぎたTransformer This is currently private[spark but will be made public later one it is stabilized Be stabilized!! (安定!!)
  17. 17. まとめ
  18. 18. まとめ ■ ML Pipelineで記述的に処理を書くことができる ! ■ 統一したフレームワークで機械学習を表現できる ! ■ VectorUDTがpublicにならない限り、自作の Transformerを使うのは難しそう 18

×