Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

23.05.15 Одесса. Impact Hub Odessa. Конференция AI&BigData Lab

Руденко Петр (Инженер-программист, Datarobot) Automation and optimisation of machine learning pipelines on top of Apache Spark

В компании Datarobot мы занимаемся автоматизированным построением точных предсказательных моделей. Помимо непосредственного обучения модели, важную роль во всем процессе играет препроцессинг данных (feature selection/normalization/transformation). В своем докладе я поделюсь нашим опытом использования платформы Apache Spark и в частности новыми ml API, которые предоставляют функционал для построения пайплайнов (Pipeline), поиска оптимальных значений гиперпараметров моделей (Crossvalidation).

Подробнее:
http://geekslab.co/
https://www.facebook.com/GeeksLab.co
https://www.youtube.com/user/GeeksLabVideo

  • Login to see the comments

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

  1. 1. Automation and optimisation of machine learning pipelines on top of Apache Peter Rudenko @peter_rud peter.rudenko@datarobot.com
  2. 2. DataRobot data pipeline Data upload Training models, selecting best models & hyperparameters Exploratory data analysis Models leaderboard Prediction API
  3. 3. Our journey to Apache Spark PySpark vs Scala API? Spark worker JVM Python process Sending instructions: df.agg({"age": "max"}) FAST! Spark worker JVM Python process Sending data: data.map(lambda x: …) data.filter(lambda x: …) SLOW! Instructions py4j Data ipc/serde
  4. 4. Our journey to Apache Spark RDD vs DataFrame RDD[Row[(Double, String, Vector)]] Dataframe (DoubleType, nullable=true) + Attributes (in spark-1.4) Dataframe (StringType, nullable=true) + Attributes (in spark-1.4) Dataframe (VectorType, nullable=true) + Attributes (in spark-1.4) Attributes: NumericAttribute NominalAttribute (Ordinal) BinaryAttribute
  5. 5. Our journey to Apache Spark Mllib vs ML Mllib: ● Low - level implementation of machine learning algorithms ● Based on RDD ML: ● High-level pipeline abstractions ● Based on dataframes ● Uses mllib under the hood.
  6. 6. Columnar format ● Compression ● Scan optimization ● Null-imputor improvement - val na2mean = {value: Double => - if (value.isNaN) meanValue else value - } - dataset.withColumn(map(outputCol), callUDF(na2mean, DoubleType, dataset(map (inputCol)))) + dataset.na.fill(map(inputCols).zip (meanValues).toMap)
  7. 7. Typical machine learning pipeline ● Features extraction ● Missing values imputation ● Variables encoding ● Dimensionality reduction ● Training model (finding the optimal model parameters) ● Selecting hyperparameters Model evaluation on some metric (AUC, R2, RMSE, etc.) Train data (features + label) Test data (features) Model state (parameters + hyperparameters) Prediction
  8. 8. Introducing Blueprint
  9. 9. Pipeline config pipeline: { "1": { input: ["NUM"], class: "org.apache.spark.ml.feature.MeanImputor" }, "2": { input: ["CAT"], class: "org.apache.spark.ml.feature.OneHotEncoder" }, "3":{ input: ["1", "2"], class: "org.apache.spark.ml.feature.VectorAssembler" }, "4": { input: "3", class : "org.apache.spark.ml.classification.LogisticRegression", params: { optimizer: “LBFGS”, regParam: [0.5, 0.1, 0.01, 0.001] } } }
  10. 10. Introducing Blueprint YARN cluster Blueprint Spark jobserver
  11. 11. Transformer (pure function) abstract class Transformer extends PipelineStage with Params { /** * Transforms the dataset with provided parameter map as additional parameters. * @param dataset input dataset * @param paramMap additional parameters, overwrite embedded params * @return transformed dataset */ def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame } Example: (new HashingTF). setInputCol("categorical_column"). setOutputCol("Hashing_tf_1"). setNumFeatures(1<<20). transform(data)
  12. 12. Estimator abstract class Estimator[M <: Model[M]] extends PipelineStage with Params { /** * Fits a single model to the input data with optional parameters. * * @param dataset input dataset * @param paramPairs Optional list of param pairs. * These values override any specified in this Estimator's embedded ParamMap. * @return fitted model */ @varargs def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = { val map = ParamMap(paramPairs: _*) fit(dataset, map) } } Example: val oneHotEncoderModel = (new OneHotEncoder). setInputCol("vector_col"). fit(trainingData) oneHotEncoderModel.transform(trainingData) oneHotEncoderModel.transform(testData) Estimator => Transformer
  13. 13. Predictor Estimator that predicts a value ProbabilisticClassifier Predictor Classifier Regressor
  14. 14. Evaluator abstract class Evaluator extends Identifiable { /** * Evaluates the output. * * @param dataset a dataset that contains labels/observations and predictions. * @param paramMap parameter map that specifies the input columns and output metrics * @return metric */ def evaluate(dataset: DataFrame, paramMap: ParamMap): Double } Example: val areaUnderROC = (new BinaryClassificationEvaluator). setScoreCol("prediction"). evaluate(data)
  15. 15. Pipeline val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) Input data Tockenizer HashingTF Logistic Regression fit Pipeline Model Estimator that encapsulates other transformers / estimators
  16. 16. CrossValidator val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array (10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) . build() crossval.setEstimatorParamMaps( paramGrid) crossval.setNumFolds(3) val cvModel = crossval.fit(training.toDF) Input data Tockenizer HashingTF Logistic Regression fit CrossVal Model numFeatures: {10, 100, 1000} regParam: {0.1, 0.01} Folds
  17. 17. Pluggable backend ● H20 ● Flink ● DeepLearning4J ● http://keystone-ml.org/ ● etc.
  18. 18. Optimization ● Disable k-fold cross validation ● Minimize redundant pre-processing ● Parallel grid search ● Parallel DAG pipeline ● Pluggable optimizer ● Non-gridsearch hyperparameter optimization (bayesian & hypergrad): http://arxiv.org/pdf/1502.03492v2.pdf http://arxiv.org/pdf/1206.2944.pdf http://arxiv.org/pdf/1502.05700v1.pdf
  19. 19. Minimize redundant pre-processing regParam: 0.1 regParam: 0.01 val rdd1 = rdd.map(function) val rdd2 = rdd.map(function) rdd1 != rdd2
  20. 20. Summary ● Good model != good result. Feature engineering is the key. ● Spark provides a good abstraction, but need to tune some parts to achieve good performance. ● ml pipeline API gives a pluggable and reusable building blocks. ● Don’t forget to clean after yourself (unpersist cache).
  21. 21. Thanks, Demo & QA

×