Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in Hivemall v0.5.0


Published on

This slide shows what's new in Apache Hivemall v0.5.0. English-only version.

Published in: Data & Analytics
  • Be the first to comment

What's new in Hivemall v0.5.0

  1. 1. What’s New in Hivemall v0.5.0 Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 1
  2. 2. Released the first Apache Release, v0.5.0, on Mar 5, 2018. 2
  3. 3. What’s new in v0.5.0? 3 Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark v2.0/v2.1/v2.2 SparkSQL/Dataframe support, Top-k data processing
  4. 4. 4 Generic Classifier/Regressor OLD Style New Style from v0.5.0
  5. 5. 5 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization
  6. 6. 6 -eta0 <arg> The initial learning rate [default 0.1] -iter,--iterations <arg> The maximum number of iterations [default: 10] -lambda <arg> Regularization term [default 0.0001] -loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss, SquaredHingeLoss, ModifiedHuberLoss, or a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss, SquaredEpsilonInsensitiveLoss, HuberLoss] -mini_batch,--mini_batch_size <arg> Mini batch size [default: 1]. Expecting the value in range [1,100] or so. -opt,--optimizer <arg> Optimizer to update weights [default: adagrad, sgd, adadelta, adam] -reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet] Generic Classifier/Regressor Hyperparameters Adagrad+RDA by the default
  7. 7. RandomForest in Hivemall Ensemble of Decision Trees 7
  8. 8. Training of RandomForest 8 Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! train_randomforest_classifier(array<double|string> features, int label [, const string options, const array<double> classWeights])
  9. 9. • Dense Vector (array<double>) • Sparse Vector (array<string>) in a LIBSVM format • feature := <index>[“:”<value>] where index := <integer> starting with 1 (index = 0 is reserved for bias clause) and value := <floating point> (default 1.0 if not provided) 9 Supported Feature Vector Format of Random Forests 1.0, 0.0, 3.0 1:1.0, 2:0.0, 3:3.0 1:1.0, 3:3.0 select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331")); ["1828616:3.3","6238429:4.999","6238429"] 1:1.0, 3
  10. 10. Feature Engineering – Feature Hashing 10
  11. 11. 11 Random Forests Taining Hyperparameters -attrs,--attribute_types <arg> Comma separated attribute types (Q for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C]) -depth,--max_depth <arg> The maximum number of the tree depth [default: Integer.MAX_VALUE] -leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes [default: Integer.MAX_VALUE] -min_samples_leaf <arg> The minimum number of samples in a leaf node [default: 1] -rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR] -seed <arg> seed value in long [default: -1 (random)] -splits,--min_split <arg> A node that has greater than or equals to `min_split` examples will split [default: 2] -stratified,--stratified_sampling Enable Stratified sampling for unbalanced data -subsample <arg> Sampling rate in range (0.0,1.0] -trees,--num_trees <arg> The number of trees for each task [default: 50] -vars,--num_variables <arg> The number of random selected features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length) is considered if num_variable is (0.0,1.0]
  12. 12. Prediction of RandomForest 12 Posterior probability based on voting of Decision Trees Reliability of a model based on OOB error rate
  13. 13. 13 Decision Tree Visualization
  14. 14. 14 Decision Tree Visualization
  15. 15. 15 Efficient All-pairs Cosine Similarity using DIMSM All-pair similarity is very computation heavy: O(N2) where N is number of items or users Twitter’s solution is DIMSUM
  16. 16. 16 All-pairs Cosine Similarity using DIMSM Find a concreate example in
  17. 17. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins 17
  18. 18. Feature Engineering – Feature Binning 18
  19. 19. Evaluation Metrics 19
  20. 20. Map tiling functions 20
  21. 21. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 21
  22. 22. 22 SELECT count(distinct id) FROM data Sketch and NLP functions SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, " hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "]
  23. 23. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 23
  24. 24. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 24
  25. 25. Take this… Anomaly/Change-point Detection by ChangeFinder 25
  26. 26. Anomaly/Change-point Detection by ChangeFinder …and do this! 26
  27. 27. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 27
  28. 28. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation 28
  29. 29. Online mini-batch LDA 29
  30. 30. 30 Probabilistic Latent Semantic Analysis - training
  31. 31. 31 Probabilistic Latent Semantic Analysis - predict
  32. 32. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü Merge Brickhouse UDFs ü XGBoost support ü LightGBM support ü Gradient Boosting Future work for v0.5.2 and later 32 PR#91 PR#116 PR#58 PR#111 PR#135
  33. 33. SELECT from_json(to_json( ARRAY( NAMED_STRUCT("country", "japan", "city", "tokyo"), NAMED_STRUCT("country", "japan", "city", "osaka") ) ),'array<struct<city:string>>') 33 Brickhouse functions
  34. 34. Prediction tracing of Decision Tree 34 Trace how predicted
  35. 35. 35 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; Experimental Not yet supported in TD
  36. 36. 36 Hivemall Digdag
  37. 37. 37 Machine Learning Workflow using Digdag
  38. 38. 38 Machine Learning Workflow using Digdag