Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in Apache Hivemall v0.5.0

885 views

Published on

April 17, 2018 at Dots
https://techplay.jp/event/663945

Published in: Data & Analytics
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

What's new in Apache Hivemall v0.5.0

  1. 1. Hivemall v0.5.0 Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 12018/4/17 Hivemall meetup
  2. 2. v0.5.0 22018/4/17 Hivemall meetup • • • • • •
  3. 3. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use 32018/4/17 Hivemall meetup
  4. 4. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop 42018/4/17 Hivemall meetup
  5. 5. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive 52018/4/17 Hivemall meetup
  6. 6. Hivemall on Apache Hive 62018/4/17 Hivemall meetup
  7. 7. Hivemall on Apache Spark Dataframe 72018/4/17 Hivemall meetup
  8. 8. Hivemall on SparkSQL 82018/4/17 Hivemall meetup
  9. 9. Hivemall on Apache Pig 92018/4/17 Hivemall meetup
  10. 10. Online Prediction by Apache Streaming 102018/4/17 Hivemall meetup
  11. 11. What’s new in v0.5.0? 11 Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark v2.0/v2.1/v2.2 SparkSQL/Dataframe support, Top-k data processing 2018/4/17 Hivemall meetup
  12. 12. 12 Generic Classifier/Regressor OLD Style New Style from v0.5.0 2018/4/17 Hivemall meetup
  13. 13. 13 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization 2018/4/17 Hivemall meetup
  14. 14. 2018/4/17 Hivemall meetup 14 -eta0 <arg> The initial learning rate [default 0.1] -iter,--iterations <arg> The maximum number of iterations [default: 10] -lambda <arg> Regularization term [default 0.0001] -loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss, SquaredHingeLoss, ModifiedHuberLoss, or a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss, SquaredEpsilonInsensitiveLoss, HuberLoss] -mini_batch,--mini_batch_size <arg> Mini batch size [default: 1]. Expecting the value in range [1,100] or so. -opt,--optimizer <arg> Optimizer to update weights [default: adagrad, sgd, adadelta, adam] -reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet] Generic Classifier/Regressor Hyperparameters Adagrad+RDA by the default
  15. 15. RandomForest in Hivemall Ensemble of Decision Trees 152018/4/17 Hivemall meetup
  16. 16. Image borrowed from http://alfredplpl.hatenablog.com/entry/2013/12/24/225420 2018/4/17 Hivemall meetup 16 What’s OOB in RandomForests? uniform/stratified sampling
  17. 17. 2018/4/17 Hivemall meetup 17 Stratified Sampling ( ) ) https://bellcurve.jp/statistics/course/8007.html
  18. 18. 2018/4/17 Hivemall meetup 18 What’s OOB in RandomForests? ) http://alfredplpl.hatenablog.com/entry/2013/12/24/225420 学習に使っていないデータを モデルの精度評価に利用
  19. 19. Training of RandomForest 19 Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! 2018/4/17 Hivemall meetup train_randomforest_classifier(array<double|string> features, int label [, const string options, const array<double> classWeights])
  20. 20. • Dense Vector (array<double>) • Sparse Vector (array<string>) in a LIBSVM format • feature := <index>[“:”<value>] where index := <integer> starting with 1 (index = 0 is reserved for bias clause) and value := <floating point> (default 1.0 if not provided) 2018/4/17 Hivemall meetup 20 Supported Feature Vector Format of Random Forests 1.0, 0.0, 3.0 1:1.0, 2:0.0, 3:3.0 1:1.0, 3:3.0 select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331")); ["1828616:3.3","6238429:4.999","6238429"] 1:1.0, 3
  21. 21. Feature Engineering – Feature Hashing 212018/4/17 Hivemall meetup
  22. 22. 2018/4/17 Hivemall meetup 22 Random Forests Taining Hyperparameters -attrs,--attribute_types <arg> Comma separated attribute types (Q for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C]) -depth,--max_depth <arg> The maximum number of the tree depth [default: Integer.MAX_VALUE] -leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes [default: Integer.MAX_VALUE] -min_samples_leaf <arg> The minimum number of samples in a leaf node [default: 1] -rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR] -seed <arg> seed value in long [default: -1 (random)] -splits,--min_split <arg> A node that has greater than or equals to `min_split` examples will split [default: 2] -stratified,--stratified_sampling Enable Stratified sampling for unbalanced data -subsample <arg> Sampling rate in range (0.0,1.0] -trees,--num_trees <arg> The number of trees for each task [default: 50] -vars,--num_variables <arg> The number of random selected features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length) is considered if num_variable is (0.0,1.0]
  23. 23. Prediction of RandomForest 232018/4/17 Hivemall meetup 決定木の予測クラスの投票に基づく事後確率 OOBエラー率に基づくmodelの信憑性
  24. 24. 24 Decision Tree Visualization 2018/4/17 Hivemall meetup
  25. 25. 25 Decision Tree Visualization 2018/4/17 Hivemall meetup http://viz-js.com/
  26. 26. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins 262018/4/17 Hivemall meetup
  27. 27. 2018/4/17 Hivemall meetup Feature Engineering – Feature Binning 27
  28. 28. Evaluation Metrics 282018/4/17 Hivemall meetup
  29. 29. Map tiling functions 292018/4/17 Hivemall meetup
  30. 30. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 302018/4/17 Hivemall meetup
  31. 31. 31 SELECT count(distinct id) FROM data Sketch and NLP functions SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "] 2018/4/17 Hivemall meetup
  32. 32. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 322018/4/17 Hivemall meetup
  33. 33. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 332018/4/17 Hivemall meetup
  34. 34. Take this… Anomaly/Change-point Detection by ChangeFinder 342018/4/17 Hivemall meetup
  35. 35. Anomaly/Change-point Detection by ChangeFinder …and do this! 352018/4/17 Hivemall meetup
  36. 36. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 362018/4/17 Hivemall meetup
  37. 37. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation 372018/4/17 Hivemall meetup
  38. 38. Online mini-batch LDA 382018/4/17 Hivemall meetup
  39. 39. 39 Probabilistic Latent Semantic Analysis - training 2018/4/17 Hivemall meetup
  40. 40. 40 Probabilistic Latent Semantic Analysis - predict 2018/4/17 Hivemall meetup
  41. 41. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü Merge Brickhouse UDFs ü XGBoost support ü LightGBM support ü Gradient Boosting Future work for v0.5.2 and later 41 PR#91 PR#116 PR#58 PR#111 2018/4/17 Hivemall meetup PR#135
  42. 42. SELECT from_json(to_json( ARRAY( NAMED_STRUCT("country", "japan", "city", "tokyo"), NAMED_STRUCT("country", "japan", "city", "osaka") ) ),'array<struct<city:string>>') 2018/4/17 Hivemall meetup 42 Brickhouse functions https://github.com/klout/brickhouse
  43. 43. Prediction tracing of Decision Tree 432018/4/17 Hivemall meetup Trace how predicted
  44. 44. 44 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; 2018/4/17 Hivemall meetup Experimental Not yet supported in TD
  45. 45. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs Try our the first Apache release (v0.5.0)! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin 452018/4/17 Hivemall meetup
  46. 46. Any feature request or questions? BTW, we are hiring! 462018/4/17 Hivemall meetup
  47. 47. 472018/4/17 Hivemall meetup Hivemall Digdag
  48. 48. 482018/4/17 Hivemall meetup Machine Learning Workflow using Digdag
  49. 49. 492018/4/17 Hivemall meetup Machine Learning Workflow using Digdag

×