Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Incubating Apache Hivemall

492 views

Published on

Talk at Plazma OSS day on Feb 15, 2018.
https://techplay.jp/event/650389

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Incubating Apache Hivemall

  1. 1. Incubating Apache Hivemall Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 12018/2/15 Plazma OSS day
  2. 2. Hivemall entered Apache Incubator on Sept 13, 2016 Since then, we invited 3 contributors as new committers (a committer has been voted as PPMC). Currently, we are working toward the first Apache release (v0.5.0). hivemall.incubator.apache.org 22018/2/15 Plazma OSS day
  3. 3. What’s new in v0.5.0? 3 Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark 2.0/2.1/2.1 SparkSQL/Dataframe support, Top-k data processing 2018/2/15 Plazma OSS day
  4. 4. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use 42018/2/15 Plazma OSS day
  5. 5. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop 52018/2/15 Plazma OSS day
  6. 6. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive 62018/2/15 Plazma OSS day
  7. 7. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 72018/2/15 Plazma OSS day
  8. 8. Hivemall on Apache Hive 82018/2/15 Plazma OSS day
  9. 9. Hivemall on Apache Spark Dataframe 92018/2/15 Plazma OSS day
  10. 10. Hivemall on SparkSQL 102018/2/15 Plazma OSS day
  11. 11. Hivemall on Apache Pig 112018/2/15 Plazma OSS day
  12. 12. Online Prediction by Apache Streaming 122018/2/15 Plazma OSS day
  13. 13. 13 Generic Classifier/Regressor OLD Style New Style from v0.5.0 2018/2/15 Plazma OSS day
  14. 14. 14 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization 2018/2/15 Plazma OSS day
  15. 15. Versatile Hivemall is a Versatile library .. ü Not only for Machine Learning ü provides a bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing Don’t Repeat Yourself! Don’t Repeat Yourself! 152018/2/15 Plazma OSS day
  16. 16. Hivemall generic functions Array and Map Bit and compress String and NLP Brickhouse UDFs are merged in v0.5.2 release. We welcome contributing your generic UDFs to Hivemall Geo Spatial Top-k processing > TF/IDF > TILE > MAP_URL 162018/2/15 Plazma OSS day
  17. 17. 2018/2/15 Plazma OSS day student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 RANK over() query does not finishes in 24 hours L where 20 million MOOCs classes and avg 1,000 students in each classes 17
  18. 18. 2018/2/15 Plazma OSS day student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t EACH_TOP_K finishes in 2 hours J 18
  19. 19. Map tiling functions 192018/2/15 Plazma OSS day
  20. 20. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 202018/2/15 Plazma OSS day
  21. 21. 21 SELECT count(distinct id) FROM data More useful functions (Sketch, NLP) SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "] 2018/2/15 Plazma OSS day
  22. 22. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 222018/2/15 Plazma OSS day
  23. 23. RandomForest in Hivemall Ensemble of Decision Trees 232018/2/15 Plazma OSS day
  24. 24. Training of RandomForest 24 Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! 2018/2/15 Plazma OSS day
  25. 25. Prediction of RandomForest 252018/2/15 Plazma OSS day
  26. 26. 26 Decision Tree Visualization 2018/2/15 Plazma OSS day
  27. 27. 27 Decision Tree Visualization 2018/2/15 Plazma OSS day
  28. 28. 28 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; 2018/2/15 Plazma OSS day
  29. 29. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 292018/2/15 Plazma OSS day
  30. 30. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc 302018/2/15 Plazma OSS day
  31. 31. Feature Engineering – Feature Hashing 312018/2/15 Plazma OSS day
  32. 32. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins 322018/2/15 Plazma OSS day
  33. 33. 2018/2/15 Plazma OSS day Feature Engineering – Feature Binning 33
  34. 34. Evaluation Metrics 342018/2/15 Plazma OSS day
  35. 35. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 352018/2/15 Plazma OSS day
  36. 36. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 362018/2/15 Plazma OSS day
  37. 37. Take this… Anomaly/Change-point Detection by ChangeFinder 372018/2/15 Plazma OSS day
  38. 38. Anomaly/Change-point Detection by ChangeFinder …and do this! 382018/2/15 Plazma OSS day
  39. 39. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 392018/2/15 Plazma OSS day
  40. 40. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation 402018/2/15 Plazma OSS day
  41. 41. Online mini-batch LDA 412018/2/15 Plazma OSS day
  42. 42. 42 Probabilistic Latent Semantic Analysis - training 2018/2/15 Plazma OSS day
  43. 43. 43 Probabilistic Latent Semantic Analysis - predict 2018/2/15 Plazma OSS day
  44. 44. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü More efficient XGBoost support ü LightGBM support ü DecisionTree prediction tracing ü Gradient Boosting Future work for v0.5.2 and later 44 PR#91 PR#116 PR#58 PR#111 2018/2/15 Plazma OSS day
  45. 45. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü More efficient XGBoost support ü LightGBM support ü DecisionTree prediction tracing ü Gradient Boosting Future work for v0.5.2 and later 45 PR#91 PR#116 PR#58 PR#111 2018/2/15 Plazma OSS day
  46. 46. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The first Apache release (v0.5.0) will appear soon! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin 462018/2/15 Plazma OSS day
  47. 47. Any feature request or questions? BTW, we are hiring! 472018/2/15 Plazma OSS day
  48. 48. 482018/2/15 Plazma OSS day Hivemall Digdag
  49. 49. 492018/2/15 Plazma OSS day Machine Learning Workflow using Digdag
  50. 50. 502018/2/15 Plazma OSS day Machine Learning Workflow using Digdag
  51. 51. Feature Selection – Signal Noise Ratio 512018/2/15 Plazma OSS day

×