Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Hivemall v0.5.0

336 views

Published on

Talk at ApacheCon North America, 2018
https://apachecon.dukecon.org/acna/2018/#/scheduledEvent/0cbf85b79b554dee6

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Introduction to Apache Hivemall v0.5.0

  1. 1. Introduction to Apache Hivemall v0.5.0: Machine Learning on Hive/Spark Makoto YUI @myui ApacheCon North America 2018 Takashi Yamamuro @maropu @ApacheHivemall 1). Principal Engineer, 2). Research Engineer, 1
  2. 2. Plan of the talk 1. Introduction to Hivemall 2. Hivemall on Spark ApacheCon North America 2018 Background, quick walk-through of feature, usages, what's new in v0.5.0, and future roadmaps New top-k join enhancement, and a feature plan for Supporting spark 2.3 and feature selection 2 Slide available: bit.ly/hivemall-apachecon18
  3. 3. We released the first Apache release v0.5.0 on Mar 3rd, 2018 ! hivemall.incubator.apache.org ApacheCon North America 2018 We plan to start voting for the 2nd Apache release (v0.5.2) in the next month (Oct 2018). 3
  4. 4. What’s new in v0.5.0? Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark 2.0/2.1/2.1 SparkSQL/Dataframe support, Top-k data processing ApacheCon North America 2018 4
  5. 5. 5 Running machine learning on massive data stored on data warehouse Make It! ApacheCon North America 2018 Suppose … Background
  6. 6. 6 Running machine learning on massive data stored on data warehouse Scalability? Data movement? Tool? ApacheCon North America 2018 Concerns:
  7. 7. Approach #1 7 Data warehouse Data preprocessing Machine Learning Typical Data Scientist’s Solution Small data? ApacheCon North America 2018
  8. 8. 8 Data warehouse Data preprocessing Machine Learning Approach #2 Data Engineer’s Solution ApacheCon North America 2018
  9. 9. 9 Q: Is Dataframe a great idea for data (pre-)processing? ApacheCon North America 2018
  10. 10. 10 Q: Do you like it? (for production-ready data preprocessing) p Yes p No p Maybe ApacheCon North America 2018 I like it for simple data processing
  11. 11. 11 Q: Do you really like it? (for messy real-world data preprocessing) p Yes p No p Maybe ApacheCon North America 2018
  12. 12. 12 Real-world ML pipelines (could be more complex) Join Extract Feature Datasource #1 Datasource #2 Datasource #3 Extract Feature Feature Scaling Feature Hashing Feature Engineering Feature Selection Train by Logistic Regression Train by RandomForest Train by Factorization Machines Ensemble Evaluate Predict ApacheCon North America 2018
  13. 13. 13 Q: Have you ever seen/write hundreds-thousands lines of preprocessing in Dataframe? ApacheCon North America 2018 Hundreds-lines of SQL queries for data pre-precessing are well seen.
  14. 14. 14 Q. Fun to play with it? (scala/python coding for trivial things) Do you write testing codes? IMPO, notebook codes are error-prone for production uses ApacheCon North America 2018
  15. 15. My Suggestion 15 Data warehouse Data preprocessing Machine Learning + Scalability + Durability/Stability + Functionalities (UDFs, JSON, Windowing functions) Push more works back to DB where data resides (including some ML logics) One size does not fit all though ... ApacheCon North America 2018
  16. 16. Machine Learning in SQL queries ApacheCon North America 2018 16
  17. 17. BigQuery ML at Google I/O 2018 17 https://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html ApacheCon North America 2018
  18. 18. 18 Could I use ML-in-SQL in my cluster? ApacheCon North America 2018
  19. 19. 19 Open-source Machine Learning Solution for SQL-on-Hadoop https://hivemall.apache.org (incubating) ApacheCon North America 2018
  20. 20. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use ApacheCon North America 2018 20
  21. 21. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop ApacheCon North America 2018 21
  22. 22. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive ApacheCon North America 2018 22
  23. 23. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 ApacheCon North America 2018 23
  24. 24. Hivemall on Apache Hive ApacheCon North America 2018 24
  25. 25. Hivemall on Apache Spark Dataframe ApacheCon North America 2018 25
  26. 26. Hivemall on SparkSQL ApacheCon North America 2018 26
  27. 27. Hivemall on Apache Pig ApacheCon North America 2018 27
  28. 28. Online Prediction by Apache Streaming ApacheCon North America 2018 28
  29. 29. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones ApacheCon North America 2018 29
  30. 30. Generic Classifier/Regressor OLD Style New Style from v0.5.0 ApacheCon North America 2018 30
  31. 31. •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization ApacheCon North America 2018 31
  32. 32. RandomForest in Hivemall Ensemble of Decision Trees ApacheCon North America 2018 32
  33. 33. Training of RandomForest Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector input. ApacheCon North America 2018 33
  34. 34. Prediction of RandomForest ApacheCon North America 2018 34
  35. 35. Decision Tree Visualization ApacheCon North America 2018 35
  36. 36. Decision Tree Visualization ApacheCon North America 2018 36
  37. 37. SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall (beta version) SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; ApacheCon North America 2018 37
  38. 38. Supported Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items ApacheCon North America 2018 38
  39. 39. Other Supported Algorithms Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓English/Japanese/Chinese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc ApacheCon North America 2018 39
  40. 40. Feature Engineering – Feature Hashing ApacheCon North America 2018 40
  41. 41. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins ApacheCon North America 2018 41
  42. 42. ApacheCon North America 2018 Feature Engineering – Feature Binning 42
  43. 43. Evaluation Metrics ApacheCon North America 2018 43
  44. 44. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation ApacheCon North America 2018 44
  45. 45. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder ApacheCon North America 2018 45
  46. 46. Take this… Anomaly/Change-point Detection by ChangeFinder ApacheCon North America 2018 46
  47. 47. Anomaly/Change-point Detection by ChangeFinder …and do this! ApacheCon North America 2018 47
  48. 48. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation ApacheCon North America 2018 48
  49. 49. Online mini-batch LDA ApacheCon North America 2018 49
  50. 50. Probabilistic Latent Semantic Analysis - training ApacheCon North America 2018 50
  51. 51. Probabilistic Latent Semantic Analysis - predict ApacheCon North America 2018 51
  52. 52. ü Spark 2.3 support ü Merged Brickhouse UDFs ü Field-aware Factorization Machines ü SLIM recommendation What’s new in the coming v0.5.2 ApacheCon North America 2018 Xia Ning and George Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. ICDM, 2011. Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin, "Field-aware Factorization Machines for CTR Prediction", Proc. RecSys. 2016. State-of-the-art method for CTR prediction, often used algorithm in Kaggle Very promising algorithm for top-k recommendation 52
  53. 53. ü Word2Vec support ü Multi-class Logistic Regression ü More efficient XGBoost support ü LightGBM support ü Gradient Boosting ü Kafka KSQL UDF porting Future work for v0.6 and later PR#91 PR#116 ApacheCon North America 2018 53
  54. 54. Copyright©2018 NTT corp. All Rights Reserved.
  55. 55. Copyright©2018 NTT corp. All Rights Reserved. , • . • • . .
  56. 56. Copyright©2018 NTT corp. All Rights Reserved. - : -: • • 665 . - 5 . - . - . • : • . 5 . , 56 5 • : • 5 . .
  57. 57. ()Copyright©2018 NTT corp. All Rights Reserved. -: , : 2 • - :1 . -: • - 31 - 1:- 1 31 • 31 : $ 7> FD E F=: = DF FD E $ D =:$ E F$= A F > : FD E $E: $ : > : /-.2/$A " F > :$ > :E 5$ > F L D E M""$ " E:F 1 (
  58. 58. Copyright©2018 NTT corp. All Rights Reserved. • ( 0 2 244 0 24 40 10 0 1 00 0 0 • )0 10 0 2 • ) E F F 1C 8 : • C .F58C 8E8 * 8C 8 E • EC8 : 8 /8 C : • :8 28 8C • I 3 *0.0 FCE 8C C E 58 F EE ( 5 E H H
  59. 59. Copyright©2018 NTT corp. All Rights Reserved. • • / 5 5 *55 -5 9 3 5 / : 9 95 5 53 9 5 A 9 5 39 5 A 29 .D ,
  60. 60. Copyright©2018 NTT corp. All Rights Reserved. • . • 6+ -6 / - - + / 6 + + 6+ - • - / / /- 6 -+ - / +60 • . • 6 / - + / / +60 • 6 - + - -+ / / 6 + • / 6 - - / -6 6 + - / +60 .
  61. 61. Copyright©2018 NTT corp. All Rights Reserved. • • / . 66/ 6 ++ : 6 1 6/ . / 1/ . 6 . 6 6/ / . 1
  62. 62. *Copyright©2018 NTT corp. All Rights Reserved. • ,7 299 A 3 7 A 7 2 ,-1 ).. ( 2 • - 1 :3 13 1 23 A A 2 A 5 1: 3 6 $$5 6 0 1 $/ /163$ 1 0/ 6 3 /:: 12 1 0/ 6 3 /:: /1 /53 , -3 : / 53 $ / / 53 $6 3 /:: / ... D 6 23 3 23 1 3 /
  63. 63. Copyright©2018 NTT corp. All Rights Reserved. • . 3 3 3 • 4 . 3 • 1 24 1 • 4 43 2 1
  64. 64. Copyright©2018 NTT corp. All Rights Reserved. • 6 . .21 6 6 • ## :2. 6 .- # 426#42 : 4:#- :. :# .0 .::2 6 4 $$ /2-/ 0 6 6 .1.1 1 6 6 6 0. ## :2. 6 .- # 426#42 : 4:#- :. :# .0 .::2 6# $$ 26 0. ## :2. 6 .- # 426#42 : 4:#- :. :# .0 .::2 6# $$ .:
  65. 65. )(Copyright©2018 NTT corp. All Rights Reserved. 2 . . // Downloads Spark v2.3 and launches a spark-shell with Hivemall $ : C < C == C . D / D > > D : 6 =: CF> > DD 6 := C5 = - F = D : / C <$ 6$ > D =: CF> "$= 6 0 )$D : $ " C5 = - D : / $ : D25 > D = = 6 E = E== = D E " DE C F 5D E== = D E "
  66. 66. Copyright©2018 NTT corp. All Rights Reserved. 3 . - -6-) :- = -6 6 ( = - = - ,6 -=> 6-. 6 " >: -=> B" - = $) - B"
  67. 67. Copyright©2018 NTT corp. All Rights Reserved. - -. = D ( L CF, = L D = 6 E C O 6 CF6 D = D ( L D EG> D, D P 6 LM " ) O CABL ) O CABL P . P 6 L CF:DGA A LM " D D P ) LM " O CABL P . 6 CF6 D P P 7 LM L C ACF
  68. 68. (Copyright©2018 NTT corp. All Rights Reserved. . .- E 6 6 6EF ) 6 8: * F EF : E F"DB =8"# : 6F D E # B8 FBD" : 6F D E # 6 B D 8= F=B E 8: >B= " B8 : 8:" : 6F D # *** B8 " : 6F D # . . # DB , " DB =8 # 6 "E= B=8"E " = F $ 6 ###
  69. 69. Copyright©2018 NTT corp. All Rights Reserved. . - 4 N >G>) JABG C MB>OB6M B G> B:B FBR T JABG:> GB N >G>) AC MB>OB6M B G> B:B FBR T:BNO:> GBT N >G>) >NOB NLG S 9. .,: MJRFA NFD JFA >GPB " RBFDEO *9 MBAF OBA S 6 :M>FI:> GB O S . : 6 :. 6 JABG:> GB S 6 O CB>OPMB ( CB>OPMB S 6 = MJRFA NOMF >MDFI
  70. 70. (Copyright©2018 NTT corp. All Rights Reserved. • - . . - . : • , : • . ) 0/- > :=:C -: : - + 7 :>C0 = C ) C :> > 3 C ) C :> > 3 C ) 3 > 3 7 + 7 C :C $" ". (" "), $" ". (" ")) , = C C >C : 7
  71. 71. (Copyright©2018 NTT corp. All Rights Reserved. • • A J KN I=D KA J$ K = E K=J K > I = I - J D , JK=) D K C > + D=>K > B A IA K >$ I R )) AD$ 17R" J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R" NAK D E I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """ N =I= I C + K " E K=J K :P JA IC ADD - J DP
  72. 72. (Copyright©2018 NTT corp. All Rights Reserved. • : 3 : : . • A J KN I=D KA J$ K = E K=J K > I = I >3 :1 : 3:> : . 13>> : J D , JK=) D K C > + D=>K > B A IA K >$ I R )) AD$ 2 7R" J=D= K D=>K > I "$ D=>K > R" IA K > PR"" J J I=R" NAK D E I C $ I C " =I IKAKA .P I R" I<=I.P J I= <=J """ N =I= I C + K " E K=J K :P JA IC ADD - 2J DP :> 3 - 2 : 3 1 3 2
  73. 73. Copyright©2018 NTT corp. All Rights Reserved. • ::- • .= AD= : A = A7 = > A A=> = 7 = > - : :- - : ) > A : A=> + ( : A+ : : A A=> 7A+ : A+ = >H ((( 7A+ = >H : A+ H 7A+ H = H - >: A A 3 - , ::
  74. 74. Copyright©2018 NTT corp. All Rights Reserved. • : : -
  75. 75. Copyright©2018 NTT corp. All Rights Reserved. • : : - K-length priority queue Computes top-K rows by using a priority queue
  76. 76. Copyright©2018 NTT corp. All Rights Reserved. • : : - K-length priority queue Computes top-K rows by using a priority queue Only joins top-K rows
  77. 77. Copyright©2018 NTT corp. All Rights Reserved. • - - • (7) ) / ) ) ) , ) 7 ) 7 ) ) , 7 - : (7) ) ) )
  78. 78. Copyright©2018 NTT corp. All Rights Reserved. • - - • / 8 8 7 , / 8 7/ / 7 8 7 /8 - : 7/
  79. 79. ,Copyright©2018 NTT corp. All Rights Reserved. • - - *: -:: • 7 JD L J EEP J L K =H> JH EL PK = E E # > =H E K = L K L :- - : K= E / LH D0 E .. PK = E E .. 7 E >2 K H 8H (# 9 JH ( :# 9 JH ) : - 1 = K JL L H JH ( # ) - H= E8 E 7= 9 JH ( # ((: 1 = K JL L H JH ) # ) H= E8 E 7= 9 JH ) # P )+: - -* -* : * - - *
  80. 80. Copyright©2018 NTT corp. All Rights Reserved. • - 3: 3 -:1 : 1 1 ! :1 : : : : : -
  81. 81. Copyright©2018 NTT corp. All Rights Reserved. • : -: : : : =: -: • 1 : 8 1 : 1 : 8 - 8 + : : -: Data Extraction (e.g., by SQL) Feature Selection (e.g., by scikit-learn) Selected Features Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016.
  82. 82. Copyright©2018 NTT corp. All Rights Reserved. • : -: : : : =: -: • 8 8 : 2 8 : 1 : 8 21 :8 2 : - 1 8 1 : 8 : + : : -: Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. Data Extraction + Feature Selection Join Pruning by Data Statistics
  83. 83. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs The 2nd Apache release (v0.5.2) will appear soon! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin ApacheCon North America 2018 83
  84. 84. Thank you! Questions? ApacheCon North America 2018 84 Mentors wanted!

×