Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2nd Hivemall meetup 20151020

2,256 views

Published on

Talk at http://eventdots.jp/event/571107

Published in: Engineering
  • Be the first to comment

2nd Hivemall meetup 20151020

  1. 1. Introduction to Hivemall and it’s new features in v0.4 Research Engineer Makoto YUI @myui 2015/10/20 Hivemall meetup #2 1 Tweet w/ #hivemallmtup http://eventdots.jp/event/571107
  2. 2. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data My mission in TD is developing ML-as-a-Service Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Worked on a large-scale Machine Learning project and Parallel Databases Ø 2009.03 Ph.D. in Computer Science from NAIST Ø Super programmer award from the MITOU Foundation Who am I ? 2015/10/20 Hivemall meetup #2 2
  3. 3. Agenda 1. What is Hivemall 2. How to use Hivemall 3. New Features in Hivemall v0.4 1. Random Forest 2. Factorization Machine 4. Development Roadmap of Hivemall 2015/10/20 Hivemall meetup #2 3
  4. 4. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 2015/10/20 Hivemall meetup #2 4 https://github.com/myui/hivemall
  5. 5. What is Hivemall Hadoop HDFS MapReduce (MR v1) Hive / PIG Hivemall Apache YARN Apache Tez DAG processing MR v2 Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System 2015/10/20 Hivemall meetup #2 5 Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
  6. 6. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 2015/10/20 Hivemall meetup #2 6
  7. 7. List of Features in Hivemall v0.3.2 Classification (both binary- and multi-class) ✓ Perceptron ✓ Passive Aggressive (PA) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad ✓AdaDELTA kNN and Recommendation ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search using K-NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix Factorization Feature engineering ✓ Feature Hashing ✓ Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion Anomaly Detection ✓ Local Outlier Factor Treasure Data supports Hivemall v0.3.2-3 2015/10/20 Hivemall meetup #2 7
  8. 8. Ø CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc. and more Ø Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. Ø Churn Detection • Algorithm: Regression • OISIX and more Ø Item/User recommendation • Algorithm: Recommendation (Matrix Factorization / kNN) • Adtech Companies, ISP portal, and more Ø Value prediction of Real estates • Algorithm: Regression • Livesense Industry use cases of Hivemall 82015/10/20 Hivemall meetup #2
  9. 9. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 2015/10/20 Hivemall meetup #2 9
  10. 10. CREATE EXTERNAL TABLE e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 2015/10/20 Hivemall meetup #2 10
  11. 11. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 2015/10/20 Hivemall meetup #2 11
  12. 12. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 2015/10/20 Hivemall meetup #2 12
  13. 13. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 2015/10/20 Hivemall meetup #2 13
  14. 14. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 2015/10/20 Hivemall meetup #2 14
  15. 15. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 2015/10/20 Hivemall meetup #2 15
  16. 16. create table news20mc_ensemble_model1as select label, cast(feature as int) as feature, cast(voted_avg(weight)as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label,feature; Ensemble learning for stable prediction performance Just stack prediction models by union all 26 / 43 162015/10/20 Hivemall meetup #2
  17. 17. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 2015/10/20 Hivemall meetup #2 17
  18. 18. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 2015/10/20 Hivemall meetup #2 18
  19. 19. How to use Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 2015/10/20 Hivemall meetup #2 19
  20. 20. 2015/10/20 Hivemall meetup #2 20 Online Prediction on MySQL (RDBMS) Quick (msec) response on a RDBMS by adding an index to feature column bit.ly/hivemall-mysql
  21. 21. Agenda 1. What is Hivemall 2. How to use Hivemall 3. New Features in Hivemall v0.4 1. Random Forest 2. Factorization Machine 4. Development Roadmap of Hivemall 2015/10/20 Hivemall meetup #2 21
  22. 22. Features to be supported in Hivemall v0.4 2015/10/20 Hivemall meetup #2 22 1.RandomForest • classification, regression • Based on Smile github.com/haifengl/smile 2.Factorization Machine • classification, regression (factorization) Planned to release v0.4 in Oct. Factorization Machine are often used by data science competition winners (Criteo/Avazu CTR prediction)
  23. 23. 2015/10/20 Hivemall meetup #2 23 RandomForest in Hivemall v0.4 Ensemble of Decision Trees Already available on a development (smile) branch and it’s usage is explained in the project wiki Bagging
  24. 24. 2015/10/20 Hivemall meetup #2 24 Training of RandomForest
  25. 25. Out-of-bag tests and Variable Importance 2015/10/20 Hivemall meetup #2 25
  26. 26. 2015/10/20 Hivemall meetup #2 26 Prediction of RandomForest
  27. 27. 2015/10/20 Hivemall meetup #2 27 RandomForest DEMO http://bit.ly/hivemall-rf
  28. 28. 2015/10/20 Hivemall meetup #2 28 Factorization Machine Matrix Factorization
  29. 29. 2015/10/20 Hivemall meetup #2 29 Factorization Machine Context information (e.g., time) can be considered Source: http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
  30. 30. 2015/10/20 Hivemall meetup #2 30 Factorization Machine Factorization Model with degress=2 (2-way interaction) Global Bias Regression coefficience of j-th variable Pairwise Interaction Factorization
  31. 31. 2015/10/20 Hivemall meetup #2 31 Factorization Machine Factorization Machine ≈ Polynomial Regression + Factorization For a feature [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]. bit.ly/hivemall-poly
  32. 32. 2015/10/20 Hivemall meetup #2 32 Factorization Machine DEMO
  33. 33. Agenda 1. What is Hivemall 2. How to use Hivemall 3. New Features in Hivemall v0.4 1. Random Forest 2. Factorization Machine 4. Development Roadmap of Hivemall 2015/10/20 Hivemall meetup #2 33
  34. 34. Features to be supported in Hivemall v0.4.1 2015/10/20 Hivemall meetup #2 34 1.Gradient Tree Boosting • classifier, regression 2.Field-aware Factorization Machine • classification, regression (factorization) • Existing implementation, i.e., LibFFM, only can be applied for classification Planned to release v0.4.1 in Nov/Dec.
  35. 35. 2015/10/20 Hivemall meetup #2 35 Gradient Tree Boosting (or Gradient Boosting Trees) RF ≈ Bagging + Decision Trees parallel execution of decision trees GBT ≈ Boosting + Decision Trees Sequential execution of decision trees
  36. 36. 2015/10/20 Hivemall meetup #2 36 Gradient Tree Boosting
  37. 37. Features to be supported in Hivemall v0.4.2 2015/10/20 Hivemall meetup #2 37 1. Online LDA • topic modeling, clustering 2. Mix server on Apache YARN • Service for parameter sharing among workers • working w/ @maropu Planned to release v0.4.2 in Dec/Jan.
  38. 38. External service to share parameters by distributed training processes in the middle of training 2015/10/20 Hivemall meetup #2 38 What’s Mix Server? ・・・・・・ Model updates Async add Piggy back if … AVG/Argmin KLD accumulator hash(feature) % N Non-blocking Channel (single shared TCP connection w/ TCP keepalive) classifiers Mix serv.Mix serv. Computation/training is not being blocked Taking benefits of asynchronous non-blocking I/O is the core idea behind Hivemall’s MIX protocol
  39. 39. 2015/10/20 Hivemall meetup #2 39 create table kdd10a_pa1_model1 as select feature, cast(voted_avg(weight) as float) as weight from (select train_pa1(addBias(features),label,"-mix host01,host02,host03") as (feature,weight) from kdd10a_train_x3 ) t group by feature; How to use Mix Server
  40. 40. Conclusion and Takeaway New features in v0.4 2015/10/20 Hivemall meetup #2 40 • Random Forest • Factorization Machine More will follow in v0.4.1 Next Actions • Propose Hivemall to Apache Incubator • New Hivemall Logo Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs The latest version of Hivemall is available on Treasure Data and used by several companies Including OISIX, Livesense, Scaleout, and Freakout.
  41. 41. 2015/10/20 Hivemall meetup #2 41 Beyond Query-as-a-Service! We Open-source! We invented .. We are hiring machine learning engineer!
  42. 42. 2015/10/20 Hivemall meetup #2 42 Additional slides
  43. 43. Recommendation Rating prediction of a Matrix Can be applied for user/Item Recommendation 432015/10/20 Hivemall meetup #2
  44. 44. 44 Matrix Factorization Factorize a matrix into a product of matrices having k-latent factor 2015/10/20 Hivemall meetup #2
  45. 45. 45 Mean Rating Matrix Factorization Regularization Bias for each user/item Criteria of Biased MF 2015/10/20 Hivemall meetup #2 Factorization
  46. 46. 46 Training of Matrix Factorization Support iterative training using local disk cache 2015/10/20 Hivemall meetup #2
  47. 47. 47 Prediction of Matrix Factorization 2015/10/20 Hivemall meetup #2
  48. 48. ØAlgorithm is different Spark: ALS-WR (considers regularization) Hivemall: Biased-MF (considers regularization and biases) ØUsability Spark: 100+ line Scala coding Hivemall: SQL (would be more easy to use) ØPrediction Accuracy Almost same for MovieLens 10M datasets 2015/10/20 Hivemall meetup #2 48 Comparison to Spark MLlib
  49. 49. rowid features 1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0. 0"] 2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0. 13255163"] 3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0. 052084323"] Unsupervised Learning: Anomaly Detection Sensor data etc. Anomaly detection runs on a series of SQL queries 492015/10/20 Hivemall meetup #2
  50. 50. 2015/10/20 Hivemall meetup #2 50 Anomalies in a Sensor Data Source: https://codeiq.jp/q/207
  51. 51. Image Source: https://en.wikipedia.org/wiki/Local_outlier_factor 2015/10/20 Hivemall meetup #2 51 Local Outlier Factor (LoF) Basic idea of LOF: comparing the local density of a point with the densities of its neighbors
  52. 52. 2015/10/20 Hivemall meetup #2 52 DEMO: Local Outlier Factor rowid features 1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0. 0"] 2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0. 13255163"] 3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0. 052084323"]

×