Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HadoopCon'16, Taipei @myui

379 views

Published on

Keynote talk at HadoopCon'16, Taipei on Sept 10.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

HadoopCon'16, Taipei @myui

  1. 1. Hivemall: Machine Learning Library for Apache Hive/Spark Research Engineer Makoto YUI (油井 誠) @myui <myui@treasure-data.com> 12016/09/09 HadoopCon 16, Taipei
  2. 2. Ø 2015.04~ Research Engineer at Treasure Data, Inc. • My mission is developing ML-as-a-Service in a Hadoop-as- a-service company Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. 産業技術総合研究所 • Developed Hivemall as a personal research project Ø 2009.03 Ph.D. in Computer Science from NAIST • Majored in Parallel Data Processing, not ML then Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh Little about me .. 2016/09/09 HadoopCon 16, Taipei 2
  3. 3. 2016/09/09 HadoopCon 16, Taipei 3 Hiro Yoshikawa CEO Kaz Ota CTO Sada Furuhashi Chief Architect Open source business veteran Founder - world’s largest Hadoop group Invented Fluentd, Messagepack TODAY
 100+ Employees, 30M+ funding 2015
 New office in Seoul, Korea 2013
 New office in Tokyo, Japan 2012
 Founded in Mountain View, CA Investors Jerry Yang
 Yahoo! Founder Bill Tai
 Angel Investor Yukihiro Matsumoto
 Ruby Inventor Sierra Ventures - Tim Guleri
 Entrerprise Software Scale Ventures - Andy Vitus
 B2B SaaS Treasure Data
  4. 4. 2016/09/09 HadoopCon 16, Taipei 4 We Open-source! TD invented .. Streaming log collector Bulk data import/export efficient binary serialization Streaming Query Processor Machine learning on Hadoop digdag.io Workflow engine (Beta)
  5. 5. 2016/09/09 HadoopCon 16, Taipei 5 Microsoft Operation Management Suite and Google Cloud Platform (Kubernates) are using Fluentd for log collection Point Our technology users
  6. 6. 2016/09/09 HadoopCon 16, Taipei 6 Microsoft Operation Management Suite and Google Cloud Platform (Kubernates) are using Fluentd for log collection Point Our technology users
  7. 7. 2016/09/09 HadoopCon 16, Taipei 7 Treasure Data’s Solution
  8. 8. 2016/09/09 HadoopCon 16, Taipei 8 Big Data Stats in TD
  9. 9. Ad-tech IoT 三菱重工 Agency / Trading Desk DMP / DSP Ad-Network Diverse Corporate Identity Manual 02 コーポレートカラー 千歳緑(ちとせみどり) この千歳緑をDiversのコーポレートカラーとします。 千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。 繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。 ■ CMYK / プロセスカラー C : 85% M : 17% Y : 76% K : 57% ■ PANTONE / プロセスカラー 555EC ■ RGB / モニター R : 0 G : 80 B : 60 背景と干渉する場合に使用するボックスロゴ 背景と干渉する場合に使用するボックスロゴ 白黒 白黒のみの場合 EC Media Game/SNS Gaminge-Commerce Internet Service Retail Finance TechnologyTelecommunicationMaker Other domain Our Customers 2016/09/09 HadoopCon 16, Taipei 9
  10. 10. Ad-tech IoT 三菱重工 Agency / Trading Desk DMP / DSP Ad-Network Diverse Corporate Identity Manual 02 コーポレートカラー 千歳緑(ちとせみどり) この千歳緑をDiversのコーポレートカラーとします。 千歳緑は、常緑の松の緑をさし、吉祥的な意味を持つ事から、おめでたく、喜ばしい意味を持ちます。 繁栄・幸運を意味し、吉祥天は幸福・美・富を顕す神であるとともに、美女の代名詞ともされています。 ■ CMYK / プロセスカラー C : 85% M : 17% Y : 76% K : 57% ■ PANTONE / プロセスカラー 555EC ■ RGB / モニター R : 0 G : 80 B : 60 背景と干渉する場合に使用するボックスロゴ 背景と干渉する場合に使用するボックスロゴ 白黒 白黒のみの場合 EC Media Game/SNS Gaminge-Commerce Internet Service Retail Finance TechnologyTelecommunicationMaker Other domain Our Customers 2016/09/09 HadoopCon 16, Taipei 10
  11. 11. 1. What is Hivemall (introduction) 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall 5. Future roadmap Agenda 2016/09/09 HadoopCon 16, Taipei 11
  12. 12. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 12 https://github.com/myui/hivemall 2016/09/09 HadoopCon 16, Taipei
  13. 13. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack Amazon S3 2016/09/09 HadoopCon 16, Taipei 13
  14. 14. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 142016/09/09 HadoopCon 16, Taipei
  15. 15. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 15 Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 2016/09/09 HadoopCon 16, Taipei
  16. 16. List of Algorithms for Recommendation 16 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 2016/09/09 HadoopCon 16, Taipei
  17. 17. Other Supported Algorithms 17 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji) 2016/09/09 HadoopCon 16, Taipei
  18. 18. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. Industry use cases of Hivemall 182016/09/09 HadoopCon 16, Taipei
  19. 19. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo Industry use cases of Hivemall 19 Problem: Recommendation using hot-item is hard in hand-crafted product market because each creator sells few single items (will soon become out-of-stock) 2016/09/09 HadoopCon 16, Taipei minne.com
  20. 20. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo • Value prediction of Real estates • Algorithm: Regression • Livesense Industry use cases of Hivemall 202016/09/09 HadoopCon 16, Taipei
  21. 21. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo • Value prediction of Real estates • Algorithm: Regression • Livesense • User score calculation • Algrorithm: Regression • Klout Industry use cases of Hivemall 21 bit.ly/klout-hivemall 2016/09/09 HadoopCon 16, Taipei Influencer marketing klout.com
  22. 22. OISIX, a leading food delivery service company in Japan, used Hivemall’s Logistic Regression to get churn probability 2016/09/09 HadoopCon 16, Taipei 22 Churn Detection of Monthly Payment Service Churn rate dropped almost by half by giving gift points to customers being predicted to leave J
  23. 23. 1. What is Hivemall 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall 5. Future roadmap Agenda 2016/09/09 HadoopCon 16, Taipei 23
  24. 24. 2016/09/09 HadoopCon 16, Taipei Motivation – Why a new ML framework? Mahout? Vowpal Wabbit? (w/ Hadoop streaming) Spark MLlib? 0xdata H2O? Cloudera Oryx? Machine Learning frameworks out there that run with Hadoop Quick Poll: How many people in this room are using them? 24
  25. 25. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector Extract-Transform-Load Machine Learning file 2016/09/09 HadoopCon 16, Taipei 25 height:173cm weight:60kg age:34 gender: man …
  26. 26. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load file Need to do expensive data preprocessing (Joins, Filtering, and Formatting of Data that does not fit in memory) Machine Learning 2016/09/09 HadoopCon 16, Taipei 26
  27. 27. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector Extract-Transform-Load file Do not scale Have to learn R/Python APIs height:173cm weight:60kg age:34 gender: man … 2016/09/09 HadoopCon 16, Taipei 27
  28. 28. Hivemall’s Vision: ML on SQL (again) Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 2016/09/09 HadoopCon 16, Taipei 28
  29. 29. 29 Hivemall on Apache Spark Installation is very easy as follows: $ spark-shell --packages maropu:hivemall-spark:0.0.6 2016/09/09 HadoopCon 16, Taipei
  30. 30. 1. What is Hivemall 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall 5. Future roadmap Agenda 2016/09/09 HadoopCon 16, Taipei 30
  31. 31. Implemented machine learning algorithms as User-Defined Table generating Functions (UDTFs) How Hivemall works in training +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> tuple <label, array<features>> tuple<feature, weights> Prediction model UDTF Relation <feature, weights> param-mix param-mix Training table Shuffle by feature train train ● Resulting prediction model is a relation of feature and its weight ● # of mapper and reducers are configurable UDTF is a function that returns a relation Parallelism is Powerful 2016/09/09 HadoopCon 16, Taipei 31
  32. 32. 32 train train +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> tuple <label, featues> array<weight> Training table -1, <2,7, 9> .. +1, <3,8> MIX -1, <2,7, 9> .. +1, <3,8> train train array<weight> Parameter averaging (bagging) 2016/09/09 HadoopCon 16, Taipei
  33. 33. Alternative Approach in Hivemall Hivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps SET hivevar:xtimes=3; CREATE VIEW training_x3 as SELECT * FROM ( SELECT amplify(${xtimes}, *) as (rowid, label, features) FROM training ) t CLUSTER BY rand() 2016/09/09 HadoopCon 16, Taipei 33
  34. 34. 1. What is Hivemall 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall 5. Future roadmap Agenda 2016/09/09 HadoopCon 16, Taipei 34
  35. 35. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 352016/09/09 HadoopCon 16, Taipei
  36. 36. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 362016/09/09 HadoopCon 16, Taipei
  37. 37. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 372016/09/09 HadoopCon 16, Taipei
  38. 38. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 382016/09/09 HadoopCon 16, Taipei
  39. 39. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 392016/09/09 HadoopCon 16, Taipei
  40. 40. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 402016/09/09 HadoopCon 16, Taipei
  41. 41. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 412016/09/09 HadoopCon 16, Taipei
  42. 42. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 422016/09/09 HadoopCon 16, Taipei
  43. 43. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 432016/09/09 HadoopCon 16, Taipei
  44. 44. Real-time prediction Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 44 bit.ly/hivemall-rtp 2016/09/09 HadoopCon 16, Taipei
  45. 45. RandomForest in Hivemall Ensemble of Decision Trees 2016/09/09 HadoopCon 16, Taipei 45
  46. 46. Training of RandomForest 2016/09/09 HadoopCon 16, Taipei 46
  47. 47. Prediction of RandomForest 2016/09/09 HadoopCon 16, Taipei 47
  48. 48. 1. What is Hivemall 2. Why Hivemall (motivations etc.) 3. Hivemall Internals 4. How to use Hivemall 5. Future roadmap Agenda 2016/09/09 HadoopCon 16, Taipei 48
  49. 49. 49 Future of Hivemall Hivemall will become Apache Hivemall (?) Now on voting though.. 2016/09/09 HadoopCon 16, Taipei
  50. 50. 50 Apache Incubation status 2016/09/09 HadoopCon 16, Taipei
  51. 51. • Makoto Yui <Treasure Data> • Takeshi Yamamuro <NTT> Ø Hivemall on Apache Spark • Daniel Dai <Hortonworks> Ø Hivemall on Apache Pig Ø Apache Pig PMC member • Tsuyoshi Ozawa <NTT> ØApache Hadoop PMC member • Kai Sasaki <Treasure Data> 51 Initial committers 2016/09/09 HadoopCon 16, Taipei
  52. 52. Champion Nominated Mentors 52 Project mentors • Reynold Xin <Databricks, ASF member> Apache Spark PMC member • Markus Weimer <Microsoft, ASF member> Apache REEF PMC member • Xiangrui Meng <Databricks, ASF member> Apache Spark PMC member • Roman Shaposhnik <Pivotal, ASF member> Apache Bigtop/Incubator PMC member 2016/09/09 HadoopCon 16, Taipei
  53. 53. • Possibly enter Apache Incubator soon • IP clearance and project/repository site setup •Contribution guideline •Create who use Hivemall list •More documentations! Sept to Nov • Initial Apache Release will be Dec (or late Nov?) 53 Roadmap 2016/09/09 HadoopCon 16, Taipei
  54. 54. ü Hivemall on Spark 2.0 w/ Dataframe support ü XGBoost support 54 Coming New Features - already merged in Master 2016/09/09 HadoopCon 16, Taipei Please Refer bit.ly/hivemall-xgboost for detail
  55. 55. ü ChangeFinder • Efficient algorithm for finding change point and outliers from timeseries data 55 Coming New Features - already merged in Master J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 2016/09/09 HadoopCon 16, Taipei
  56. 56. ü ChangeFinder • Efficient algorithm for finding change point and outliers from timeseries data 56 Coming New Features - already merged in Master J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 2016/09/09 HadoopCon 16, Taipei
  57. 57. ü Various Evaluation Metrics •PR #326 57 Coming New Features - already merged in Master 2016/09/09 HadoopCon 16, Taipei
  58. 58. • v0.5-beta{1,2} release (Oct-Nov) üone-hot encoding ü Field-aware Factorization Machines ü Kernelized Passive Aggressive üGeneralized Linear Model ü Optimizer framework including ADAM ü L1/L2 regularization ü Gradient Tree Boosting ü Online LDA 58 Other undergoing new features 2016/09/09 HadoopCon 16, Taipei
  59. 59. Conclusion and Takeaway Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs 59 Ø For SQL users that need ML Ø For whom already using Hive Ø Easy-of-use and scalability in mind Do not require coding, packaging, compiling or introducing a new programming language or APIs. Hivemall’s Positioning We welcome your contributions to Apache Hivemall J 2016/09/09 HadoopCon 16, Taipei
  60. 60. 60 Any feature request or questions? #hivemall 2016/09/09 HadoopCon 16, Taipei

×