Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hivemall dbtechshowcase 20160713 #dbts2016

670 views

Published on

Talk at DB tech showcase on July/13/2016

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Hivemall dbtechshowcase 20160713 #dbts2016

  1. 1. Machine Learning Made Easy by using Hivemall Research Engineer Makoto YUI @myui <myui@treasure-data.com> bit.ly/hivemall 12016/07/13 DB tech showcase
  2. 2. ➢2015/04 Joined Treasure Data, Inc. ➢1st Research Engineer in Treasure Data ➢My mission in TD is developing ML-as-a-Service (MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project and Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST ➢XML native database and Parallel Database systems Who am I ? 2
  3. 3. External Integrations SQL Server CRM RDBMS App log Sensor Apache log ERP Hive Batch Adhoc Presto API ODBC JDBC PUSH Treasure Agent BI tools Data analysis Data Collectors Embedded Embulk Mobile SDK JS SDK Treasure Data Cloud Service Machine Learning 900,000 Records stored per sec. 3
  4. 4. 0 2000 4000 6000 8000 10000 12000 (単位)10億レコード サービス開始 Series A Funding 100社導入 Gartner社「Cool Vendor in Big Data」に選定される 10兆件 5兆レコード 数字でみる トレジャーデータ (2014年10月): 40万レコード 毎秒インポートされるデータの数 10兆レコード以上 インポートされたデータの数 120億 アドテク業界のお客様1社によって毎日送られてくるデー タ Data Imported to Treasure Data 4
  5. 5. 1. What is Hivemall (short intro.) 2. Why Hivemall (motivations etc.) 3. How to use Hivemall Agenda 5
  6. 6. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System SparkSQL Apache Spark MESOS Hive Pig MLlib 6
  7. 7. Won IDG’s InfoWorld 2014 Bossie Awards 2014: The best open source big data tools InfoWorld's top picks in distributeddata processing, data analytics,machine learning,NoSQL databases,and the Hadoop ecosystem (awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka) bit.ly/hivemall-award 7
  8. 8. Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad(logistic loss) ✓AdaDELTA (logistic loss) ✓Factorization Machines ✓RandomForest Regression List of supported Algorithms 8
  9. 9. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad(logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 9
  10. 10. List of Algorithms for Recommendation K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 10
  11. 11. Other Supported Algorithms Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji) 11
  12. 12. Ø CTR prediction of Ad click logs • Freakout Inc., Fan communication, and more • Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall http://www.slideshare.net/masakazusano75/sano-hmm-2015051212
  13. 13. ØGender prediction of Ad click logs • Scaleout Inc. and Fan commucations http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall 13
  14. 14. Industry use cases of Hivemall Ø Value prediction of Real estates • Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14
  15. 15. Source: http://itnp.net/article/2016/02/18/2286.html Industry use cases of Hivemall 15
  16. 16. ØChurn Detection • OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 16
  17. 17. 17 会員サービスの解約予測 •10万人の会員による定期購 買が会社全体の売上、利益を 左右するが、解約リスクのあ る会員を事前に把握、防止す る策を欠いていた • 統計の専門知識無しで機械学習 • 解約予測リストへのポイント付 与により解約率が半減 • 解約リスクを伴う施策、イベン トを炙り出すと同時に、非解約 者の特徴的な行動も把握可能に • リスク度合いに応じて UI を変 更するなど間接的なサービス改 善も実現 •機械学習を行い、過去1ヶ月間 のデータをもとに未来1ヶ月間 に解約する可能性の高い顧客リ ストを作成 •具体的には、学習用テーブル作 成 -> 正規化 -> 学習モデル作成 -> ロジスティック回帰の各ステ ップをTD + Hivemall を用いて クエリで簡便に実現 Web Mobile 属性情報 行動ログ クレーム情報 流入元 利用サービス情報 直接施策 間接施策 ポイント付与 ケアコール 成功体験への誘導UI 変更 予測に使うデータ
  18. 18. ØRecommendation • Portal site Industry use cases of Hivemall 18
  19. 19. 1. What is Hivemall (short intro.) 2. Why Hivemall (motivations etc.) 3. How to use Hivemall Agenda 19
  20. 20. Why Hivemall 1. In my experience working on ML, I used Hive for preprocessing and Python (scikit-learn etc.) for ML. This was INEFFICIENT and ANNOYING. Also, Python is not as scalable as Hive. 2. Why not run ML algorithms inside Hive? Less components to manage and more scalable. That’s why I build Hivemall. 20
  21. 21. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load Machine Learning file 21
  22. 22. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load file Need to do expensive data preprocessing (Joins, Filtering, and Formatting of Data that does not fit in memory) Machine Learning 22
  23. 23. How I used to do ML projects before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load file Do not scale Have to learn R/Python APIs 23
  24. 24. How I used to do ML before Hivemall Given raw data stored on Hadoop HDFS Raw Data HDFS S3 Feature Vector height:173cm weight:60kg age:34 gender: man … Extract-Transform-Load Does not meet my needs In terms of its scalability, ML algorithms, and usability I ❤ scalable SQL query 24
  25. 25. Framework User interface Mahout Java API Programming Spark MLlib/MLI Scala API programming Scala Shell (REPL) H2O R programming GUI Cloudera Oryx Http REST API programming Vowpal Wabbit (w/ Hadoop streaming) C++ API programming Command Line Survey on existing ML frameworks Existing distributed machine learning frameworks are NOT easy to use 25
  26. 26. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 26
  27. 27. Hivemall on Apache Spark Installation is very easy as follows: $ spark-shell --packages maropu:hivemall-spark:0.0.6 27
  28. 28. 1. What is Hivemall 2. Why Hivemall 3. How to use Hivemall Agenda 28
  29. 29. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 29
  30. 30. Create external table e2006tfidf_train( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 30
  31. 31. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 31
  32. 32. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 32
  33. 33. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 33
  34. 34. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 34
  35. 35. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 35
  36. 36. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 36
  37. 37. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 37
  38. 38. Real-time prediction Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model bit.ly/hivemall-rtp 38
  39. 39. Export Prediction Model to a RDBMS Any RDBMS TD export Periodical export is very easy in Treasure Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855 39 Prediction Model
  40. 40. Real-time Prediction on MySQL SIGMOID(x) = 1.0 / (1.0 + exp(-x)) Prediction Model Label Feature Vector SELECT sigmoid(sum(t.value * m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature) Online prediction on MySQL Index lookups are very efficient in RDBMSs 40
  41. 41. RandomForest in Hivemall Ensemble of Decision Trees 41
  42. 42. Training of RandomForest 42
  43. 43. Prediction of RandomForest 43
  44. 44. 44 https://console.treasuredata.com/jobs/75633717
  45. 45. Conclusion Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs Ø For SQL users that need ML Ø For whom already using Hive Ø Easy-of-use and scalability in mind Do not require coding, packaging, compiling or introducing a new programming language or APIs. Hivemall’s Positioning Treasure Data provides ML-as-a-Service using the latest version of Hivemall 45
  46. 46. We support machine learning in Cloud Any feature request? Or, questions? 46

×