Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

3rd Hivemall meetup

1,129 views

Published on

Talk at the 3rd Hivemall meet-up on 2016/09/08.
https://eventdots.jp/event/597518

Published in: Engineering
  • Be the first to comment

3rd Hivemall meetup

  1. 1. Recent progress and future roadmap of Hivemall Research Engineer Makoto YUI @myui <myui@treasure-data.com> 1 #hivemallmtup 2016/09/08 3rd Hivemall meetup
  2. 2. Agenda 1. Short Introduction to Hivemall ü Hivemall use-cases 2. Recent Updates 3. Roadmap of Hivemall ü coming new features 22016/09/08 3rd Hivemall meetup
  3. 3. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 3 https://github.com/myui/hivemall Thank for everyone contributed to the project! 2016/09/08 3rd Hivemall meetup
  4. 4. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib What is Hivemall Amazon S3 2016/09/08 3rd Hivemall meetup 4
  5. 5. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 52016/09/08 3rd Hivemall meetup
  6. 6. Ø CTR prediction of Ad click logs •Freakout Inc., Fan communication, and more •Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 6 http://www.slideshare.net/masakazusano75/sano-hmm-20150512 2016/09/08 3rd Hivemall meetup
  7. 7. 7 ØGender prediction of Ad click logs •Scaleout Inc. and Fan commutations http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall 2016/09/08 3rd Hivemall meetup
  8. 8. 8 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 2016/09/08 3rd Hivemall meetup
  9. 9. 9 ØChurn Detection •OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix 2016/09/08 3rd Hivemall meetup
  10. 10. Agenda 1. Short Introduction to Hivemall ü Hivemall use-cases 2. Recent Updates 3. Roadmap of Hivemall ü coming new features 102016/09/08 3rd Hivemall meetup
  11. 11. v0.4.2-rc.2 Ø Released on 2016/06/28 Ø minor hotfixes Ø The latest release 11 Recent Releases 2016/09/08 3rd Hivemall meetup
  12. 12. v0.4.2-rc.1 Ø Released on 2016/06/07 Ø Hivemall on Spark v1.6 Ø Kudos to @maropu Ø BPR-MF (Matrix Factorization for Implicit Feedbacks) 12 Recent Releases 2016/09/08 3rd Hivemall meetup
  13. 13. 13 Hivemall on Apache Spark Installation is very easy as follows: $ spark-shell --packages maropu:hivemall-spark:0.0.6 2016/09/08 3rd Hivemall meetup
  14. 14. 14 Feature Hashing Frequently used technique to deal with high-dimensional data 2016/09/08 3rd Hivemall meetup 高次元 低次元
  15. 15. Kernel trick 2016/09/08 3rd Hivemall meetup 15 高次元に写像 Input Feature Space Mapped Feature Space 高次空間でhyperplaneを引く低次元で非線形分離できている For two dimensional features [a, b], the degree-2 polynomial features are [(1, ) a, b, a^2, ab, b^2].高次元低次元
  16. 16. 16 Polynomial Expansion 2016/09/08 3rd Hivemall meetup
  17. 17. 17 Polynomial Expansion b^b:1.0 and b^b^b:1.0 are omitted w/ truncate option a^a:0.25 and c^c:0.09 are omitted w/ interactive only option 2016/09/08 3rd Hivemall meetup
  18. 18. Feature Vector formatter Functions 18 量的変数は「カラム名:値」 質的変数は「カラム名#値」となる なお、nullや重み0.0の特徴は作成されない 2016/09/08 3rd Hivemall meetup
  19. 19. 19 Mini-batch Gradient Descent Caution: Mini-batch generally requires more iterations than SGD2016/09/08 3rd Hivemall meetup
  20. 20. 20 Japanese Tokenizer using Kuromoji This feature is request from a Treasure Data customer 2016/09/08 3rd Hivemall meetup Thanks providing a reference implementation to us (company R)
  21. 21. Agenda 1. Short Introduction to Hivemall ü Hivemall use-cases 2. Recent Updates 3. Roadmap of Hivemall ü coming new features 212016/09/08 3rd Hivemall meetup
  22. 22. 22 Important Announcement Hivemall will become Apache Hivemall (?) Now on voting though.. 2016/09/08 3rd Hivemall meetup
  23. 23. 23 Apache Incubation status 2016/09/08 3rd Hivemall meetup
  24. 24. • Makoto Yui <Treasure Data> • Takeshi Yamamuro <NTT> Ø Hivemall on Apache Spark • Daniel Dai <Hortonworks> Ø Hivemall on Apache Pig Ø Apache Pig PMC member • Tsuyoshi Ozawa <NTT> ØApache Hadoop PMC member • Kai Sasaki <Treasure Data> 24 Initial committers 2016/09/08 3rd Hivemall meetup
  25. 25. Champion Nominated Mentors 25 Project mentors • Reynold Xin <Databricks, ASF member> Apache Spark PMC member • Markus Weimer <Microsoft, ASF member> Apache REEF PMC member • Xiangrui Meng <Databricks, ASF member> Apache Spark PMC member • Roman Shaposhnik <Pivotal, ASF member> Apache Bigtop/Incubator PMC member 2016/09/08 3rd Hivemall meetup
  26. 26. • Possibly enter Apache Incubator in Sept, 2016 • IP clearance and project/repository site setup •Contribution guideline •Create who use Hivemall list •More documentations! Sept to Nov • Initial Apache Release Dec (or late Nov?) • v0.5 • Non-Apache release of v0.5-beta.xx will be release in github in Oct 26 Roadmap 2016/09/08 3rd Hivemall meetup
  27. 27. ü Hivemall on Spark 2.0 w/ Dataframe support • Kudos to @maropu ü ChangeFinder • Change Point and Anomaly Detection • Kudos to @L3sota @takuti • PR #333 ü XGBoost support • Kudos to @maropu 27 Coming New Features - already merged in Master 2016/09/08 3rd Hivemall meetup
  28. 28. ü ChangeFinder 28 Coming New Features - already merged in Master cf_detect(array<double> x [, const string options]) J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 2016/09/08 3rd Hivemall meetup
  29. 29. ü ChangeFinder 29 Coming New Features - already merged in Master cf_detect(array<double> x [, const string options]) 2016/09/08 3rd Hivemall meetup
  30. 30. ü Various Evaluation Metrics • Kudos to @takuti, also R2 by , logloss by • PR #326 30 Coming New Features - already merged in Master 2016/09/08 3rd Hivemall meetup Fan-cs, sakai-san
  31. 31. 31 Coming New Features - already merged in Master ü Feature Binning • Kudos to @amaya382 on PR #382 • Maps quantitative variables to bins Age (quantitative variable) is mapped into a meaningful bin (categorical variable) based on quantiles 2016/09/08 3rd Hivemall meetup
  32. 32. • v0.5-beta{1,2} release (Oct-Nov) ü System test framework üKudos to @amaya382 ü one-hot encoding üKudos to @kai ü Field-aware Factorization Machines ü Kernelized Passive Aggressive üKudos to @L3sota ü Generalized Linear Model ü Optimizer framework including ADAM ü L1/L2 regularization ü Kudos to @maropu ü Disk-based iteration support ü To avoid too large amplify ü Gradient Tree Boosting ü Online LDA 32 Other undergoing new features 2016/09/08 3rd Hivemall meetup
  33. 33. 33 We support machine learning in Cloud Any feature request? Or, questions? bit.ly/td-wants-you

×