Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

2,866 views

Published on

Keynote talk at https://eventdots.jp/event/602633

Published in:
Data & Analytics

No Downloads

Total views

2,866

On SlideShare

0

From Embeds

0

Number of Embeds

14

Shares

0

Downloads

51

Comments

0

Likes

8

No embeds

No notes for slide

- 1. Apache Hivemall: Machine Learning Library for Apache Hive/Spark/Pig Research Engineer Makoto YUI @myui <myui@treasure-data.com> 12016/10/29 @Dots
- 2. Ø 2015.04~ Research Engineer at Treasure Data, Inc. • My mission is developing ML-as-a-Service in a Hadoop-as- a-service company Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. • Developed Hivemall as a personal research project Ø 2009.03 Ph.D. in Computer Science from NAIST • Majored in Parallel Data Processing, not ML then Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh Little about me … 2016/10/29 @Dots 2
- 3. 2016/10/29 @Dots 3 Hiro Yoshikawa CEO Kaz Ota CTO Sada Furuhashi Chief Architect Open source business veteran Founder - world’s largest Hadoop group Invented Fluentd, Messagepack TODAY 100+ Employees, 30M+ funding 2015 New ofﬁce in Seoul, Korea 2013 New ofﬁce in Tokyo, Japan 2012 Founded in Mountain View, CA Investors Jerry Yang Yahoo! Founder Bill Tai Angel Investor Yukihiro Matsumoto Ruby Inventor Sierra Ventures - Tim Guleri Entrerprise Software Scale Ventures - Andy Vitus B2B SaaS Treasure Data
- 4. 2016/10/29 @Dots 4 Big Data Stats in Treasure Data
- 5. 2016/10/29 @Dots 5 We Open-source! TD invented .. Streaming log collector Bulk data import/export efficient binary serialization Streaming Query Processor Machine learning on Hadoop digdag.io Workflow engine (Beta)
- 6. 2016/10/29 @Dots 6 Treasure Data’s Solution
- 7. 1. What is Hivemall (introduction) 2. How to use Hivemall 3. Roadmap and coming new features Agenda 2016/10/29 @Dots 7
- 8. 2016/10/29 @Dots 8 Hivemall entered Apache Incubator on Sept 13, 2016 🎉 hivemall.incubator.apache.org @ApacheHivemall
- 9. • Makoto Yui <Treasure Data> • Takeshi Yamamuro <NTT> Ø Hivemall on Apache Spark • Daniel Dai <Hortonworks> Ø Hivemall on Apache Pig Ø Apache Pig PMC member • Tsuyoshi Ozawa <NTT> ØApache Hadoop PMC member • Kai Sasaki <Treasure Data> 9 Initial committers 2016/10/29 @Dots
- 10. Champion Nominated Mentors 10 Project mentors • Reynold Xin <Databricks, ASF member> Apache Spark PMC member • Markus Weimer <Microsoft, ASF member> Apache REEF PMC member • Xiangrui Meng <Databricks, ASF member> Apache Spark PMC member • Roman Shaposhnik <Pivotal, ASF member> Apache Bigtop/Incubator PMC member 2016/10/29 @Dots
- 11. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs 112016/10/29 @Dots Multi/Cross platform Versatile Scalable Ease-of-use
- 12. Hivemall is easy and scalable … Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ML made easy for SQL developers Born to be parallel and scalable This SQL query automatically runs in parallel on Hadoop cluster 122016/10/29 @Dots Ease-of-use Scalable
- 13. 2016/10/29 @Dots 13 Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive
- 14. Hivemall’s Technology Stack 2016/10/29 @Dots 14
- 15. 2016/10/29 @Dots 15 Hivemall on Apache Hive
- 16. 2016/10/29 @Dots 16 Hivemall on Apache Spark Dataframe
- 17. 2016/10/29 @Dots 17 Hivemall on SparkSQL
- 18. 2016/10/29 @Dots 18 Hivemall on Apache Pig
- 19. 2016/10/29 @Dots 19 Versatile Hivemall is a Versatile library .. ü Hivemall is not only for Machine Learning ü Hivemall provides bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing! Don’t Repeat Yourself! Don’t Repeat Yourself!
- 20. 2016/10/29 @Dots 20 Hivemall generic functions Array and Map Bit and compress String and NLP We welcome contributing your generic UDFs to Hivemall!
- 21. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 21 Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones 2016/10/29 @Dots
- 22. List of Algorithms for Recommendation 22 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items 2016/10/29 @Dots
- 23. 2016/10/29 @Dots 23 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing student class score 3 a 90 2 a 80 1 b 70 6 b 60 List top-2 students for each class
- 24. 2016/10/29 @Dots 24 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 Top-k query processing
- 25. 2016/10/29 @Dots 25 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t Top-k query processing
- 26. 2016/10/29 @Dots 26 Top-k query processing by RANK OVER() partition by class Node 1 Sort by class, score rank over() rank >= 2
- 27. 2016/10/29 @Dots 27 Top-k query processing by EACH_TOP_K distributed by class Node 1 Sort by class each_top_k OUTPUT only K items
- 28. 2016/10/29 @Dots 28 Comparison between RANK and EACH_TOP_K distributed by class Sort by class each_top_k Sort by class, score rank over() rank >= 2 SORTING IS HEAVY NEED TO PROCESS ALL OUTPUT only K items Each_top_k is very efficient where the number of class is large Bounded Priority Queue is utilized
- 29. Performance reported by TD customer 2016/10/29 @Dots 29 •1,000 students in each class •20 million classes RANK over() query does not finishes in 24 hours L EACH_TOP_K finishes in 2 hours J Refer for detail https://speakerdeck.com/kaky0922/hivemall-meetup-20160908
- 30. Other Supported Algorithms 30 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji) 2016/10/29 @Dots
- 31. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. Industry use cases of Hivemall 312016/10/29 @Dots http://www.slideshare.net/eventdotsjp/hivemall
- 32. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo Industry use cases of Hivemall 322016/10/29 @Dots minne.com
- 33. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo • Value prediction of Real estates • Algorithm: Regression • Livesense Industry use cases of Hivemall 332016/10/29 @Dots
- 34. • CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more • Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. • Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo • Value prediction of Real estates • Algorithm: Regression • Livesense • User score calculation • Algrorithm: Regression • Klout Industry use cases of Hivemall 34 bit.ly/klout-hivemall 2016/10/29 @Dots Influencer marketing klout.com
- 35. OISIX, a leading food delivery service company in Japan, used Hivemall’s Logistic Regression to get churn probability 2016/10/29 @Dots 35 Churn Detection of Monthly Payment Service Churn rate dropped almost by half by giving gift points to customers being predicted to leave J
- 36. 1. What is Hivemall (introduction) 2. How to use Hivemall 3. Roadmap and coming new features Agenda 2016/10/29 @Dots 36
- 37. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 372016/10/29 @Dots
- 38. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 382016/10/29 @Dots
- 39. 2016/10/29 @Dots 39 How to use Hivemall
- 40. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 402016/10/29 @Dots
- 41. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 412016/10/29 @Dots
- 42. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 422016/10/29 @Dots
- 43. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 432016/10/29 @Dots
- 44. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 442016/10/29 @Dots
- 45. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 452016/10/29 @Dots
- 46. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 462016/10/29 @Dots
- 47. Real-time prediction Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 47 bit.ly/hivemall-rtp 2016/10/29 @Dots
- 48. Export Prediction Model to a RDBMS Any RDBMS TD export Periodical export is very easy in Treasure Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855 48 Prediction Model 2016/10/29 @Dots
- 49. Real-time Prediction on MySQL Prediction Model Label Feature Vector SELECT sigmoid(sum(t.value * m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature) Index lookups are very efficient in RDBMSs! 492016/10/29 @Dots
- 50. 2016/10/29 @Dots 50 Online Prediction by Apache Streaming
- 51. RandomForest in Hivemall Ensemble of Decision Trees 2016/10/29 @Dots 51
- 52. Training of RandomForest 2016/10/29 @Dots 52
- 53. Prediction of RandomForest 2016/10/29 @Dots 53
- 54. 1. What is Hivemall (introduction) 2. How to use Hivemall 3. Roadmap and coming new features Agenda 2016/10/29 @Dots 54
- 55. • IP clearance and project/repository site setup • Create contribution guidelines • Move repository from github to ASF • Add more tests and documentations • Initial Apache Release will be Dec or Jan 55 Roadmap 2016/10/29 @Dots
- 56. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/29 @Dots 56 J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
- 57. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/29 @Dots 57 J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
- 58. 2016/10/29 @Dots 58 T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation Less Hyper-parameters than ChangeFinder J
- 59. 2016/10/29 @Dots 59 Evaluation Metrics
- 60. 2016/10/29 @Dots 60 Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins
- 61. 2016/10/29 @Dots 61 Feature Selection – Signal Noise Ratio
- 62. 2016/10/29 @Dots 62 Feature Selection – Chi-Square
- 63. 2016/10/29 @Dots 63 Feature Transformation – Onehot encoding Maps a categorical variable to a unique number starting from 1
- 64. ü Spark 2.0 Dataframe support ü XGBoost Integration ü Field-aware Factorization Machines ü Generalized Linear Model • Optimizer framework including ADAM • L1/L2 regularization 2016/10/29 @Dots 64 Other new features to come
- 65. Conclusion and Takeaway Hivemall is a machine learning library that is … 2016/10/29 @Dots 65 We welcome your contributions to Apache Hivemall J Multi/Cross platform Versatile Scalable Ease-of-use hivemall.incubator.apache.org Ø For Data Engineers who need ML Ø Deep Learning is out of scope Ø Recommendation is high-priority for us Hivemall’s Positioning
- 66. 66 Any questions or comments? 2016/10/29 @Dots

No public clipboards found for this slide

Be the first to comment