Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Datascientistsymp1113

1,305 views

Published on

Talk at 2nd Data Scientist Symp on Nov 13, 2015.
http://www.datascientist.or.jp/symp/2015/

Published in: Data & Analytics
  • Be the first to comment

Datascientistsymp1113

  1. 1. Machine Learning as a Service in Treasure Data Research Engineer Makoto YUI @myui <myui@treasure-data.com> 2014/11/13 Japan DataScientist Org. 2nd Symposium 1
  2. 2. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data My mission in TD is developing ML-as-a-Service Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Worked on a large-scale Machine Learning project and Parallel Databases Ø 2009.03 Ph.D. in Computer Science from NAIST Who am I ? 2014/09/17 Talk@Japan DataScientist Society 2
  3. 3. 2014/11/13 Japan DataScientist Org. 2nd Symposium 3 他製品連携 SQL Server CRM RDBMS Appログ センサー Webログ ERP バッチ型 分析 アドホック型 分析 API ODBC JDBC PUSH Treasure Agent 分析ツール連携 データ可視化・共有 Treasure Data Collectors 組込み Embulk モバイルSDK JS SDK 1. Collect and Store 2. Transform 4. Analyze 3. Export What Treasure Data provides Treasure Data = Cloud Data Lake (is not just a Hadoop-as-a-Service)
  4. 4. 4 100+ 日本の顧客社数 22兆保存されている データ件数 4,000 一社が所有する最大 サーバー数 900,000 1秒間に保存される データ件数 2014/11/13 Japan DataScientist Org. 2nd Symposium Stats in Treasure Data
  5. 5. 2014/11/13 Japan DataScientist Org. 2nd Symposium 5 Customers of Treasure Data http://www.treasuredata.com/jp/customers
  6. 6. 2014/11/13 Japan DataScientist Org. 2nd Symposium 6 他製品連携 SQL Server CRM RDBMS Appログ センサー Webログ ERP バッチ型 分析 アドホック型 分析 API ODBC JDBC PUSH Treasure Agent 分析ツール連携 データ可視化・共有 Treasure Data Collectors 組込み Embulk モバイルSDK JS SDK Treasure Data supports ML-as-a-Service Machine Learning
  7. 7. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 2015/10/20 Hivemall meetup #2 7 https://github.com/myui/hivemall
  8. 8. What is Hivemall Hadoop HDFS MapReduce (MR v1) Hive / PIG Hivemall Apache YARN Apache Tez DAG processing MR v2 Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System 2015/10/20 Hivemall meetup #2 8 Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
  9. 9. Awarded in IDG’s InfoWorld 2014 Bossie Awards 2014: The best open source big data tools InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem bit.ly/hivemall-award 9
  10. 10. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 2014/09/17 Talk@Japan DataScientist Society 10
  11. 11. Ø CTR prediction of Ad click logs •Freakout Inc. and more Ø Gender prediction of Ad click logs •Scaleout Inc. Ø Churn Detection •OISIX and more Ø Item/User recommendation •Adtech Companies, ISP portal, and more Ø Value prediction of Real estates •Livesense Industry use cases of Hivemall 112015/10/20 Hivemall meetup #2
  12. 12. List of Features in Hivemall v0.3.2 Classification (both binary- and multi-class) ✓ Perceptron ✓ Passive Aggressive (PA) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad ✓AdaDELTA kNN and Recommendation ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search using K-NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix Factorization Feature engineering ✓ Feature Hashing ✓ Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion Anomaly Detection ✓ Local Outlier Factor 2015/10/20 Hivemall meetup #2 12
  13. 13. Features supported in Hivemall v0.4 2015/10/20 Hivemall meetup #2 13 1.RandomForest • classification, regression 2.Factorization Machine • classification, regression (factorization) Treasure Data now supports v0.4.0-2 Factorization Machine are often used by data science competition winners (Criteo/Avazu CTR prediction)
  14. 14. 2015/10/20 Hivemall meetup #2 14 RandomForest in Hivemall v0.4 Ensemble of Decision Trees
  15. 15. 2015/10/20 Hivemall meetup #2 15 Training of RandomForest
  16. 16. 2015/10/20 Hivemall meetup #2 16 Prediction of RandomForest
  17. 17. Features to be supported in Hivemall v0.4.1 2015/10/20 Hivemall meetup #2 17 1. Gradient Tree Boosting • classifier, regression 2. Field-aware Factorization Machine • classification, regression (factorization) • Existing implementation, i.e., LibFFM, only can be applied for classification 3. NLP Tokenizer (形態素解析) Planned to release v0.4.1 in Dec.
  18. 18. Features to be supported in Hivemall v0.4.2 2015/10/20 Hivemall meetup #2 18 1. Online LDA • topic modeling, clustering 2. Mix server on Apache YARN • Service for parameter sharing among workers Planned to release v0.4.2 in Jan.
  19. 19. Conclusion and Takeaway 2015/10/20 Hivemall meetup #2 19 Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs Hivemall’s Positioning Treasure Data provides ML-as-a-Service using Hivemall Major development leaps in v0.4 More will follow in v0.4.1 • For SQL users that need ML • Easy-of-use and scalability in mind • Random Forest • Factorization Machine
  20. 20. 2014/11/13 Japan DataScientist Org. 2nd Symposium 20 Beyond Query-as-a-Service! We Open-source! We invented ..
  21. 21. Real-time Prediction on Treasure Data Run batch training job periodically Real-time prediction on a RDBMS Periodical export 2014/09/17 Talk@Japan DataScientist Society 21

×