Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoopsummit16 myui

1,035 views

Published on

Talk at Hadoop Summit 2016, Tokyo on Oct 26, 2016.
http://hadoopsummit.org/tokyo/agenda/

Published in: Data & Analytics
  • Be the first to comment

Hadoopsummit16 myui

  1. 1. Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto YUI @myui 2016/10/26 Hadoop Summit '16, Tokyo 2). Research Engineer, NTT Takashi Yamamuro @maropu #hivemall
  2. 2. Plan of the talk 1. Introduction to Hivemall 2. Hivemall on Spark 2016/10/26 Hadoop Summit '16, Tokyo 2
  3. 3. 2016/10/26 Hadoop Summit '16, Tokyo 3 Hivemall enters Apache Incubator on Sept 13, 2016 🎉 http://incubator.apache.org/projects/hivemall.html @ApacheHivemall
  4. 4. • Makoto Yui <Treasure Data> • Takeshi Yamamuro <NTT> Ø Hivemall on Apache Spark • Daniel Dai <Hortonworks> Ø Hivemall on Apache Pig Ø Apache Pig PMC member • Tsuyoshi Ozawa <NTT> ØApache Hadoop PMC member • Kai Sasaki <Treasure Data> 2016/10/26 Hadoop Summit '16, Tokyo 4 Initial committers
  5. 5. Champion Nominated Mentors 2016/10/26 Hadoop Summit '16, Tokyo 5 Project mentors • Reynold Xin <Databricks, ASF member> Apache Spark PMC member • Markus Weimer <Microsoft, ASF member> Apache REEF PMC member • Xiangrui Meng <Databricks, ASF member> Apache Spark PMC member • Roman Shaposhnik <Pivotal, ASF member> Apache Bigtop/Incubator PMC member
  6. 6. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs 2016/10/26 Hadoop Summit '16, Tokyo github.com/myui/hivemall hivemall.incubator.apache.org
  7. 7. Hivemall’s Vision: ML on SQL 2016/10/26 Hadoop Summit '16, Tokyo 7 Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓ Machine Learning made easy for SQL developers ✓ Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop
  8. 8. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack 2016/10/26 Hadoop Summit '16, Tokyo 8 Amazon S3
  9. 9. List of supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 2016/10/26 Hadoop Summit '16, Tokyo 9 Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones
  10. 10. List of Algorithms for Recommendation 2016/10/26 Hadoop Summit '16, Tokyo 10 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items
  11. 11. 2016/10/26 Hadoop Summit '16, Tokyo 11 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing student class score 3 a 90 2 a 80 1 b 70 6 b 60 List top-2 students for each class
  12. 12. 2016/10/26 Hadoop Summit '16, Tokyo 12 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2
  13. 13. 2016/10/26 Hadoop Summit '16, Tokyo 13 student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t
  14. 14. 2016/10/26 Hadoop Summit '16, Tokyo 14 Top-k query processing by RANK OVER() partition by class Node 1 Sort by class, score rank over() rank >= 2
  15. 15. 2016/10/26 Hadoop Summit '16, Tokyo 15 Top-k query processing by EACH_TOP_K distributed by class Node 1 Sort by class each_top_k OUTPUT only K items
  16. 16. 2016/10/26 Hadoop Summit '16, Tokyo 16 Comparison between RANK and EACH_TOP_K distributed by class Sort by class each_top_k Sort by class, score rank over() rank >= 2 SORTING IS HEAVY NEED TO PROCESS ALL OUTPUT only K items Each_top_k is very efficient where the number of class is large Bounded Priority Queue is utilized
  17. 17. Performance reported by TD customer 2016/10/26 Hadoop Summit '16, Tokyo 17 •1,000 students in each class •20 million classes RANK over() query does not finishes in 24 hours L EACH_TOP_K finishes in 2 hours J Refer for detail https://speakerdeck.com/kaky0922/hivemall-meetup-20160908
  18. 18. Other Supported Algorithms 2016/10/26 Hadoop Summit '16, Tokyo 18 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)
  19. 19. Industry use cases of Hivemall CTR prediction of Ad click logs • Algorithm: Logistic regression • Freakout Inc., Smartnews, and more Gender prediction of Ad click logs • Algorithm: Classification • Scaleout Inc. 2016/10/26 Hadoop Summit '16, Tokyo 19 http://www.slideshare.net/eventdotsjp/hivemall
  20. 20. Industry use cases of Hivemall Item/User recommendation • Algorithm: Recommendation • Wish.com, GMO pepabo 2016/10/26 Hadoop Summit '16, Tokyo 20 Problem: Recommendation using hot-item is hard in hand-crafted product market because each creator sells few single items (will soon become out-of-stock) minne.com
  21. 21. Industry use cases of Hivemall Value prediction of Real estates • Algorithm: Regression • Livesense 2016/10/26 Hadoop Summit '16, Tokyo 21
  22. 22. Industry use cases of Hivemall User score calculation • Algrorithm: Regression • Klout 2016/10/26 Hadoop Summit '16, Tokyo 22 bit.ly/klout-hivemall Influencer marketing klout.com
  23. 23. 2016/10/26 Hadoop Summit '16, Tokyo 23
  24. 24. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/26 Hadoop Summit '16, Tokyo 24 J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
  25. 25. Efficient algorithm for finding change point and outliers from timeseries data 2016/10/26 Hadoop Summit '16, Tokyo 25 Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
  26. 26. 2016/10/26 Hadoop Summit '16, Tokyo 26 • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation
  27. 27. 2016/10/26 Hadoop Summit '16, Tokyo 27 Evaluation Metrics
  28. 28. 2016/10/26 Hadoop Summit '16, Tokyo 28 Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins
  29. 29. 2016/10/26 Hadoop Summit '16, Tokyo 29 Feature Selection – Signal Noise Ratio
  30. 30. 2016/10/26 Hadoop Summit '16, Tokyo 30 Feature Selection – Chi-Square
  31. 31. 2016/10/26 Hadoop Summit '16, Tokyo 31 Feature Transformation – Onehot encoding Maps a categorical variable to a unique number starting from 1
  32. 32. ü Spark 2.0 support ü XGBoost Integration ü Field-aware Factorization Machines ü Generalized Linear Model • Optimizer framework including ADAM • L1/L2 regularization 2016/10/26 Hadoop Summit '16, Tokyo 32 Other new features to come
  33. 33. Copyright©2016 NTT corp. All Rights Reserved. Hivemall on Takeshi Yamamuro @ NTT Hadoop Summit 2016 Tokyo, Japan
  34. 34. 34Copyright©2016 NTT corp. All Rights Reserved. Who am I?
  35. 35. 35Copyright©2016 NTT corp. All Rights Reserved. Whatʼs Spark? • 1. Unified Engine • support end-to-end APs, e.g., MLlib and Streaming • 2. High-level APIs • easy-to-use, rich optimization • 3. Integrate broadly • storages, libraries, ...
  36. 36. 36Copyright©2016 NTT corp. All Rights Reserved. • Hivemall wrapper for Spark • Wrapper implementations for DataFrame/SQL • + some utilities for easy-to-use in Spark • The wrapper makes you... • run most of Hivemall functions in Spark • try examples easily in your laptop • improve some function performance in Spark Whatʼs Hivemall on Spark?
  37. 37. 37Copyright©2016 NTT corp. All Rights Reserved. • Hivemall already has many fascinating ML algorithms and useful utilities • High barriers to add newer algorithms in MLlib Whyʼs Hivemall on Spark? https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  38. 38. 38Copyright©2016 NTT corp. All Rights Reserved. • Most of Hivemall functions supported in Spark v1.6 and v2.0 Current Status - For Spark v2.0 $ git clone https://github.com/myui/hivemall $ cd hivemall $ mvn package –Pspark-2.0 –DskipTests $ ls target/*spark* target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar ...
  39. 39. 39Copyright©2016 NTT corp. All Rights Reserved. • Most of Hivemall functions supported in Spark v1.6 and v2.0 Current Status - For Spark v2.0 $ git clone https://github.com/myui/hivemall $ cd hivemall $ mvn package –Pspark-2.0 –DskipTests $ ls target/*spark* target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar ... -Pspark-1.6 for Spark v1.6
  40. 40. 40Copyright©2016 NTT corp. All Rights Reserved. • 1. Download a Spark binary • 2. Fetch training and test data • 3. Load these data in Spark • 4. Build a model • 5. Do predictions Running an Example
  41. 41. 41Copyright©2016 NTT corp. All Rights Reserved. 1. Download a Spark binary Running an Example http://spark.apache.org/downloads.html
  42. 42. 42Copyright©2016 NTT corp. All Rights Reserved. • E2006 tfidf regression dataset • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat asets/regression.html#E2006-tfidf 2. Fetch training and test data Running an Example $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.test.bz2
  43. 43. 43Copyright©2016 NTT corp. All Rights Reserved. 3. Load data in Spark Running an Example $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar // Create DataFrame from the bzip’d libsvm-formatted file scala> val trainDf = sqlContext.sparkSession.read.format("libsvm”) .load(“E2006.train.bz2") scala> trainDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  44. 44. 44Copyright©2016 NTT corp. All Rights Reserved. 3. Load data in Spark Running an Example 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… trainDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable
  45. 45. 45Copyright©2016 NTT corp. All Rights Reserved. 4. Build a model - DataFrame Running an Example scala> paste: val modelDf = trainDf.train_logregr($"features", $"label") .groupBy("feature”) .agg("weight" -> "avg”)
  46. 46. 46Copyright©2016 NTT corp. All Rights Reserved. 4. Build a model - SQL Running an Example scala> trainDf.createOrReplaceTempView("TrainTable") scala> paste: val modelDf = sql(""" | SELECT feature, AVG(weight) AS weight | FROM ( | SELECT train_logregr(features, label) | AS (feature, weight) | FROM TrainTable | ) | GROUP BY feature """.stripMargin)
  47. 47. 47Copyright©2016 NTT corp. All Rights Reserved. 5. Do predictions - DataFrame Running an Example scala> paste: val df = testDf.select(rowid(), $"features") .explode_vector($"features") .cache # Do predictions df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER") .groupBy("rowid") .avg(sigmoid(sum($"weight" * $"value")))
  48. 48. 48Copyright©2016 NTT corp. All Rights Reserved. 5. Do predictions - SQL Running an Example scala> modelDf.createOrReplaceTempView(”ModelTable") scala> df.createOrReplaceTempView(”TestTable”) scala> paste: sql(""" | SELECT rowid, sigmoid(value * weight) AS predicted | FROM TrainTable t | LEFT OUTER JOIN ModelTable m | ON t.feature = m.feature | GROUP BY rowid """.stripMargin)
  49. 49. 49Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d)) scala> val sparkUdf = functions.udf(sigmoidFunc) scala> df.select(sparkUdf($“value”))
  50. 50. 50Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark scala> val hiveUdf = HivemallOps.sigmoid scala> df.select(hiveUdf($“value”))
  51. 51. 51Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark
  52. 52. 52Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group Improve Some Functions in Spark scala> paste: df.withColumn( “rank”, rank().over(Window.partitionBy($"key").orderBy($"score".desc) ) .where($"rank" <= topK)
  53. 53. 53Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group • Fixed the overhead issue for each_top_k • See pr#353: “Implement EachTopK as a generator expression“ in Spark Improve Some Functions in Spark scala> df.each_top_k(topK, “key”, “score”, “value”)
  54. 54. 54Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group Improve Some Functions in Spark ~4 times faster than rank!!
  55. 55. 55Copyright©2016 NTT corp. All Rights Reserved. • supports under development • fast implementation of the gradient tree boosting • widely used in Kaggle competitions • This integration will make you... • load built models and predict in parallel • build multiple models in parallel for cross-validation 3rd–Party Library Integration
  56. 56. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs 2016/10/26 Hadoop Summit '16, Tokyo 56 We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin
  57. 57. 2016/10/26 Hadoop Summit '16, Tokyo 57 Any feature request or questions?

×