Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Hivemall @ Apache BigData '17, Miami

3,165 views

Published on

Talk at Apache BigData 2017, Miami

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Hivemall @ Apache BigData '17, Miami

  1. 1. Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto YUI @myui 2017/5/16 Apache BigData North America '17, Miami 2). Research Engineer, NTT Takashi Yamamuro @maropu @ApacheHivemall
  2. 2. Plan of the talk 1. Introduction to Hivemall 2. Hivemall on Spark 2017/5/16 Apache BigData North America '17, Miami
  3. 3. 2017/5/16 Apache BigData North America '17, Miami Hivemall entered Apache Incubator on Sept 13, 2016 🎉 Since then, we invited 2 contributors as new committers (a committer has been voted as PPMC). Currently, we are working toward the first Apache release on this Q2. hivemall.incubator.apache.org
  4. 4. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs 2017/5/16 Apache BigData North America '17, Miami Multi/Cross platform VersatileScalableEase-of-use
  5. 5. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable 2017/5/16 Apache BigData North America '17, Miami Ease-of-use Scalable CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop
  6. 6. 2017/5/16 Apache BigData North America '17, Miami Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive
  7. 7. Hadoop HDFS MapReduce (MRv1) Hivemall Apache YARN Apache Tez DAG processing Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System Cloud Storage SparkSQL Apache Spark MESOS Hive Pig MLlib Hivemall’s Technology Stack 2017/5/16 Apache BigData North America '17, Miami Amazon S3
  8. 8. 2017/5/16 Apache BigData North America '17, Miami Hivemall on Apache Hive
  9. 9. 2017/5/16 Apache BigData North America '17, Miami Hivemall on Apache Spark Dataframe
  10. 10. 2017/5/16 Apache BigData North America '17, Miami Hivemall on SparkSQL
  11. 11. 2017/5/16 Apache BigData North America '17, Miami Hivemall on Apache Pig
  12. 12. 2017/5/16 Apache BigData North America '17, Miami Online Prediction by Apache Streaming
  13. 13. 2017/5/16 Apache BigData North America '17, Miami Versatile Hivemall is a Versatile library .. ü Not only for Machine Learning ü provides a bunch of generic utility functions Each organization has own sets of UDFs for data preprocessing Don’t Repeat Yourself! Don’t Repeat Yourself!
  14. 14. 2017/5/16 Apache BigData North America '17, Miami Hivemall generic functions Array and Map Bit and compress String and NLP We welcome contributing your generic UDFs to Hivemall! Geo Spatial Top-k processing > TF/IDF > TILE > MAP_URL
  15. 15. 2017/5/16 Apache BigData North America '17, Miami Map tiling functions
  16. 16. 2017/5/16 Apache BigData North America '17, Miami Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15
  17. 17. 2017/5/16 Apache BigData North America '17, Miami student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing student class score 3 a 90 2 a 80 1 b 70 6 b 60 List top-2 students for each class
  18. 18. 2017/5/16 Apache BigData North America '17, Miami student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table ) t WHERE rank <= 2 RANK over() query does not finishes in 24 hours L where 20 million MOOCs classes and avg 1,000 students in each classes
  19. 19. 2017/5/16 Apache BigData North America '17, Miami student class score 1 b 70 2 a 80 3 a 90 4 b 50 5 a 70 6 b 60 Top-k query processing List top-2 students for each class SELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student) FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class ) t EACH_TOP_K finishes in 2 hours J
  20. 20. 2017/5/16 Apache BigData North America '17, Miami Top-k query processing by RANK OVER() partition by class Node 1 Sort by class, score rank over() rank >= 2
  21. 21. 2017/5/16 Apache BigData North America '17, Miami Top-k query processing by EACH_TOP_K distributed by class Node 1 Sort by class each_top_k OUTPUT only K items
  22. 22. 2017/5/16 Apache BigData North America '17, Miami Comparison between RANK and EACH_TOP_K distributed by class Sort by class each_top_k Sort by class, score rank over() rank >= 2 SORTING IS HEAVY NEED TO PROCESS ALL OUTPUT only K items Each_top_k is very efficient where the number of class is large Bounded Priority Queue is utilized
  23. 23. List of Supported Algorithms Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification 2017/5/16 Apache BigData North America '17, Miami Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones
  24. 24. RandomForest in Hivemall Ensemble of Decision Trees 2017/5/16 Apache BigData North America '17, Miami
  25. 25. Training of RandomForest 2017/5/16 Apache BigData North America '17, Miami
  26. 26. Prediction of RandomForest 2017/5/16 Apache BigData North America '17, Miami
  27. 27. Supported Algorithms for Recommendation 2017/5/16 Apache BigData North America '17, Miami K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items
  28. 28. Other Supported Algorithms 2017/5/16 Apache BigData North America '17, Miami Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ Feature Binning ✓ TF-IDF vectorizer ✓ Polynomial Expansion ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer Evaluation metrics ✓AUC, nDCG, logloss, precision recall@K, and etc
  29. 29. 2017/5/16 Apache BigData North America '17, Miami Feature Engineering – Feature Hashing
  30. 30. 2017/5/16 Apache BigData North America '17, Miami Feature Engineering – Feature Hashing
  31. 31. 2017/5/16 Apache BigData North America '17, Miami Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins
  32. 32. 2017/5/16 Apache BigData North America '17, Miami Feature Selection – Signal Noise Ratio
  33. 33. 2017/5/16 Apache BigData North America '17, Miami Evaluation Metrics
  34. 34. Other Supported Features 2017/5/16 Apache BigData North America '17, Miami Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation
  35. 35. Efficient algorithm for finding change point and outliers from timeseries data 2017/5/16 Apache BigData North America '17, Miami J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
  36. 36. Take this… Anomaly/Change-point Detection by ChangeFinder 2017/5/16 Apache BigData North America '17, Miami
  37. 37. 2017/5/16 Apache BigData North America '17, Miami Anomaly/Change-point Detection by ChangeFinder …and do this!
  38. 38. Efficient algorithm for finding change point and outliers from timeseries data 2017/5/16 Apache BigData North America '17, Miami Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
  39. 39. Efficient algorithm for finding change point and outliers from timeseries data 2017/5/16 Apache BigData North America '17, Miami Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
  40. 40. 2017/5/16 Apache BigData North America '17, Miami • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation
  41. 41. 2017/5/16 Apache BigData North America '17, Miami Online mini-batch LDA
  42. 42. ü XGBoost Integration ü Sparse vector support in RandomForests ü Docker support (for evaluation) ü Field-aware Factorization Machines ü Generalized Linear Model • Optimizer framework including ADAM • L1/L2 regularization 2017/5/16 Apache BigData North America '17, Miami Other new features in development
  43. 43. Copyright©2016 NTT corp. All Rights Reserved. Hivemall on
  44. 44. 44Copyright©2016 NTT corp. All Rights Reserved. Who am I • Takeshi Yamamuro • NTT corp. in Japan • OSS activities • Apache Hivemall PPMC • Apache Spark contributer
  45. 45. 45Copyright©2016 NTT corp. All Rights Reserved. Whatʼs Spark? • Distributed data analytics engine, generalizing Map Reduce
  46. 46. 46Copyright©2016 NTT corp. All Rights Reserved. Whatʼs Spark? • 1. Unified Engine • support end-to-end APs, e.g., MLlib and Streaming • 2. High-level APIs • easy-to-use, rich optimization • 3. Integrate broadly • storages, libraries, ...
  47. 47. 47Copyright©2016 NTT corp. All Rights Reserved. • Hivemall wrapper for Spark • Wrapper implementations for DataFrame/SQL • + some utilities for easy-to-use in Spark • The wrapper makes you... • run most of Hivemall functions in Spark • try examples easily in your laptop • improve some function performance in Spark Whatʼs Hivemall on Spark?
  48. 48. 48Copyright©2016 NTT corp. All Rights Reserved. • Hivemall already has many fascinating ML algorithms and useful utilities • + High barriers to add newer algorithms in MLlib Whyʼs Hivemall on Spark? https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  49. 49. 49Copyright©2016 NTT corp. All Rights Reserved. • Most of Hivemall functions supported in Spark v2.0 and v2.1 Current Status - To compile for Spark v2.1 $ git clone https://github.com/apache/incubator-hivemall $ cd incubator-hivemall $ mvn package –Pspark-2.1–DskipTests $ ls target/*spark* target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar ...
  50. 50. 50Copyright©2016 NTT corp. All Rights Reserved. • Most of Hivemall functions supported in Spark v2.0 and v2.1 Current Status - To compile for Spark v2.0 $ git clone https://github.com/apache/incubator-hivemall $ cd incubator-hivemall $ mvn package –Pspark-2.0–DskipTests $ ls target/*spark* target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar ...
  51. 51. 51Copyright©2016 NTT corp. All Rights Reserved. • 1. Fetch training and test data • 2. Load these data in Spark • 3. Build a model • 4. Do predictions 4 Step Example
  52. 52. 52Copyright©2016 NTT corp. All Rights Reserved. • E2006 tfidf regression dataset • http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat asets/regression.html#E2006-tfidf 1. Fetch training and test data $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ regression/E2006.test.bz2
  53. 53. 53Copyright©2016 NTT corp. All Rights Reserved. 2. Load data in Spark // Download Spark-v2.1 and launch a spark-shell with Hivemall $ <HIVEMALL_HOME>/bin/spark-shell // Create DataFrame from the bzipʼd libsvm-formatted file scala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2") scala> trainDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  54. 54. 54Copyright©2016 NTT corp. All Rights Reserved. 2. Load data in Spark (Detailed) 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… trainDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable
  55. 55. 55Copyright©2016 NTT corp. All Rights Reserved. 3. Build a model - DataFrame scala> paste: val modelDf = trainDf.train_logregr($"features", $"label") .groupBy("feature”) .agg("weight" -> "avg”)
  56. 56. 56Copyright©2016 NTT corp. All Rights Reserved. 3. Build a model - SQL scala> trainDf.createOrReplaceTempView("TrainTable") scala> paste: val modelDf = sql(""" | SELECT feature, AVG(weight) AS weight | FROM ( | SELECT train_logregr(features, label) | AS (feature, weight) | FROM TrainTable | ) | GROUP BY feature """.stripMargin)
  57. 57. 57Copyright©2016 NTT corp. All Rights Reserved. 4. Do predictions - DataFrame scala> paste: val df = testDf.select(rowid(), $"features") .explode_vector($"features") .cache # Do predictions df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER") .groupBy("rowid") .avg(sigmoid(sum($"weight" * $"value")))
  58. 58. 58Copyright©2016 NTT corp. All Rights Reserved. 4. Do predictions - SQL scala> modelDf.createOrReplaceTempView(”ModelTable") scala> df.createOrReplaceTempView(”TestTable”) scala> paste: sql(""" | SELECT rowid, sigmoid(value * weight) AS predicted | FROM TrainTable t | LEFT OUTER JOIN ModelTable m | ON t.feature = m.feature | GROUP BY rowid """.stripMargin)
  59. 59. 59Copyright©2016 NTT corp. All Rights Reserved. • Top-K Join • Join two relations and compute top-K for each group Add New Features in Spark scala> paste: val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”) .select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”) .withColumn( "rank", rank().over(partitionBy($"group”).orderBy($"score".desc)) ) .where($"rank" <= topK) Use Spark vanilla APIs
  60. 60. 60Copyright©2016 NTT corp. All Rights Reserved. • Top-K Join • Join two relations and compute top-K for each group Add New Features in Spark scala> paste: val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”) .select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”) .withColumn( "rank", rank().over(partitionBy($"group”).orderBy($"score".desc)) ) .where($"rank" <= topK) 1. Join leftDf with rightDf 2. Sort data in each group 3. Filter top-K data Use Spark vanilla APIs
  61. 61. 61Copyright©2016 NTT corp. All Rights Reserved. • Top-K Join • Join two relations and compute top-K for each group Add New Features in Spark scala> paste: val topkDf = leftDf.top_k_join( lit(topK), rightDf, leftDf(“group”) === rightDf(“group”), (leftDf(“x”) + rightDf(“y”)).as(“score”) ) Use a fused API implemented in Hivemall
  62. 62. 62Copyright©2016 NTT corp. All Rights Reserved. • How top_k_join works? Add New Features in Spark group x group y leftDf rightDf ・・・
  63. 63. 63Copyright©2016 NTT corp. All Rights Reserved. • How top_k_join works? Add New Features in Spark group x group y ・・・ ・・・ K-length priority queue Compute top-K rows by using a priority queue leftDf rightDf
  64. 64. 64Copyright©2016 NTT corp. All Rights Reserved. • How top_k_join works? Add New Features in Spark group x group y ・・・ ・・・ K-length priority queue Compute top-K rows by using a priority queue Only join top-K rowsleftDf rightDf
  65. 65. 65Copyright©2016 NTT corp. All Rights Reserved. • Codegenʼd top_k_join for fast processing • Spark internally generates Java code from a built physical plan, and compiles/executes it Add New Features in Spark Spark Planner (Catalyst) Overview
  66. 66. 66Copyright©2016 NTT corp. All Rights Reserved. • Codegenʼd top_k_join for fast processing • Spark internally generates Java code from a built physical plan, and compiles/executes it Add New Features in Spark Codegen A Physical Plan
  67. 67. 67Copyright©2016 NTT corp. All Rights Reserved. • Codegenʼd top_k_join for fast processing • Spark internally generates Java code from a built physical plan, and compiles/executes it Add New Features in Spark scala> topkDf.explain == Physical Plan == *ShuffledHashJoinTopK 1, [group#10], [group#27] :- Exchange hashpartitioning(group#10, 200) : +- LocalTableScan [group#10, x#11] +- Exchange hashpartitioning(group#27, 200) +- LocalTableScan [group#27, y#28] ʻ*ʼ in the head means a codegenʼd plan
  68. 68. 68Copyright©2016 NTT corp. All Rights Reserved. • Benchmark Result Add New Features in Spark ~13 times faster than vanilla APIs!!
  69. 69. 69Copyright©2016 NTT corp. All Rights Reserved. • Support more useful functions • Spark only implements naive and basic functions in terms of usability and maintainability • e.g.,) • flatten - flatten a nested schema into flat one • from_csv/to_csv – interconversion of CSV strings and structured data with schemas • ... • See more in the Hivemall user guide • https://hivemall.incubator.apache.org /userguide /spark /misc /misc.html Add New Features in Spark
  70. 70. 70Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d)) scala> val sparkUdf = functions.udf(sigmoidFunc) scala> df.select(sparkUdf($“value”))
  71. 71. 71Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark scala> val hiveUdf = HivemallOps.sigmoid scala> df.select(hiveUdf($“value”))
  72. 72. 72Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.1) Compute a sigmoid function Improve Some Functions in Spark
  73. 73. 73Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group Improve Some Functions in Spark scala> paste: df.withColumn( “rank”, rank().over(Window.partitionBy($"key").orderBy($"score".desc) ) .where($"rank" <= topK)
  74. 74. 74Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group • Fixed the overhead issue for each_top_k • See pr#353: “Implement EachTopK as a generator expression“ in Spark Improve Some Functions in Spark scala> df.each_top_k(topK, “key”, “score”, “value”)
  75. 75. 75Copyright©2016 NTT corp. All Rights Reserved. • Spark has overheads to call Hive UD*Fs • Hivemall heavily depends on them • ex.2) Compute top-k for each key group Improve Some Functions in Spark ~4 times faster than rank!!
  76. 76. 76Copyright©2016 NTT corp. All Rights Reserved. • supports under development • fast implementation of the gradient tree boosting • widely used in Kaggle competitions • This integration will make you... • load built models and predict in parallel • build multiple models in parallel for cross-validation Support 3rd Party Library in Spark
  77. 77. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs 2017/5/16 Apache BigData North America '17, Miami We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin
  78. 78. 2017/5/16 Apache BigData North America '17, Miami Any feature request or questions?

×