Hivemail: Scalable Machine Learning Library for Apache Hive

1,927 views
1,662 views

Published on

Published in: Technology, Education
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
1,927
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • 30m talk + 10m QA
  • I would like to know about how many people in this room is using them.
    I’m going to take a quick poll now.

    Could you raise you up if you are using
  • Hivemail: Scalable Machine Learning Library for Apache Hive

    1. 1. National Institute of Advanced Industrial Science and Technology (AIST), Japan Makoto YUI m.yui@aist.go.jp, @myui Hivemall: Scalable Machine Learning Library for Apache Hive Hadoop Summit 2014, San Jose 1 / 43
    2. 2. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 2 / 43
    3. 3. What is Hivemall • A collection of machine learning algorithms implemented as Hive UDFs/UDTFs • Classification & Regression • Recommendation • k-Nearest Neighbor Search .. and more • An open source project on Github • Licensed under LGPL • github.com/myui/hivemall (bit.ly/hivemall) • 4 contributors Hadoop Summit 2014, San Jose 3 / 43
    4. 4. Reactions to the release Hadoop Summit 2014, San Jose 4 / 43
    5. 5. Reactions to the release Hadoop Summit 2014, San Jose 5 / 43
    6. 6. Hadoop Summit 2014, San Jose Motivation – Why a new ML framework? Mahout? Vowpal Wabbit? (w/ Hadoop streaming) Spark MLlib? 0xdata H2O? Cloudera Oryx? Machine Learning frameworks out there that run with Hadoop Quick Poll: How many people in this room are using them? 6 / 43
    7. 7. Framework User interface Mahout Java API Programming Spark MLlib/MLI Scala API programming Scala Shell (REPL) H2O R programming GUI Cloudera Oryx Http REST API programming Vowpal Wabbit (w/ Hadoop streaming) C++ API programming Command Line Hadoop Summit 2014, San Jose Motivation – Why a new ML framework? Existing distributed machine learning frameworks are NOT easy to use 7 / 43
    8. 8. Hadoop Summit 2014, San Jose Classification with Mahout org/apache/mahout/classifier/sgd/TrainNewsGroups.java Find the complete code at bit.ly/news20-mahout 8 / 43
    9. 9. Hadoop Summit 2014, San Jose Why Hivemall 1. Ease of use • No programming • Every machine learning step is done within HiveQL • No compilation/packaging overhead • Easy for existing Hive users • You can evaluate Hivemall within 5 minutes or so • Installation is just as follows 9 / 43
    10. 10. Hadoop Summit 2014, San Jose Why Hivemall 2. Scalable to data • Scalable to # of training/testing instances • Scalable to # of features • Built-in support for feature hashing • Scalable to the size of prediction model • Suppose there are 200 labels * 100 million features ⇒ Requires 150GB • Hivemall does not need a prediction model fit in memory both in the training/prediction • Feature engineering step is also scalable and parallelized using Hive 10 / 43
    11. 11. Hadoop Summit 2014, San Jose Why Hivemall 3. Scalable to computing resources • Exploiting the benefits of Hadoop & Hive • Provisioning the machine learning service on Amazon Elastic MapReduce • Provides an EMR bootstrap for the automated setup Find an example on bit.ly/hivemall-emr 11 / 43
    12. 12. Hadoop Summit 2014, San Jose Why Hivemall 4. Supports the state-of-the-art online learning algorithms (for classification) • Less configuration parameters (no learning rate as one in SGD) • CW, AROW[1], and SCW[2] are not yet supported in the other ML frameworks • Surprising fast convergence properties (few iterations is enough) 1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 2009 2. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012 12 / 43
    13. 13. Hadoop Summit 2014, San Jose Why Hivemall Algorithms News20.binary Classification Accuracy Perceptron 0.9460 Passive-Aggressive (a.k.a. Online-SVM) 0.9604 LibLinear 0.9636 LibSVM/TinySVM 0.9643 Confidence Weighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662 Better 4. Supports the state-of-the-art online learning algorithms (for classification) CW-variants are very smart online ML algorithm 13 / 43
    14. 14. Hadoop Summit 2014, San Jose Why CW variants are so good? Suppose a binary classification setting to classify sentences positive or negative → learn the weight for each word (each word is a feature) I like this authorPositive I like this author, but found this book dullNegative Label Feature Vector Naïve update will reduce both at same rateWlike Wdull CW-variants adjust weights at different rates 14 / 43
    15. 15. Hadoop Summit 2014, San Jose Why CW variants are so good? weight weight Adjust a weight Adjust a weight & confidence 0.6 0.80.6 0.80.6 At this confidence, the weight is 0.5 Confidence (covariance) 0.5 15 / 43
    16. 16. Hadoop Summit 2014, San Jose Why Hivemall 4. Supports the state-of-the-art online learning algorithms (for classification) • Fast convergence properties • Perform small update where confidence is enough • Perform large update where confidence is low (e.g., at the beginning) • A few iterations are enough 16 / 43
    17. 17. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 17 / 43
    18. 18. Hadoop Summit 2014, San Jose What Hivemall can do • Classification (both one- and multi-class)  Perceptron  Passive Aggressive (PA)  Confidence Weighted (CW)  Adaptive Regularization of Weight Vectors (AROW)  Soft Confidence Weighted (SCW) • Regression  Logistic Regression using Stochastic Gradient Descent (SGD)  PA Regression  AROW Regression • k-Nearest Neighbor & Recommendation  Minhash and b-Bit Minhash (LSH variant)  Brute-force search using similarity measures (cosine similarity) • Feature engineering  Feature hashing  Feature scaling (normalization, z-score) 18 / 43
    19. 19. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 19 / 43
    20. 20. Hadoop Summit 2014, San Jose Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 20 / 43
    21. 21. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 21 / 43
    22. 22. Hadoop Summit 2014, San Jose create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 22 / 43
    23. 23. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 23 / 43
    24. 24. Hadoop Summit 2014, San Jose How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 24 / 43
    25. 25. Hadoop Summit 2014, San Jose How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 25 / 43
    26. 26. Hadoop Summit 2014, San Jose create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble learning for stable prediction performance Just stack prediction models by union all 26 / 43
    27. 27. Hadoop Summit 2014, San Jose How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 27 / 43
    28. 28. Hadoop Summit 2014, San Jose How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 28 / 43
    29. 29. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 29 / 43
    30. 30. Implemented machine learning algorithms as User- Defined Table generating Functions (UDTFs) Hadoop Summit 2014, San Jose How Hivemall works in the training +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> tuple <label, array<features>> tuple<feature, weights> Prediction model UDTF Relation <feature, weights> param-mix param-mix Training table Shuffle by feature train train  Friendly to the Hive relational query engine • Resulting prediction model is a relation of feature and its weight  Embarrassingly parallel • # of mapper and reducers are configurable  Bagging-like effect which helps to reduce the variance of each classifier/partition 30 / 43
    31. 31. Hadoop Summit 2014, San Jose train train +1, <1,2> .. +1, <1,7,9> -1, <1,3, 9> .. +1, <3,8> merge tuple <label, array<features > array<weight> array<sum of weight>, array<count> Training table Prediction model -1, <2,7, 9> .. +1, <3,8> final merge merge -1, <2,7, 9> .. +1, <3,8> train train array<weight> Why not UDAF (as one in MADLib) 4 ops in parallel 2 ops in parallel No parallelism Machine learning as an aggregate function Bottleneck in the final merge Throughput limited by its fan out Memory consumption grows Parallelism decreases 31 / 43
    32. 32. How to deal with Iterations Iterations are mandatory to get a good prediction model • However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS • Spark avoid it by in-memory computation iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Input 32 / 43
    33. 33. val data = spark.textFile(...).map(readPoint).cache() for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } Repeated MapReduce steps to do gradient descent For each node, loads data in memory once This is just a toy example! Why? Training with Iterations in Spark Logistic Regression example of Spark Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required) 33 / 43
    34. 34. Hadoop Summit 2014, San Jose What MLlib actually do? Val data = .. for (i <- 1 to numIterations) { val sampled = val gradient = w -= gradient } Mini-batch Gradient Descent with Sampling Iterations are mandatory for convergence because each iteration uses only small fraction of data GradientDescent.scala bit.ly/spark-gd sample subset of data (partitioned RDD) averaging the subgradients over the sampled data using Spark MapReduce 34 / 43
    35. 35. How to deal with Iterations in Hivemall Hivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps SET hivevar:xtimes=3; CREATE VIEW training_x3 as SELECT * FROM ( SELECT amplify(${xtimes}, *) as (rowid, label, features) FROM training ) t CLUSTER BY RANDOM 35 / 43
    36. 36. Map-only shuffling and amplifying rand_amplify UDTF randomly shuffles the input rows for each Map task CREATE VIEW training_x3 as SELECT rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features) FROM training; 36 / 43
    37. 37. Detailed plan w/ map-local shuffle … Shuffle (distributed by feature) Reducetask Merge Aggregate Reduce write Maptask Table scan Rand Amplifier Map write Logress UDTF Partial aggregate Maptask Table scan Rand Amplifier Map write Logress UDTF Partial aggregate Reducetask Merge Aggregate Reduce write Scanned entries are amplified and then shuffled Note this is a pipeline op. The Rand Amplifier operator is interleaved between the table scan and the training operator 37 / 43
    38. 38. Hadoop Summit 2014, San Jose Method ELAPSED TIME (sec) AUC Plain 89.718 0.734805 amplifier+clustered by (a.k.a. global shuffle) 479.855 0.746214 rand_amplifier (a.k.a. map-local shuffle) 116.424 0.743392 Performance effects of amplifiers With the map-local shuffle, prediction accuracy got improved with an acceptable overhead 38 / 43
    39. 39. Plan of the talk • What is Hivemall • Why Hivemall • What Hivemall can do • How to use Hivemall • How Hivemall works • How to deal with iterations w/ comparing to Spark • Experimental Evaluation • Conclusion Hadoop Summit 2014, San Jose 39 / 43
    40. 40. Experimental Evaluation Compared the performance of our batch learning scheme to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit • Dataset KDD Cup 2012, Track 2 dataset, which is one of the largest publically available datasets for machine learning, provided by a commercial search engine provider • The training data is about 235 million records in 33 GB • # of feature dimensions is about 54 million • Task Predicting Click-Through-Rates of search engine ads • Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop) each equipped with 8 processors and 24 GB memory 40 bit.ly/hivemall-kdd-dataset 40 / 43
    41. 41. Hadoop Summit 2014, San Jose 116.4 596.67 493.81 755.24 0 100 200 300 400 500 600 700 800 Hivemall VW1 VW32 Bismarck 0.64 0.66 0.68 0.7 0.72 0.74 0.76 Hivemall VW1 VW32 Bismarck Throughput: 2.3 million tuples/sec on 32 nodes Latency: 96 sec for training 235 million records of 23 GB Performance comparison Prediction performance (AUC) is good Elapsed time (sec) for training The lower, the better 41 / 43
    42. 42. Hadoop Summit 2014, San Jose val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/small/training_libsvmfmt", multiclass = false) val model = LogisticRegressionWithSGD.train(training, numIterations) .. How about Spark 1.0 MLlib Works fine for small data (10k training examples in about 1.5 MB) on 33 nodes with allocating 5 GB memory to each worker LoC is small and easy to understand However, Spark does not work for large dataset (235 million training example of 2^24 feature dimensions in about 33 GB) Further investigation is required 42 / 43
    43. 43. Hadoop Summit 2014, San Jose Conclusion Hivemall is an open source library that provides a collection of machine learning algorithms as Hive UDFs/UDTFs  Easy to use  Scalable to computing resources  Runs on Amazon EMR  Support state of the art classification algorithms  Plan to support Shark/Spark SQL Project Site: github.com/myui/hivemall or bit.ly/hivemall Message of this talk: Please evaluate Hivemall by yourself. 5 minutes is enough for a quick start  Slide available on bit.ly/hivemall-slide 43 / 43

    ×