Successfully reported this slideshow.
Your SlideShare is downloading. ×

Tensors Are All You Need: Faster Inference with Hummingbird

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 49 Ad

Tensors Are All You Need: Faster Inference with Hummingbird

Download to read offline

The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.



In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird



This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.



Outline:

Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A

The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.



In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird



This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.



Outline:

Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Tensors Are All You Need: Faster Inference with Hummingbird (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Tensors Are All You Need: Faster Inference with Hummingbird

  1. 1. Matteo Interlandi, Karla Saur Hummingbird
  2. 2. Overview Machine Learning Prediction Serving
  3. 3. Machine Learning Prediction Serving
  4. 4. Machine Learning Prediction Serving 1. Models are learned from data Data Training Model Learn
  5. 5. Machine Learning Prediction Serving 1. Models are learned from data 2. Models are deployed and served together Prediction serving Users Server Data Training Model Learn Deploy
  6. 6. Model Serving Specialized Systems have been developed
  7. 7. Model Serving Specialized Systems have been developed Focus: Deep Learning (DL)
  8. 8. Model Serving Specialized Systems have been developed Support for traditional ML methods is largely overlooked Focus: Deep Learning (DL)
  9. 9. Traditional ML Models 2019 Kaggle Survey: The State of Data Science & Machine Learning Data Science through the looking glass: https://arxiv.org/abs/1912.09536
  10. 10. Problem: Lack of Optimizations for Traditional ML Serving Systems for training traditional ML models are not optimized for serving Traditional ML models are expressed using imperative code in an ad-hoc fashion, not using a shared logical abstraction Traditional ML models cannot natively exploit hardware acceleration
  11. 11. Tokeniz er How do “Traditional ML Models” look inside? <Example: Binary Classification> Char Ngram Word Ngram Conca t Logistic Regressio n 0 vs 1 Traditional ML Model A B C D a 0.1 c 0.5
  12. 12. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification>
  13. 13. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification> Featurizers
  14. 14. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) Featurizers Predictor A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification>
  15. 15. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification>
  16. 16. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) Split input into cat num A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification>
  17. 17. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) Normalize num A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification> Split input into cat num One hot encode cat
  18. 18. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) Merge two vectors A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification> Split input into cat num Normalize num One hot encode cat
  19. 19. Split How do “Traditional ML Models” look inside? Scaler OneHot Conca t Logistic Regressio n DAG of Operators (aka pipeline) Merge two vectors Compute final score A B C D a 0.1 c 0.5 0 vs 1 <Example: Binary Classification> Split input into cat num Normalize num One hot encode cat
  20. 20. Deep Learning
  21. 21. Primarily relies on the abstraction of tensors Deep Learning DL models are expressed as a DAG of tensor operators w1 b1 X Mat Mul Add ReLU Mat Mul Add Sigmoid w1 b1 User Input
  22. 22. Systems for DL Prediction Serving Exploit the abstraction of tensor operations to support multiple DL frameworks on multiple target environments Mat Mul Add ReLU … ✔ Efficient implementations ✔ Declarative ✔ Seamless hardware acceleration ✔ Reduced engineering efforts Benefit s:
  23. 23. Hummingbird Mat Mul Add ReLU … Deep Learning Traditional ML DL Prediction Serving Systems
  24. 24. Converting ML Operators into Tensor Operations Observation: pipelines are composed of two classes of operators Algebraic Operations: E.g., Linear Regression Algorithmic Operations: E.g., RandomForest, OneHotEncoder Y = wX + b Complex data access patterns and control-flow patterns! Introduce redundancies, both computational and storage Make data access patterns and control flow uniform for all inputs Our Solution: Depending on the level of redundancy introduced there can be more than one potential compilation approach Hummingbird picks the one that works given pipeline statistics
  25. 25. Converting Decision tree-based models
  26. 26. Converting Decision tree-based models
  27. 27. Converting Decision tree-based models
  28. 28. Converting Decision tree-based models
  29. 29. Converting Decision tree-based models <=
  30. 30. Converting Decision tree-based models <=
  31. 31. Converting Decision tree-based models <=
  32. 32. Converting Decision tree-based models <=
  33. 33. Converting Decision tree-based models <=
  34. 34. Compiling Decision Tree-based Models Above approach (GEMM approach) essentially evaluates all paths in a decision tree model: computation redundancy. Works surprisingly well on modern hardware for many cases! Two other tree traversal-based methods that exploit the tree structure. For tall trees (e.g., LightGBM) For bushy trees (e.g., XGBoost)
  35. 35. Tree Traversal Method
  36. 36. Tree Traversal Method Initial: 0 repeat while < max tree depth Gather Feature Ids Featu re id X Gather Feature value Gather Thresholds Threshold value < Co nd. Where Lefts Rights Tr ue Fal se
  37. 37. Perfect Tree Traversal Method 3 F3 < 0.5 F2 < 2.0 F5 < 5.5 C1 C2 C1 F3 < 2.4 C2 C1 true false true true true false false false F3 < 0.5 F2 < 2.0 F5 < 5.5 C1 C1 F3 < 2.4 C2 C1 true false true true true false false false C1 C2 C2 C1
  38. 38. Operator Group Supported Operators Linear Classifiers Logistic Regression, Linear SVC, LinearSVR, SVC, SVR, NuSVC, SGDClassifier, LogisticRegressionCV Tree Methods DecisionTreeClassifier/Regressor, RandomForestClassifier/Regressor, (Hist)GradientBoostingClassifier/Regressor, ExtraTreesClassifier/Regressor, XGBClassifier/Regressor, LGBMClassifier/Regressor/Ranker Neural Networks MLPClassifier Others BernouliNB, Kmeans, MeanShift Feature Selectors SelectKBest Decomposition PCA, TruncatedSVD Feature Pre-Processing SimpleImputer, Imputer, ColumnTransformer, RobustScaler, MaxAbsScaler, MinMaxScaler, StandardScaler, Binarizer, KBinsDiscretizer, Normalizer, PolynomialFeatures, OneHotEncoder, LabelEncoder, FeatureHasher Supported Operators
  39. 39. High-level System Design Hummingbird Trained Traditional ML Pipelines DL Prediction Serving Systems
  40. 40. End-to-End Pipeline Evaluation 4 Hardware Setup Experimental Workload Hummingbird can translate 2328 pipelines (88%). Perform inference on 20% of the dataset. TorchScript as the backend for Hummingbird. Scikit-Learn pipelines for OpenML-CC18 benchmark which has 72 datasets. Azure NC6 v2 machine Intel Xeon E5-2690 v4@ 2.6GHz (6 cores) 112 GB RAM Nvidia P100 Ubuntu 18.04, PyTorch 1.3, TVM 0.6, CUDA 10, RAPIDS 0.9
  41. 41. End-to-End Pipeline Evaluation 4 CPU 60% 1200 X 60 X
  42. 42. End-to-End Pipeline Evaluation 4 CPU GPU 60% 1200 X 60 X 73% 1000 X 130 X Main reasons for slowdowns: Sparse input data, small inference datasets.
  43. 43. 43 Demo
  44. 44. Hummingbird Updates • Hummingbird has reached > 21K PyPI downloads and 2.4k stars • Demoed at Microsoft Ignite • Integrated with ONNX converter tools • OSDI paper • New features include: • Pandas Dataframes • PySparkML support • TVM support • Looking for new users/contributors!
  45. 45. 45 Thank you! hummingbird-dev@microsoft.com
  46. 46. Tree-Models Microbenchmark 46 Experimental Workload: Nvidia Gradient Boosting Algorithm Benchmark* Dataset Rows #Features Task Fraud 285k 28 BinaryClass Year 512k 90 Regression Covtype 581k 54 MultiClass Epsilon 500k 2000 BinaryClass 3 Models: RandomForest, XGBoost, LightGBM. 80/20 train/test split. Batch inference (batch size 10k w/ and w/o GPU. (* https://github.com/NVIDIA/gbm-bench)
  47. 47. Tree-Models Microbenchmark 47 Algorithm Dataset Sklearn (CPU Baseline) Hummingbird (CPU) RAPIDS (GPU Baseline) Hummingbird (GPU) TorchScript TVM TorchScript TVM Rand. Forest Fraud Year Covtype Epsilon LightGBM Fraud Year Covtype Epsilon XGBoost Fraud Year Covtype Epsilon (All runtimes are reported in seconds. More datasets and experimental results in the paper.)
  48. 48. Tree-Models Microbenchmark 48 Algorithm Dataset Sklearn (CPU Baseline) Hummingbird (CPU) RAPIDS (GPU Baseline) Hummingbird (GPU) TorchScript TVM TorchScript TVM Rand. Forest Fraud 2.5 7.8 3.0 Year 1.9 7.7 1.4 Covtype 5.9 16.5 6.8 Epsilon 9.8 13.9 6.6 LightGBM Fraud 3.4 7.6 1.7 Year 5.0 7.6 1.6 Covtype 51.1 79.5 27.2 Epsilon 10.5 14.5 4.0 XGBoost Fraud 1.9 7.6 1.6 Year 3.1 7.6 1.6 Covtype 42.3 79.0 26.4 Epsilon 7.6 14.8 4.2 (All runtimes are reported in seconds. More datasets and experimental results in the paper.)
  49. 49. Tree-Models Microbenchmark 49 Algorithm Dataset Sklearn (CPU Baseline) Hummingbird (CPU) RAPIDS (GPU Baseline) Hummingbird (GPU) TorchScript TVM TorchScript TVM Rand. Forest Fraud 2.5 7.8 3.0 !SUPPORTED 0.044 0.015 Year 1.9 7.7 1.4 !SUPPORTED 0.045 0.026 Covtype 5.9 16.5 6.8 !SUPPORTED 0.110 0.047 Epsilon 9.8 13.9 6.6 !SUPPORTED 0.130 0.13 LightGBM Fraud 3.4 7.6 1.7 0.014 0.044 0.014 Year 5.0 7.6 1.6 0.023 0.045 0.025 Covtype 51.1 79.5 27.2 !SUPPORTED 0.620 0.250 Epsilon 10.5 14.5 4.0 0.150 0.130 0.120 XGBoost Fraud 1.9 7.6 1.6 0.013 0.044 0.015 Year 3.1 7.6 1.6 0.022 0.045 0.026 Covtype 42.3 79.0 26.4 !SUPPORTED 0.620 0.250 Epsilon 7.6 14.8 4.2 0.150 0.130 0.120 (All runtimes are reported in seconds. More datasets and experimental results in the paper.)

×