This document discusses various options for deploying machine learning models including AWS Lambda, Kubernetes, KServe (previously known as Kubeflow Serving), and AWS SageMaker. It provides an overview of deploying models with Kubernetes using Deployments and Services. Specific examples are given for deploying Keras models with TensorFlow Serving and a gateway using gRPC as well as deploying with KServe including the use of transformers. The document concludes that there is no one size fits all solution and different options may be better suited depending on factors like load size, ease of use, transparency, and cost.
5. Plan
● Different options to deploy a model (Lambda, Kubernetes, SageMaker)
● Kubernetes 101
● Deploying an XGB model with Flask and Kubernetes
● Deploying a Keras model with TF-Serving and Kubernetes
● Deploying a Keras model with KServe (previously known as Kubeflow
Serving)
6. Ways to deploy a model
● Flask + AWS Elastic Beanstalk
● Serverless (AWS Lambda)
● Kubernetes (EKS)
● KServe (EKS)
● AWS SageMaker
● ...
(or their alternatives in other cloud providers)
12. Lambda vs SageMaker vs Kubernetes
● Lambda
○ Cheap for small load
○ Easy to manage
○ Not always transparent
13. Lambda vs SageMaker vs Kubernetes
● Lambda
○ Cheap for small load
○ Easy to manage
○ Not always transparent
● SageMaker (serving)
○ Easy to use/manage
○ Needs wrappers
○ Not always transparent
○ Expensive
14. Lambda vs SageMaker vs Kubernetes
● Lambda
○ Cheap for small load
○ Easy to manage
○ Not always transparent
● SageMaker (serving)
○ Easy to use/manage
○ Needs wrappers
○ Not always transparent
○ Expensive
● Kubernetes
○ Complex (for me)
○ More flexible
○ Cloud-agnostic *
○ Requires support
○ Cheaper for high load
* sort of
16. Kubernetes glossary
● Pod ~ one instance of your service
● Deployment - a bunch of pods
● HPA - horizontal pod autoscaler
● Node - a server (e.g. EC2 instance)
● Service - an interface to the deployment
● Ingress - an interface to the cluster
20. import xgboost as xgb
# load the model from the pickle file
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
result = apply_model(data)
return jsonify(result)
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0', port=9696)
21. FROM python:3.9-slim
RUN pip install flask gunicorn xgboost
COPY "model.py" "model.py"
EXPOSE 9696
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "model:app"]
28. import tensorflow as tf
from tensorflow import keras
model = keras.models.load_model('keras-model.h5')
tf.saved_model.save(model, 'tf-model')
29. $ ls -lhR
.:
total 3,1M
4,0K assets
3,1M saved_model.pb
4,0K variables
./assets:
total 0
./variables:
total 83M
83M variables.data-00000-of-00001
15K variables.index
30. saved_model_cli show --dir tf-model --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
...
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_8'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 299, 299, 3)
name: serving_default_input_8:0
The given SavedModel SignatureDef contains the following output(s):
outputs['dense_7'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 10)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
31. docker run -it --rm
-p 8500:8500
-v "$(pwd)/tf-model:/models/tf-model/1"
-e MODEL_NAME=tf-model
tensorflow/serving:2.3.0
2021-09-07 21:03:58.579046: I tensorflow_serving/model_servers/server.cc:367]
Running gRPC ModelServer at 0.0.0.0:8500 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
2021-09-07 21:03:58.582097: I tensorflow_serving/model_servers/server.cc:387]
Exporting HTTP/REST API at:localhost:8501 ...
35. Not so fast
def np_to_protobuf(data):
return tf.make_tensor_proto(data, shape=data.shape)
pb_request = predict_pb2.PredictRequest()
pb_request.model_spec.name = 'tf-model'
pb_request.model_spec.signature_name = 'serving_default'
pb_request.inputs['input_8'].CopyFrom(np_to_protobuf(X))
pb_result = stub.Predict(pb_request, timeout=20.0)
pred = pb_result.outputs['dense_7'].float_val
36. 2,0 GB dependency?
Get only the things you need!
https://github.com/alexeygrigorev/tensorflow-protobuf
37. from tensorflow.keras.applications.xception import preprocess_input
https://github.com/alexeygrigorev/keras-image-helper
from keras_image_helper import create_preprocessor
preprocessor = create_preprocessor('xception', target_size=(299, 299))
url = 'http://bit.ly/mlbookcamp-pants'
X = preprocessor.from_url(url)
38. Next steps...
● Bake in the model into the TF-serving image
● Wrap the gRPC calls in a Flask app for the Gateway
● Write a Dockerfile for the Gateway
● Publish the images to ERC
61. Summary
● AWS SageMaker vs AWS Lambda vs Kubernetes vs Kubeflow
● Deploying models with Kubernetes: deployment + service
62. Summary
● AWS SageMaker vs AWS Lambda vs Kubernetes vs Kubeflow
● Deploying models with Kubernetes: deployment + service
● Deploying Keras models: TF-Serving + Gateway (over gRPC)
63. Summary
● AWS SageMaker vs AWS Lambda vs Kubernetes vs Kubeflow
● Deploying models with Kubernetes: deployment + service
● Deploying Keras models: TF-Serving + Gateway (over gRPC)
● KFServing: transformers + model
64. Summary
● AWS SageMaker vs AWS Lambda vs Kubernetes vs Kubeflow
● Deploying models with Kubernetes: deployment + service
● Deploying Keras models: TF-Serving + Gateway (over gRPC)
● KFServing: transformers + model
● No size fits all