The document summarizes the steps to transform an analytics script into a scalable data science service using Python, from defining an importable module, exposing an API endpoint, making it asynchronous using Celery, containerizing it using Docker, and deploying it on Kubernetes. It discusses each step in detail providing code examples. The overall goal is to make the data science solution robust, responsive and able to handle production scale loads.
1. Data Science with Python
From analytics scripts to services in scale
Giuseppe Broccolo
Data Engineer @ Decibel ltd.
@giubro
/gbroccolo
gbroccolo@decibelinsight.com
PyLondinium 2019
London, UK, June 15th
4. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
The repo
From analytics scripts to services in scale
PyLondinium 2019
https://github.com/gbroccolo/k8s_example
5. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Outline of the talk
From analytics scripts to services in scale
PyLondinium 2019
●
Requirements to deploy a service in scale
●
A simple case study: from a prototype to a service in scale
●
The roadmap
●
Conclusions
6. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
The requirements of
a scalable service
From analytics scripts to services in scale
PyLondinium 2019
https://12factor.net/
●
Prefer stateless applications
●
Async is better than sync
●
Ingest data via HTTP reqs, or consuming streaming
●
Store the configuration as ENVs – avoid filesystem persistence
●
Avoid even temporary artifacts on filesystem – prefer in memory executions
●
Rely on backing services if necessary
●
High portability and easy deploy of the processing units – use Docker!
7. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
Data scientists produced a fantastic algorithm
to detect anomalies in gaussian time series
HTTP
Data engineers have to design the pipeline of a
responsive, in scale service able to ingest time
series provided via HTTP requests
curl -X POST -H “Content-Type: application/x-www-form-urlencoded”
-d “(18:27:26.345, 2.345) (18:27:26.346, 2.352) ...”
http://<url>/get_anomalous_data
The scenario
8. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
The scenario
From analytics scripts to services in scale
PyLondinium 2019
from itertools import count
import numpy as np
import pandas as pd
def nd_rolling(data, window_size):
sample = list(zip(count(), data.values[:, 0], data.values[:, 1]))
for idx in range(0, len(sample)):
idx0 = idx if idx - window_size < 0 else idx - window_size
window = [it for it in sample
if it[0] >= idx0 or it[0] <= idx0 + window_size]
x = np.array([it[2] for it in window])
yield {'idx': sample[idx][1],
'value': sample[idx][2],
'window_mean': np.mean(x),
'window_std': np.std(x)}
def get_anomalous_values(data, window_size=100):
return [(p['idx'], p['value'])
for p in nd_rolling(data, window_size)
if abs(p['value'] - p['window_mean'])
> 5 * p['window_std']]
time
(18:27:26.345, 2.345),
(18:27:26.346, 2.352),
(18:27:26.347, 2.348),
...
9. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Roadmap
From analytics scripts to services in scale
PyLondinium 2019
10. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 1
define an importable module
From analytics scripts to services in scale
PyLondinium 2019
1
11. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 1
define an importable module
From analytics scripts to services in scale
PyLondinium 2019
anomaly/
__init__.py
anomaly.py
setup.cfg
setup.py
requirements.txt
from itertools import count
import numpy as np
import pandas as pd
def nd_rolling(data, window_size):
sample = list(zip(count(), data.values[:, 0], data.values[:, 1]))
for idx in range(0, len(sample)):
idx0 = idx if idx - window_size < 0 else idx - window_size
window = [it for it in sample
if it[0] >= idx0 or it[0] <= idx0 + window_size]
x = np.array([it[2] for it in window])
yield {'idx': sample[idx][1],
'value': sample[idx][2],
'window_mean': np.mean(x),
'window_std': np.std(x)}
def get_anomalous_values(data, window_size=100):
return [(p['idx'], p['value'])
for p in nd_rolling(data, window_size)
if abs(p['value'] - p['window_mean'])
> 5 * p['window_std']]
https://setuptools.readthedocs.io
12. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 1
define an importable module
From analytics scripts to services in scale
PyLondinium 2019
$ pip install --no-cache-dir . && rm -rf ./*
$ python
Python 3.7.1 (default, Dec 20 2018, 10:12:31)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> from anomaly import anomaly
>>>
13. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 2
expose a responsive endpoint
From analytics scripts to services in scale
PyLondinium 2019
1
2
14. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 2
expose a responsive endpoint
From analytics scripts to services in scale
PyLondinium 2019
References: a lot of blogs, articles, ...
15. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 2
expose a responsive endpoint
From analytics scripts to services in scale
PyLondinium 2019
from flask import Flask
from flask import request
import json
from io import StringIO
import re
import pandas as pd
from anomaly.anomaly import get_anomalous_values
app = Flask(__name__)
def bloat2float(x):
y = re.search("((.*),(.*))", x).group(1, 2)
return y[0], float(y[1])
@app.route('/get_anomalous_data', methods=['POST'])
def get_anomalous_data():
stream = pd.DataFrame([bloat2float(x) for x in StringIO(
request.get_data().decode('utf-8')).getvalue().split()])
stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()]
response = get_anomalous_values(stream)
return json.dumps(response), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80, debug=True)
main.py
16. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 3
make it asynchronously
From analytics scripts to services in scale
PyLondinium 2019
1
2 3
17. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
reduce the impact to the requestor
POST
GET
Step 3
make it asynchronously
18. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
Step 3
make it asynchronously
celery worker
source
APP
redis & celery clients
Broker pipeline
Backend pipeline
19. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 3
make it asynchronously
From analytics scripts to services in scale
PyLondinium 2019
[supervisord]
nodeamon=true
[program:uwsgi]
environment=PYTHONPATH=/app/
command=/usr/local/bin/uwsgi --ini /etc/uwsgi/uwsgi.ini --die-on-term --need-app --plugin
python3
autostart=true
autorestart=true
[program:nginx]
command=/usr/sbin/nginx
autostart=true
autorestart=true
[program:celery_worker]
environment=PYTHONPATH=/app/
command=/usr/local/bin/celery worker -A main.celery --loglevel=info
autostart=true
autorestart=true
supervisord.ini
20. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
[…]
from celery import Celery
from celery.signals import after_setup_logger
from redis import StrictRedis
[…]
app = Flask(__name__)
app.config['CELERY_RESULT_BACKEND'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST')
app.config['CELERY_BACKEND_URL'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST')
celery = Celery(app.name, broker=app.config['CELERY_BACKEND_URL'])
celery.conf.update(app.config)
[…]
REDIS_CACHE = StrictRedis(
host='%s' % os.environ.get('REDIS_HOST'), port=6379, db=1,decode_responses=True)
[…]
@celery.task(bind=True)
def wrap_long_task(self, arg):
return get_anomalous_values(arg)
Step 3
make it asynchronously
21. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
@app.route("/get_anomalous_data", methods=['GET', 'POST'])
def get_anomalous_data():
if request.method == ‘GET’:
task_id = REDIS_CACHE.rpop("running_task_ids")
task = wrap_long_task.AsyncResult(task_id)
if task.state == 'SUCCESS':
return json.dumps(task.result), 200
elif task.state == 'FAILURE':
return str(task.traceback), 500
else:
REDIS_CACHE.lpush("running_task_ids", task_id)
return ‘’, 202
elif request.method == ‘POST’:
stream = pd.DataFrame([bloat2float(x) for x in StringIO(
request.get_data().decode('utf-8')).getvalue().split()])
stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()]
task = wrap_long_task.apply_async(args=[stream.to_json(orient='records')])
REDIS_CACHE.lpush("running_task_ids", task.id)
return 'submitted', 200
Step 3
make it asynchronously
GET
POST
Celery ingests
JSON serializable
Inputs only!!
22. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
[…]
def get_anomalous_values(data, window_size=100):
"""
data : pandas.core.frame.DataFrame
window_size: int
return: list
"""
pd_data = pd.read_json(data, orient='records')
# calculate the moving window for each point, and report the anomaly if
# the distance of the idx-th point is greater than md times the mahalanobis
# distance
return [(p['idx'], p['value']) for p in nd_rolling(pd_data, window_size)
if abs(p['value'] - p['window_mean']) > 5 * p['window_std']]
Step 3
make it asynchronously
Need to deserialise
the ingested data
into a Pandas
dataframe
anomaly.py
23. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
4
3
1
2
24. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
main
application
broker & backing
service
25. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
Dockerfile
FROM decibel/uwsgi-nginx-flask-docker:python3.6-alpine3.8-pandas
MAINTAINER Giuseppe Broccolo <gbroccolo@decibelinsight.com>
RUN mkdir -p /source
COPY . /source
COPY ./main.py /app/
COPY ./supervisord.ini /etc/supervisor.d/
RUN cd /source &&
pip install --no-cache-dir . &&
cd / &&
rm -rf /source
inherited from: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/
26. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
version: '3.5'
networks:
redis_network:
services:
python-app:
build: .
environment:
- REDIS_HOST=redis
ports:
- “0.0.0.0:80:80”
depends_on:
- redis
networks:
- redis_network
redis:
image: redis:5.0.3-alpine3.8
networks:
- redis_network
docker-compose.yml
27. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud
From analytics scripts to services in scale
PyLondinium 2019
4 5
3
1
2
28. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud
From analytics scripts to services in scale
PyLondinium 2019
●
Dynamic cluster of worker units
●
autoscaled basing on load of single units
●
Several cloud providers – GKE, EKS, …
●
Deployable through YAML configurations:
●
PODs – elementary unit composed by 1+ containers
●
Deployments – how the PODs are deployed
●
Services – how the deployed PODs are exposed
●
Horizontal autoscalers – how the cluster autoscales
HTTP
https://kubernetes.io
29. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud – static cluster
From analytics scripts to services in scale
PyLondinium 2019
pods.yml
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “localhost”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
- name: redis
image: redis:5.0.3-alpine3.8
30. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud – scalable cluster
From analytics scripts to services in scale
PyLondinium 2019
pods.yml
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “localhost”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
- name: redis
image: redis:5.0.3-alpine3.8
31. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud – scalable cluster
From analytics scripts to services in scale
PyLondinium 2019
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “redis”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
ports:
- containerPort: 80
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: redis
image: redis:5.0.3-alpine3.8
---
apiVersion: v1
kind: Service
[…]
spec:
type: ClusterIP
ports:
- port: 6379
targetPort: 6379
32. Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Room for improvements
From analytics scripts to services in scale
PyLondinium 2019
●
Consume streams, publish results on streams
●
Make the cache backing service more reliables
(HA cluster, persist data in a volume in case of restart of pods)
●
Decouple Celery workers from HTTP data ingestion