High scalable applications with Python

Data Science with Python
From analytics scripts to services in scale
Giuseppe Broccolo
Data Engineer @ Decibel ltd.
@giubro
/gbroccolo
gbroccolo@decibelinsight.com
PyLondinium 2019
London, UK, June 15th

Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Intro
PyLondinium 2019
DNA primase enzyme:
- unroll human DNA (3.2G pair bases)
- DNA is unrolled with a rate of 5 pair
bases per sec (~20 yr)
Wikipedia, © 2019

Intro
PyLondinium 2019
DNA primase enzyme:
- unroll human DNA (3.2G pair bases)
- DNA is unrolled with a rate of 5 pair
bases per sec (~20 yr)
- distributed processing: every 500k pair bases
(~20 yr → ~20 min)
Wikipedia, © 2019

The repo
PyLondinium 2019
https://github.com/gbroccolo/k8s_example

Outline of the talk
PyLondinium 2019
●
Requirements to deploy a service in scale
●
A simple case study: from a prototype to a service in scale
●
The roadmap
●
Conclusions

The requirements of
a scalable service
PyLondinium 2019
https://12factor.net/
●
Prefer stateless applications
●
Async is better than sync
●
Ingest data via HTTP reqs, or consuming streaming
●
Store the configuration as ENVs – avoid filesystem persistence
●
Avoid even temporary artifacts on filesystem – prefer in memory executions
●
Rely on backing services if necessary
●
High portability and easy deploy of the processing units – use Docker!

PyLondinium 2019
Data scientists produced a fantastic algorithm
to detect anomalies in gaussian time series
HTTP
Data engineers have to design the pipeline of a
responsive, in scale service able to ingest time
series provided via HTTP requests
curl -X POST -H “Content-Type: application/x-www-form-urlencoded”
-d “(18:27:26.345, 2.345) (18:27:26.346, 2.352) ...”
http://<url>/get_anomalous_data
The scenario

The scenario
PyLondinium 2019
from itertools import count
import numpy as np
import pandas as pd
def nd_rolling(data, window_size):
sample = list(zip(count(), data.values[:, 0], data.values[:, 1]))
for idx in range(0, len(sample)):
idx0 = idx if idx - window_size < 0 else idx - window_size
window = [it for it in sample
if it[0] >= idx0 or it[0] <= idx0 + window_size]
x = np.array([it[2] for it in window])
yield {'idx': sample[idx][1],
'value': sample[idx][2],
'window_mean': np.mean(x),
'window_std': np.std(x)}
def get_anomalous_values(data, window_size=100):
return [(p['idx'], p['value'])
for p in nd_rolling(data, window_size)
if abs(p['value'] - p['window_mean'])
> 5 * p['window_std']]
time
(18:27:26.345, 2.345),
(18:27:26.346, 2.352),
(18:27:26.347, 2.348),
...

Roadmap
PyLondinium 2019

Step 1
define an importable module
PyLondinium 2019
1

Step 1
PyLondinium 2019
anomaly/
__init__.py
anomaly.py
setup.cfg
setup.py
requirements.txt
from itertools import count
import numpy as np
import pandas as pd
def nd_rolling(data, window_size):
sample = list(zip(count(), data.values[:, 0], data.values[:, 1]))
for idx in range(0, len(sample)):
idx0 = idx if idx - window_size < 0 else idx - window_size
window = [it for it in sample
if it[0] >= idx0 or it[0] <= idx0 + window_size]
x = np.array([it[2] for it in window])
yield {'idx': sample[idx][1],
'value': sample[idx][2],
'window_mean': np.mean(x),
'window_std': np.std(x)}
return [(p['idx'], p['value'])
for p in nd_rolling(data, window_size)
if abs(p['value'] - p['window_mean'])
> 5 * p['window_std']]
https://setuptools.readthedocs.io

Step 1
PyLondinium 2019
$ pip install --no-cache-dir . && rm -rf ./*
$ python
Python 3.7.1 (default, Dec 20 2018, 10:12:31)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> from anomaly import anomaly
>>>

Step 2
expose a responsive endpoint
PyLondinium 2019
1
2

Step 2
PyLondinium 2019
References: a lot of blogs, articles, ...

Step 2
PyLondinium 2019
from flask import Flask
from flask import request
import json
from io import StringIO
import re
import pandas as pd
from anomaly.anomaly import get_anomalous_values
app = Flask(__name__)
def bloat2float(x):
y = re.search("((.*),(.*))", x).group(1, 2)
return y[0], float(y[1])
@app.route('/get_anomalous_data', methods=['POST'])
def get_anomalous_data():
stream = pd.DataFrame([bloat2float(x) for x in StringIO(
request.get_data().decode('utf-8')).getvalue().split()])
stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()]
response = get_anomalous_values(stream)
return json.dumps(response), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80, debug=True)
main.py

Step 3
make it asynchronously
PyLondinium 2019
1
2 3

PyLondinium 2019
reduce the impact to the requestor
POST
GET
Step 3

PyLondinium 2019
Step 3
celery worker
source
APP
redis & celery clients
Broker pipeline
Backend pipeline

Step 3
PyLondinium 2019
[supervisord]
nodeamon=true
[program:uwsgi]
environment=PYTHONPATH=/app/
command=/usr/local/bin/uwsgi --ini /etc/uwsgi/uwsgi.ini --die-on-term --need-app --plugin
python3
autostart=true
autorestart=true
[program:nginx]
command=/usr/sbin/nginx
autostart=true
autorestart=true
[program:celery_worker]
environment=PYTHONPATH=/app/
command=/usr/local/bin/celery worker -A main.celery --loglevel=info
autostart=true
autorestart=true
supervisord.ini

PyLondinium 2019
[…]
from celery import Celery
from celery.signals import after_setup_logger
from redis import StrictRedis
[…]
app = Flask(__name__)
app.config['CELERY_RESULT_BACKEND'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST')
app.config['CELERY_BACKEND_URL'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST')
celery = Celery(app.name, broker=app.config['CELERY_BACKEND_URL'])
celery.conf.update(app.config)
[…]
REDIS_CACHE = StrictRedis(
host='%s' % os.environ.get('REDIS_HOST'), port=6379, db=1,decode_responses=True)
[…]
@celery.task(bind=True)
def wrap_long_task(self, arg):
return get_anomalous_values(arg)
Step 3

PyLondinium 2019
@app.route("/get_anomalous_data", methods=['GET', 'POST'])
def get_anomalous_data():
if request.method == ‘GET’:
task_id = REDIS_CACHE.rpop("running_task_ids")
task = wrap_long_task.AsyncResult(task_id)
if task.state == 'SUCCESS':
return json.dumps(task.result), 200
elif task.state == 'FAILURE':
return str(task.traceback), 500
else:
REDIS_CACHE.lpush("running_task_ids", task_id)
return ‘’, 202
elif request.method == ‘POST’:
stream = pd.DataFrame([bloat2float(x) for x in StringIO(
request.get_data().decode('utf-8')).getvalue().split()])
stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()]
task = wrap_long_task.apply_async(args=[stream.to_json(orient='records')])
REDIS_CACHE.lpush("running_task_ids", task.id)
return 'submitted', 200
Step 3
GET
POST
Celery ingests
JSON serializable
Inputs only!!

PyLondinium 2019
[…]
"""
data : pandas.core.frame.DataFrame
window_size: int
return: list
"""
pd_data = pd.read_json(data, orient='records')
# calculate the moving window for each point, and report the anomaly if
# the distance of the idx-th point is greater than md times the mahalanobis
# distance
return [(p['idx'], p['value']) for p in nd_rolling(pd_data, window_size)
if abs(p['value'] - p['window_mean']) > 5 * p['window_std']]
Step 3
Need to deserialise
the ingested data
into a Pandas
dataframe
anomaly.py

Step 4
microservices (Docker)
PyLondinium 2019
4
3
1
2

Step 4
PyLondinium 2019
main
application
broker & backing
service

Step 4
PyLondinium 2019
Dockerfile
FROM decibel/uwsgi-nginx-flask-docker:python3.6-alpine3.8-pandas
MAINTAINER Giuseppe Broccolo <gbroccolo@decibelinsight.com>
RUN mkdir -p /source
COPY . /source
COPY ./main.py /app/
COPY ./supervisord.ini /etc/supervisor.d/
RUN cd /source &&
pip install --no-cache-dir . &&
cd / &&
rm -rf /source
inherited from: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/

Step 4
PyLondinium 2019
version: '3.5'
networks:
redis_network:
services:
python-app:
build: .
environment:
- REDIS_HOST=redis
ports:
- “0.0.0.0:80:80”
depends_on:
- redis
networks:
- redis_network
redis:
image: redis:5.0.3-alpine3.8
networks:
- redis_network
docker-compose.yml

Step 5
kubernetes on cloud
PyLondinium 2019
4 5
3
1
2

Step 5
kubernetes on cloud
PyLondinium 2019
●
Dynamic cluster of worker units
●
autoscaled basing on load of single units
●
Several cloud providers – GKE, EKS, …
●
Deployable through YAML configurations:
●
PODs – elementary unit composed by 1+ containers
●
Deployments – how the PODs are deployed
●
Services – how the deployed PODs are exposed
●
Horizontal autoscalers – how the cluster autoscales
HTTP
https://kubernetes.io

Step 5
kubernetes on cloud – static cluster
PyLondinium 2019
pods.yml
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “localhost”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
- name: redis

Step 5
kubernetes on cloud – scalable cluster
PyLondinium 2019
pods.yml
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
env:
- name: REDIS_HOST
value: “localhost”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
- name: redis

Step 5
kubernetes on cloud – scalable cluster
PyLondinium 2019
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
env:
- name: REDIS_HOST
value: “redis”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
ports:
- containerPort: 80
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: redis
---
apiVersion: v1
kind: Service
[…]
spec:
type: ClusterIP
ports:
- port: 6379
targetPort: 6379

Room for improvements
PyLondinium 2019
●
Consume streams, publish results on streams
●
Make the cache backing service more reliables
(HA cluster, persist data in a volume in case of restart of pods)
●
Decouple Celery workers from HTTP data ingestion

https://github.com/gbroccolo/k8s_example
PyLondinium 2019
4 5
gbroccolo@decibelinsight.com
@giubro
/gbroccolo
3
1
2
© Giuseppe Broccolo, Decibel ltd, 2019

High scalable applications with Python

Recommended

Recommended

More Related Content

Similar to High scalable applications with Python

Similar to High scalable applications with Python (20)

More from Giuseppe Broccolo

More from Giuseppe Broccolo (10)

Recently uploaded

Recently uploaded (20)

High scalable applications with Python