SlideShare a Scribd company logo
1 of 33
Download to read offline
Data Science with Python
From analytics scripts to services in scale
Giuseppe Broccolo
Data Engineer @ Decibel ltd.
@giubro
/gbroccolo
gbroccolo@decibelinsight.com
PyLondinium 2019
London, UK, June 15th
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Intro
From analytics scripts to services in scale
PyLondinium 2019
DNA primase enzyme:
- unroll human DNA (3.2G pair bases)
- DNA is unrolled with a rate of 5 pair
bases per sec (~20 yr)
Wikipedia, © 2019
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Intro
From analytics scripts to services in scale
PyLondinium 2019
DNA primase enzyme:
- unroll human DNA (3.2G pair bases)
- DNA is unrolled with a rate of 5 pair
bases per sec (~20 yr)
- distributed processing: every 500k pair bases
(~20 yr → ~20 min)
Wikipedia, © 2019
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
The repo
From analytics scripts to services in scale
PyLondinium 2019
https://github.com/gbroccolo/k8s_example
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Outline of the talk
From analytics scripts to services in scale
PyLondinium 2019
●
Requirements to deploy a service in scale
●
A simple case study: from a prototype to a service in scale
●
The roadmap
●
Conclusions
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
The requirements of
a scalable service
From analytics scripts to services in scale
PyLondinium 2019
https://12factor.net/
●
Prefer stateless applications
●
Async is better than sync
●
Ingest data via HTTP reqs, or consuming streaming
●
Store the configuration as ENVs – avoid filesystem persistence
●
Avoid even temporary artifacts on filesystem – prefer in memory executions
●
Rely on backing services if necessary
●
High portability and easy deploy of the processing units – use Docker!
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
Data scientists produced a fantastic algorithm
to detect anomalies in gaussian time series
HTTP
Data engineers have to design the pipeline of a
responsive, in scale service able to ingest time
series provided via HTTP requests
curl -X POST -H “Content-Type: application/x-www-form-urlencoded” 
-d “(18:27:26.345, 2.345) (18:27:26.346, 2.352) ...” 
http://<url>/get_anomalous_data
The scenario
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
The scenario
From analytics scripts to services in scale
PyLondinium 2019
from itertools import count
import numpy as np
import pandas as pd
def nd_rolling(data, window_size):
sample = list(zip(count(), data.values[:, 0], data.values[:, 1]))
for idx in range(0, len(sample)):
idx0 = idx if idx - window_size < 0 else idx - window_size
window = [it for it in sample
if it[0] >= idx0 or it[0] <= idx0 + window_size]
x = np.array([it[2] for it in window])
yield {'idx': sample[idx][1],
'value': sample[idx][2],
'window_mean': np.mean(x),
'window_std': np.std(x)}
def get_anomalous_values(data, window_size=100):
return [(p['idx'], p['value'])
for p in nd_rolling(data, window_size)
if abs(p['value'] - p['window_mean']) 
> 5 * p['window_std']]
time
(18:27:26.345, 2.345),
(18:27:26.346, 2.352),
(18:27:26.347, 2.348),
...
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Roadmap
From analytics scripts to services in scale
PyLondinium 2019
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 1
define an importable module
From analytics scripts to services in scale
PyLondinium 2019
1
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 1
define an importable module
From analytics scripts to services in scale
PyLondinium 2019
anomaly/
__init__.py
anomaly.py
setup.cfg
setup.py
requirements.txt
from itertools import count
import numpy as np
import pandas as pd
def nd_rolling(data, window_size):
sample = list(zip(count(), data.values[:, 0], data.values[:, 1]))
for idx in range(0, len(sample)):
idx0 = idx if idx - window_size < 0 else idx - window_size
window = [it for it in sample
if it[0] >= idx0 or it[0] <= idx0 + window_size]
x = np.array([it[2] for it in window])
yield {'idx': sample[idx][1],
'value': sample[idx][2],
'window_mean': np.mean(x),
'window_std': np.std(x)}
def get_anomalous_values(data, window_size=100):
return [(p['idx'], p['value'])
for p in nd_rolling(data, window_size)
if abs(p['value'] - p['window_mean']) 
> 5 * p['window_std']]
https://setuptools.readthedocs.io
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 1
define an importable module
From analytics scripts to services in scale
PyLondinium 2019
$ pip install --no-cache-dir . && rm -rf ./*
$ python
Python 3.7.1 (default, Dec 20 2018, 10:12:31)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more
information.
>>> from anomaly import anomaly
>>>
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 2
expose a responsive endpoint
From analytics scripts to services in scale
PyLondinium 2019
1
2
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 2
expose a responsive endpoint
From analytics scripts to services in scale
PyLondinium 2019
References: a lot of blogs, articles, ...
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 2
expose a responsive endpoint
From analytics scripts to services in scale
PyLondinium 2019
from flask import Flask
from flask import request
import json
from io import StringIO
import re
import pandas as pd
from anomaly.anomaly import get_anomalous_values
app = Flask(__name__)
def bloat2float(x):
y = re.search("((.*),(.*))", x).group(1, 2)
return y[0], float(y[1])
@app.route('/get_anomalous_data', methods=['POST'])
def get_anomalous_data():
stream = pd.DataFrame([bloat2float(x) for x in StringIO(
request.get_data().decode('utf-8')).getvalue().split()])
stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()]
response = get_anomalous_values(stream)
return json.dumps(response), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80, debug=True)
main.py
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 3
make it asynchronously
From analytics scripts to services in scale
PyLondinium 2019
1
2 3
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
reduce the impact to the requestor
POST
GET
Step 3
make it asynchronously
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
Step 3
make it asynchronously
celery worker
source
APP
redis & celery clients
Broker pipeline
Backend pipeline
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 3
make it asynchronously
From analytics scripts to services in scale
PyLondinium 2019
[supervisord]
nodeamon=true
[program:uwsgi]
environment=PYTHONPATH=/app/
command=/usr/local/bin/uwsgi --ini /etc/uwsgi/uwsgi.ini --die-on-term --need-app --plugin
python3
autostart=true
autorestart=true
[program:nginx]
command=/usr/sbin/nginx
autostart=true
autorestart=true
[program:celery_worker]
environment=PYTHONPATH=/app/
command=/usr/local/bin/celery worker -A main.celery --loglevel=info
autostart=true
autorestart=true
supervisord.ini
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
[…]
from celery import Celery
from celery.signals import after_setup_logger
from redis import StrictRedis
[…]
app = Flask(__name__)
app.config['CELERY_RESULT_BACKEND'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST')
app.config['CELERY_BACKEND_URL'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST')
celery = Celery(app.name, broker=app.config['CELERY_BACKEND_URL'])
celery.conf.update(app.config)
[…]
REDIS_CACHE = StrictRedis(
host='%s' % os.environ.get('REDIS_HOST'), port=6379, db=1,decode_responses=True)
[…]
@celery.task(bind=True)
def wrap_long_task(self, arg):
return get_anomalous_values(arg)
Step 3
make it asynchronously
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
@app.route("/get_anomalous_data", methods=['GET', 'POST'])
def get_anomalous_data():
if request.method == ‘GET’:
task_id = REDIS_CACHE.rpop("running_task_ids")
task = wrap_long_task.AsyncResult(task_id)
if task.state == 'SUCCESS':
return json.dumps(task.result), 200
elif task.state == 'FAILURE':
return str(task.traceback), 500
else:
REDIS_CACHE.lpush("running_task_ids", task_id)
return ‘’, 202
elif request.method == ‘POST’:
stream = pd.DataFrame([bloat2float(x) for x in StringIO(
request.get_data().decode('utf-8')).getvalue().split()])
stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()]
task = wrap_long_task.apply_async(args=[stream.to_json(orient='records')])
REDIS_CACHE.lpush("running_task_ids", task.id)
return 'submitted', 200
Step 3
make it asynchronously
GET
POST
Celery ingests
JSON serializable
Inputs only!!
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
From analytics scripts to services in scale
PyLondinium 2019
[…]
def get_anomalous_values(data, window_size=100):
"""
data : pandas.core.frame.DataFrame
window_size: int
return: list
"""
pd_data = pd.read_json(data, orient='records')
# calculate the moving window for each point, and report the anomaly if
# the distance of the idx-th point is greater than md times the mahalanobis
# distance
return [(p['idx'], p['value']) for p in nd_rolling(pd_data, window_size)
if abs(p['value'] - p['window_mean']) > 5 * p['window_std']]
Step 3
make it asynchronously
Need to deserialise
the ingested data
into a Pandas
dataframe
anomaly.py
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
4
3
1
2
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
main
application
broker & backing
service
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
Dockerfile
FROM decibel/uwsgi-nginx-flask-docker:python3.6-alpine3.8-pandas
MAINTAINER Giuseppe Broccolo <gbroccolo@decibelinsight.com>
RUN mkdir -p /source
COPY . /source
COPY ./main.py /app/
COPY ./supervisord.ini /etc/supervisor.d/
RUN cd /source &&
pip install --no-cache-dir . &&
cd / &&
rm -rf /source
inherited from: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 4
microservices (Docker)
From analytics scripts to services in scale
PyLondinium 2019
version: '3.5'
networks:
redis_network:
services:
python-app:
build: .
environment:
- REDIS_HOST=redis
ports:
- “0.0.0.0:80:80”
depends_on:
- redis
networks:
- redis_network
redis:
image: redis:5.0.3-alpine3.8
networks:
- redis_network
docker-compose.yml
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud
From analytics scripts to services in scale
PyLondinium 2019
4 5
3
1
2
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud
From analytics scripts to services in scale
PyLondinium 2019
●
Dynamic cluster of worker units
●
autoscaled basing on load of single units
●
Several cloud providers – GKE, EKS, …
●
Deployable through YAML configurations:
●
PODs – elementary unit composed by 1+ containers
●
Deployments – how the PODs are deployed
●
Services – how the deployed PODs are exposed
●
Horizontal autoscalers – how the cluster autoscales
HTTP
https://kubernetes.io
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud – static cluster
From analytics scripts to services in scale
PyLondinium 2019
pods.yml
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “localhost”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
- name: redis
image: redis:5.0.3-alpine3.8
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud – scalable cluster
From analytics scripts to services in scale
PyLondinium 2019
pods.yml
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “localhost”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
- name: redis
image: redis:5.0.3-alpine3.8
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Step 5
kubernetes on cloud – scalable cluster
From analytics scripts to services in scale
PyLondinium 2019
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: python-app
image: <some-docker-hub>/python-app:latest
env:
- name: REDIS_HOST
value: “redis”
resources:
limits:
cpu: “400m”
requests:
cpu: “200m”
ports:
- containerPort: 80
---
apiVersion: apps/v1
kind: Pod
[…]
spec:
containers:
- name: redis
image: redis:5.0.3-alpine3.8
---
apiVersion: v1
kind: Service
[…]
spec:
type: ClusterIP
ports:
- port: 6379
targetPort: 6379
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
Room for improvements
From analytics scripts to services in scale
PyLondinium 2019
●
Consume streams, publish results on streams
●
Make the cache backing service more reliables
(HA cluster, persist data in a volume in case of restart of pods)
●
Decouple Celery workers from HTTP data ingestion
Giuseppe Broccolo Data Science with Python:
London, UK, 2019 June 15th
https://github.com/gbroccolo/k8s_example
From analytics scripts to services in scale
PyLondinium 2019
4 5
gbroccolo@decibelinsight.com
@giubro
/gbroccolo
3
1
2
© Giuseppe Broccolo, Decibel ltd, 2019

More Related Content

Similar to High scalable applications with Python

Mixing C++ & Python II: Pybind11
Mixing C++ & Python II: Pybind11Mixing C++ & Python II: Pybind11
Mixing C++ & Python II: Pybind11corehard_by
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
DA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluDA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluCan Köklü
 
Cloud-based dynamic distributed optimisation of integrated process planning a...
Cloud-based dynamic distributed optimisation of integrated process planning a...Cloud-based dynamic distributed optimisation of integrated process planning a...
Cloud-based dynamic distributed optimisation of integrated process planning a...Piotr Dziurzanski
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewDelft University of Technology
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET Journal
 
apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...apidays
 
Scientific Plotting in Python
Scientific Plotting in PythonScientific Plotting in Python
Scientific Plotting in PythonJack Parmer
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonInsuk (Chris) Cho
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24
Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24
Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24Kei IWASAKI
 
Serverless survival kit
Serverless survival kitServerless survival kit
Serverless survival kitSteve Houël
 
IRJET- Monument Informatica Application using AR
IRJET-  	  Monument Informatica Application using ARIRJET-  	  Monument Informatica Application using AR
IRJET- Monument Informatica Application using ARIRJET Journal
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...Edureka!
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge ComputingStreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge ComputingDemetris Trihinas
 
Measuring Software development with GrimoireLab
Measuring Software development with GrimoireLabMeasuring Software development with GrimoireLab
Measuring Software development with GrimoireLabValerio Cosentino
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersICTeam S.p.A.
 

Similar to High scalable applications with Python (20)

Mixing C++ & Python II: Pybind11
Mixing C++ & Python II: Pybind11Mixing C++ & Python II: Pybind11
Mixing C++ & Python II: Pybind11
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
DA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can KokluDA 592 - Term Project Report - Berker Kozan Can Koklu
DA 592 - Term Project Report - Berker Kozan Can Koklu
 
Cloud-based dynamic distributed optimisation of integrated process planning a...
Cloud-based dynamic distributed optimisation of integrated process planning a...Cloud-based dynamic distributed optimisation of integrated process planning a...
Cloud-based dynamic distributed optimisation of integrated process planning a...
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code Review
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 
apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...
 
Scientific Plotting in Python
Scientific Plotting in PythonScientific Plotting in Python
Scientific Plotting in Python
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of PythonSamsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24
Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24
Collaboration hack with slackbot - PyCon HK 2018 - 2018.11.24
 
Serverless survival kit
Serverless survival kitServerless survival kit
Serverless survival kit
 
IRJET- Monument Informatica Application using AR
IRJET-  	  Monument Informatica Application using ARIRJET-  	  Monument Informatica Application using AR
IRJET- Monument Informatica Application using AR
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge ComputingStreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
StreamSight - Query-Driven Descriptive Analytics for IoT and Edge Computing
 
Measuring Software development with GrimoireLab
Measuring Software development with GrimoireLabMeasuring Software development with GrimoireLab
Measuring Software development with GrimoireLab
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users
 

More from Giuseppe Broccolo

Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksGiuseppe Broccolo
 
GBroccolo JRouhaud pgconfeu2016_brin4postgis
GBroccolo JRouhaud pgconfeu2016_brin4postgisGBroccolo JRouhaud pgconfeu2016_brin4postgis
GBroccolo JRouhaud pgconfeu2016_brin4postgisGiuseppe Broccolo
 
Gbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfsGbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfsGiuseppe Broccolo
 
Gbroccolo foss4 guk2016_brin4postgis
Gbroccolo foss4 guk2016_brin4postgisGbroccolo foss4 guk2016_brin4postgis
Gbroccolo foss4 guk2016_brin4postgisGiuseppe Broccolo
 
Relational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLRelational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLGiuseppe Broccolo
 
BRIN indexes on geospatial databases - FOSS4G.NA 2016
BRIN indexes on geospatial databases - FOSS4G.NA 2016BRIN indexes on geospatial databases - FOSS4G.NA 2016
BRIN indexes on geospatial databases - FOSS4G.NA 2016Giuseppe Broccolo
 
Gbroccolo itpug p_gday2015_geodbbrin
Gbroccolo itpug p_gday2015_geodbbrinGbroccolo itpug p_gday2015_geodbbrin
Gbroccolo itpug p_gday2015_geodbbrinGiuseppe Broccolo
 
gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...
gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...
gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...Giuseppe Broccolo
 
Gbroccolo itpug p_gday2014_geodbindex
Gbroccolo itpug p_gday2014_geodbindexGbroccolo itpug p_gday2014_geodbindex
Gbroccolo itpug p_gday2014_geodbindexGiuseppe Broccolo
 

More from Giuseppe Broccolo (10)

Indexes in PostgreSQL (10)
Indexes in PostgreSQL (10)Indexes in PostgreSQL (10)
Indexes in PostgreSQL (10)
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
GBroccolo JRouhaud pgconfeu2016_brin4postgis
GBroccolo JRouhaud pgconfeu2016_brin4postgisGBroccolo JRouhaud pgconfeu2016_brin4postgis
GBroccolo JRouhaud pgconfeu2016_brin4postgis
 
Gbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfsGbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfs
 
Gbroccolo foss4 guk2016_brin4postgis
Gbroccolo foss4 guk2016_brin4postgisGbroccolo foss4 guk2016_brin4postgis
Gbroccolo foss4 guk2016_brin4postgis
 
Relational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQLRelational approach with LiDAR data with PostgreSQL
Relational approach with LiDAR data with PostgreSQL
 
BRIN indexes on geospatial databases - FOSS4G.NA 2016
BRIN indexes on geospatial databases - FOSS4G.NA 2016BRIN indexes on geospatial databases - FOSS4G.NA 2016
BRIN indexes on geospatial databases - FOSS4G.NA 2016
 
Gbroccolo itpug p_gday2015_geodbbrin
Gbroccolo itpug p_gday2015_geodbbrinGbroccolo itpug p_gday2015_geodbbrin
Gbroccolo itpug p_gday2015_geodbbrin
 
gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...
gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...
gbroccolo - Use of Indexes on geospatial databases with PostgreSQL - FOSS4G.E...
 
Gbroccolo itpug p_gday2014_geodbindex
Gbroccolo itpug p_gday2014_geodbindexGbroccolo itpug p_gday2014_geodbindex
Gbroccolo itpug p_gday2014_geodbindex
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

High scalable applications with Python

  • 1. Data Science with Python From analytics scripts to services in scale Giuseppe Broccolo Data Engineer @ Decibel ltd. @giubro /gbroccolo gbroccolo@decibelinsight.com PyLondinium 2019 London, UK, June 15th
  • 2. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Intro From analytics scripts to services in scale PyLondinium 2019 DNA primase enzyme: - unroll human DNA (3.2G pair bases) - DNA is unrolled with a rate of 5 pair bases per sec (~20 yr) Wikipedia, © 2019
  • 3. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Intro From analytics scripts to services in scale PyLondinium 2019 DNA primase enzyme: - unroll human DNA (3.2G pair bases) - DNA is unrolled with a rate of 5 pair bases per sec (~20 yr) - distributed processing: every 500k pair bases (~20 yr → ~20 min) Wikipedia, © 2019
  • 4. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th The repo From analytics scripts to services in scale PyLondinium 2019 https://github.com/gbroccolo/k8s_example
  • 5. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Outline of the talk From analytics scripts to services in scale PyLondinium 2019 ● Requirements to deploy a service in scale ● A simple case study: from a prototype to a service in scale ● The roadmap ● Conclusions
  • 6. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th The requirements of a scalable service From analytics scripts to services in scale PyLondinium 2019 https://12factor.net/ ● Prefer stateless applications ● Async is better than sync ● Ingest data via HTTP reqs, or consuming streaming ● Store the configuration as ENVs – avoid filesystem persistence ● Avoid even temporary artifacts on filesystem – prefer in memory executions ● Rely on backing services if necessary ● High portability and easy deploy of the processing units – use Docker!
  • 7. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th From analytics scripts to services in scale PyLondinium 2019 Data scientists produced a fantastic algorithm to detect anomalies in gaussian time series HTTP Data engineers have to design the pipeline of a responsive, in scale service able to ingest time series provided via HTTP requests curl -X POST -H “Content-Type: application/x-www-form-urlencoded” -d “(18:27:26.345, 2.345) (18:27:26.346, 2.352) ...” http://<url>/get_anomalous_data The scenario
  • 8. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th The scenario From analytics scripts to services in scale PyLondinium 2019 from itertools import count import numpy as np import pandas as pd def nd_rolling(data, window_size): sample = list(zip(count(), data.values[:, 0], data.values[:, 1])) for idx in range(0, len(sample)): idx0 = idx if idx - window_size < 0 else idx - window_size window = [it for it in sample if it[0] >= idx0 or it[0] <= idx0 + window_size] x = np.array([it[2] for it in window]) yield {'idx': sample[idx][1], 'value': sample[idx][2], 'window_mean': np.mean(x), 'window_std': np.std(x)} def get_anomalous_values(data, window_size=100): return [(p['idx'], p['value']) for p in nd_rolling(data, window_size) if abs(p['value'] - p['window_mean']) > 5 * p['window_std']] time (18:27:26.345, 2.345), (18:27:26.346, 2.352), (18:27:26.347, 2.348), ...
  • 9. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Roadmap From analytics scripts to services in scale PyLondinium 2019
  • 10. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 1 define an importable module From analytics scripts to services in scale PyLondinium 2019 1
  • 11. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 1 define an importable module From analytics scripts to services in scale PyLondinium 2019 anomaly/ __init__.py anomaly.py setup.cfg setup.py requirements.txt from itertools import count import numpy as np import pandas as pd def nd_rolling(data, window_size): sample = list(zip(count(), data.values[:, 0], data.values[:, 1])) for idx in range(0, len(sample)): idx0 = idx if idx - window_size < 0 else idx - window_size window = [it for it in sample if it[0] >= idx0 or it[0] <= idx0 + window_size] x = np.array([it[2] for it in window]) yield {'idx': sample[idx][1], 'value': sample[idx][2], 'window_mean': np.mean(x), 'window_std': np.std(x)} def get_anomalous_values(data, window_size=100): return [(p['idx'], p['value']) for p in nd_rolling(data, window_size) if abs(p['value'] - p['window_mean']) > 5 * p['window_std']] https://setuptools.readthedocs.io
  • 12. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 1 define an importable module From analytics scripts to services in scale PyLondinium 2019 $ pip install --no-cache-dir . && rm -rf ./* $ python Python 3.7.1 (default, Dec 20 2018, 10:12:31) [Clang 10.0.0 (clang-1000.11.45.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from anomaly import anomaly >>>
  • 13. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 2 expose a responsive endpoint From analytics scripts to services in scale PyLondinium 2019 1 2
  • 14. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 2 expose a responsive endpoint From analytics scripts to services in scale PyLondinium 2019 References: a lot of blogs, articles, ...
  • 15. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 2 expose a responsive endpoint From analytics scripts to services in scale PyLondinium 2019 from flask import Flask from flask import request import json from io import StringIO import re import pandas as pd from anomaly.anomaly import get_anomalous_values app = Flask(__name__) def bloat2float(x): y = re.search("((.*),(.*))", x).group(1, 2) return y[0], float(y[1]) @app.route('/get_anomalous_data', methods=['POST']) def get_anomalous_data(): stream = pd.DataFrame([bloat2float(x) for x in StringIO( request.get_data().decode('utf-8')).getvalue().split()]) stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()] response = get_anomalous_values(stream) return json.dumps(response), 200 if __name__ == '__main__': app.run(host='0.0.0.0', port=80, debug=True) main.py
  • 16. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 3 make it asynchronously From analytics scripts to services in scale PyLondinium 2019 1 2 3
  • 17. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th From analytics scripts to services in scale PyLondinium 2019 reduce the impact to the requestor POST GET Step 3 make it asynchronously
  • 18. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th From analytics scripts to services in scale PyLondinium 2019 Step 3 make it asynchronously celery worker source APP redis & celery clients Broker pipeline Backend pipeline
  • 19. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 3 make it asynchronously From analytics scripts to services in scale PyLondinium 2019 [supervisord] nodeamon=true [program:uwsgi] environment=PYTHONPATH=/app/ command=/usr/local/bin/uwsgi --ini /etc/uwsgi/uwsgi.ini --die-on-term --need-app --plugin python3 autostart=true autorestart=true [program:nginx] command=/usr/sbin/nginx autostart=true autorestart=true [program:celery_worker] environment=PYTHONPATH=/app/ command=/usr/local/bin/celery worker -A main.celery --loglevel=info autostart=true autorestart=true supervisord.ini
  • 20. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th From analytics scripts to services in scale PyLondinium 2019 […] from celery import Celery from celery.signals import after_setup_logger from redis import StrictRedis […] app = Flask(__name__) app.config['CELERY_RESULT_BACKEND'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST') app.config['CELERY_BACKEND_URL'] = 'redis://%s:6379/0' % os.environ.get('REDIS_HOST') celery = Celery(app.name, broker=app.config['CELERY_BACKEND_URL']) celery.conf.update(app.config) […] REDIS_CACHE = StrictRedis( host='%s' % os.environ.get('REDIS_HOST'), port=6379, db=1,decode_responses=True) […] @celery.task(bind=True) def wrap_long_task(self, arg): return get_anomalous_values(arg) Step 3 make it asynchronously
  • 21. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th From analytics scripts to services in scale PyLondinium 2019 @app.route("/get_anomalous_data", methods=['GET', 'POST']) def get_anomalous_data(): if request.method == ‘GET’: task_id = REDIS_CACHE.rpop("running_task_ids") task = wrap_long_task.AsyncResult(task_id) if task.state == 'SUCCESS': return json.dumps(task.result), 200 elif task.state == 'FAILURE': return str(task.traceback), 500 else: REDIS_CACHE.lpush("running_task_ids", task_id) return ‘’, 202 elif request.method == ‘POST’: stream = pd.DataFrame([bloat2float(x) for x in StringIO( request.get_data().decode('utf-8')).getvalue().split()]) stream = stream[pd.to_numeric(stream[1], errors='coerce').notnull()] task = wrap_long_task.apply_async(args=[stream.to_json(orient='records')]) REDIS_CACHE.lpush("running_task_ids", task.id) return 'submitted', 200 Step 3 make it asynchronously GET POST Celery ingests JSON serializable Inputs only!!
  • 22. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th From analytics scripts to services in scale PyLondinium 2019 […] def get_anomalous_values(data, window_size=100): """ data : pandas.core.frame.DataFrame window_size: int return: list """ pd_data = pd.read_json(data, orient='records') # calculate the moving window for each point, and report the anomaly if # the distance of the idx-th point is greater than md times the mahalanobis # distance return [(p['idx'], p['value']) for p in nd_rolling(pd_data, window_size) if abs(p['value'] - p['window_mean']) > 5 * p['window_std']] Step 3 make it asynchronously Need to deserialise the ingested data into a Pandas dataframe anomaly.py
  • 23. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 4 microservices (Docker) From analytics scripts to services in scale PyLondinium 2019 4 3 1 2
  • 24. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 4 microservices (Docker) From analytics scripts to services in scale PyLondinium 2019 main application broker & backing service
  • 25. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 4 microservices (Docker) From analytics scripts to services in scale PyLondinium 2019 Dockerfile FROM decibel/uwsgi-nginx-flask-docker:python3.6-alpine3.8-pandas MAINTAINER Giuseppe Broccolo <gbroccolo@decibelinsight.com> RUN mkdir -p /source COPY . /source COPY ./main.py /app/ COPY ./supervisord.ini /etc/supervisor.d/ RUN cd /source && pip install --no-cache-dir . && cd / && rm -rf /source inherited from: https://hub.docker.com/r/tiangolo/uwsgi-nginx-flask/
  • 26. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 4 microservices (Docker) From analytics scripts to services in scale PyLondinium 2019 version: '3.5' networks: redis_network: services: python-app: build: . environment: - REDIS_HOST=redis ports: - “0.0.0.0:80:80” depends_on: - redis networks: - redis_network redis: image: redis:5.0.3-alpine3.8 networks: - redis_network docker-compose.yml
  • 27. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 5 kubernetes on cloud From analytics scripts to services in scale PyLondinium 2019 4 5 3 1 2
  • 28. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 5 kubernetes on cloud From analytics scripts to services in scale PyLondinium 2019 ● Dynamic cluster of worker units ● autoscaled basing on load of single units ● Several cloud providers – GKE, EKS, … ● Deployable through YAML configurations: ● PODs – elementary unit composed by 1+ containers ● Deployments – how the PODs are deployed ● Services – how the deployed PODs are exposed ● Horizontal autoscalers – how the cluster autoscales HTTP https://kubernetes.io
  • 29. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 5 kubernetes on cloud – static cluster From analytics scripts to services in scale PyLondinium 2019 pods.yml --- apiVersion: apps/v1 kind: Pod […] spec: containers: - name: python-app image: <some-docker-hub>/python-app:latest env: - name: REDIS_HOST value: “localhost” resources: limits: cpu: “400m” requests: cpu: “200m” - name: redis image: redis:5.0.3-alpine3.8
  • 30. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 5 kubernetes on cloud – scalable cluster From analytics scripts to services in scale PyLondinium 2019 pods.yml --- apiVersion: apps/v1 kind: Pod […] spec: containers: - name: python-app image: <some-docker-hub>/python-app:latest env: - name: REDIS_HOST value: “localhost” resources: limits: cpu: “400m” requests: cpu: “200m” - name: redis image: redis:5.0.3-alpine3.8
  • 31. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Step 5 kubernetes on cloud – scalable cluster From analytics scripts to services in scale PyLondinium 2019 --- apiVersion: apps/v1 kind: Pod […] spec: containers: - name: python-app image: <some-docker-hub>/python-app:latest env: - name: REDIS_HOST value: “redis” resources: limits: cpu: “400m” requests: cpu: “200m” ports: - containerPort: 80 --- apiVersion: apps/v1 kind: Pod […] spec: containers: - name: redis image: redis:5.0.3-alpine3.8 --- apiVersion: v1 kind: Service […] spec: type: ClusterIP ports: - port: 6379 targetPort: 6379
  • 32. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th Room for improvements From analytics scripts to services in scale PyLondinium 2019 ● Consume streams, publish results on streams ● Make the cache backing service more reliables (HA cluster, persist data in a volume in case of restart of pods) ● Decouple Celery workers from HTTP data ingestion
  • 33. Giuseppe Broccolo Data Science with Python: London, UK, 2019 June 15th https://github.com/gbroccolo/k8s_example From analytics scripts to services in scale PyLondinium 2019 4 5 gbroccolo@decibelinsight.com @giubro /gbroccolo 3 1 2 © Giuseppe Broccolo, Decibel ltd, 2019