SlideShare a Scribd company logo
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@gglanzani
{nielszeilemaker,giovannilanzani}@godatadriven.com
Embarrassingly parallel
database calls with Python
Niels Zeilemaker / Giovanni Lanzani
Big Data Hacker / Data Whisperer
Who are we
Background:
PhD Computer Science /Theoretical Physics
Now
GoDataDriven
Why embarrassingly parallel?
•No store and retrieve;
•Store, {transform, enrich, analyse} and then retrieve;
•Real-time: retrieve is not a batch process.
Retrieve network of businesses
Show their structure
Challenges
•Relational data model seems to be the best… but:
Filter properties of “hub”
Filter property of “satellites”
Filter properties of relationship
Filter properties of the ratio of relationship
with respect to total network
…and filter properties of the total network!
Challenges
•That is up to 13 filters:
•11 JOIN’s (with tables ranging from 300k to 15M
records);
•9 WHERE’s and 3 HAVING’s;
•1 windowing function;
•4 CASE’s;
•(And 1 store procedure generating materialized
views).
Example query
Data structure
•Single large table contains all interactions between
companies
Date Payer Beneficiary Amount #Transactions
2015-01 GDD PyData 100 1
First question: which database?
•Postgresql
•Window function,WITH, functional/partial
indexes, open source;
•With the right indexes: 3s per query.
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
Architecture
JS-1
JS-2
Scaling issues: app is not realtime
• Load balacing does not reduce (single) query runtime
• Sharding makes queries faster if the shard key is in a where
clause
• Our use case requires us to query all data from either
• the payer
• the beneficiary
• Traditional sharding will not cut it
New architecture
•Instead, let’s run the queries in parallel across
sharded instances and merge the result in python
AngularJS app.py
helper.py
REST
Front-end Back-end
database.py
Data
psycopg2
JSON
New Data structure
Date Sharded-Payer Beneficiary Amount #Transactions
2015-01 PyData GDD 100 1
2015-03 PyData Xebia 20 2
Date Sharded-Payer Beneficiary Amount #Transactions
2015-01 GDD PyData 100 1
2015-01 GDD Xebia 15 3
Date Sharded-Payer Beneficiary Amount #Transactions
2015-02 Xebia GDD 100 1
2015-03 Xebia PyData 20 2
Old code (single database)
pool = ThreadedConnectionPool(1, 20, dsn=d)
connection = pool.getconn()
cursor = connection.cursor()
cursor.execute(my_query)
cursor.fetchall()
New code (multiple databases)
pools = [ThreadedConnectionPool(1, 20, dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
cursor = parallel_connection.cursor()
cursor.execute(my_query)
cursor.fetchall()
parallel_connection.py
from threading import Thread
class ParallelConnection(object):
"""
This class manages multiple database connections, handles the parallel access to it, and
hides the complexity this entails. The execution of queries is distributed by running it
for each connection in parallel. The result (as retrieved by fetchall() and fetchone())
is the union of the parallelized query results from each connection.
"""
def __init__(self, connections):
self.connections = connections
self.cursors = None
parallel_connection.py
def execute(self, query, tuple_args=None, fetchnone=False):
self._do_parallel(lambda i,c: c.execute(query, tuple_args))
def _do_parallel(self, target):
threads = []
for i, c in enumerate(self.cursors):
t = Thread(target=lambda i=i, c=c: target(i,c))
t.setDaemon(True)
t.start()
threads.append(t)
for t in threads:
t.join()
parallel_connection.py
def fetchone(self):
results = [None] * len(self.cursors)
def do_work(index, cursor):
results[index] = cursor.fetchone()
self._do_parallel(do_work)
results_values = filter(is_not_none, results)
if results_values:
return list(chain(results_values))[0]
def fetchall(self):
results = [None] * len(self.cursors)
def do_work(index, cursor):
results[index] = cursor.fetchall()
self._do_parallel(do_work)
return list(chain(*[rs for rs in results]))
Unsharded tables?
•They are present in every Postgres instance
•Space is not an issue nowadays
Results
•Queries on sharded tables execute in 1/N, where
N is the number of Postgres instances;
•Plus some neglibigle thread overhead
•Our results, using 3 servers 1.04s instead of 3.0s
Update/Inserts
•Short anwser, not supported
•parallel_connection.py does not know of the
existence of the shards
•It simply executes a single query multiple times
•In order to support updates and inserts, a sharded
insert/insert all needs to be implented
•(PR are welcome)
Our insert approach
•Load data in batches, coordinated with ansible
•To determine the shard, we compute the hash +
modulo in hive
Where can I get it
•https://github.com/godatadriven/parallel-connection
GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
{nielszeilemaker,giovannilanzani}@godatadriven.com
Niels Zeilemaker / Giovanni Lanzani
Big Data Hacker / Data Whisperer

More Related Content

What's hot

Chapter 3: ado.net
Chapter 3: ado.netChapter 3: ado.net
Chapter 3: ado.net
Ngeam Soly
 
บทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูล
บทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูลบทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูล
บทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูล
Priew Chakrit
 
ADO.NET -database connection
ADO.NET -database connectionADO.NET -database connection
ADO.NET -database connection
Anekwong Yoddumnern
 
Ado.net
Ado.netAdo.net
Ado.net
Iblesoft
 
Intake 38 10
Intake 38 10Intake 38 10
Intake 38 10
Mahmoud Ouf
 
Lecture 6. ADO.NET Overview.
Lecture 6. ADO.NET Overview.Lecture 6. ADO.NET Overview.
Lecture 6. ADO.NET Overview.
Alexey Furmanov
 
The art of messaging tune (Joker 2015 edition)
The art of messaging tune (Joker 2015 edition)The art of messaging tune (Joker 2015 edition)
The art of messaging tune (Joker 2015 edition)
Vyacheslav Lapin
 
Ado.Net Tutorial
Ado.Net TutorialAdo.Net Tutorial
Ado.Net Tutorial
prabhu rajendran
 
22jdbc
22jdbc22jdbc
22jdbc
Adil Jafri
 
For Beginners - Ado.net
For Beginners - Ado.netFor Beginners - Ado.net
For Beginners - Ado.net
Tarun Jain
 
05 entity framework
05 entity framework05 entity framework
05 entity framework
glubox
 
Android Architecure Components - introduction
Android Architecure Components - introductionAndroid Architecure Components - introduction
Android Architecure Components - introduction
Paulina Szklarska
 
Session06 handling xml data
Session06  handling xml dataSession06  handling xml data
Session06 handling xml data
kendyhuu
 
GPARS: Lessons from the parallel universe - Itamar Tayer, CoolaData
GPARS: Lessons from the parallel universe - Itamar Tayer, CoolaDataGPARS: Lessons from the parallel universe - Itamar Tayer, CoolaData
GPARS: Lessons from the parallel universe - Itamar Tayer, CoolaData
Codemotion Tel Aviv
 
Intake 38 data access 1
Intake 38 data access 1Intake 38 data access 1
Intake 38 data access 1
Mahmoud Ouf
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Database programming
Database programmingDatabase programming

What's hot (18)

Chapter 3: ado.net
Chapter 3: ado.netChapter 3: ado.net
Chapter 3: ado.net
 
บทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูล
บทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูลบทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูล
บทที่ 4 การเพิ่มข้อมูลลงฐานข้อมูล
 
Insert
InsertInsert
Insert
 
ADO.NET -database connection
ADO.NET -database connectionADO.NET -database connection
ADO.NET -database connection
 
Ado.net
Ado.netAdo.net
Ado.net
 
Intake 38 10
Intake 38 10Intake 38 10
Intake 38 10
 
Lecture 6. ADO.NET Overview.
Lecture 6. ADO.NET Overview.Lecture 6. ADO.NET Overview.
Lecture 6. ADO.NET Overview.
 
The art of messaging tune (Joker 2015 edition)
The art of messaging tune (Joker 2015 edition)The art of messaging tune (Joker 2015 edition)
The art of messaging tune (Joker 2015 edition)
 
Ado.Net Tutorial
Ado.Net TutorialAdo.Net Tutorial
Ado.Net Tutorial
 
22jdbc
22jdbc22jdbc
22jdbc
 
For Beginners - Ado.net
For Beginners - Ado.netFor Beginners - Ado.net
For Beginners - Ado.net
 
05 entity framework
05 entity framework05 entity framework
05 entity framework
 
Android Architecure Components - introduction
Android Architecure Components - introductionAndroid Architecure Components - introduction
Android Architecure Components - introduction
 
Session06 handling xml data
Session06  handling xml dataSession06  handling xml data
Session06 handling xml data
 
GPARS: Lessons from the parallel universe - Itamar Tayer, CoolaData
GPARS: Lessons from the parallel universe - Itamar Tayer, CoolaDataGPARS: Lessons from the parallel universe - Itamar Tayer, CoolaData
GPARS: Lessons from the parallel universe - Itamar Tayer, CoolaData
 
Intake 38 data access 1
Intake 38 data access 1Intake 38 data access 1
Intake 38 data access 1
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Database programming
Database programmingDatabase programming
Database programming
 

Viewers also liked

checklist
checklistchecklist
Session 5 Geneviève Fioraso et Remise du Prix Presse
Session 5 Geneviève Fioraso et Remise du Prix PresseSession 5 Geneviève Fioraso et Remise du Prix Presse
Session 5 Geneviève Fioraso et Remise du Prix Presse
Pôle Systematic Paris-Region
 
PyData Paris 2015 - Track 1.1 Alexandre Gramfort
PyData Paris 2015 - Track 1.1 Alexandre GramfortPyData Paris 2015 - Track 1.1 Alexandre Gramfort
PyData Paris 2015 - Track 1.1 Alexandre Gramfort
Pôle Systematic Paris-Region
 
Stargate by i hub
Stargate by i hubStargate by i hub
Présentation du système d'exploitation RIOT-OS
Présentation du système d'exploitation RIOT-OSPrésentation du système d'exploitation RIOT-OS
Présentation du système d'exploitation RIOT-OS
Pôle Systematic Paris-Region
 
Model-checking for efficient malware detection
Model-checking for efficient malware detectionModel-checking for efficient malware detection
Model-checking for efficient malware detection
Pôle Systematic Paris-Region
 
Connected TIZEN
Connected TIZENConnected TIZEN
Introduction à la journée Sécurité, Sureté et Confidentialité
Introduction à la journée Sécurité, Sureté et ConfidentialitéIntroduction à la journée Sécurité, Sureté et Confidentialité
Introduction à la journée Sécurité, Sureté et Confidentialité
Pôle Systematic Paris-Region
 
Utilisation de la plateforme virtuelle QEMU/SystemC pour l'IoT
Utilisation de la plateforme virtuelle QEMU/SystemC pour l'IoTUtilisation de la plateforme virtuelle QEMU/SystemC pour l'IoT
Utilisation de la plateforme virtuelle QEMU/SystemC pour l'IoT
Pôle Systematic Paris-Region
 

Viewers also liked (17)

05.carlos moreno convention_systematic_vf
05.carlos moreno convention_systematic_vf05.carlos moreno convention_systematic_vf
05.carlos moreno convention_systematic_vf
 
checklist
checklistchecklist
checklist
 
Session 5 Geneviève Fioraso et Remise du Prix Presse
Session 5 Geneviève Fioraso et Remise du Prix PresseSession 5 Geneviève Fioraso et Remise du Prix Presse
Session 5 Geneviève Fioraso et Remise du Prix Presse
 
PyData Paris 2015 - Track 1.1 Alexandre Gramfort
PyData Paris 2015 - Track 1.1 Alexandre GramfortPyData Paris 2015 - Track 1.1 Alexandre Gramfort
PyData Paris 2015 - Track 1.1 Alexandre Gramfort
 
Stargate by i hub
Stargate by i hubStargate by i hub
Stargate by i hub
 
Présentation du système d'exploitation RIOT-OS
Présentation du système d'exploitation RIOT-OSPrésentation du système d'exploitation RIOT-OS
Présentation du système d'exploitation RIOT-OS
 
Model-checking for efficient malware detection
Model-checking for efficient malware detectionModel-checking for efficient malware detection
Model-checking for efficient malware detection
 
Connected TIZEN
Connected TIZENConnected TIZEN
Connected TIZEN
 
Introduction à la journée Sécurité, Sureté et Confidentialité
Introduction à la journée Sécurité, Sureté et ConfidentialitéIntroduction à la journée Sécurité, Sureté et Confidentialité
Introduction à la journée Sécurité, Sureté et Confidentialité
 
Utilisation de la plateforme virtuelle QEMU/SystemC pour l'IoT
Utilisation de la plateforme virtuelle QEMU/SystemC pour l'IoTUtilisation de la plateforme virtuelle QEMU/SystemC pour l'IoT
Utilisation de la plateforme virtuelle QEMU/SystemC pour l'IoT
 
Hera klaudia
Hera klaudiaHera klaudia
Hera klaudia
 
Posejdon patrycja
Posejdon patrycjaPosejdon patrycja
Posejdon patrycja
 
Hefajstos Artur
Hefajstos ArturHefajstos Artur
Hefajstos Artur
 
Zeus agata
Zeus agataZeus agata
Zeus agata
 
Gaja Magda
Gaja MagdaGaja Magda
Gaja Magda
 
Posejdon Patrycja
Posejdon PatrycjaPosejdon Patrycja
Posejdon Patrycja
 
Gaja Magda
Gaja MagdaGaja Magda
Gaja Magda
 

Similar to PyData Paris 2015 - Track 3.1 Niels Zeilemaker

Session 24 - JDBC, Intro to Enterprise Java
Session 24 - JDBC, Intro to Enterprise JavaSession 24 - JDBC, Intro to Enterprise Java
Session 24 - JDBC, Intro to Enterprise Java
PawanMM
 
JDBC Part - 2
JDBC Part - 2JDBC Part - 2
JDBC Part - 2
Hitesh-Java
 
DDBMS
DDBMSDDBMS
Automatic mechanism data migration between relational and object database
Automatic mechanism data migration between relational and object databaseAutomatic mechanism data migration between relational and object database
Automatic mechanism data migration between relational and object database
MarcinStachniuk
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
RichardWarburton
 
Patterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta LakePatterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta Lake
Databricks
 
Java database connectivity
Java database connectivityJava database connectivity
Java database connectivity
Atul Saurabh
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
Ortus Solutions, Corp
 
Jdbc
JdbcJdbc
Jdbc
Indu Lata
 
Java Databse Connectvity- Alex Jose
Java Databse Connectvity- Alex JoseJava Databse Connectvity- Alex Jose
Java Databse Connectvity- Alex Jose
Dipayan Sarkar
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Making Django and NoSQL Play Nice
Making Django and NoSQL Play NiceMaking Django and NoSQL Play Nice
Making Django and NoSQL Play Nice
Alex Gaynor
 
JDBC (2).ppt
JDBC (2).pptJDBC (2).ppt
JDBC (2).ppt
manvibaunthiyal1
 
Jdbc ja
Jdbc jaJdbc ja
Jdbc ja
DEEPIKA T
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
LECTURE 14 Data Access.pptx
LECTURE 14 Data Access.pptxLECTURE 14 Data Access.pptx
LECTURE 14 Data Access.pptx
AOmaAli
 
Java collections the force awakens
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakens
RichardWarburton
 
Leveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clusteringLeveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clustering
Texas Natural Resources Information System
 
Scala coated JVM
Scala coated JVMScala coated JVM
Scala coated JVM
Stuart Roebuck
 

Similar to PyData Paris 2015 - Track 3.1 Niels Zeilemaker (20)

Session 24 - JDBC, Intro to Enterprise Java
Session 24 - JDBC, Intro to Enterprise JavaSession 24 - JDBC, Intro to Enterprise Java
Session 24 - JDBC, Intro to Enterprise Java
 
JDBC Part - 2
JDBC Part - 2JDBC Part - 2
JDBC Part - 2
 
DDBMS
DDBMSDDBMS
DDBMS
 
Automatic mechanism data migration between relational and object database
Automatic mechanism data migration between relational and object databaseAutomatic mechanism data migration between relational and object database
Automatic mechanism data migration between relational and object database
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Patterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta LakePatterns and Operational Insights from the First Users of Delta Lake
Patterns and Operational Insights from the First Users of Delta Lake
 
Java database connectivity
Java database connectivityJava database connectivity
Java database connectivity
 
CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
Jdbc
JdbcJdbc
Jdbc
 
Java Databse Connectvity- Alex Jose
Java Databse Connectvity- Alex JoseJava Databse Connectvity- Alex Jose
Java Databse Connectvity- Alex Jose
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Making Django and NoSQL Play Nice
Making Django and NoSQL Play NiceMaking Django and NoSQL Play Nice
Making Django and NoSQL Play Nice
 
JDBC (2).ppt
JDBC (2).pptJDBC (2).ppt
JDBC (2).ppt
 
Jdbc ja
Jdbc jaJdbc ja
Jdbc ja
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
LECTURE 14 Data Access.pptx
LECTURE 14 Data Access.pptxLECTURE 14 Data Access.pptx
LECTURE 14 Data Access.pptx
 
Java collections the force awakens
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakens
 
Leveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clusteringLeveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clustering
 
Scala coated JVM
Scala coated JVMScala coated JVM
Scala coated JVM
 

More from Pôle Systematic Paris-Region

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
Pôle Systematic Paris-Region
 
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
Pôle Systematic Paris-Region
 
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyOsis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Pôle Systematic Paris-Region
 
Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
Pôle Systematic Paris-Region
 
Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
Pôle Systematic Paris-Region
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Pôle Systematic Paris-Region
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Pôle Systematic Paris-Region
 
Osis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
Pôle Systematic Paris-Region
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
Pôle Systematic Paris-Region
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotOSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
Pôle Systematic Paris-Region
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
Pôle Systematic Paris-Region
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
Pôle Systematic Paris-Region
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
Pôle Systematic Paris-Region
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
Pôle Systematic Paris-Region
 
PyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
Pôle Systematic Paris-Region
 

More from Pôle Systematic Paris-Region (20)

OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
 
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
 
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
 
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...OSIS19_Cloud : Performance and power management in virtualized data centers, ...
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
 
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
 
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
 
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
 
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick MoyOsis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
 
Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
 
Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentOsis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
 
Osis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riotOSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
 
PyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
 

Recently uploaded

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 

Recently uploaded (20)

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 

PyData Paris 2015 - Track 3.1 Niels Zeilemaker

  • 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP @gglanzani {nielszeilemaker,giovannilanzani}@godatadriven.com Embarrassingly parallel database calls with Python Niels Zeilemaker / Giovanni Lanzani Big Data Hacker / Data Whisperer
  • 2. Who are we Background: PhD Computer Science /Theoretical Physics Now GoDataDriven
  • 3. Why embarrassingly parallel? •No store and retrieve; •Store, {transform, enrich, analyse} and then retrieve; •Real-time: retrieve is not a batch process.
  • 4. Retrieve network of businesses
  • 6. Challenges •Relational data model seems to be the best… but: Filter properties of “hub” Filter property of “satellites” Filter properties of relationship Filter properties of the ratio of relationship with respect to total network …and filter properties of the total network!
  • 7. Challenges •That is up to 13 filters: •11 JOIN’s (with tables ranging from 300k to 15M records); •9 WHERE’s and 3 HAVING’s; •1 windowing function; •4 CASE’s; •(And 1 store procedure generating materialized views).
  • 9. Data structure •Single large table contains all interactions between companies Date Payer Beneficiary Amount #Transactions 2015-01 GDD PyData 100 1
  • 10. First question: which database? •Postgresql •Window function,WITH, functional/partial indexes, open source; •With the right indexes: 3s per query.
  • 12. JS-1
  • 13. JS-2
  • 14. Scaling issues: app is not realtime • Load balacing does not reduce (single) query runtime • Sharding makes queries faster if the shard key is in a where clause • Our use case requires us to query all data from either • the payer • the beneficiary • Traditional sharding will not cut it
  • 15. New architecture •Instead, let’s run the queries in parallel across sharded instances and merge the result in python AngularJS app.py helper.py REST Front-end Back-end database.py Data psycopg2 JSON
  • 16. New Data structure Date Sharded-Payer Beneficiary Amount #Transactions 2015-01 PyData GDD 100 1 2015-03 PyData Xebia 20 2 Date Sharded-Payer Beneficiary Amount #Transactions 2015-01 GDD PyData 100 1 2015-01 GDD Xebia 15 3 Date Sharded-Payer Beneficiary Amount #Transactions 2015-02 Xebia GDD 100 1 2015-03 Xebia PyData 20 2
  • 17. Old code (single database) pool = ThreadedConnectionPool(1, 20, dsn=d) connection = pool.getconn() cursor = connection.cursor() cursor.execute(my_query) cursor.fetchall()
  • 18. New code (multiple databases) pools = [ThreadedConnectionPool(1, 20, dsn=d) for d in dsns] connections = [pool.getconn() for pool in pools] parallel_connection = ParallelConnection(connections) cursor = parallel_connection.cursor() cursor.execute(my_query) cursor.fetchall()
  • 19. parallel_connection.py from threading import Thread class ParallelConnection(object): """ This class manages multiple database connections, handles the parallel access to it, and hides the complexity this entails. The execution of queries is distributed by running it for each connection in parallel. The result (as retrieved by fetchall() and fetchone()) is the union of the parallelized query results from each connection. """ def __init__(self, connections): self.connections = connections self.cursors = None
  • 20. parallel_connection.py def execute(self, query, tuple_args=None, fetchnone=False): self._do_parallel(lambda i,c: c.execute(query, tuple_args)) def _do_parallel(self, target): threads = [] for i, c in enumerate(self.cursors): t = Thread(target=lambda i=i, c=c: target(i,c)) t.setDaemon(True) t.start() threads.append(t) for t in threads: t.join()
  • 21. parallel_connection.py def fetchone(self): results = [None] * len(self.cursors) def do_work(index, cursor): results[index] = cursor.fetchone() self._do_parallel(do_work) results_values = filter(is_not_none, results) if results_values: return list(chain(results_values))[0] def fetchall(self): results = [None] * len(self.cursors) def do_work(index, cursor): results[index] = cursor.fetchall() self._do_parallel(do_work) return list(chain(*[rs for rs in results]))
  • 22. Unsharded tables? •They are present in every Postgres instance •Space is not an issue nowadays
  • 23. Results •Queries on sharded tables execute in 1/N, where N is the number of Postgres instances; •Plus some neglibigle thread overhead •Our results, using 3 servers 1.04s instead of 3.0s
  • 24. Update/Inserts •Short anwser, not supported •parallel_connection.py does not know of the existence of the shards •It simply executes a single query multiple times •In order to support updates and inserts, a sharded insert/insert all needs to be implented •(PR are welcome)
  • 25. Our insert approach •Load data in batches, coordinated with ansible •To determine the shard, we compute the hash + modulo in hive
  • 26. Where can I get it •https://github.com/godatadriven/parallel-connection
  • 27. GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani {nielszeilemaker,giovannilanzani}@godatadriven.com Niels Zeilemaker / Giovanni Lanzani Big Data Hacker / Data Whisperer