SlideShare a Scribd company logo
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni Lanzani
Data Whisperer
and SQL vs NoSQL databases
Who am I?
2008-2012: PhDTheoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven
Feedback
@gglanzani
Real-time, data driven app?
•No store and retrieve;
•Store, {transform, enrich, analyse} and retrieve;
•Real-time: retrieve is not a batch process;
•App: something your mother could use:
SELECT attendees
FROM NoSQLMatters
WHERE password = '1234';
Get insight about event impact
Get insight about event impact
Get insight about event impact
Get insight about event impact
Challenges
1. Big Data;
2. Privacy;
3. Some real-time analysis;
4. Real-time retrieval.
Is it Big Data?
Everybody talks about it
Nobody knows how to do it
Everyone thinks everyone else is doing it, so everyone
claims they’re doing it…
Dan Ariely
Is it Big Data?
•Raw logs are in the order of 40TB;
•We use Hadoop for storing, enriching and pre-
processing.
2. Privacy
3. (Some) real-time analysis
•Harder than it looks;
•Large data;
•Retrieval is by giving date, center location +
radius.
4. Real-Time Retrieval
AngularJS python app
REST
Front-end Back-end
JSON
Architecture
JS-1
JS-2
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example
date hour id_activity
postcod
e
hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example
helper.py example
def get_statistics(data, sbi):
sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi
hits = sbi_df.hits.sum() # select sum(hits) from …
delta_hits = sbi_df.delta.sum() # select sum(delta) from …
if delta_hits:
percentage = (hits - delta_hits) / delta_hits
else:
percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
helper.py example
def get_timeline(data, sbi):
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi
return df_sbi
Who has my data?
•First iteration was a (pre)-POC, less data (3GB vs
500GB);
•Time constraints;
•Oeps: everything is a pandas df!
Advantage of “everything is a df”
Pro:
•Fast!!
•Use what you know
•NO DBA’s!
•We all love CSV’s!
Contra:
•Doesn’t scale;
•Huge startup time;
•NO DBA’s!
•We all hate CSV’s!
•Set the dataframe index wisely;
•Align the data to the index:
•Beware of modifications of the original dataframe!
source_data.sort_index(inplace=True)
If you want to go down this path
The reason pandas is faster is because I came up with a better algorithm
If you want to go down this path
AngularJS python app
REST
Front-end Back-end Database
JSON
?
If you don’t
A word about (traditional) databases…
Db: programming language dict
Postgres for data driven apps?
Postgres for data driven apps?
Issues?!
•With a radius of 10km, in Amsterdam, you get
10k postcodes.You need to do this in your SQL:
•Index on date and postcode, but single queries
running more than 20 minutes.
SELECT * FROM datapoints
WHERE
date IN date_array
AND
postcode IN postcode_array;
PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries:
SELECT *
FROM datapoints
WHERE ST_DWithin(lon, lat, 1500)
AND dates IN ('2013-02-30', '2013-02-31');
-- every point within 1.5km
-- from (lat, lon) on imaginary dates
Postgres + Postgis (2.x)
Other db’s?
How we solved it
1. Align data on disk by date;
2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL
PRIMARY KEY);
INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp
JOIN datapoints d
ON d.postcode = tmp.postcodes
WHERE
d.dt IN dates_array;
Take home messages
1. Geospatial problems are “hard” and cam kill your
queries;
2. Not everybody has infinite resources: be smart
and KISS!
3. SQL or NoSQL? (Size, schema)
GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
giovannilanzani@godatadriven.com
Giovanni Lanzani
Data Whisperer

More Related Content

What's hot

Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
Niko Vuokko
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
Lucidworks
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Databricks
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep Learning
Lucidworks
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
gdusbabek
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
Pierre Gutierrez
 
Building Better Models Faster Using Active Learning
Building Better Models Faster Using Active LearningBuilding Better Models Faster Using Active Learning
Building Better Models Faster Using Active Learning
CrowdFlower
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Work-Bench
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 

What's hot (13)

Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep Learning
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Building Better Models Faster Using Active Learning
Building Better Models Faster Using Active LearningBuilding Better Models Faster Using Active Learning
Building Better Models Faster Using Active Learning
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 

Viewers also liked

Apache Spark Talk for Applied machine learning
Apache Spark Talk for Applied machine learningApache Spark Talk for Applied machine learning
Apache Spark Talk for Applied machine learning
GoDataDriven
 
Heating solution using Panstamp and Python
Heating solution using Panstamp and PythonHeating solution using Panstamp and Python
Heating solution using Panstamp and Python
Oriol Rius
 
Divolte collector overview
Divolte collector overviewDivolte collector overview
Divolte collector overview
GoDataDriven
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19
GoDataDriven
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
fvanvollenhoven
 
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
PyData
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Google Analytics vs. Omniture Comparative Guide
Google Analytics vs. Omniture Comparative GuideGoogle Analytics vs. Omniture Comparative Guide
Google Analytics vs. Omniture Comparative Guide
Jimmy Jay
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
Natasha Murashev
 

Viewers also liked (11)

Apache Spark Talk for Applied machine learning
Apache Spark Talk for Applied machine learningApache Spark Talk for Applied machine learning
Apache Spark Talk for Applied machine learning
 
Heating solution using Panstamp and Python
Heating solution using Panstamp and PythonHeating solution using Panstamp and Python
Heating solution using Panstamp and Python
 
Divolte collector overview
Divolte collector overviewDivolte collector overview
Divolte collector overview
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
Data Engineering 101: Building your first data product by Jonathan Dinu PyDat...
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Google Analytics vs. Omniture Comparative Guide
Google Analytics vs. Omniture Comparative GuideGoogle Analytics vs. Omniture Comparative Guide
Google Analytics vs. Omniture Comparative Guide
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to Real time data driven applications (and SQL vs NoSQL databases)

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
NoSQLmatters
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Your app works slowly. Now what?
Your app works slowly. Now what?Your app works slowly. Now what?
Your app works slowly. Now what?
Aleksandra (Ola) Kunysz
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
Christos Charmatzis
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Big Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = AwesomeBig Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = Awesome
Adel Rahimi
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
GIAF
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
Susan Ibach
 
Pre-Aggregated Analytics And Social Feeds Using MongoDB
Pre-Aggregated Analytics And Social Feeds Using MongoDBPre-Aggregated Analytics And Social Feeds Using MongoDB
Pre-Aggregated Analytics And Social Feeds Using MongoDB
Rackspace
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
Josh Patterson
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
My life with MongoDB
My life with MongoDBMy life with MongoDB
My life with MongoDB
Mitch Pirtle
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
SATOSHI TAGOMORI
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
Paolo Platter
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
William Yetman
 

Similar to Real time data driven applications (and SQL vs NoSQL databases) (20)

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Your app works slowly. Now what?
Your app works slowly. Now what?Your app works slowly. Now what?
Your app works slowly. Now what?
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Big Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = AwesomeBig Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = Awesome
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
 
Pre-Aggregated Analytics And Social Feeds Using MongoDB
Pre-Aggregated Analytics And Social Feeds Using MongoDBPre-Aggregated Analytics And Social Feeds Using MongoDB
Pre-Aggregated Analytics And Social Feeds Using MongoDB
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
My life with MongoDB
My life with MongoDBMy life with MongoDB
My life with MongoDB
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 

More from GoDataDriven

Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
GoDataDriven
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
GoDataDriven
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
GoDataDriven
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
GoDataDriven
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
GoDataDriven
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
GoDataDriven
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
GoDataDriven
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
GoDataDriven
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
GoDataDriven
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
GoDataDriven
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
GoDataDriven
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
GoDataDriven
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
GoDataDriven
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
GoDataDriven
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
GoDataDriven
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
GoDataDriven
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 

More from GoDataDriven (20)

Streamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature Catalog
 
Visualizing Big Data in a Small Screen
Visualizing Big Data in a Small ScreenVisualizing Big Data in a Small Screen
Visualizing Big Data in a Small Screen
 
Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
 

Recently uploaded

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 

Recently uploaded (20)

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 

Real time data driven applications (and SQL vs NoSQL databases)

  • 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP Real time data driven applications Giovanni Lanzani Data Whisperer and SQL vs NoSQL databases
  • 2. Who am I? 2008-2012: PhDTheoretical Physics 2012-2013: KPMG 2013-Now: GoDataDriven
  • 4. Real-time, data driven app? •No store and retrieve; •Store, {transform, enrich, analyse} and retrieve; •Real-time: retrieve is not a batch process; •App: something your mother could use: SELECT attendees FROM NoSQLMatters WHERE password = '1234';
  • 5. Get insight about event impact
  • 6. Get insight about event impact
  • 7. Get insight about event impact
  • 8. Get insight about event impact
  • 9. Challenges 1. Big Data; 2. Privacy; 3. Some real-time analysis; 4. Real-time retrieval.
  • 10. Is it Big Data? Everybody talks about it Nobody knows how to do it Everyone thinks everyone else is doing it, so everyone claims they’re doing it… Dan Ariely
  • 11. Is it Big Data? •Raw logs are in the order of 40TB; •We use Hadoop for storing, enriching and pre- processing.
  • 14. •Harder than it looks; •Large data; •Retrieval is by giving date, center location + radius. 4. Real-Time Retrieval
  • 15. AngularJS python app REST Front-end Back-end JSON Architecture
  • 16. JS-1
  • 17. JS-2
  • 18. date hour id_activity postcode hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2 Data Example
  • 19. date hour id_activity postcod e hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2 Data Example
  • 20. helper.py example def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0 return {"sbi": sbi, "total": hits, "percentage": percentage}
  • 21. helper.py example def get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi
  • 22. Who has my data? •First iteration was a (pre)-POC, less data (3GB vs 500GB); •Time constraints; •Oeps: everything is a pandas df!
  • 23. Advantage of “everything is a df” Pro: •Fast!! •Use what you know •NO DBA’s! •We all love CSV’s! Contra: •Doesn’t scale; •Huge startup time; •NO DBA’s! •We all hate CSV’s!
  • 24. •Set the dataframe index wisely; •Align the data to the index: •Beware of modifications of the original dataframe! source_data.sort_index(inplace=True) If you want to go down this path
  • 25. The reason pandas is faster is because I came up with a better algorithm If you want to go down this path
  • 26. AngularJS python app REST Front-end Back-end Database JSON ? If you don’t
  • 27. A word about (traditional) databases…
  • 29. Postgres for data driven apps?
  • 30. Postgres for data driven apps?
  • 31. Issues?! •With a radius of 10km, in Amsterdam, you get 10k postcodes.You need to do this in your SQL: •Index on date and postcode, but single queries running more than 20 minutes. SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array;
  • 32. PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries: SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates Postgres + Postgis (2.x)
  • 34. How we solved it 1. Align data on disk by date; 2. Use the temporary table trick: 3. Lose precision: 1234AB→1234 CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array; SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;
  • 35. Take home messages 1. Geospatial problems are “hard” and cam kill your queries; 2. Not everybody has infinite resources: be smart and KISS! 3. SQL or NoSQL? (Size, schema)
  • 36. GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani giovannilanzani@godatadriven.com Giovanni Lanzani Data Whisperer