Real time data driven applications 
and SQL vs NoSQL databases 
GoDataDriven 
PROUDLY PART OF THE XEBIA GROUP 
Giovanni Lanzani 
Data Whisperer
Who am I? 
2008-2012: PhD Theoretical Physics 
2012-2013: KPMG 
2013-Now: GoDataDriven
Feedback 
@gglanzani
Real-time, data driven app? 
• No store and retrieve; 
• Store, {transform, enrich, analyse} and retrieve; 
• Real-time: retrieve is not a batch process; 
• App: something your mother could use: 
SELECT attendees 
FROM NoSQLMatters 
WHERE password = '1234';
Get insight about event impact
Get insight about event impact
Get insight about event impact
Get insight about event impact
Challenges 
1. Big Data; 
2. Privacy; 
3. Some real-time analysis; 
4. Real-time retrieval.
Is it Big Data? 
Everybody talks about it 
Nobody knows how to do it 
Everyone thinks everyone else is doing it, so everyone 
claims they’re doing it… 
Dan Ariely
Is it Big Data? 
• Raw logs are in the order of 40TB; 
•We use Hadoop for storing, enriching and pre-processing.
2. Privacy
3. (Some) real-time analysis
4. Real-Time Retrieval 
• Harder than it looks; 
• Large data; 
• Retrieval is by giving date, center location + 
radius.
REST 
Architecture 
AngularJS python app 
JSON 
Front-end Back-end
JS-1
JS-2
Data Example 
date hour id_activity postcode hits delta sbi 
2013-01-01 12 1234 1234AB 35 22 1 
2013-01-08 12 1234 1234AB 45 35 1 
2013-01-01 11 2345 5555ZB 2 1 2 
2013-01-08 11 2345 5555ZB 55 2 2
Data Example 
date hour id_activity postcod 
e hits delta sbi 
2013-01-01 12 1234 1234AB 35 22 1 
2013-01-08 12 1234 1234AB 45 35 1 
2013-01-01 11 2345 5555ZB 2 1 2 
2013-01-08 11 2345 5555ZB 55 2 2
helper.py example 
def get_statistics(data, sbi): 
sbi_df = data[data.sbi == sbi] 
# select * from data where sbi = sbi 
hits = sbi_df.hits.sum() # select sum(hits) from … 
delta_hits = sbi_df.delta.sum() # select sum(delta) from … 
if delta_hits: 
percentage = (hits - delta_hits) / delta_hits 
else: 
percentage = 0 
return {"sbi": sbi, "total": hits, "percentage": percentage}
helper.py example 
def get_timeline(data, sbi): 
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) 
# select sum(hits), sum(delta) from data group by date, hour, sbi 
return df_sbi
Who has my data? 
• First iteration was a (pre)-POC, less data (3GB vs 
500GB); 
• Time constraints; 
• Oeps: everything is a pandas df!
Advantage of “everything is a df” 
Pro:• 
Fast!! 
• Use what you know 
• NO DBA’s! 
•We all love CSV’s! 
Contra: 
• Doesn’t scale; 
• Huge startup time; 
• NO DBA’s! 
•We all hate CSV’s!
If you want to go down this path 
• Set the dataframe index wisely; 
• Align the data to the index: 
source_data.sort_index(inplace=True) 
• Beware of modifications of the original dataframe!
If you want to go down this path 
The reason pandas is faster is because I came up with a better algorithm
If you don’t 
AngularJS python app 
REST 
JSON 
? 
Front-end Back-end Database
A word about (traditional) databases…
Db: programming language dict
Postgres for data driven apps?
Postgres for data driven apps?
Issues?! 
•With a radius of 10km, in Amsterdam, you get 
10k postcodes. You need to do this in your SQL: 
SELECT * FROM datapoints 
WHERE 
date IN date_array 
AND 
postcode IN postcode_array; 
• Index on date and postcode, but single queries 
running more than 20 minutes.
Postgres + Postgis (2.x) 
PostGIS is a spatial database extender for PostgreSQL. 
Supports geographic objects allowing location queries: 
SELECT * 
FROM datapoints 
WHERE ST_DWithin(lon, lat, 1500) 
AND dates IN ('2013-02-30', '2013-02-31'); 
-- every point within 1.5km 
-- from (lat, lon) on imaginary dates
Other db’s?
How we solved it 
1. Align data on disk by date; 
2. Use the temporary table trick: 
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL 
PRIMARY KEY); 
INSERT INTO tmp (postcodes) VALUES postcode_array; 
SELECT * FROM tmp 
JOIN datapoints d 
ON d.postcode = tmp.postcodes 
WHERE 
d.dt IN dates_array; 
3. Lose precision: 1234AB→1234
Take home messages 
1. Geospatial problems are “hard” and cam kill your 
queries; 
2. Not everybody has infinite resources: be smart 
and KISS! 
3. SQL or NoSQL? (Size, schema)
GoDataDriven 
We’re hiring / Questions? / Thank you! 
@gglanzani 
giovannilanzani@godatadriven.com 
Giovanni Lanzani 
Data Whisperer

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

  • 1.
    Real time datadriven applications and SQL vs NoSQL databases GoDataDriven PROUDLY PART OF THE XEBIA GROUP Giovanni Lanzani Data Whisperer
  • 2.
    Who am I? 2008-2012: PhD Theoretical Physics 2012-2013: KPMG 2013-Now: GoDataDriven
  • 3.
  • 4.
    Real-time, data drivenapp? • No store and retrieve; • Store, {transform, enrich, analyse} and retrieve; • Real-time: retrieve is not a batch process; • App: something your mother could use: SELECT attendees FROM NoSQLMatters WHERE password = '1234';
  • 5.
    Get insight aboutevent impact
  • 6.
    Get insight aboutevent impact
  • 7.
    Get insight aboutevent impact
  • 8.
    Get insight aboutevent impact
  • 9.
    Challenges 1. BigData; 2. Privacy; 3. Some real-time analysis; 4. Real-time retrieval.
  • 10.
    Is it BigData? Everybody talks about it Nobody knows how to do it Everyone thinks everyone else is doing it, so everyone claims they’re doing it… Dan Ariely
  • 11.
    Is it BigData? • Raw logs are in the order of 40TB; •We use Hadoop for storing, enriching and pre-processing.
  • 12.
  • 13.
  • 14.
    4. Real-Time Retrieval • Harder than it looks; • Large data; • Retrieval is by giving date, center location + radius.
  • 15.
    REST Architecture AngularJSpython app JSON Front-end Back-end
  • 16.
  • 17.
  • 18.
    Data Example datehour id_activity postcode hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2
  • 19.
    Data Example datehour id_activity postcod e hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2
  • 20.
    helper.py example defget_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0 return {"sbi": sbi, "total": hits, "percentage": percentage}
  • 21.
    helper.py example defget_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi
  • 22.
    Who has mydata? • First iteration was a (pre)-POC, less data (3GB vs 500GB); • Time constraints; • Oeps: everything is a pandas df!
  • 23.
    Advantage of “everythingis a df” Pro:• Fast!! • Use what you know • NO DBA’s! •We all love CSV’s! Contra: • Doesn’t scale; • Huge startup time; • NO DBA’s! •We all hate CSV’s!
  • 24.
    If you wantto go down this path • Set the dataframe index wisely; • Align the data to the index: source_data.sort_index(inplace=True) • Beware of modifications of the original dataframe!
  • 25.
    If you wantto go down this path The reason pandas is faster is because I came up with a better algorithm
  • 26.
    If you don’t AngularJS python app REST JSON ? Front-end Back-end Database
  • 27.
    A word about(traditional) databases…
  • 28.
  • 29.
    Postgres for datadriven apps?
  • 30.
    Postgres for datadriven apps?
  • 31.
    Issues?! •With aradius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL: SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array; • Index on date and postcode, but single queries running more than 20 minutes.
  • 32.
    Postgres + Postgis(2.x) PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries: SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates
  • 33.
  • 34.
    How we solvedit 1. Align data on disk by date; 2. Use the temporary table trick: CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array; SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array; 3. Lose precision: 1234AB→1234
  • 35.
    Take home messages 1. Geospatial problems are “hard” and cam kill your queries; 2. Not everybody has infinite resources: be smart and KISS! 3. SQL or NoSQL? (Size, schema)
  • 36.
    GoDataDriven We’re hiring/ Questions? / Thank you! @gglanzani giovannilanzani@godatadriven.com Giovanni Lanzani Data Whisperer