Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Real time data driven applications
and SQL vs NoSQL databases
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
Giovanni Lanzani
Data Whisperer

Who am I?
2008-2012: PhD Theoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven

Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees
FROM NoSQLMatters
WHERE password = '1234';

Get insight about event impact

Challenges
1. Big Data;
2. Privacy;
3. Some real-time analysis;
4. Real-time retrieval.

Is it Big Data?
Everybody talks about it
Nobody knows how to do it
Everyone thinks everyone else is doing it, so everyone
claims they’re doing it…
Dan Ariely

Is it Big Data?
• Raw logs are in the order of 40TB;
•We use Hadoop for storing, enriching and pre-processing.

4. Real-Time Retrieval
• Harder than it looks;
• Large data;
• Retrieval is by giving date, center location +
radius.

REST
Architecture
AngularJS python app
JSON
Front-end Back-end

Data Example
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2

Data Example
date hour id_activity postcod
e hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2

helper.py example
def get_statistics(data, sbi):
sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi
hits = sbi_df.hits.sum() # select sum(hits) from …
delta_hits = sbi_df.delta.sum() # select sum(delta) from …
if delta_hits:
percentage = (hits - delta_hits) / delta_hits
else:
percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}

helper.py example
def get_timeline(data, sbi):
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi
return df_sbi

Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs
500GB);
• Time constraints;
• Oeps: everything is a pandas df!

Advantage of “everything is a df”
Pro:•
Fast!!
• Use what you know
• NO DBA’s!
•We all love CSV’s!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
•We all hate CSV’s!

If you want to go down this path
• Set the dataframe index wisely;
• Align the data to the index:
source_data.sort_index(inplace=True)
• Beware of modifications of the original dataframe!

If you want to go down this path
The reason pandas is faster is because I came up with a better algorithm

If you don’t
AngularJS python app
REST
JSON
?
Front-end Back-end Database

A word about (traditional) databases…

Postgres for data driven apps?

Issues?!
•With a radius of 10km, in Amsterdam, you get
10k postcodes. You need to do this in your SQL:
SELECT * FROM datapoints
WHERE
date IN date_array
AND
postcode IN postcode_array;
• Index on date and postcode, but single queries
running more than 20 minutes.

Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries:
SELECT *
FROM datapoints
WHERE ST_DWithin(lon, lat, 1500)
AND dates IN ('2013-02-30', '2013-02-31');
-- every point within 1.5km
-- from (lat, lon) on imaginary dates

How we solved it
1. Align data on disk by date;
2. Use the temporary table trick:
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL
PRIMARY KEY);
INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp
JOIN datapoints d
ON d.postcode = tmp.postcodes
WHERE
d.dt IN dates_array;
3. Lose precision: 1234AB→1234

Take home messages
1. Geospatial problems are “hard” and cam kill your
queries;
2. Not everybody has infinite resources: be smart
and KISS!
3. SQL or NoSQL? (Size, schema)

GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
giovannilanzani@godatadriven.com
Giovanni Lanzani
Data Whisperer

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Similar to Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014 (20)

More from NoSQLmatters

More from NoSQLmatters (20)

Recently uploaded

Recently uploaded (20)

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014