Real time data driven applications (and SQL vs NoSQL databases)

GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni Lanzani
Data Whisperer
and SQL vs NoSQL databases

Who am I?
2008-2012: PhDTheoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven

Real-time, data driven app?
•No store and retrieve;
•Store, {transform, enrich, analyse} and retrieve;
•Real-time: retrieve is not a batch process;
•App: something your mother could use:
SELECT attendees
FROM NoSQLMatters
WHERE password = '1234';

Get insight about event impact

Challenges
1. Big Data;
2. Privacy;
3. Some real-time analysis;
4. Real-time retrieval.

Is it Big Data?
Everybody talks about it
Nobody knows how to do it
Everyone thinks everyone else is doing it, so everyone
claims they’re doing it…
Dan Ariely

Is it Big Data?
•Raw logs are in the order of 40TB;
•We use Hadoop for storing, enriching and pre-
processing.

•Harder than it looks;
•Large data;
•Retrieval is by giving date, center location +
radius.
4. Real-Time Retrieval

AngularJS python app
REST
Front-end Back-end
JSON
Architecture

date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example

date hour id_activity
postcod
e
hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
Data Example

helper.py example
def get_statistics(data, sbi):
sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi
hits = sbi_df.hits.sum() # select sum(hits) from …
delta_hits = sbi_df.delta.sum() # select sum(delta) from …
if delta_hits:
percentage = (hits - delta_hits) / delta_hits
else:
percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}

helper.py example
def get_timeline(data, sbi):
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi
return df_sbi

Who has my data?
•First iteration was a (pre)-POC, less data (3GB vs
500GB);
•Time constraints;
•Oeps: everything is a pandas df!

Advantage of “everything is a df”
Pro:
•Fast!!
•Use what you know
•NO DBA’s!
•We all love CSV’s!
Contra:
•Doesn’t scale;
•Huge startup time;
•NO DBA’s!
•We all hate CSV’s!

•Set the dataframe index wisely;
•Align the data to the index:
•Beware of modiﬁcations of the original dataframe!
source_data.sort_index(inplace=True)
If you want to go down this path

The reason pandas is faster is because I came up with a better algorithm
If you want to go down this path

AngularJS python app
REST
Front-end Back-end Database
JSON
?
If you don’t

A word about (traditional) databases…

Postgres for data driven apps?

Issues?!
•With a radius of 10km, in Amsterdam, you get
10k postcodes.You need to do this in your SQL:
•Index on date and postcode, but single queries
running more than 20 minutes.
SELECT * FROM datapoints
WHERE
date IN date_array
AND
postcode IN postcode_array;

PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries:
SELECT *
FROM datapoints
WHERE ST_DWithin(lon, lat, 1500)
AND dates IN ('2013-02-30', '2013-02-31');
-- every point within 1.5km
-- from (lat, lon) on imaginary dates
Postgres + Postgis (2.x)

How we solved it
1. Align data on disk by date;
2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL
PRIMARY KEY);
INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp
JOIN datapoints d
ON d.postcode = tmp.postcodes
WHERE
d.dt IN dates_array;

Take home messages
1. Geospatial problems are “hard” and cam kill your
queries;
2. Not everybody has inﬁnite resources: be smart
and KISS!
3. SQL or NoSQL? (Size, schema)

GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
giovannilanzani@godatadriven.com
Giovanni Lanzani
Data Whisperer

Real time data driven applications (and SQL vs NoSQL databases)

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (11)

Similar to Real time data driven applications (and SQL vs NoSQL databases)

Similar to Real time data driven applications (and SQL vs NoSQL databases) (20)

More from GoDataDriven

More from GoDataDriven (20)

Recently uploaded

Recently uploaded (20)

Real time data driven applications (and SQL vs NoSQL databases)