Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Creating a
Data Infrastructure
Rob Grzywinski
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
About Me -- Rob Grzywinski
◉ Writing software professionally >20y
◉ Chief Technology Officer
◉ Advisor to Silicon Valley and other Companies
Big Data & Analytics
Artificial Intelligence
Data-Centric Apps
◉ Multiple successful startup acquisitions
Latest: Aggregate Knowledge (AdTech) for >$100M (USD)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Goals
Where is all of this going?!?
1
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Requirements
Create a system:
◉Records sensor data (time, interval, value)
One historical file (>1y of data)
Multiple updates per day (can miss days)
Expect Inserts and Updates (no deletes)
◉ Supports point-in-time queries
◉ Identifies gaps in data
◉ Identifies additions and/or updates from a new
feed
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
SENSOR_ID,UOM,READING_DATETIME,DURATION,VALUE
"9551020567","kWh",2016-03-12 22:30:00,900,1.154
"9551020567","kWh",2016-03-12 22:45:00,900,3.934
"9551020567","kWh",2016-03-12 23:15:00,900,4.843
Data (Sample)
SENSOR_ID READING_DATETIME DURATION VALUE UOM
"9551020567" 2016-03-12 22:30:00 900 1.154 kWh
"9551020567" 2016-03-12 22:45:00 900 3.934 kWh
"9551020567" 2016-03-12 23:15:00 900 4.843 kWh
(Assume GMT) (Seconds)
(Unit of
Measure)
Delivered as:
- UTF-8 encoded
- CSV with header
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Data (Update Example)
READING_DATETIME VALUE
2016-03-12 22:30:00 1.154
2016-03-12 22:45:00 3.934
2016-03-12 23:15:00 4.843
Gap
Monday’s Feed
READING_DATETIME VALUE
2016-03-12 22:45:00 2.493
2016-03-12 23:00:00 1.036
2016-03-12 23:30:00 2.431
2016-03-12 23:45:00 3.284
Tuesday’s Feed
Gap
Update
(Focusing on the important columns for illustrative purposes)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Desired Output (Example)
READING_DATETIME VALUE
2016-03-12 22:30:00 1.154
2016-03-12 22:45:00 3.934
2016-03-12 23:15:00 4.843
Query Monday
March 12, 22:30 through 22:45 used
5.088 kWh
March 12, 23:15
used 4.843 kWh
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Desired Output (Example)
March 12, 22:30 through 23:45 used
15.241 kWh
READING_DATETIME VALUE
2016-03-12 22:30:00 1.154
2016-03-12 22:45:00 2.493
2016-03-12 23:00:00 1.036
2016-03-12 23:15:00 4.843
2016-03-12 23:30:00 2.431
2016-03-12 23:45:00 3.284
Query Tuesday
Added Value
Updated Value
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Assumptions
◉ Sensor data is kept “indefinitely” (storage is cheap!)
◉ Interactive queries must have “human time”
performance (seconds not minutes or hours)
◉ Cloud technologies are good!
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Data Volumes
◉ ~100 new readings per sensor per day
◉ ~10 sensors per site
◉ ~100k sites
= 100M new readings per day!!!
(~40B readings per year)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Break it Down
What are we trying to solve?
2
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
◉ Quickly insert 100M rows (while still allowing access to the data)
◉Get “latest” data (as of some feed or date)
◉ Identify additions / updates (as of some feed or date)
◉ Identify islands and gaps (as of some feed or date)
◉ Test deleting a “bad feed” (just in case!)
◉ Understand the performance characteristics
Problems to Solve
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
◉ Store it all in a Database (modern DBs are big, fast and cheap!)
◉ Use the DB for all of the hard work
◉ No need to do updates -- just keep the raw data
Updates are bad (and row-by-row)
Too much data to realistically use INSERT INTO ...
Ditto for “upserts” (they’re row-by-row)
Initial Thoughts
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Inspiration!
Let’s think about the problem for a minute!
3
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
The FEED_ID can
reference a rich set of
meta-data:
- Date
- Source
- Processing
- …
(For now just assume it’s
sequential -- feed “2”
occurred after feed “1”,
etc)
Inspiration! (Data from Prior Example)
Assign each feed a unique identifier and simply insert the data
FEED_ID READING_DATETIME VALUE
1 2016-03-12 22:30:00 1.154
1 2016-03-12 22:45:00 3.934
1 2016-03-12 23:15:00 4.843
2 2016-03-12 22:45:00 2.493
2 2016-03-12 23:00:00 1.036
2 2016-03-12 23:30:00 2.431
2 2016-03-12 23:45:00 3.284
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Sort the data by:
FEED_ID and
READING_DATETIME
Inspiration!
FEED_ID READING_DATETIME VALUE
1 2016-03-12 22:30:00 1.154
1 2016-03-12 22:45:00 3.934
2 2016-03-12 22:45:00 2.493
2 2016-03-12 23:00:00 1.036
1 2016-03-12 23:15:00 4.843
2 2016-03-12 23:30:00 2.431
2 2016-03-12 23:45:00 3.284
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Identify Latest Data
FEED_ID READING_DATETIME VALUE
1 2016-03-12 22:30:00 1.154
1 2016-03-12 22:45:00 3.934
2 2016-03-12 22:45:00 2.493
2 2016-03-12 23:00:00 1.036
1 2016-03-12 23:15:00 4.843
2 2016-03-12 23:30:00 2.431
2 2016-03-12 23:45:00 3.284
Notice that the “last” value
in each FEED_ID,
READING_DATETIME group
will always be the latest
value.
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
SELECT DISTINCT
reading_datetime
, duration
, LAST_VALUE(value) OVER (PARTITION BY reading_datetime
ORDER BY feed_id
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS latest_value
FROM raw_sensor_interval_data
Latest Data SQL (BigQuery StandardSQL)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
SELECT reading_datetime
, duration
, LAST_VALUE(value) OVER (PARTITION BY reading_datetime
ORDER BY feed_id
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS latest_value
FROM raw_sensor_interval_data
GROUP BY 1, 2, 3
Latest Data SQL (BigQuery StandardSQL)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Identify Updates
FEED_ID READING_DATETIME VALUE
1 2016-03-12 22:30:00 1.154
1 2016-03-12 22:45:00 3.934
2 2016-03-12 22:45:00 2.493
2 2016-03-12 23:00:00 1.036
1 2016-03-12 23:15:00 4.843
2 2016-03-12 23:30:00 2.431
2 2016-03-12 23:45:00 3.284
Multiple values in a group
indicates updates.
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Updates SQL (BigQuery StandardSQL)
SELECT reading_datetime
, duration
, LAST_VALUE(value) OVER (PARTITION BY reading_datetime
ORDER BY feed_id
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS latest_value
, SUM(1) AS update_count
FROM raw_sensor_interval_data
GROUP BY 1, 2, 3
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Identify Additions
A FEED_ID equal to
MAX(FEED_ID) when the
update count is one
indicates additions
FEED_ID READING_DATETIME VALUE
1 2016-03-12 22:30:00 1.154
1 2016-03-12 22:45:00 3.934
2 2016-03-12 22:45:00 2.493
2 2016-03-12 23:00:00 1.036
1 2016-03-12 23:15:00 4.843
2 2016-03-12 23:30:00 2.431
2 2016-03-12 23:45:00 3.284
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
SELECT reading_datetime
, duration
, MAX(feed_id) OVER (PARTITION BY meter_id)
AS latest_feed_id_for_meter
, MAX(feed_id) OVER (PARTITION BY meter_id, reading_datetime)
AS latest_feed_id_for_reading
, LAST_VALUE(value) OVER (PARTITION BY reading_datetime
ORDER BY feed_id
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS latest_value
FROM raw_sensor_interval_data
Additions SQL (BigQuery StandardSQL)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
◉ Insert 100M rows (while still allowing access to the data)
◉Get “latest” data (as of some feed or date)
◉ Identify additions / updates (as of some feed or date)
◉ Identify islands and gaps (as of some feed or date)
◉ Test deleting a “bad feed” (just in case!)
◉ Understand the performance characteristics
Problems to Solve
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Islands and Gaps
This is going to blow your mind!!!
4
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Islands and Gaps (Different Data to Illustrate Example)
READING_DATETIME
2016-03-12 22:30:00
2016-03-12 22:45:00
2016-03-12 23:00:00
2016-03-12 23:15:00
2016-03-13 00:15:00
2016-03-13 00:30:00
2016-03-13 00:45:00
2016-03-13 01:15:00
(15min data ordered by READING_DATETIME)
Gap
Island
Island
Island
Gap
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
READING_DATETIME
2016-03-12 22:30:00
2016-03-12 22:45:00
2016-03-12 23:00:00
2016-03-12 23:15:00
2016-03-13 00:15:00
2016-03-13 00:30:00
2016-03-13 00:45:00
2016-03-13 01:15:00
Islands
(15min data ordered by READING_DATETIME)
Start: 2016-03-12 22:30:00
End: 2016-03-12
23:15:00
Start: 2016-03-13 00:15:00
End: 2016-03-13 00:45:00
Start: 2016-03-13 01:15:00
End: 2016-03-13 01:15:00
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Simpler Example (Same Islands and Gaps)
VALUE
3
4
5
6
11
12
13
25
(15min data ordered by READING_DATETIME)
Gap
Island
Island
Island
Gap
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
VALUE
3
4
5
6
11
12
13
25
Islands
How can these groups
be identified?
(Once we have the groups then the
answer is just MIN() and MAX())
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
VALUE
3
4
5
6
Arithmetic Progression3
a1=
a2=
a3=
a4=
a1
a2 = a1 + 1
a3 = a1 + 2
a4 = a1 + 3
...
an = a1 + (n - 1)
Compute each value
relative to the first value
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
an = a1 + (n - 1) * d
Where
a1 = first value of sequence
n = rank (position) of value
d = distance between values
VALUE
3
4
5
6
Arithmetic Progression
a1=
a2=
a3=
a4=
a1
a2 = a1 + 1
a3 = a1 + 2
a4 = a1 + 3
...
an = a1 + (n - 1)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Flip it around:
an = a1 + (n - 1) * d
a1 = an - (n - 1) * d
One step further:
a1 + n * d = an - n * d
const = an - n * d
VALUE n CONST
3 1 2
4 2 2
5 3 2
6 4 2
Arithmetic Progression
(d = 1 in this case)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Notice that the value of const
doesn’t matter.
What does matter is that it’s a
constant for a sequence of
values and different for
different sequences!
VALUE n CONST
3 1 2
4 2 2
5 3 2
6 4 2
11 5 6
12 6 6
13 7 6
25 8 17
What If There Are Gaps?
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Works for Dates too!
(15min data ordered by READING_DATETIME)
OFFSET =
RANK * 15min
(the duration)
Where
RANK = n
DURATION = d
OFFSET = n * d
READING_DATETIME RANK OFFSET
2016-03-12 22:30:00 1 15min
2016-03-12 22:45:00 2 30min
2016-03-12 23:00:00 3 45min
2016-03-12 23:15:00 4 60min
2016-03-13 00:15:00 5 75min
2016-03-13 00:30:00 6 90min
2016-03-13 00:45:00 7 105min
2016-03-13 01:15:00 8 120min
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Islands
(15min data ordered by READING_DATETIME)
READING_DATETIME RANK OFFSET CONST
2016-03-12 22:30:00 1 15min 2016-03-12 22:15:00
2016-03-12 22:45:00 2 30min 2016-03-12 22:15:00
2016-03-12 23:00:00 3 45min 2016-03-12 22:15:00
2016-03-12 23:15:00 4 60min 2016-03-12 22:15:00
2016-03-13 00:15:00 5 75min 2016-03-12 23:00:00
2016-03-13 00:30:00 6 90min 2016-03-12 23:00:00
2016-03-13 00:45:00 7 105min 2016-03-12 23:00:00
2016-03-13 01:15:00 8 120min 2016-03-12 23:15:00
READING_DATETIME - OFFSET = CONST
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Islands SQL (BigQuery StandardSQL)
WITH
sequence_offsets AS (
SELECT reading_datetime
, (RANK() OVER(ORDER BY reading_datetime) *
duration) AS offset_seconds
FROM sensor_interval_data
)
, islands AS (
SELECT MIN(reading_datetime) AS start_date
, MAX(reading_datetime) AS end_date
FROM sequence_offsets
GROUP BY DATETIME_SUB(reading_date,
INTERVAL offset_seconds SECOND)
)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Islands SQL (BigQuery StandardSQL)
WITH
sequence_offsets AS (
SELECT reading_datetime
, (RANK() OVER(ORDER BY reading_datetime) *
duration) AS offset_seconds
, value
FROM sensor_interval_data
)
, islands AS (
SELECT MIN(reading_datetime) AS start_date
, MAX(reading_datetime) AS end_date
, SUM(value) AS total_value
FROM sequence_offsets
GROUP BY DATETIME_SUB(reading_date,
INTERVAL offset_seconds SECOND)
)
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
◉ Insert 100M rows (while still allowing access to the data)
◉Get “latest” data (as of some feed or date)
◉ Identify additions / updates (as of some feed or date)
◉ Identify islands and gaps (as of some feed or date)
◉ Test deleting a “bad feed” (just in case!)
◉ Understand the performance characteristics
Problems to Solve
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Any questions ?
You can find me at:
◉ rob@limalimacharlie.com
Thanks!
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Appendixn
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
References
1.http://www.sqltopia.com/?page_id=83
2.http://sqlmag.com/sql-server-2012/solving-gaps-
and-islands-enhanced-window-functions
3.https://en.wikipedia.org/wiki/Arithmetic_progressi
on
Copyright © 2017, Lima Lima Charlie, LLC. All rights
reserved.
Attribution
1.http://www.slidescarnival.com/
2.http://www.clipartbest.com/
3.https://thenounproject.com/term/mind-the-
gap/1965/

How to Create a Data Infrastructure

  • 1.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Creating a Data Infrastructure Rob Grzywinski
  • 2.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. About Me -- Rob Grzywinski ◉ Writing software professionally >20y ◉ Chief Technology Officer ◉ Advisor to Silicon Valley and other Companies Big Data & Analytics Artificial Intelligence Data-Centric Apps ◉ Multiple successful startup acquisitions Latest: Aggregate Knowledge (AdTech) for >$100M (USD)
  • 3.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Goals Where is all of this going?!? 1
  • 4.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Requirements Create a system: ◉Records sensor data (time, interval, value) One historical file (>1y of data) Multiple updates per day (can miss days) Expect Inserts and Updates (no deletes) ◉ Supports point-in-time queries ◉ Identifies gaps in data ◉ Identifies additions and/or updates from a new feed
  • 5.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. SENSOR_ID,UOM,READING_DATETIME,DURATION,VALUE "9551020567","kWh",2016-03-12 22:30:00,900,1.154 "9551020567","kWh",2016-03-12 22:45:00,900,3.934 "9551020567","kWh",2016-03-12 23:15:00,900,4.843 Data (Sample) SENSOR_ID READING_DATETIME DURATION VALUE UOM "9551020567" 2016-03-12 22:30:00 900 1.154 kWh "9551020567" 2016-03-12 22:45:00 900 3.934 kWh "9551020567" 2016-03-12 23:15:00 900 4.843 kWh (Assume GMT) (Seconds) (Unit of Measure) Delivered as: - UTF-8 encoded - CSV with header
  • 6.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Data (Update Example) READING_DATETIME VALUE 2016-03-12 22:30:00 1.154 2016-03-12 22:45:00 3.934 2016-03-12 23:15:00 4.843 Gap Monday’s Feed READING_DATETIME VALUE 2016-03-12 22:45:00 2.493 2016-03-12 23:00:00 1.036 2016-03-12 23:30:00 2.431 2016-03-12 23:45:00 3.284 Tuesday’s Feed Gap Update (Focusing on the important columns for illustrative purposes)
  • 7.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Desired Output (Example) READING_DATETIME VALUE 2016-03-12 22:30:00 1.154 2016-03-12 22:45:00 3.934 2016-03-12 23:15:00 4.843 Query Monday March 12, 22:30 through 22:45 used 5.088 kWh March 12, 23:15 used 4.843 kWh
  • 8.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Desired Output (Example) March 12, 22:30 through 23:45 used 15.241 kWh READING_DATETIME VALUE 2016-03-12 22:30:00 1.154 2016-03-12 22:45:00 2.493 2016-03-12 23:00:00 1.036 2016-03-12 23:15:00 4.843 2016-03-12 23:30:00 2.431 2016-03-12 23:45:00 3.284 Query Tuesday Added Value Updated Value
  • 9.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Assumptions ◉ Sensor data is kept “indefinitely” (storage is cheap!) ◉ Interactive queries must have “human time” performance (seconds not minutes or hours) ◉ Cloud technologies are good!
  • 10.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Data Volumes ◉ ~100 new readings per sensor per day ◉ ~10 sensors per site ◉ ~100k sites = 100M new readings per day!!! (~40B readings per year)
  • 11.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Break it Down What are we trying to solve? 2
  • 12.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. ◉ Quickly insert 100M rows (while still allowing access to the data) ◉Get “latest” data (as of some feed or date) ◉ Identify additions / updates (as of some feed or date) ◉ Identify islands and gaps (as of some feed or date) ◉ Test deleting a “bad feed” (just in case!) ◉ Understand the performance characteristics Problems to Solve
  • 13.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. ◉ Store it all in a Database (modern DBs are big, fast and cheap!) ◉ Use the DB for all of the hard work ◉ No need to do updates -- just keep the raw data Updates are bad (and row-by-row) Too much data to realistically use INSERT INTO ... Ditto for “upserts” (they’re row-by-row) Initial Thoughts
  • 14.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Inspiration! Let’s think about the problem for a minute! 3
  • 15.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. The FEED_ID can reference a rich set of meta-data: - Date - Source - Processing - … (For now just assume it’s sequential -- feed “2” occurred after feed “1”, etc) Inspiration! (Data from Prior Example) Assign each feed a unique identifier and simply insert the data FEED_ID READING_DATETIME VALUE 1 2016-03-12 22:30:00 1.154 1 2016-03-12 22:45:00 3.934 1 2016-03-12 23:15:00 4.843 2 2016-03-12 22:45:00 2.493 2 2016-03-12 23:00:00 1.036 2 2016-03-12 23:30:00 2.431 2 2016-03-12 23:45:00 3.284
  • 16.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Sort the data by: FEED_ID and READING_DATETIME Inspiration! FEED_ID READING_DATETIME VALUE 1 2016-03-12 22:30:00 1.154 1 2016-03-12 22:45:00 3.934 2 2016-03-12 22:45:00 2.493 2 2016-03-12 23:00:00 1.036 1 2016-03-12 23:15:00 4.843 2 2016-03-12 23:30:00 2.431 2 2016-03-12 23:45:00 3.284
  • 17.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Identify Latest Data FEED_ID READING_DATETIME VALUE 1 2016-03-12 22:30:00 1.154 1 2016-03-12 22:45:00 3.934 2 2016-03-12 22:45:00 2.493 2 2016-03-12 23:00:00 1.036 1 2016-03-12 23:15:00 4.843 2 2016-03-12 23:30:00 2.431 2 2016-03-12 23:45:00 3.284 Notice that the “last” value in each FEED_ID, READING_DATETIME group will always be the latest value.
  • 18.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. SELECT DISTINCT reading_datetime , duration , LAST_VALUE(value) OVER (PARTITION BY reading_datetime ORDER BY feed_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_value FROM raw_sensor_interval_data Latest Data SQL (BigQuery StandardSQL)
  • 19.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. SELECT reading_datetime , duration , LAST_VALUE(value) OVER (PARTITION BY reading_datetime ORDER BY feed_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_value FROM raw_sensor_interval_data GROUP BY 1, 2, 3 Latest Data SQL (BigQuery StandardSQL)
  • 20.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Identify Updates FEED_ID READING_DATETIME VALUE 1 2016-03-12 22:30:00 1.154 1 2016-03-12 22:45:00 3.934 2 2016-03-12 22:45:00 2.493 2 2016-03-12 23:00:00 1.036 1 2016-03-12 23:15:00 4.843 2 2016-03-12 23:30:00 2.431 2 2016-03-12 23:45:00 3.284 Multiple values in a group indicates updates.
  • 21.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Updates SQL (BigQuery StandardSQL) SELECT reading_datetime , duration , LAST_VALUE(value) OVER (PARTITION BY reading_datetime ORDER BY feed_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_value , SUM(1) AS update_count FROM raw_sensor_interval_data GROUP BY 1, 2, 3
  • 22.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Identify Additions A FEED_ID equal to MAX(FEED_ID) when the update count is one indicates additions FEED_ID READING_DATETIME VALUE 1 2016-03-12 22:30:00 1.154 1 2016-03-12 22:45:00 3.934 2 2016-03-12 22:45:00 2.493 2 2016-03-12 23:00:00 1.036 1 2016-03-12 23:15:00 4.843 2 2016-03-12 23:30:00 2.431 2 2016-03-12 23:45:00 3.284
  • 23.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. SELECT reading_datetime , duration , MAX(feed_id) OVER (PARTITION BY meter_id) AS latest_feed_id_for_meter , MAX(feed_id) OVER (PARTITION BY meter_id, reading_datetime) AS latest_feed_id_for_reading , LAST_VALUE(value) OVER (PARTITION BY reading_datetime ORDER BY feed_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_value FROM raw_sensor_interval_data Additions SQL (BigQuery StandardSQL)
  • 24.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. ◉ Insert 100M rows (while still allowing access to the data) ◉Get “latest” data (as of some feed or date) ◉ Identify additions / updates (as of some feed or date) ◉ Identify islands and gaps (as of some feed or date) ◉ Test deleting a “bad feed” (just in case!) ◉ Understand the performance characteristics Problems to Solve
  • 25.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Islands and Gaps This is going to blow your mind!!! 4
  • 26.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Islands and Gaps (Different Data to Illustrate Example) READING_DATETIME 2016-03-12 22:30:00 2016-03-12 22:45:00 2016-03-12 23:00:00 2016-03-12 23:15:00 2016-03-13 00:15:00 2016-03-13 00:30:00 2016-03-13 00:45:00 2016-03-13 01:15:00 (15min data ordered by READING_DATETIME) Gap Island Island Island Gap
  • 27.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. READING_DATETIME 2016-03-12 22:30:00 2016-03-12 22:45:00 2016-03-12 23:00:00 2016-03-12 23:15:00 2016-03-13 00:15:00 2016-03-13 00:30:00 2016-03-13 00:45:00 2016-03-13 01:15:00 Islands (15min data ordered by READING_DATETIME) Start: 2016-03-12 22:30:00 End: 2016-03-12 23:15:00 Start: 2016-03-13 00:15:00 End: 2016-03-13 00:45:00 Start: 2016-03-13 01:15:00 End: 2016-03-13 01:15:00
  • 28.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Simpler Example (Same Islands and Gaps) VALUE 3 4 5 6 11 12 13 25 (15min data ordered by READING_DATETIME) Gap Island Island Island Gap
  • 29.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. VALUE 3 4 5 6 11 12 13 25 Islands How can these groups be identified? (Once we have the groups then the answer is just MIN() and MAX())
  • 30.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. VALUE 3 4 5 6 Arithmetic Progression3 a1= a2= a3= a4= a1 a2 = a1 + 1 a3 = a1 + 2 a4 = a1 + 3 ... an = a1 + (n - 1) Compute each value relative to the first value
  • 31.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. an = a1 + (n - 1) * d Where a1 = first value of sequence n = rank (position) of value d = distance between values VALUE 3 4 5 6 Arithmetic Progression a1= a2= a3= a4= a1 a2 = a1 + 1 a3 = a1 + 2 a4 = a1 + 3 ... an = a1 + (n - 1)
  • 32.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Flip it around: an = a1 + (n - 1) * d a1 = an - (n - 1) * d One step further: a1 + n * d = an - n * d const = an - n * d VALUE n CONST 3 1 2 4 2 2 5 3 2 6 4 2 Arithmetic Progression (d = 1 in this case)
  • 33.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Notice that the value of const doesn’t matter. What does matter is that it’s a constant for a sequence of values and different for different sequences! VALUE n CONST 3 1 2 4 2 2 5 3 2 6 4 2 11 5 6 12 6 6 13 7 6 25 8 17 What If There Are Gaps?
  • 34.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Works for Dates too! (15min data ordered by READING_DATETIME) OFFSET = RANK * 15min (the duration) Where RANK = n DURATION = d OFFSET = n * d READING_DATETIME RANK OFFSET 2016-03-12 22:30:00 1 15min 2016-03-12 22:45:00 2 30min 2016-03-12 23:00:00 3 45min 2016-03-12 23:15:00 4 60min 2016-03-13 00:15:00 5 75min 2016-03-13 00:30:00 6 90min 2016-03-13 00:45:00 7 105min 2016-03-13 01:15:00 8 120min
  • 35.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Islands (15min data ordered by READING_DATETIME) READING_DATETIME RANK OFFSET CONST 2016-03-12 22:30:00 1 15min 2016-03-12 22:15:00 2016-03-12 22:45:00 2 30min 2016-03-12 22:15:00 2016-03-12 23:00:00 3 45min 2016-03-12 22:15:00 2016-03-12 23:15:00 4 60min 2016-03-12 22:15:00 2016-03-13 00:15:00 5 75min 2016-03-12 23:00:00 2016-03-13 00:30:00 6 90min 2016-03-12 23:00:00 2016-03-13 00:45:00 7 105min 2016-03-12 23:00:00 2016-03-13 01:15:00 8 120min 2016-03-12 23:15:00 READING_DATETIME - OFFSET = CONST
  • 36.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Islands SQL (BigQuery StandardSQL) WITH sequence_offsets AS ( SELECT reading_datetime , (RANK() OVER(ORDER BY reading_datetime) * duration) AS offset_seconds FROM sensor_interval_data ) , islands AS ( SELECT MIN(reading_datetime) AS start_date , MAX(reading_datetime) AS end_date FROM sequence_offsets GROUP BY DATETIME_SUB(reading_date, INTERVAL offset_seconds SECOND) )
  • 37.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Islands SQL (BigQuery StandardSQL) WITH sequence_offsets AS ( SELECT reading_datetime , (RANK() OVER(ORDER BY reading_datetime) * duration) AS offset_seconds , value FROM sensor_interval_data ) , islands AS ( SELECT MIN(reading_datetime) AS start_date , MAX(reading_datetime) AS end_date , SUM(value) AS total_value FROM sequence_offsets GROUP BY DATETIME_SUB(reading_date, INTERVAL offset_seconds SECOND) )
  • 38.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. ◉ Insert 100M rows (while still allowing access to the data) ◉Get “latest” data (as of some feed or date) ◉ Identify additions / updates (as of some feed or date) ◉ Identify islands and gaps (as of some feed or date) ◉ Test deleting a “bad feed” (just in case!) ◉ Understand the performance characteristics Problems to Solve
  • 39.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Any questions ? You can find me at: ◉ rob@limalimacharlie.com Thanks!
  • 40.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Appendixn
  • 41.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. References 1.http://www.sqltopia.com/?page_id=83 2.http://sqlmag.com/sql-server-2012/solving-gaps- and-islands-enhanced-window-functions 3.https://en.wikipedia.org/wiki/Arithmetic_progressi on
  • 42.
    Copyright © 2017,Lima Lima Charlie, LLC. All rights reserved. Attribution 1.http://www.slidescarnival.com/ 2.http://www.clipartbest.com/ 3.https://thenounproject.com/term/mind-the- gap/1965/

Editor's Notes

  • #7 This demonstrates how data can be received which updates existing data and adds to it.
  • #25 Not bad so far!
  • #39 Not bad so far!