Processing planetary sized datasets

# UserId, ActivityId, Latitude, Longitude, Timestamp
101528757,285751033,51.517227,-0.101553,1429087808
101528757,285751033,51.517296,-0.101817,1429087812
101528757,285751033,51.517353,-0.102064,1429087816
101528757,285751033,51.517445,-0.102144,1429087820
101528757,285751033,51.517475,-0.102259,1429087824
101528757,285751033,51.51743,-0.102343,1429087828
101528757,285751033,51.517338,-0.102309,1429087837
101528757,285751033,51.517307,-0.102303,1429087857
101528757,285751033,51.517296,-0.102346,1429087864
101528757,285751033,51.517284,-0.102388,1429087877
101528757,285751033,51.51729,-0.102321,1429087959
101528757,285751033,51.51729,-0.102248,1429087961
101528757,285751033,51.517338,-0.102088,1429087965
101528757,285751033,51.51737,-0.10196,1429087969

Total Size: 560GB
Activities: 2.5 million
GPS Locations: 3.5 billion

• Many activities
per user.
• Want to be able
to pull a time
range of user
locations for
activity display.

User Id Timestamp Latitude Longitude
…
10152875766888406 1445374423000 36.966819 -122.012298
10152875766888406 1445377625000 36.966845 -122.012248
10152875766888406 1445377627000 36.966877 -122.012228
10152875766888406 1445377629000 36.966913 -122.012236
10152875766888406 1445377630000 36.966946 -122.012236
10152875766888406 1445377631000 36.966984 -122.012263
10152875766888406 1445379512000 36.967027 -122.012281
…

This is a challenge with a large dataset:
• A traditional relational database typically requires
hand sharding to scale to PBs of data (eg. Postgres).
• Highly indexed non relational solutions can be very
expensive (eg. MongoDB).
• Lightly indexed solutions are a good fit because we
really only have one query we need to execute
against the data.

PartitionKey
(userId)
RowKey
(timestamp)
Latitude Longitude
1015287576688840
6
1445377623000 36.966819 -122.012298
1015287576688840
6
1445377625000 36.966845 -122.012248
1015287576688840
6
1445377627000 36.966877 -122.012228
1015287576688840
6
1445377629000 36.966913 -122.012236
1015287576688840
6
1445377630000 36.966946 -122.012236
…

• Want to query a
set of activities
in a bounding
box.
• Also want to
filter activities
based on
distance and
duration.

activity id start (sec) finish (sec) distance (m) Duration (m) bbox
(geometry)
101528 1445377625 1445383025 50023 6222 [-104.990,
39.7392...
101643 1445362577 1445373616 28778 2498 [-122.01228,
36.96…
101843 1445377627 1445382432 4629 701 [0.1278, 51.5074
…
101901 1445362577 1445374713 99691 14232 [139.6917,
35.699...
102102 1445374713 1445374713 25259 6657 [1.3521,
103.8129…

user Id timestamp latitude longitude
1015287576688840
6
144537762
3
36.966819 -122.012298
1015287576688840
6
144537762
5
36.966845 -122.012248
…
1015287576688840
6
144538302
5
36.966913 -122.012236
1015287576688840
6
144538303
0
36.966946 -122.012236
activity
id
start finish … bbox
101528 1445362577 1445373616 … [-104.990, 39.7392...
101643 1445377625 1445383025 … [-122.01228, 36.96…
101843 1445377627 1445382432 … [0.1278, 51.5074 …
101901 1445362577 1445374713 … [139.6917, 35.699...
102102 1445374713 1445374713 … [1.3521, 103.8129…
Location Data
(Azure Table Storage)
Activity Data
(Postgres + PostGIS)

• Total number
of location
samples in a
geographical
area.
• Whole dataset
operation.

• Based on Apache Spark
• Offered as a Service in Azure as HDInsight
Spark.
• Can think of it is as “Hadoop the Next
Generation”
• Better performance (10-100x)
• Cleaner programming model

• Divides world
up into tiles.
• Each tile has
four children at
the next higher
zoom level.
• Maps 2
dimension
space to 1
dimension.

For each location, map to tiles at every zoom level:
(36.9741, -122.0308)  [
(10_398_164, 1), (11_797_329, 1)
(12_1594_659, 1), (13_3189_1319, 1),
(14_6378_2638, 1), (15_12757_5276,1),
(16_25514_10552, 1), (17_51028_21105, 1),
(18_102057_42211, 1)
]

def tile_id_mapper(location):
tileMappings = []
tileIds = Tile.tile_ids_for_zoom_levels(
location['latitude'],
location['longitude'],
MIN_ZOOM_LEVEL,
MAX_ZOOM_LEVEL
)
for tileId in tileIds:
tileMappings.append(
(tileId, 1)
)
return tileMappings

Reduce all these mappings with the same key into an
aggregate value:
(10_398_164, 151)  [
(10_398_164, 15), (10_398_164, 28)
(10_398_164, 29), (10_398_164, 17),
(10_398_164, 31), (10_398_164, 2),
(10_398_164, 16), (10_398_164, 2),
(10_398_164, 11)
]

lines = sc.textFile('wasb://locations@loc.blob.core.windows.net/')
locations = lines.flatMap(json_loader)
heatmap = locations
.flatMap(tile_id_mapper)
.reduceByKey(lambda agg1,agg2: agg1+agg2)
heatmap.saveAsTextFile('wasb://heatmap@loc.blob.core.windows.net/');
Building the heatmap then boils down to this in Spark:

6_25_31
9_201_249
9_201_250
9_201_248
9_201_245
9_201_247
8_100_124
8_100_125
8_100_126
7_50_62
7_50_63
7_50_64

User Id Activity Id Timestamp Latitude Longitude
…
10152875766888406 57169639 1445377623000 36.966819 -122.012298
10152875766888406 57169639 1445377625000 36.966845 -122.012248
10152875766888406 57169639 1445377627000 36.966877 -122.012228
10152875766888406 57169639 1445377629000 36.966913 -122.012236
10152875766888406 57169639 1445377630000 36.966946 -122.012236
10152875766888406 57169639 1445377631000 36.966984 -122.012263
10152875766888406 57169639 1445377632000 36.967027 -122.012281
…

let azure = require('azure-storage'),
elevationService = require('../services/elevation');
module.exports = function(context, locationBlob) {
let locations = locationBlob.split('n');
elevationService.enrichLocations(locations, (err, enrichedLocations) => {
if (err) return context.done(err);
// ... save enrichedlocations to blob ...
context.done();
});
};

560 GB 224 GB
JSON Avro
60%
Smaller

• geotile: http://github.com/timfpark/geotile
• XYZ tile math in C#, JavaScript, and Python
• heatmap:
http://github.com/timfpark/heatmap
• Spark code for building heatmaps

Processing planetary sized datasets

Recommended

Recommended

More Related Content

Similar to Processing planetary sized datasets

Similar to Processing planetary sized datasets (20)

Recently uploaded

Recently uploaded (20)

Processing planetary sized datasets

Editor's Notes