A Century Of Weather Data - Midwest.io

Weather of the Century
J. Randall Hunt
@jrhunt 
Developer Advocate, MongoDB
@midwestio

What was the weather the day you were born?

Agenda
• Data and Schema
• Application
• Operational Concerns

What Is It And Why Use It?
• Document Data Store
• Geo Indexing
• "Simple" Sharded deployments

Terminology
RDBMS MongoDB (Document Store)
Database Database
Table Collection
Row(s) (bson) Document
Index Index
Join Nope.

A Weather Datum
• A station ID
• A timestamp
• Lat, Long, Elevation
• A LOT OF WEATHER DATA (135 page manual for
parsing)
• Lots of optional sections

How much of it do we have?
• 2.5 billion distinct data points
• 4 Terabytes
• Number of documents is huge, overall data size is
reasonable
• We'll call this: "moderately big" data

Who Else Is This Relevant For?
• Particle Physics
• Stocks, high frequency trading
• Insurance
• People with lots of small pieces data

Things We Care About
• Performance
‣ Ingestion
‣ App Speciﬁc
‣ Ad-hoc
• Cost
• Flexibility

Performance Breakdown
• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.

0303725053947282013060322517+40779-073969FM-15+0048KNYC
V0309999C00005030485MN0080475N5+02115+02005100975
ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999
GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{
"st" : "u725053",
"ts" : ISODate("2013-06-03T22:51:00Z"),
"airTemperature" : {
"value" : 21.1,
"quality" : "5"
},
"atmosphericPressure" : {
"value" : 1009.7,
"quality" : "5"
}
}
Station ID:
NYC Central Park

Schema
{!
st: "u724463",!
ts: ISODate("1991-01-01T00:00:00Z"),!
position: {!
type: "Point",!
coordinates: [!
-94.6,!
39.117!
]!
},!
elevation: 231,!
… other fields …!
}!
station ID and source

Stations
• USAF and WBAN IDs exist for most of North America.
Preﬁx with "u" and "w" then the ID
• For ships we use the preﬁx "x" and their lat and lng to
create a station id.

Schema
{!
st: "u724463",!
ts: ISODate("1991-01-01T00:00:00Z"),!
position: {!
type: "Point",!
coordinates: [!
-94.6,!
39.117!
]!
},!
elevation: 231,!
… other fields …!
}!
GeoJSON

GeoJSON
• A rich geographical data format
• Lines, MultiLines, Polygons, Geometries
• Able to perform queries on complex structures

Schema
!
airTemperature: {!
value: -4.9,!
quality: "1"!
}!

Choice: Embedding?
Problem: ~100 "weather codes" and optional sections
• Store them inline
• Store them in another collection

Choice: Embedding?
• Embedding keeps your logic in the schema instead of
the application.
• Depends on cardinality, don't embed "squillions"
• Don't embed objects that have to change frequently.

Choice: Unique Identiﬁer
{_id: ObjectId("53a33f823ed4ac438f8c63b7")}!
• Simple, guaranteed unique identiﬁer
• 12 bytes

Choice: Unique Identiﬁer
!
{_id: {!
'st': 'w12345',!
'ts': ISODate("2014-06-19T19:53:58.680Z")!
}!
}
• Not great if there are duplicates
• Slightly More complex queries
• ~12 bytes saved per document

Choice: Field Shortening
• Indexes are still the same size
• Decreases readability
• In our example you can save ~40% space with
minimum ﬁeld lengths
• Probably better to go for semi-readable with ~20%
space savings

{!
"_id": ObjectId("5298c40f3004e2fe02922e29"),!
"st": "w13731",!
"ts": ISODate("1949-01-01T05:00:00Z"),!
"airTemperature": {!
"quality": "5",!
"value": 1.1!
},!
"skyCondition": {!
"cavok": "N",!
"ceilingHeight": {!
"determination": "9",!
"quality": "4",!
"value": 1433!
}!
},!
... ... ...!
}!
1236 Bytes

{!
"_id": ObjectId("5398c40f3004e2fe02922e29"),!
"st": "w13731",!
"ts": ISODate("1949-01-01T05:00:00Z"),!
"aT": {!
"q": "5",!
"v": 1.1!
},!
"sC": {!
"c": "N",!
"cH": {!
"d": "9",!
"q": "4",!
"v": 1433!
}!
},!
... ... ...!
}!
786 Bytes

Choice: Indexes
• Prefer sparse indexes! All Geo indexes are sparse.
• Relying on index intersection can reduce storage
needs but compound indexes are more performant.
• Build indexes AFTER ingesting the data!

Overview
Javascript
!
Chrome
!
Google Earth
browser plugin
KML
!
Python
PyMongo
Data
Data
ClientServer

Aggregation
pipeline = [{!
'$match': {!
'ts': {!
'$gte': dt,!
'$lt': dt + timedelta(hours=1)},!
'airTemperature.quality': {!
'$in': ['0', '1', '5', '9']}!
}!
}, {!
'$group': {!
'_id': '$st',!
'position': {'$first': '$position'},!
'airTemperature': {'$first': '$airTemperature'}}!
}]!
!
cursor = db.data.aggregate(pipeline, cursor={})!

{!
name : "New York",!
! geometry : {!
type: "MultiPolygon",!
coordinates: [!
[!
[-71.94, 41.28],!
[-71.92, 41.29],!
/* 2000 more points... */!
[-71.94, 41.28]!
]!
]!
}!
}!
db.states.createIndex({!
geometry: '2dsphere'!
});!
GeoFencing

GeoFencing
db.states.find_one({!
'geometry': {!
'$geoIntersects': {!
'$geometry': {!
'type': 'Point',!
'coordinates': [lng, lat]}}}})!

Single Server
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge

Sharded Cluster
Application / mongos
...
100 x r3.2xlarge
61 GB RAM
@
100 GB disk
mongod
c3.8xlarge

Cost?
..
$60,000 / yr
$700,000 / yr

Bulk Loading: Single Server 8 threads
100 batch size

Bulk Loading: Single Server
Settings
8 Threads
100 Batch Size
Total loading time: 10 h 20 min
Documents per second: ~70,000
Index build time 7 h 40 min (ts_1_st_1)

Bulk Loading: Sharded Cluster 144 threads 
200 batch size

Bulk Loading: Sharded Cluster
Shard Key Station ID, hashed
Settings
10 mongos @ 144 threads
200 batch size
Total loading time: 3 h 10 min
Documents per second: ~228,000
Index build time 5 min (ts_1_st_1)

Queries: Point in Space-Time
db.data.find({"st" : "u747940", 
"ts" : ISODate("1969-07-16T12:00:00Z")})

Queries: Point in Space-Time
"ts" : ISODate("1969-07-16T12:00:00Z")})
0
0.5
1
1.5
2
single server cluster
ms
avg
95th
99th
max.
throughput:
40,000/s 610,000/s
(10 mongos)

Queries: One Station, One Year
"ts" : {"$gte": ISODate("1989-01-01"), 
"$lt" : ISODate("1990-01-01")}})

Queries: One Station, One Year
"ts" : {"$gte": ISODate("1989-01-01"), 
"$lt" : ISODate("1990-01-01")}})
0
1000
2000
3000
4000
5000
ms
avg
95th
99th
max.
throughput: 20/s 430/s
(10 mongos)
targeted query

Queries: The Whole World
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})

Queries: The Whole World
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
0
2000
4000
6000
8000
10000
ms
avg
95th
99th
max.
throughput: 8/s
310/s
(10 mongos)
scatter/gather query

Analytics: Maximum Temperature
db.data.aggregate
([

{
"$match"
:
{
"airTemperature.quality"
:

{
"$in"
:
[
"1",
"5"
]
}
}
},

{
"$group"
:
{
"_id"

:
null, 

"maxTemp"
:
{
"$max"
:
 

"$airTemperature.value"
}
}
}

])

61.8 °C = 143 °F
2 h 30 min
Single Server
2 min
Cluster

Summary: Single Server
Pro
• Cost Effective
• Low latency for single queries
Con
• Table scans are still slow

Summary: Cluster
!
Con
• High cost
!
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields signiﬁcant speed-up
• Analytics are possible
!
..

Thank You!
J. Randall Hunt
@jrhunt 
Developer Advocate, MongoDB
@midwest.io

A Century Of Weather Data - Midwest.io

More Related Content

What's hot

Viewers also liked

Similar to A Century Of Weather Data - Midwest.io

More from Randall Hunt

A Century Of Weather Data - Midwest.io