A Century Of Weather Data - Midwest.io
Upcoming SlideShare
Loading in...5
×
 

A Century Of Weather Data - Midwest.io

on

  • 79 views

Use MongoDB to store and query 4TB of weather data. At midwest.io .

Use MongoDB to store and query 4TB of weather data. At midwest.io .

Statistics

Views

Total Views
79
Views on SlideShare
59
Embed Views
20

Actions

Likes
0
Downloads
4
Comments
0

2 Embeds 20

http://eventifier.local 19
http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A Century Of Weather Data - Midwest.io A Century Of Weather Data - Midwest.io Presentation Transcript

  • Weather of the Century J. Randall Hunt @jrhunt
 Developer Advocate, MongoDB @midwestio
  • What was the weather the day you were born?
  • Agenda • Data and Schema • Application • Operational Concerns
  • MONGODB INTERLUDE!
  • What Is It And Why Use It? • Document Data Store • Geo Indexing • "Simple" Sharded deployments
  • Terminology RDBMS MongoDB (Document Store) Database Database Table Collection Row(s) (bson) Document Index Index Join Nope.
  • The Data
  • Where To Get Data?
  • A Weather Datum • A station ID • A timestamp • Lat, Long, Elevation • A LOT OF WEATHER DATA (135 page manual for parsing) • Lots of optional sections
  • How much of it do we have? • 2.5 billion distinct data points • 4 Terabytes • Number of documents is huge, overall data size is reasonable • We'll call this: "moderately big" data
  • How does it grow?
  • How does it grow?
  • Who Else Is This Relevant For? • Particle Physics • Stocks, high frequency trading • Insurance • People with lots of small pieces data
  • Schema Design 101
  • Things We Care About • Performance ‣ Ingestion ‣ App Specific ‣ Ad-hoc • Cost • Flexibility
  • Performance Breakdown • Bulk Loading • Latency and throughput for queries • point in space-time • one station, one year • the whole world at one time • Aggregation and Exploration • warmest and coldest day ever, average temperature, etc.
  • 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } } Station ID: NYC Central Park
  • Schema {! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …! }! station ID and source
  • Stations • USAF and WBAN IDs exist for most of North America. Prefix with "u" and "w" then the ID • For ships we use the prefix "x" and their lat and lng to create a station id.
  • Schema {! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …! }! GeoJSON
  • GeoJSON • A rich geographical data format • Lines, MultiLines, Polygons, Geometries • Able to perform queries on complex structures
  • Schema ! airTemperature: {! value: -4.9,! quality: "1"! }!
  • Choice: Embedding? Problem: ~100 "weather codes" and optional sections • Store them inline • Store them in another collection
  • Choice: Embedding? • Embedding keeps your logic in the schema instead of the application. • Depends on cardinality, don't embed "squillions" • Don't embed objects that have to change frequently.
  • Choice: Unique Identifier {_id: ObjectId("53a33f823ed4ac438f8c63b7")}! • Simple, guaranteed unique identifier • 12 bytes
  • Choice: Unique Identifier ! {_id: {! 'st': 'w12345',! 'ts': ISODate("2014-06-19T19:53:58.680Z")! }! } • Not great if there are duplicates • Slightly More complex queries • ~12 bytes saved per document
  • Choice: Field Shortening • Indexes are still the same size • Decreases readability • In our example you can save ~40% space with minimum field lengths • Probably better to go for semi-readable with ~20% space savings
  • {! "_id": ObjectId("5298c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "airTemperature": {! "quality": "5",! "value": 1.1! },! "skyCondition": {! "cavok": "N",! "ceilingHeight": {! "determination": "9",! "quality": "4",! "value": 1433! }! },! ... ... ...! }! 1236 Bytes
  • {! "_id": ObjectId("5398c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "aT": {! "q": "5",! "v": 1.1! },! "sC": {! "c": "N",! "cH": {! "d": "9",! "q": "4",! "v": 1433! }! },! ... ... ...! }! 786 Bytes
  • Choice: Indexes • Prefer sparse indexes! All Geo indexes are sparse. • Relying on index intersection can reduce storage needs but compound indexes are more performant. • Build indexes AFTER ingesting the data!
  • The Application
  • Overview Javascript ! Chrome ! Google Earth browser plugin KML ! Python PyMongo Data Data ClientServer
  • Aggregation pipeline = [{! '$match': {! 'ts': {! '$gte': dt,! '$lt': dt + timedelta(hours=1)},! 'airTemperature.quality': {! '$in': ['0', '1', '5', '9']}! }! }, {! '$group': {! '_id': '$st',! 'position': {'$first': '$position'},! 'airTemperature': {'$first': '$airTemperature'}}! }]! ! cursor = db.data.aggregate(pipeline, cursor={})!
  • {! name : "New York",! ! geometry : {! type: "MultiPolygon",! coordinates: [! [! [-71.94, 41.28],! [-71.92, 41.29],! /* 2000 more points... */! [-71.94, 41.28]! ]! ]! }! }! db.states.createIndex({! geometry: '2dsphere'! });! GeoFencing
  • GeoFencing db.states.find_one({! 'geometry': {! '$geoIntersects': {! '$geometry': {! 'type': 'Point',! 'coordinates': [lng, lat]}}}})!
  • Operational Concerns
  • Single Server Application mongod i2.8xlarge 251 GB RAM 6 TB SSD c3.8xlarge
  • Sharded Cluster Application / mongos ... 100 x r3.2xlarge 61 GB RAM @ 100 GB disk mongod c3.8xlarge
  • Cost? .. $60,000 / yr $700,000 / yr
  • Performance Breakdown • Bulk Loading • Latency and throughput for queries • point in space-time • one station, one year • the whole world at one time • Aggregation and Exploration • warmest and coldest day ever, average temperature, etc.
  • Bulk Loading: Single Server 8 threads 100 batch size
  • Bulk Loading: Single Server Settings 8 Threads 100 Batch Size Total loading time: 10 h 20 min Documents per second: ~70,000 Index build time 7 h 40 min (ts_1_st_1)
  • Bulk Loading: Sharded Cluster 144 threads
 200 batch size
  • Bulk Loading: Sharded Cluster Shard Key Station ID, hashed Settings 10 mongos @ 144 threads 200 batch size Total loading time: 3 h 10 min Documents per second: ~228,000 Index build time 5 min (ts_1_st_1)
  • Queries: Point in Space-Time db.data.find({"st" : "u747940",
 "ts" : ISODate("1969-07-16T12:00:00Z")})
  • Queries: Point in Space-Time db.data.find({"st" : "u747940",
 "ts" : ISODate("1969-07-16T12:00:00Z")}) 0 0.5 1 1.5 2 single server cluster ms avg 95th 99th max. throughput: 40,000/s 610,000/s (10 mongos)
  • Queries: One Station, One Year db.data.find({"st" : "u103840",
 "ts" : {"$gte": ISODate("1989-01-01"),
 "$lt" : ISODate("1990-01-01")}})
  • Queries: One Station, One Year db.data.find({"st" : "u103840",
 "ts" : {"$gte": ISODate("1989-01-01"),
 "$lt" : ISODate("1990-01-01")}}) 0 1000 2000 3000 4000 5000 single server cluster ms avg 95th 99th max. throughput: 20/s 430/s (10 mongos) targeted query
  • Queries: The Whole World db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  • Queries: The Whole World db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")}) 0 2000 4000 6000 8000 10000 single server cluster ms avg 95th 99th max. throughput: 8/s 310/s (10 mongos) scatter/gather query
  • Analytics: Maximum Temperature db.data.aggregate  ([      {  "$match"  :  {  "airTemperature.quality"  :                                                                    {  "$in"  :  [  "1",  "5"  ]  }  }  },      {  "$group"  :  {  "_id"          :  null,
                                  "maxTemp"  :  {  "$max"  :  
                                                              "$airTemperature.value"  }  }  }   ])     61.8 °C = 143 °F 2 h 30 min Single Server 2 min Cluster
  • Summary: Single Server Pro • Cost Effective • Low latency for single queries Con • Table scans are still slow
  • Summary: Cluster ! Con • High cost ! Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible ! ..
  • Thank You! J. Randall Hunt @jrhunt
 Developer Advocate, MongoDB @midwest.io