A Century Of Weather Data - Midwest.io

345 views

Published on

Use MongoDB to store and query 4TB of weather data. At midwest.io .

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
345
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A Century Of Weather Data - Midwest.io

  1. 1. Weather of the Century J. Randall Hunt @jrhunt
 Developer Advocate, MongoDB @midwestio
  2. 2. What was the weather the day you were born?
  3. 3. Agenda • Data and Schema • Application • Operational Concerns
  4. 4. MONGODB INTERLUDE!
  5. 5. What Is It And Why Use It? • Document Data Store • Geo Indexing • "Simple" Sharded deployments
  6. 6. Terminology RDBMS MongoDB (Document Store) Database Database Table Collection Row(s) (bson) Document Index Index Join Nope.
  7. 7. The Data
  8. 8. Where To Get Data?
  9. 9. A Weather Datum • A station ID • A timestamp • Lat, Long, Elevation • A LOT OF WEATHER DATA (135 page manual for parsing) • Lots of optional sections
  10. 10. How much of it do we have? • 2.5 billion distinct data points • 4 Terabytes • Number of documents is huge, overall data size is reasonable • We'll call this: "moderately big" data
  11. 11. How does it grow?
  12. 12. How does it grow?
  13. 13. Who Else Is This Relevant For? • Particle Physics • Stocks, high frequency trading • Insurance • People with lots of small pieces data
  14. 14. Schema Design 101
  15. 15. Things We Care About • Performance ‣ Ingestion ‣ App Specific ‣ Ad-hoc • Cost • Flexibility
  16. 16. Performance Breakdown • Bulk Loading • Latency and throughput for queries • point in space-time • one station, one year • the whole world at one time • Aggregation and Exploration • warmest and coldest day ever, average temperature, etc.
  17. 17. 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } } Station ID: NYC Central Park
  18. 18. Schema {! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …! }! station ID and source
  19. 19. Stations • USAF and WBAN IDs exist for most of North America. Prefix with "u" and "w" then the ID • For ships we use the prefix "x" and their lat and lng to create a station id.
  20. 20. Schema {! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …! }! GeoJSON
  21. 21. GeoJSON • A rich geographical data format • Lines, MultiLines, Polygons, Geometries • Able to perform queries on complex structures
  22. 22. Schema ! airTemperature: {! value: -4.9,! quality: "1"! }!
  23. 23. Choice: Embedding? Problem: ~100 "weather codes" and optional sections • Store them inline • Store them in another collection
  24. 24. Choice: Embedding? • Embedding keeps your logic in the schema instead of the application. • Depends on cardinality, don't embed "squillions" • Don't embed objects that have to change frequently.
  25. 25. Choice: Unique Identifier {_id: ObjectId("53a33f823ed4ac438f8c63b7")}! • Simple, guaranteed unique identifier • 12 bytes
  26. 26. Choice: Unique Identifier ! {_id: {! 'st': 'w12345',! 'ts': ISODate("2014-06-19T19:53:58.680Z")! }! } • Not great if there are duplicates • Slightly More complex queries • ~12 bytes saved per document
  27. 27. Choice: Field Shortening • Indexes are still the same size • Decreases readability • In our example you can save ~40% space with minimum field lengths • Probably better to go for semi-readable with ~20% space savings
  28. 28. {! "_id": ObjectId("5298c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "airTemperature": {! "quality": "5",! "value": 1.1! },! "skyCondition": {! "cavok": "N",! "ceilingHeight": {! "determination": "9",! "quality": "4",! "value": 1433! }! },! ... ... ...! }! 1236 Bytes
  29. 29. {! "_id": ObjectId("5398c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "aT": {! "q": "5",! "v": 1.1! },! "sC": {! "c": "N",! "cH": {! "d": "9",! "q": "4",! "v": 1433! }! },! ... ... ...! }! 786 Bytes
  30. 30. Choice: Indexes • Prefer sparse indexes! All Geo indexes are sparse. • Relying on index intersection can reduce storage needs but compound indexes are more performant. • Build indexes AFTER ingesting the data!
  31. 31. The Application
  32. 32. Overview Javascript ! Chrome ! Google Earth browser plugin KML ! Python PyMongo Data Data ClientServer
  33. 33. Aggregation pipeline = [{! '$match': {! 'ts': {! '$gte': dt,! '$lt': dt + timedelta(hours=1)},! 'airTemperature.quality': {! '$in': ['0', '1', '5', '9']}! }! }, {! '$group': {! '_id': '$st',! 'position': {'$first': '$position'},! 'airTemperature': {'$first': '$airTemperature'}}! }]! ! cursor = db.data.aggregate(pipeline, cursor={})!
  34. 34. {! name : "New York",! ! geometry : {! type: "MultiPolygon",! coordinates: [! [! [-71.94, 41.28],! [-71.92, 41.29],! /* 2000 more points... */! [-71.94, 41.28]! ]! ]! }! }! db.states.createIndex({! geometry: '2dsphere'! });! GeoFencing
  35. 35. GeoFencing db.states.find_one({! 'geometry': {! '$geoIntersects': {! '$geometry': {! 'type': 'Point',! 'coordinates': [lng, lat]}}}})!
  36. 36. Operational Concerns
  37. 37. Single Server Application mongod i2.8xlarge 251 GB RAM 6 TB SSD c3.8xlarge
  38. 38. Sharded Cluster Application / mongos ... 100 x r3.2xlarge 61 GB RAM @ 100 GB disk mongod c3.8xlarge
  39. 39. Cost? .. $60,000 / yr $700,000 / yr
  40. 40. Performance Breakdown • Bulk Loading • Latency and throughput for queries • point in space-time • one station, one year • the whole world at one time • Aggregation and Exploration • warmest and coldest day ever, average temperature, etc.
  41. 41. Bulk Loading: Single Server 8 threads 100 batch size
  42. 42. Bulk Loading: Single Server Settings 8 Threads 100 Batch Size Total loading time: 10 h 20 min Documents per second: ~70,000 Index build time 7 h 40 min (ts_1_st_1)
  43. 43. Bulk Loading: Sharded Cluster 144 threads
 200 batch size
  44. 44. Bulk Loading: Sharded Cluster Shard Key Station ID, hashed Settings 10 mongos @ 144 threads 200 batch size Total loading time: 3 h 10 min Documents per second: ~228,000 Index build time 5 min (ts_1_st_1)
  45. 45. Queries: Point in Space-Time db.data.find({"st" : "u747940",
 "ts" : ISODate("1969-07-16T12:00:00Z")})
  46. 46. Queries: Point in Space-Time db.data.find({"st" : "u747940",
 "ts" : ISODate("1969-07-16T12:00:00Z")}) 0 0.5 1 1.5 2 single server cluster ms avg 95th 99th max. throughput: 40,000/s 610,000/s (10 mongos)
  47. 47. Queries: One Station, One Year db.data.find({"st" : "u103840",
 "ts" : {"$gte": ISODate("1989-01-01"),
 "$lt" : ISODate("1990-01-01")}})
  48. 48. Queries: One Station, One Year db.data.find({"st" : "u103840",
 "ts" : {"$gte": ISODate("1989-01-01"),
 "$lt" : ISODate("1990-01-01")}}) 0 1000 2000 3000 4000 5000 single server cluster ms avg 95th 99th max. throughput: 20/s 430/s (10 mongos) targeted query
  49. 49. Queries: The Whole World db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  50. 50. Queries: The Whole World db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")}) 0 2000 4000 6000 8000 10000 single server cluster ms avg 95th 99th max. throughput: 8/s 310/s (10 mongos) scatter/gather query
  51. 51. Analytics: Maximum Temperature db.data.aggregate  ([      {  "$match"  :  {  "airTemperature.quality"  :                                                                    {  "$in"  :  [  "1",  "5"  ]  }  }  },      {  "$group"  :  {  "_id"          :  null,
                                  "maxTemp"  :  {  "$max"  :  
                                                              "$airTemperature.value"  }  }  }   ])     61.8 °C = 143 °F 2 h 30 min Single Server 2 min Cluster
  52. 52. Summary: Single Server Pro • Cost Effective • Low latency for single queries Con • Table scans are still slow
  53. 53. Summary: Cluster ! Con • High cost ! Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible ! ..
  54. 54. Thank You! J. Randall Hunt @jrhunt
 Developer Advocate, MongoDB @midwest.io

×