Weather of the Century: Design and Performance

824 views
758 views

Published on

This talk walks you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application.

Published in: Technology

Weather of the Century: Design and Performance

  1. 1. #MongoDB The Weather of the Century: Design and High Performance André Spiegel Consulting Engineer, MongoDB
  2. 2. What was the weather when you were born?
  3. 3. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } }
  4. 4. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } } Station Identifier (»NYC Central Park«)
  5. 5. How Big Is It? • 2.5 billion data points • 4 Terabyte (1.6k per document) • “moderately big”
  6. 6. How to do this with MongoDB?
  7. 7. First Deployment • A single server with a really big disk Application mongod i2.8xlarge 251 GB RAM 6 TB SSD c3.8xlarge
  8. 8. Second Deployment • A really big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk mongod c3.8xlarge
  9. 9. Second Deployment • A really big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk mongod
  10. 10. Now... how much would you pay? ..
  11. 11. Now... how much would you pay? $60,000 / yr ..
  12. 12. Now... how much would you pay? $60,000 / yr .. $700,000 / yr
  13. 13. Use Cases • Bulk loading – getting all data into the system • Latency and throughput for queries – point in space-time – one station, one year – the whole world, once upon a time • Aggregation and Exploration – warmest and coldest day ever, etc.
  14. 14. Bulk Loading: Principles • On the application side: – batch size – number of client threads – use unordered bulk writes • On the server side: – Journaling off ( temporarily! ) – Index later – In cluster: pre-split, no balancing
  15. 15. Bulk Loading: Single Server batch size threads through put
  16. 16. Bulk Loading: Single Server batch size threads through put 8 threads, batch size 100 → 85,000 doc/s
  17. 17. Bulk Loading: Single Server • Settings: 8 threads batch size 100 • Total loading time: 10 h 20 min • Documents per second: 70,000 • Index build time: 7 h 40 min (ts_1_st_1)
  18. 18. Bulk Loading: Cluster
  19. 19. Bulk Loading: Cluster 144 threads, batch size 200 → 220,000 doc/s
  20. 20. Bulk Loading: Cluster • Shard Key: Station ID, hashed • Settings: 10 mongos @ 144 threads batch size 200 • Total loading time: 3 h 10 min • Documents per second: 228,000 • Index build time: 5 min (ts_1_st_1)
  21. 21. Queries: Point in Space-Time db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
  22. 22. Queries: Point in Space-Time db.data.find({"st" : "u747940", 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 "ts" : ISODate("1969-07-16T12:00:00Z")}) single server cluster ms avg 95th 99th max. throughput: 40,000/s 610,000/s (10 mongos)
  23. 23. Queries: One Station, One Year db.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
  24. 24. Queries: One Station, One Year db.data.find({"st" : "u103840", 5000 4000 3000 2000 1000 0 "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}}) single server cluster ms avg 95th 99th max. throughput: 20/s 430/s (10 mongos) targeted query
  25. 25. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  26. 26. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")}) 10000 8000 6000 4000 2000 0 single server cluster ms avg 95th 99th max. throughput: 8/s 310/s (10 mongos) scatter/gather query
  27. 27. Analytics and Exploration • Analytics means ad-hoc queries for which we do not have an index – Find all tornados – Maximum reported temperature • We cannot just index everything – memory – write performance
  28. 28. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" })
  29. 29. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 1 h 28 min Single Server
  30. 30. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 47 s Cluster 1 h 28 min Single Server
  31. 31. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ])
  32. 32. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F
  33. 33. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 4 h 45 min Single Server
  34. 34. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 2 min Cluster 4 h 45 min Single Server
  35. 35. Summary: Single Server Pro • Cost-effective • Very good latency for single queries Con • Some operations are prohibitive: – Indexing – Table Scans
  36. 36. Summary: Cluster Con • High cost Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible ..
  37. 37. Thank you.

×