• Like

Weather of the Century: Design and Performance

  • 508 views
Uploaded on

This talk walks you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application.

This talk walks you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
508
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
25
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. #MongoDB The Weather of the Century: Design and High Performance André Spiegel Consulting Engineer, MongoDB
  • 2. What was the weather when you were born?
  • 3. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } }
  • 4. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } } Station Identifier (»NYC Central Park«)
  • 5. How Big Is It? • 2.5 billion data points • 4 Terabyte (1.6k per document) • “moderately big”
  • 6. How to do this with MongoDB?
  • 7. First Deployment • A single server with a really big disk Application mongod i2.8xlarge 251 GB RAM 6 TB SSD c3.8xlarge
  • 8. Second Deployment • A really big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk mongod c3.8xlarge
  • 9. Second Deployment • A really big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk mongod
  • 10. Now... how much would you pay? ..
  • 11. Now... how much would you pay? $60,000 / yr ..
  • 12. Now... how much would you pay? $60,000 / yr .. $700,000 / yr
  • 13. Use Cases • Bulk loading – getting all data into the system • Latency and throughput for queries – point in space-time – one station, one year – the whole world, once upon a time • Aggregation and Exploration – warmest and coldest day ever, etc.
  • 14. Bulk Loading: Principles • On the application side: – batch size – number of client threads – use unordered bulk writes • On the server side: – Journaling off ( temporarily! ) – Index later – In cluster: pre-split, no balancing
  • 15. Bulk Loading: Single Server batch size threads through put
  • 16. Bulk Loading: Single Server batch size threads through put 8 threads, batch size 100 → 85,000 doc/s
  • 17. Bulk Loading: Single Server • Settings: 8 threads batch size 100 • Total loading time: 10 h 20 min • Documents per second: 70,000 • Index build time: 7 h 40 min (ts_1_st_1)
  • 18. Bulk Loading: Cluster
  • 19. Bulk Loading: Cluster 144 threads, batch size 200 → 220,000 doc/s
  • 20. Bulk Loading: Cluster • Shard Key: Station ID, hashed • Settings: 10 mongos @ 144 threads batch size 200 • Total loading time: 3 h 10 min • Documents per second: 228,000 • Index build time: 5 min (ts_1_st_1)
  • 21. Queries: Point in Space-Time db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
  • 22. Queries: Point in Space-Time db.data.find({"st" : "u747940", 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 "ts" : ISODate("1969-07-16T12:00:00Z")}) single server cluster ms avg 95th 99th max. throughput: 40,000/s 610,000/s (10 mongos)
  • 23. Queries: One Station, One Year db.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
  • 24. Queries: One Station, One Year db.data.find({"st" : "u103840", 5000 4000 3000 2000 1000 0 "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}}) single server cluster ms avg 95th 99th max. throughput: 20/s 430/s (10 mongos) targeted query
  • 25. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  • 26. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")}) 10000 8000 6000 4000 2000 0 single server cluster ms avg 95th 99th max. throughput: 8/s 310/s (10 mongos) scatter/gather query
  • 27. Analytics and Exploration • Analytics means ad-hoc queries for which we do not have an index – Find all tornados – Maximum reported temperature • We cannot just index everything – memory – write performance
  • 28. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" })
  • 29. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 1 h 28 min Single Server
  • 30. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 47 s Cluster 1 h 28 min Single Server
  • 31. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ])
  • 32. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F
  • 33. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 4 h 45 min Single Server
  • 34. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 2 min Cluster 4 h 45 min Single Server
  • 35. Summary: Single Server Pro • Cost-effective • Very good latency for single queries Con • Some operations are prohibitive: – Indexing – Table Scans
  • 36. Summary: Cluster Con • High cost Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible ..
  • 37. Thank you.