Your SlideShare is downloading. ×
0
#MongoDB 
The Weather of the Century: 
Design and High Performance 
André Spiegel 
Consulting Engineer, MongoDB
What was the weather 
when you were born?
Data Format: Raw and in MongoDB 
0303725053947282013060322517+40779-073969FM-15+0048KNYC 
V0309999C00005030485MN0080475N5+...
Data Format: Raw and in MongoDB 
0303725053947282013060322517+40779-073969FM-15+0048KNYC 
V0309999C00005030485MN0080475N5+...
How Big Is It? 
• 2.5 billion data points 
• 4 Terabyte (1.6k per document) 
• “moderately big”
How to do this with MongoDB?
First Deployment 
• A single server with a really big disk 
Application mongod 
i2.8xlarge 
251 GB RAM 
6 TB SSD 
c3.8xlar...
Second Deployment 
• A really big cluster where everything is in RAM 
Application / mongos 
... 
100 x r3.2xlarge 
@ 
61 G...
Second Deployment 
• A really big cluster where everything is in RAM 
Application / mongos 
... 
100 x r3.2xlarge 
@ 
61 G...
Now... how much would you pay? 
..
Now... how much would you pay? 
$60,000 / yr 
..
Now... how much would you pay? 
$60,000 / yr 
.. 
$700,000 / yr
Use Cases 
• Bulk loading 
– getting all data into the system 
• Latency and throughput for queries 
– point in space-time...
Bulk Loading: Principles 
• On the application side: 
– batch size 
– number of client threads 
– use unordered bulk write...
Bulk Loading: Single Server 
batch 
size 
threads 
through 
put
Bulk Loading: Single Server 
batch 
size 
threads 
through 
put 
8 threads, 
batch size 100 
→ 85,000 doc/s
Bulk Loading: Single Server 
• Settings: 8 threads 
batch size 100 
• Total loading time: 10 h 20 min 
• Documents per sec...
Bulk Loading: Cluster
Bulk Loading: Cluster 
144 threads, 
batch size 200 
→ 220,000 doc/s
Bulk Loading: Cluster 
• Shard Key: Station ID, hashed 
• Settings: 10 mongos @ 144 
threads 
batch size 200 
• Total load...
Queries: Point in Space-Time 
db.data.find({"st" : "u747940", 
"ts" : ISODate("1969-07-16T12:00:00Z")})
Queries: Point in Space-Time 
db.data.find({"st" : "u747940", 
1.6 
1.4 
1.2 
1 
0.8 
0.6 
0.4 
0.2 
0 
"ts" : ISODate("19...
Queries: One Station, One Year 
db.data.find({"st" : "u103840", 
"ts" : {"$gte": ISODate("1989-01-01"), 
"$lt" : ISODate("...
Queries: One Station, One Year 
db.data.find({"st" : "u103840", 
5000 
4000 
3000 
2000 
1000 
0 
"ts" : {"$gte": ISODate(...
Queries: The Whole World, Once 
Upon... 
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
Queries: The Whole World, Once 
Upon... 
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")}) 
10000 
8000 
6000 
4000 
...
Analytics and Exploration 
• Analytics means ad-hoc queries for which 
we do not have an index 
– Find all tornados 
– Max...
Analytics: Find all Tornados 
db.data.find ({ 
"presentWeatherObservation.condition" : "99" 
})
Analytics: Find all Tornados 
db.data.find ({ 
"presentWeatherObservation.condition" : "99" 
}) 
1 h 28 min 
Single Server
Analytics: Find all Tornados 
db.data.find ({ 
"presentWeatherObservation.condition" : "99" 
}) 
47 s 
Cluster 
1 h 28 min...
Analytics: Maximum Temperature 
db.data.aggregate ([ 
{ "$match" : { "airTemperature.quality" : 
{ "$in" : [ "1", "5" ] } ...
Analytics: Maximum Temperature 
db.data.aggregate ([ 
{ "$match" : { "airTemperature.quality" : 
{ "$in" : [ "1", "5" ] } ...
Analytics: Maximum Temperature 
db.data.aggregate ([ 
{ "$match" : { "airTemperature.quality" : 
{ "$in" : [ "1", "5" ] } ...
Analytics: Maximum Temperature 
db.data.aggregate ([ 
{ "$match" : { "airTemperature.quality" : 
{ "$in" : [ "1", "5" ] } ...
Summary: Single Server 
Pro 
• Cost-effective 
• Very good latency for single queries 
Con 
• Some operations are prohibit...
Summary: Cluster 
Con 
• High cost 
Pro 
• High throughput 
• Very good latency for single queries 
• Scatter-gather yield...
Thank you.
Weather of the Century: Design and Performance
Upcoming SlideShare
Loading in...5
×

Weather of the Century: Design and Performance

596

Published on

This talk walks you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application.

Published in: Technology

Transcript of "Weather of the Century: Design and Performance"

  1. 1. #MongoDB The Weather of the Century: Design and High Performance André Spiegel Consulting Engineer, MongoDB
  2. 2. What was the weather when you were born?
  3. 3. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } }
  4. 4. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } } Station Identifier (»NYC Central Park«)
  5. 5. How Big Is It? • 2.5 billion data points • 4 Terabyte (1.6k per document) • “moderately big”
  6. 6. How to do this with MongoDB?
  7. 7. First Deployment • A single server with a really big disk Application mongod i2.8xlarge 251 GB RAM 6 TB SSD c3.8xlarge
  8. 8. Second Deployment • A really big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk mongod c3.8xlarge
  9. 9. Second Deployment • A really big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk mongod
  10. 10. Now... how much would you pay? ..
  11. 11. Now... how much would you pay? $60,000 / yr ..
  12. 12. Now... how much would you pay? $60,000 / yr .. $700,000 / yr
  13. 13. Use Cases • Bulk loading – getting all data into the system • Latency and throughput for queries – point in space-time – one station, one year – the whole world, once upon a time • Aggregation and Exploration – warmest and coldest day ever, etc.
  14. 14. Bulk Loading: Principles • On the application side: – batch size – number of client threads – use unordered bulk writes • On the server side: – Journaling off ( temporarily! ) – Index later – In cluster: pre-split, no balancing
  15. 15. Bulk Loading: Single Server batch size threads through put
  16. 16. Bulk Loading: Single Server batch size threads through put 8 threads, batch size 100 → 85,000 doc/s
  17. 17. Bulk Loading: Single Server • Settings: 8 threads batch size 100 • Total loading time: 10 h 20 min • Documents per second: 70,000 • Index build time: 7 h 40 min (ts_1_st_1)
  18. 18. Bulk Loading: Cluster
  19. 19. Bulk Loading: Cluster 144 threads, batch size 200 → 220,000 doc/s
  20. 20. Bulk Loading: Cluster • Shard Key: Station ID, hashed • Settings: 10 mongos @ 144 threads batch size 200 • Total loading time: 3 h 10 min • Documents per second: 228,000 • Index build time: 5 min (ts_1_st_1)
  21. 21. Queries: Point in Space-Time db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
  22. 22. Queries: Point in Space-Time db.data.find({"st" : "u747940", 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 "ts" : ISODate("1969-07-16T12:00:00Z")}) single server cluster ms avg 95th 99th max. throughput: 40,000/s 610,000/s (10 mongos)
  23. 23. Queries: One Station, One Year db.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
  24. 24. Queries: One Station, One Year db.data.find({"st" : "u103840", 5000 4000 3000 2000 1000 0 "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}}) single server cluster ms avg 95th 99th max. throughput: 20/s 430/s (10 mongos) targeted query
  25. 25. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  26. 26. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")}) 10000 8000 6000 4000 2000 0 single server cluster ms avg 95th 99th max. throughput: 8/s 310/s (10 mongos) scatter/gather query
  27. 27. Analytics and Exploration • Analytics means ad-hoc queries for which we do not have an index – Find all tornados – Maximum reported temperature • We cannot just index everything – memory – write performance
  28. 28. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" })
  29. 29. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 1 h 28 min Single Server
  30. 30. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 47 s Cluster 1 h 28 min Single Server
  31. 31. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ])
  32. 32. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F
  33. 33. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 4 h 45 min Single Server
  34. 34. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 2 min Cluster 4 h 45 min Single Server
  35. 35. Summary: Single Server Pro • Cost-effective • Very good latency for single queries Con • Some operations are prohibitive: – Indexing – Table Scans
  36. 36. Summary: Cluster Con • High cost Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible ..
  37. 37. Thank you.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×