Consulting Engineer, MongoDB
André Spiegel
#MongoDBWorld
The Weather of the Century
Part II: High Performance
What was the weather
when you were born?
Data Format: Raw and in MongoDB
0303725053947282013060322517+40779-073969FM-15+0048KNYC
V0309999C00005030485MN0080475N5+02...
How Big Is It?
• 2.5 billion data points
• 4 Terabyte (1.6k per document)
• “moderately big”
How to do this with MongoDB?
First Deployment
• Asingle server with a really big disk
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge
Second Deployment
• Areally big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
61 GB RAM
@
1...
Second Deployment
• Areally big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
61 GB RAM
@
1...
Now... how much would you pay?
..
$60,000 / yr
$700,000 / yr
Use Cases
• Bulk loading
– getting all data into the system
• Latency and throughput for queries
– point in space-time
– o...
Bulk Loading: Principles
• On the application side:
– batch size
– number of client threads
– use unordered bulk writes
• ...
Bulk Loading: Single Server
batch
size
threads
through
put
8 threads,
batch size 100
→ 85,000 doc/s
Bulk Loading: Single Server
• Settings: 8 threads
batch size 100
• Total loading time: 10 h 20 min
• Documents per second:...
Bulk Loading: Cluster
144 threads,
batch size 200
→ 220,000 doc/s
Bulk Loading: Cluster
• Shard Key: Station ID, hashed
• Settings: 10 mongos @ 144
threads
batch size 200
• Total loading t...
Queries: Point in Space-Time
db.data.find({"st" : "u747940",
"ts" : ISODate("1969-07-16T12:00:00Z")})
Queries: Point in Space-Time
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
single server cluster
ms
avg
95th
99th
max.
throughput:
40,00...
Queries: One Station, One Year
db.data.find({"st" : "u103840",
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("199...
0
1000
2000
3000
4000
5000
single server cluster
ms
avg
95th
99th
Queries: One Station, One Year
max.
throughput: 20/s 430...
Queries: The Whole World, Once
Upon...
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
0
2000
4000
6000
8000
10000
single server cluster
ms
avg
95th
99th
Queries: The Whole World, Once
Upon...
max.
throughput:...
Analytics and Exploration
• Analytics means ad-hoc queries for which
we do not have an index
– Find all tornados
– Maximum...
Analytics: Find all Tornados
db.data.find ({
"presentWeatherObservation.condition" : "99"
})
47 s
Cluster
1 h 28 min
Singl...
Analytics: Maximum Temperature
db.data.aggregate ([
{ "$match" : { "airTemperature.quality" :
{ "$in" : [ "1", "5" ] } } }...
Summary: Single Server
Pro
• Cost-effective
• Very good latency for single queries
Con
• Some operations are prohibitive:
...
Summary: Cluster
Con
• High cost
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields sign...
Thank you.
The Weather of the Century Part 2: High Performance
Upcoming SlideShare
Loading in …5
×

The Weather of the Century Part 2: High Performance

953 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
953
On SlideShare
0
From Embeds
0
Number of Embeds
343
Actions
Shares
0
Downloads
29
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

The Weather of the Century Part 2: High Performance

  1. 1. Consulting Engineer, MongoDB André Spiegel #MongoDBWorld The Weather of the Century Part II: High Performance
  2. 2. What was the weather when you were born?
  3. 3. Data Format: Raw and in MongoDB 0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859... { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } } Station Identifier (»NYC Central Park«)
  4. 4. How Big Is It? • 2.5 billion data points • 4 Terabyte (1.6k per document) • “moderately big”
  5. 5. How to do this with MongoDB?
  6. 6. First Deployment • Asingle server with a really big disk Application mongod i2.8xlarge 251 GB RAM 6 TB SSD c3.8xlarge
  7. 7. Second Deployment • Areally big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge 61 GB RAM @ 100 GB disk mongod c3.8xlarge
  8. 8. Second Deployment • Areally big cluster where everything is in RAM Application / mongos ... 100 x r3.2xlarge 61 GB RAM @ 100 GB disk mongod
  9. 9. Now... how much would you pay? .. $60,000 / yr $700,000 / yr
  10. 10. Use Cases • Bulk loading – getting all data into the system • Latency and throughput for queries – point in space-time – one station, one year – the whole world, once upon a time • Aggregation and Exploration – warmest and coldest day ever, etc.
  11. 11. Bulk Loading: Principles • On the application side: – batch size – number of client threads – use unordered bulk writes • On the server side: – Journaling off ( temporarily! ) – Index later – In cluster: pre-split, no balancing
  12. 12. Bulk Loading: Single Server batch size threads through put 8 threads, batch size 100 → 85,000 doc/s
  13. 13. Bulk Loading: Single Server • Settings: 8 threads batch size 100 • Total loading time: 10 h 20 min • Documents per second: 70,000 • Index build time: 7 h 40 min (ts_1_st_1)
  14. 14. Bulk Loading: Cluster 144 threads, batch size 200 → 220,000 doc/s
  15. 15. Bulk Loading: Cluster • Shard Key: Station ID, hashed • Settings: 10 mongos @ 144 threads batch size 200 • Total loading time: 3 h 10 min • Documents per second: 228,000 • Index build time: 5 min (ts_1_st_1)
  16. 16. Queries: Point in Space-Time db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
  17. 17. Queries: Point in Space-Time 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 single server cluster ms avg 95th 99th max. throughput: 40,000/s 610,000/s (10 mongos) db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
  18. 18. Queries: One Station, One Year db.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
  19. 19. 0 1000 2000 3000 4000 5000 single server cluster ms avg 95th 99th Queries: One Station, One Year max. throughput: 20/s 430/s (10 mongos) targeted query db.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
  20. 20. Queries: The Whole World, Once Upon... db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  21. 21. 0 2000 4000 6000 8000 10000 single server cluster ms avg 95th 99th Queries: The Whole World, Once Upon... max. throughput: 8/s 310/s (10 mongos) scatter/gather query db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
  22. 22. Analytics and Exploration • Analytics means ad-hoc queries for which we do not have an index – Find all tornados – Maximum reported temperature • We cannot just index everything – memory – write performance
  23. 23. Analytics: Find all Tornados db.data.find ({ "presentWeatherObservation.condition" : "99" }) 47 s Cluster 1 h 28 min Single Server
  24. 24. Analytics: Maximum Temperature db.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ]) 61.8 °C = 143 °F 2 min Cluster 4 h 45 min Single Server
  25. 25. Summary: Single Server Pro • Cost-effective • Very good latency for single queries Con • Some operations are prohibitive: – Indexing – Table Scans
  26. 26. Summary: Cluster Con • High cost Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible ..
  27. 27. Thank you.

×