Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Webinar: Best Practices for Getting Started with MongoDB

5,687 views

Published on

MongoDB adoption continues to grow at a record pace due to the significant enhancements in developer productivity and scalability that the database provides. Occasionally, however, organizations new to the technology make mistakes that limit their ability to leverage the significant advantages MongoDB provides. This webinar will discuss some of the common mistakes made by users when they first start working with MongoDB, how to identify when you've made those mistakes, and how to resolve them.

Published in: Technology
  • Be the first to comment

Webinar: Best Practices for Getting Started with MongoDB

  1. 1. MongoDB Best Practices Jay Runkel Principal Solutions Architect jay.runkel@mongodb.com @jayrunkel
  2. 2. About Me • Solution Architect • Part of Sales Organization • Work with many organizations new to MongoDB
  3. 3. Everyone Loves MongoDB’s Flexibility • Document Model • Dynamic Schema • Powerful Query Language • Secondary Indexes
  4. 4. Everyone Loves MongoDB’s Flexibility • Document Model • Dynamic Schema • Powerful Query Language • Secondary Indexes
  5. 5. Sometimes Organizations Struggle with Performance
  6. 6. Good News! • Poor Performance Usually Due to Common (and often simple) mistakes
  7. 7. Agenda • Quick MongoDB Introduction • Best Practices 1. Hardware/OS 2. Schema/Queries 3. Loading Data
  8. 8. MongoDB Introduction
  9. 9. Document Data Model Relational MongoDB { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ] }
  10. 10. Documents are Rich Data Structures { first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: [45.123,47.232], Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } Fields can contain an array of sub-documents Fields Typed fields Fields can contain arrays
  11. 11. Do More With Your Data { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } } } Rich Queries Find everybody in London with a car built between 1970 and 1980 Geospatial Find all of the car owners within 5km of Trafalgar Sq. Text Search Find all the cars described as having leather seats Aggregation Calculate the average value of Paul’s car collection Map Reduce What is the ownership pattern of colors by geography over time? (is purple trending up in China?)
  12. 12. Automatic Sharding Three types: hash-based, range-based, location- aware Increase or decrease capacity as you go Automatic balancing
  13. 13. Query Routing Multiple query optimization models Each sharding option appropriate for different apps mongos
  14. 14. Replica Sets Replica Set – 2 to 50 copies Self-healing shard Data Center Aware Addresses availability considerations: High Availability Disaster Recovery Maintenance Workload Isolation: operational & analytics
  15. 15. Assumptions
  16. 16. Assumptions MongoDB 3.0 or 3.2
  17. 17. Storage Engine Architecture in 3.2 Content Repo IoT Sensor Backend Ad Service Customer Analytics Archive MongoDB Query Language (MQL) + Native Drivers MongoDB Document Data Model WT MMAP Supported in MongoDB 3.2 Management Security In-memory (beta) Encrypted 3rd party
  18. 18. Best Practices Hardware/Operating System
  19. 19. Servers • Specifications Good Fit For MongoDB? • Correct Number of Servers? • Properly Configured?
  20. 20. What Type of Servers • RAM – 64  256 GB+ • Fast IO Systems – RAID-10/SSDs • Many cores – Compress/Uncompress – Encrypt/Decrypt – Aggregation queries
  21. 21. What about a SAN? • Mostly Random Disk Access • IOPS • Need dedicated IOPS or performance will vary • Configure your SAN properly • Suitability of any IO system will depend upon IOPS
  22. 22. How Many Servers Do I Need? • How Many Shards Do I Need?
  23. 23. MongoDB cluster sizing at 30,000 ft • Disk Space • RAM • Query Throughput
  24. 24. • Sum of disk space across shards > greater than required storage size Disk Space: How Many Shards Do I Need?
  25. 25. • Sum of disk space across shards > greater than required storage size Disk Space: How Many Shards Do I Need? Example Data Size = 9 TB WiredTiger Compression Ratio: .33 Storage size = 3 TB Server disk capacity = 2 TB 2 Shards Required
  26. 26. • Working set should fit in RAM – Sum of RAM across shards > Working Set • WorkSet = Indexes plus the set of documents accessed frequently • WorkSet in RAM  – Shorter latency – Higher Throughput RAM: How Many Shards Do I Need?
  27. 27. • Measuring Index Size – db.coll.stats() – index size of collection • Estimate frequently accessed documents – Ex: total size of documents accessed per day RAM: How Many Shards Do I Need?
  28. 28. • Measuring Index Size – db.coll.stats() – index size of collection • Estimate frequently accessed documents – Ex: total size of documents accessed per day RAM: How Many Shards Do I Need? Example Working Set = 428 GB Server RAM = 128 GB 428/128 = 3.34 4 Shards Required
  29. 29. • Measure max sustained query rate of a single server (with replication) – build a prototype and measure • Assume sharding overhead of 20-30% Query Rate: How Many Shards Do I Need?
  30. 30. • Measure max sustained query rate of a single server (with replication) – build a prototype and measure • Assume sharding overhead of 20-30% Query Rate: How Many Shards Do I Need? Example Require: 50K ops/sec Prototype performance: 20 ops/sec (1 replica set) 4 Shards Required: 80 ops/sec * .7 = 56K ops/sec
  31. 31. Configure Them Properly • Default OS Settings Often Don’t Provide Optimal Performance • See MongoDB Production Notes – https://docs.mongodb.org/manual/administration/production-notes • Also Review: – Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/ – Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/
  32. 32. Server/OS Configuration • Server configuration recommendations – XFS – Turn off atime and diratime – NOOP scheduler – File descriptor limits – Disable transparent huge pages and NUMA – Read ahead of 32 – Separate data volumes for data files, the journal, and the log. – Change the default TCP keepalive time to 300 seconds.
  33. 33. These are important • Ignore them and your performance may suffer • The first 100 lines of the MongoDB logs identifies suboptimal OS settings
  34. 34. Best Practices Schema Design
  35. 35. Don’t Use a Relational Schema
  36. 36. Taylor MongoDB Schema toApplication Workload • Design schema to provide good query performance • Schema design will impact required number of shards! Application Query Workload { Name: “john” Height: 12 Address: {…} } db.cust.find({…}) db.cust.aggregate({…})
  37. 37. Compare Alternative Schemas • Build a spreadsheet • Calculate # of shards for each schema • Estimate query performance – # of documents – # of inserts – # of deletes – Required indexes – Number of documents inspected – Number of documents sent across network
  38. 38. Modeling Decisions • Referencing vs. Embedding • Aggregating data by device, customer, product, etc.
  39. 39. Referencing Procedure { "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134 } Results { “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }
  40. 40. EmbeddingProcedure { "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } } }
  41. 41. Embedding • Advantages – Retrieve all relevant information in a single query/document – Avoid implementing joins in application code – Update related information as a single atomic operation • MongoDB doesn’t offer multi-document transactions • Limitations – Large documents mean more overhead if most fields are not relevant – Might mean replicating data – 16 MB document size limit
  42. 42. Referencing • Advantages – Smaller documents – Less likely to reach 16 MB document limit – Infrequently accessed information not accessed on every query – No duplication of data • Limitations – Two queries required to retrieve information – Cannot update related information atomically
  43. 43. { _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”, …}, { id: 12346, date: 2015-02-15, type: “blood test”, …}] } Patients Embed One-to-Many & Many-to-Many Relationships { _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]} { _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …} Patients Reference Procedures
  44. 44. Schema Alternatives – Do the math? • How complex queries? • How much hardware/shards will I need?
  45. 45. Vital Sign Monitoring Device Vital Signs Measured: • Blood Pressure • Pulse • Blood Oxygen Levels Produces data at regular intervals • Once per minute
  46. 46. We have a hospital(s) of devices
  47. 47. Data From Vital Signs Monitoring Device { deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500") } • One document per minute per device • Relational approach
  48. 48. Document Per Hour (By minute) { deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500") } • Store per-minute data at the hourly level • Update-driven workload • 1 document per device per hour
  49. 49. Characterizing Write Differences • Example: data generated every minute • Recording the data for 1 patient for 1 hour: Document Per Event 60 inserts Document Per Hour 1 insert, 59 updates
  50. 50. Characterizing Read Differences • Want to graph 24 hour of vital signs for a patient: • Read performance is greatly improved Document Per Event 1440 reads Document Per Hour 24 reads
  51. 51. Characterizing Memory and Storage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data
  52. 52. Characterizing Memory and Storage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 100000 * 365 * 24
  53. 53. Characterizing Memory and Storage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 * 130 100000 * 365 * 24 * 130
  54. 54. Characterizing Memory and Storage Differences Document Per Minute Document Per Hour Number Documents 52.6 B 876 M Total Index Size 6364 GB 106 GB _id index 1468 GB 24.5 GB {ts: 1, deviceId: 1} 4895 GB 81.6 GB Document Size 92 Bytes 758 Bytes Database Size 4503 GB 618 GB • 100K Devices • 1 years worth of data 100000 * 365 * 24 * 60 * 92 100000 * 365 * 24 * 758
  55. 55. Best Practices Loading Data
  56. 56. Rule of Thumb • To saturate a MongoDB cluster  – loader hardware ~= mongodb hardware • Many threads • Many mongos
  57. 57. Loader Architecture loader mongos primary primary primary secondary secondary secondary secondary secondary secondary
  58. 58. Loader Architecture loader mongos primary primary primary secondary secondary secondary secondary secondary secondary Where are the bottlenecks?
  59. 59. Loader Architecture loader mongos primary primary primary secondary secondary secondary secondary secondary secondary Where are the bottlenecks?
  60. 60. Loader Architecture loader (8) mongos (4) primary primary primary secondary secondary secondary secondary secondary secondary loader (8) mongos (4) loader (8) mongos (4) Use many threads Use multiple loader servers
  61. 61. When Sharding • If you care about initial performance, you must pre-split • Otherwise, initial performance will be slow • (hash sharding automatically presplits collection)
  62. 62. Without presplitting Shard 1 Shard 2 Shard 3 Shard 4 -∞ … ∞ • sh.shardCollection(“records.patients”, {zipcode : 1})
  63. 63. Without presplitting Shard 1 Shard 2 Shard 3 Shard 4 -∞ … 11305 • 64K chunks • Splitting will occur quickly • Balancing occurs much more slowly • The entire query workload  Shard 1 11306 … 44506 44507 … ∞
  64. 64. Without presplitting Shard 1 Shard 2 Shard 3 Shard 4 -∞ … 11305 11306 … 44506 44507 … ∞ Loader mongos
  65. 65. Split collection Shard 1 Shard 2 Shard 3 Shard 4 • Split and distribute empty chunks before loading any data • Evenly distribute query load across cluster -∞ … 08333 08334 … 16667 16668 … 25000 25001… 33334 33335 … 41668 41669 … 50000 50001 … 58334 58335 … 66668 66669 … 75000 75001 … 83334 88335 … 96668 96669 … 99999
  66. 66. Split collection Shard 1 Shard 2 Shard 3 Shard 4 -∞ … 08333 08334 … 16667 16668 … 25000 25001… 33334 33335 … 41668 41669 … 50000 50001 … 58334 58335 … 66668 66669 … 75000 75001 … 83334 88335 … 96668 96669 … 99999 Loader mongos
  67. 67. Summary
  68. 68. Best Practices 1. Use servers with specifications that will provide good MongoDB performance – 64+ GB RAM, many cores, many IOPS (RAID-10/SSDs) 2. Calculate How Many Shards? 1. Calculate required RAM and Disk Space 2. Build a prototype to determine the ops/sec capacity of a server 3. Do the math 3. Configure OS for Optimal MongoDB Performance – See MongoDB Production Notes – Review logs for warnings (Don’t ignore)
  69. 69. Best Practices (cont.) 4. Create a Document Schema – Denormalized 5. Tailor schema to application workload – Use application queries to guide schema design decisions – Consider alternative schemas – Compare cluster size (# of shards) and performance – Build a spreadsheet
  70. 70. Best Practices 6. Loading Data – Loader Hardware ~= MongoDB hardware – Many threads – Many mongos 7. Pre-split – Ensure query workload is evenly distributed across the cluster from the start
  71. 71. Questions? jay.runkel@mongodb.com @jayrunkel

×