Cignex mongodb-sharding-mongodbdays


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cignex mongodb-sharding-mongodbdays

  1. 1. CIGNEX Datamatics Confidential Scaling MongoDB with Sharding – A Case Study Presented by: Nikhil Naib Title: Lead Consultant – Big Data For MongoDB and CIGNEX Datamatics Use Only
  2. 2. CIGNEX Datamatics Confidential Who We Are? • Since 2000, delivering solutions using Open Source technologies to – Address business goals – Increase business velocity – Lower the cost of doing business – Gain competitive advantage • Dramatically reduce Total Cost of Ownership (TCO) & deployment time of IT solutions 2 400+ Implementations 450+ Experts 200+ Integrations 13 Books 5000+ Community Contributions Offices : America | India | UK | Europe | Singapore | Australia Portal Solutions Content Solutions Big Data Analytics Solutions
  3. 3. CIGNEX Datamatics Confidential Our Big Data Analytics Practice 3 Team Size: 110+ Projects: 10+ • 20+ Big Data, 100+ Analytics & DW/BI • Partnership –MongoDB, Cloudera, IBM • Technical expertise –MongoDB, Hadoop, Neo4j, Solr, Pentaho, Talend, Cognos, Business Objects, Tableau, Jasper Reports • Research & Analytics division with data scientists • Connectors/Accelerators, Frameworks • BIGArchive – Enterprise Scale Archival • Liferay MongoDB Store • Drupal MongoDB Connector Big Data Partners Business Intelligence Expertise
  4. 4. CIGNEX Datamatics Confidential 4 • Use Case & Database Requirements • Why MongoDB? • Solution • To Shard Or Not To Shard • Scaling with Sharding – Sharding Basics – Architecture and Hardware Sizing – Sharding – Choosing the RIGHT Shard Key – Benchmarking with Results • Key Takeaways Agenda
  5. 5. CIGNEX Datamatics Confidential 5 Use Case Load Balancer DatabaseDevices 7 Million Users Across Geography Users 8 devices / user Home/Office/Any where High volume of concurrent CRUD requests routes to DB cluster MongoDB Data Storage cluster enabled with sharding, Auto replication for failover, Indexes Ability to access the digital assets of the service provider across array of devices registered by the user with the facility of resuming (session shifting).
  6. 6. CIGNEX Datamatics Confidential Database Requirements 6 Agility in Development & Deployment High Availability Flexibility in Schema Enterprise Level Support High Performance
  7. 7. CIGNEX Datamatics Confidential • Global Coverage • 24x7 Support • Ease of maintenance Why MongoDB? 7 • Programming Language drivers • Shorter Dev cycle • Faster deployment • Automatic failover • Redundancy • ~100% uptime Agility in Development & Deployment • Easy integration • Ease of schema design • Document oriented storage Loose Schema Replication Driver Support Strong Community • Concurrent CRUD • Fast Updates • Write distribution with Sharding Indexes & Sharding Availability Flexibility in Schema Enterprise Level Support High Performance
  8. 8. CIGNEX Datamatics Confidential Sharding – What is it? 8 • Distributes single logical database across multiple mongod nodes • Advantages: – Raises limits of data size beyond a single node – Increases Write capacity – Ability to support larger working sets – Read scaling (By the means of targeting specific shards through routed requests and distributed data. It is possible to support good amount of Scatter-gather requests if used judiciously. )
  9. 9. CIGNEX Datamatics Confidential Sharding – When to use? 9 Storage Drive Your data set approaches or exceeds the storage capacity of a single node in your system Working Set RAM The size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system Storage Drive Your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other approaches have not reduced contention
  10. 10. CIGNEX Datamatics Confidential Sharding - Features 10 • Range-based Data Partitioning • Automatic Data volume distribution • Transparent query routing • Horizontal capacity – Additional write capacity through distribution – Right shard key allows expansion of working set
  11. 11. CIGNEX Datamatics Confidential Solution: Approach 1111 • Schema Design • Collections and Field DefinitionsSchema • Document Size • Total expected data sizeDatabase Size • Frequency of CRUD operations • Read/Write ratioConcurrent Load • Replication, Backup and Automatic Failover • Right Replication Factor (RF) • Read Scaling for the use cases with eventual consistency. Availability • Working Set • Access PatternsIndexing • Horizontal Scaling • Read/Write ScalingSharding • Cluster sizing • RAM and Disk storageHardware Sizing
  12. 12. CIGNEX Datamatics Confidential To Shard Or Not To Shard ? • Sharding is a very powerful technique provided by MongoDB to scale, but it should be used only after due diligence, else it proves to be an over kill. • It brings substantial amount of overhead from infrastructure and maintenance standpoint. • It should be used only when you have done all the possible optimizations for the single node and still the write capacity of the single node proves to be a bottleneck. • In production minimum 6 server instances are required to have a sharded cluster with no failover capability. • In production we can not afford to have no redundancy/failover. Hence minimum RF of 2 is required which also brings an arbiter node into picture. 12
  13. 13. CIGNEX Datamatics Confidential To Shard Or Not To Shard ? 13 Inserts And Updates With No Sharding
  14. 14. CIGNEX Datamatics Confidential AppServerAppServerAppServer Solution: Architecture 14 mongod Primary mongod Secondary mongod Arbiter Shard 1 mongodmongod Config Servers mongod Routed Requests from mongos to shards mongod Primary mongod Secondary mongod Arbiter Shard 2 mongos Load Balancer Data TierApp Tier mongod Primary mongod Secondary mongod Arbiter Shard n mongosmongos
  15. 15. CIGNEX Datamatics Confidential Shard Keys • The ideal shard key : – High cardinality which makes it easy for MongoDB to split the chunks. – Higher “randomness” – Targeted queries – May need to be computed 15 Shard Keys: Exist in every document in a collection. MongoDB uses shard key to distribute documents among the shards. Just like indexes, they can be either a single field, or a compound key.
  16. 16. CIGNEX Datamatics Confidential Choosing Right Shard Key 16 Different approach for Shard Keys • Approach 1: Random Key – UserId + AssetId • Approach 2: Coarsely ascending key + Random Key – YearMonth + UserId + AssetId • Hashed Shard Keys (Not Tested/Applicable here.) – New in version 2.4. – Hashed shard keys use a hashed index of a single field as the shard key to partition data across your sharded cluster. – Field should good cardinality. – Hashed keys work well with fields that increase monotonically.
  17. 17. CIGNEX Datamatics Confidential Benchmarking / Load Testing Approach 17 Automated scripts with varied load
  18. 18. CIGNEX Datamatics Confidential Results - INSERTS 18 Over 80 million documents inserted with a decreasing threshold over 10 million Over 225 million documents inserted at a stable rate of 6000 documents/sec Approach 1 Approach 2 Benchmarks done on 2.2 GHz 8 core, 32GB, 7200RPM spinning drives with no RAID support Bare Metal Machines
  19. 19. CIGNEX Datamatics Confidential Results - UPDATES 19 Over 50 million documents updated at avg. 400 documents/sec Over 100 million documents updated at as high as. 4000 documents/sec Approach 1 Approach 2 Benchmarks done on 2.2 GHz 8 core, 32GB, 7200RPM spinning drives with no RAID support Bare Metal Machines
  20. 20. CIGNEX Datamatics Confidential Results – INSERT, UPDATE 20 >6000 documents/ second >70 million records >6000 documents/ second >50 million records Simultaneous INSERT Simultaneous UPDATE Approach 2 Benchmarks done on 2.2 GHz 8 core, 32GB, 7200RPM spinning drives with no RAID support Bare Metal Machines
  21. 21. CIGNEX Datamatics Confidential Benchmarking – Sharding Vs Non Sharding 21 Operation Sharding (YearMonth + UserId) Non-Sharding INSERTS ~6000 docs/sec ~2900 docs/sec UPDATES ~4000 docs/sec ~620 updates/sec INSERT & UPDATES ~6000 docs/sec & ~6100 docs/sec ~2000 docs/sec & ~600 docs/sec Benchmarks done on 2.2 GHz 8 core, 32GB, 7200RPM spinning drives with no RAID support Bare Metal Machines
  22. 22. CIGNEX Datamatics Confidential Key Takeaways • MongoDB scales & shines. – Expected - 690 Million CRUD operations per day. – Achieved - 840 Million CRUD operations per day. • Plan early for sharding. • Sharding scales INSERTS/UPDATES Vs Non sharding. • There is no magic recipe for finding an ideal shard key. • DO NOT go to production without benchmarking the shard key. Shard key cannot be changed for the given configuration. • Use MMS. It’s a great tool to assess the health of the cluster and identify the bottlenecks well in advance. • Sharding with Approach 2(Coarsely ascending Key + Random Key) provides sustained results & better utilization of the RAM (better index locality). 22 Disclaimer: Suitable shard key depends on your data, so while this Shard Key approach delivers good results for this use case, it is not a generic approach.
  23. 23. CIGNEX Datamatics Confidential Key Takeaways 23 • Routed Requests are always faster than scatter/gather requests. • Identify the consistency requirements for the read queries. In case of eventual consistency using read preference secondary- preferred can help you to squeeze more performance. • Different set of server/s for NON-Sharded collections. • Indexes to be defined carefully. More number of Indexes substantially bring down the write throughput. • Sharded collections should have minimal number of indexes. Disclaimer: Suitable shard key depends on your data, so while this Shard Key approach delivers good results for this use case, it is not a generic approach.
  24. 24. CIGNEX Datamatics Confidential Our Success Stories : At a Glance 24 1 2 3 4 5 6 Big Data Analytics for Telecom Optimum network bandwidth management & policy configuration for telecom companies Social Media Research Platform for Legal Firms Leverage social media & unstructured data analytics for collecting supporting evidences for trials US based Advanced GPS Solutions Provider Real time analysis of data accumulated from 200,000 GPS based devices Global Provider of Risk Management Solutions Collection and analysis of data from external and internal applications delivered to a dashboard US based Networking Equipment Leader Cluster configuration of high volume video uploads including 30 million inserts/hour European Chemical Giant Patent search – 10x increased in performance and 20x reduction in TCO 7 US based Social Security e-Benefits System Managing billion object repository with enterprise search and retrieval
  25. 25. CIGNEX Datamatics Confidential For queries reach out to us at Thank You. Any Questions ? Making Open Source Work