Scaling	  MongoDB	  with	    Sharding	  –	  A	  Case	  Study	    Presented	  by	    Yash	  Badiani	  and	  Rahul	  Nair	  ...
About	  CIGNEX	  Datamatics	                                  A	  subsidiary	  of	  Datamatics	  Global	  Services	       ...
Introduction	  of	  Datamatics	  (DGSL)	       •  Mission	                                                                ...
What	  Does	  CIGNEX	  Datamatics	  Do?	  Since	  2000,	  making	  Open	  Source	  work	  for	  the	  enterprise	  through...
Where	  We	  Can	  Help	  You	              	  SOLUTIONS	                                                                 ...
About	  the	  Presenters	       •  Yash	  Badiani	  is	  the	  Big	  Data	  Practice	  Lead	  at	  CIGNEX	  Datamatics	  a...
Agenda	         •      CIGNEX	  Datamatics	  –	  Introduction	  &	  Offerings	         •      Use	  Case	  &	  Database	  ...
Big	  Data	  Practice	  At	  CIGNEX	  Datamatics	               Brief	  Snapshot	            •  ~40	  employee	  Big	  Dat...
Our	  Offerings	  –	  Big	  Data	                                                                                        S...
Use	  Case	          	                                                                                                    ...
Database	  Requirements	             Flexibility	  	                                                High	             in	 ...
Limitations	  of	  RDBMS	         Support	  limited	  to	               Manage	  only	  Structured	                RDBMS	 ...
Why	  MongoDB?	          	  Flexibility	  	                                                                               ...
Solution:	  Approach	                  	         Schema	                                                • Schema	  Design	...
Solution:	  Architecture	                                                              Con1ig	  Servers	                  ...
Sharding	  –	  What	  is	  it?	         •  Distributes	  single	  logical	  database	  system	  across	  clusters	        ...
Sharding	  -­‐	  Features	         •  Range-­‐based	  Data	  Partitioning	         •  Automatic	  Data	  volume	  distribu...
Sharding	  –	  When	  to	  use?	                                                               Your	  data	  set	  approac...
Shard	  Keys	       Shard	  Keys:	                                                •  	  	  The	  ideal	  shard	  key	  :	 ...
Choosing	  Right	  Shard	  Key	         Different	  approach	  for	  Shard	  Keys	  	         •  Approach	  1:	  Random	  ...
Benchmarking	  /	  Load	  Testing	  Approach	   Automated	  scripts	  with	  varied	  load	  	   	  CIGNEX	  Datamatics	  ...
Results	  -­‐	  INSERTS	                                                                                             Appro...
Results	  -­‐	  UPDATES	                                                                                            Approa...
Results	  –	  INSERT,	  UPDATE	                                                                                        App...
Benchmarking	  –	  Sharding	  Vs	  Non	  Sharding	       Operation	                                  Sharding	  (YearMonth...
Key	  Takeaways	         •  Comprehensive	  approach	  on	  Performance	  Tuning	         •  Plan	  Early	  for	  Performa...
Thank	  You.	  Any	  Questions	  ?	                                               Making	  Open	  Source	  Work	  	       ...
Upcoming SlideShare
Loading in...5

Webinar: Scaling MongoDB through Sharding - A Case Study with CIGNEX Datamatics


Published on

This webinar will walk through the solution CIGNEX developed for a real-time event logging application along with some of the key technical considerations, like selecting the proper shard key. Yash will explain the key decision factors and performance statistics that went into their solution. By selecting the correct shard key MongoDB is able to handle approximately 30 Million inserts and 5 million updates per hour. This case study will cover everything from hardware recommendations to cluster configuration management with scale.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Webinar: Scaling MongoDB through Sharding - A Case Study with CIGNEX Datamatics

  1. 1. Scaling  MongoDB  with   Sharding  –  A  Case  Study   Presented  by   Yash  Badiani  and  Rahul  Nair          CIGNEX  Datamatics  Con1idential  
  2. 2. About  CIGNEX  Datamatics   A  subsidiary  of  Datamatics  Global  Services   Limited    CIGNEX  Datamatics  Con1idential   2  
  3. 3. Introduction  of  Datamatics  (DGSL)   •  Mission   Strategic  Alliances   –  Experts  in  improving   Enterprise  productivity     through    Process  Engineering  &     Information  Management   Solutions   •  Key  Highlights   –  Founded  in  1975   –  Publicly  listed  in  India   –  Annual  consolidated  revenue  of   US$100  Million   –  Fortune  500  clients   –  4,400+  employees  across  22   of1ices  in  9  countries  CIGNEX  Datamatics  Con1idential   3  
  4. 4. What  Does  CIGNEX  Datamatics  Do?  Since  2000,  making  Open  Source  work  for  the  enterprise  through  adoption  and  integration  to:   Portal   Solutions   Content     •  Address  business  goals   Solutions   •  Increase  business  velocity   •  Lower  the  cost  of  doing  business   •  Reduce  TCO   Big  Data   •  Gain  competitive  advantage   Solutions   400+  implementations  worldwide  across  industries  CIGNEX  Datamatics  Con1idential   4  
  5. 5. Where  We  Can  Help  You    SOLUTIONS   •  Intranet     •  S o c i a l   Portals   Liferay,  Drupal,  JBoss,   •  •  Extranet   EAI   Collabora>on   •  Mobile  Portals   User  eXperience     ZK,  HTML5,   •  SOA     PlaRorm   MuleSoW   Alfresco,  Adobe  CQ,     •  WCM   Content   Drupal,  Magento,     •  DM   •  E-­‐Commerce   Enterprise  Content   •  RM   •  E-­‐learning    JBoss,  Moodle,  EphesoW,   •  CMS   •  ERP   Management   •  DAM   •  Imaging   Liferay            Solu>ons       •  Analy>cs   •  DW  -­‐  BI   Hadoop,    MongoDB,  Neo4j,   Big  Data   •  •  Mobile   Social   •  Log  Processing   Flume,  Hive     and  Analysis     Making  Data  Work   •  Web   •  Enterprise   Solr,    Pentaho,  JaspersoW   •  Real-­‐>me   Search      SERVICES   UI,    Development  ,  Integra>on,    Customiza>on,    Migra>on  ,  Tes>ng,      Training  ,    Support  (24*7)   Managed  Cloud  Services  -­‐  Develop,  Deploy,  Manage   VAR/Annual  Product  Subscrip>on  -­‐  Liferay,  Alfresco,  Cloudera  Hadoop,  MongoDB     Extended  Development  Center  –  Center  of  Excellence    CIGNEX  Datamatics  Con1idential   5  
  6. 6. About  the  Presenters   •  Yash  Badiani  is  the  Big  Data  Practice  Lead  at  CIGNEX  Datamatics  and   focuses  on  Big  Data  Technologies  including  MongoDB  &  Hadoop.  He   has  worked  extensively  on  large  Data  warehousing  &  Business   Intelligence  projects  with  tools  such  as  Business  Objects,  Microsoft  SQL   Server,  Microstrategy,  IBM  Cognos.         •  Gaurav  Khambhala  works  at  CIGNEX  Datamatics  as  Technical  Lead.   He  is  the  senior  member  of  the  PHP  Practice  at  CIGNEX  Datamatics  and   is  involved  on  various  technology  initiatives  like  Big  Data  where  he   focuses  on  integration  of  PHP  with  NoSQL  sources  like  MongoDB.  He   has  a  wide  industry  experience  in  software  development  &   management  in  Open  Source  technologies  such  as  Drupal  &  Moodle  CIGNEX  Datamatics  Con1idential   6  
  7. 7. Agenda   •  CIGNEX  Datamatics  –  Introduction  &  Offerings   •  Use  Case  &  Database  Requirements   •  Challenges  with  Traditional  Databases   •  Why  MongoDB?   •  Solution     –  Approach   –  Architecture  and  Hardware  Sizing   •  Scaling  with  Sharding   –  Sharding  Basics   –  Sharding  –  Choosing  the  RIGHT  Shard  Key   –  Benchmarking  with  Results   •  Key  Takeaways    CIGNEX  Datamatics  Con1idential   7  
  8. 8. Big  Data  Practice  At  CIGNEX  Datamatics   Brief  Snapshot   •  ~40  employee  Big  Data  Practice   Technology  Partnership   focused  on  Hadoop,  MongoDB,  Neo4j,   Solr   •  Professionals  formally  trained  /   certi1ied  from  Cloudera  and  10gen   •  Expertize  in  Hadoop  Eco-­‐System   (HBase,  Pig,  Hive,  Flume,  Sqoop,   Oozie,  Zookeeper)   •  Strong  partnerships:   •  System  Integration  partners   with  Cloudera  for  CDH   •  Global  partner  with  10gen  for   MongoDB  –  multiple  webinars   on  different  solutions  CIGNEX  Datamatics  Con1idential   8  
  9. 9. Our  Offerings  –  Big  Data   Support  &   Consulting   Implementation   Training   Consulting   Implementation   Support  &  Training   •  Business  Analysis     •  UI  Development   •  DBA  Support   •  Technology  Evaluation   •  Application  Integration   •  Application  Support   •  Architecture     •  Customization   •  Enhancements   •  Design  Framework   •  Migration   •  24*7  Production   •  Cluster  sizing   •  Testing   Support(Tier  1/2/3)   •  Deployment  planning   •  Performance  Tuning   •  Trainings   •  Proof-­‐of-­‐Concept   •  Health  Check   •  Performance   Benchmarking  CIGNEX  Datamatics  Con1idential   9  
  10. 10. Use  Case     Load   Users   Devices   Database   Balancer     Data  Storage   App.  Layer   End  Users   Devices   7  Million  Users   8  devices  /  user   Load  Balancer   mongoDB  cluster   Spread  Across   Home/OfMice/ Receives    high   Sharding   Geography   Anywhere   volume  of   Replication  with   concurrent  CRUD   Automatic   requests   Failover   Routes  request   Indexes   trafMic  to  DB   cluster    CIGNEX  Datamatics  Con1idential   10  
  11. 11. Database  Requirements   Flexibility     High   in  Schema   Performance   Agility  in     Development     &  Deployment   Availability   Enterprise     Level  Support  CIGNEX  Datamatics  Con1idential   11  
  12. 12. Limitations  of  RDBMS   Support  limited  to   Manage  only  Structured   RDBMS  doesn’t  scale   Feature  rich  but  slow    terabytes   Data   inherently   performance     $   Complex  to  Shard/Partition   Limitations  in  scaling  High   Specialized  Hardware  -­‐   Vertical  Scaling  expensive   due  to  maintenance  of  schema   volume  of  concurrent  CRUD   Expensive   and  dif1icult  to  scale   RDBMS  can’t  manage  all  dimensions    of  data  with  speed  &  at  lower  cost.  CIGNEX  Datamatics  Con1idential   12
  13. 13. Why  MongoDB?    Flexibility     High     in  Schema   Performance   •  Easy  integration   •  Concurrent  CRUD     •  Ease  of  schema     •  Fast  Updates   Agility  in                  design   •  Write  distribution     Development     •  Document  oriented                  with  Sharding    &  Deployment                storage   Schema  free   •  Programming     Indexes  &  Sharding                Language  drivers   •  Shorter  Dev  cycle   •  Faster  deployment   Enterprise   Availability   Level   Support   Driver  Support   •  Automatic  failover   •  Global  Coverage   •  Redundancy   •  24x7  Support   •  100%  uptime   •  Ease  of                    maintenance   Replication   Strong  Community  CIGNEX  Datamatics  Con1idential   13  
  14. 14. Solution:  Approach     Schema   • Schema  Design                                                                     • Collections  and  Field  De1initions   • Document  Size   Database  Size   • Total  expected  data  size   • Frequency  of  CRUD  operations     Concurrent  Load   • Read/Write  ratio   • Automatic  Failover   Availability   • Replication  and  Backup   • Working  Set   Indexing   • Access  Patterns   • Horizontal  Scaling   Sharding   • Query  Performance   • Cluster  sizing   Hardware  Sizing   • RAM  and  Disk  storage  CIGNEX  Datamatics  Con1idential   14  
  15. 15. Solution:  Architecture   Con1ig  Servers   Shard  1   mongos   mongod   Server   App   Primary   Mongod   mongod   mongod     Arbiter   mongod   Secondary   mongos   Server   mongod     App     Shard  2   mongod   Primary   Mongod   mongos   Server     Arbiter   App   Balancer   Data  Tier   Load     mongod   Secondary   Routed  Requests  from   mongos  to  shards   mongos   Server     App   Shard  3   mongod   Primary   Mongod     Arbiter   mongos   mongod   Server   App   Secondary   Shard  4   mongod   mongos   Server   Primary   App     Mongod   Arbiter   mongod   Secondary   App  Tier   Routed  for  non-­‐ sharded  collections   Replica  Set   mongod   Primary   Mongod   Arbiter   mongod   Secondary  CIGNEX  Datamatics  Con1idential   15  
  16. 16. Sharding  –  What  is  it?   •  Distributes  single  logical  database  system  across  clusters   •  Allows  to  partition  a  collection  across  #  of  mongod   instances(shards)   •  Advantages:   –  Increases  write  capacity   –  Ability  to  support  larger  working  sets   –  Raises  limits  of  data  size  beyond  a  single  node    CIGNEX  Datamatics  Con1idential   16  
  17. 17. Sharding  -­‐  Features   •  Range-­‐based  Data  Partitioning   •  Automatic  Data  volume  distribution   •  Transparent  query  routing   •  Horizontal  capacity   –  Additional  write  capacity  through  distribution   –  Right  shard  key  allows  expansion  of  working  set    CIGNEX  Datamatics  Con1idential   17  
  18. 18. Sharding  –  When  to  use?   Your  data  set  approaches  or  exceeds  the  storage   capacity  of  a  single  node  in  your  system   Storage   Drive   The  size  of  your  system’s  active  working  set  will  soon   exceed  the  capacity  of  the  maximum  amount  of  RAM   for  your  system   RAM   Working  Set   Your  system  has  a  large  amount  of  write   activity,  a  single  MongoDB  instance  cannot   Storage   write  data  fast  enough  to  meet  demand,  and  all   Drive   other  approaches  have  not  reduced  contention      CIGNEX  Datamatics  Con1idential   18  
  19. 19. Shard  Keys   Shard  Keys:   •     The  ideal  shard  key  :   Exist  in  every  document  in  a   collection  that  MongoDB  uses  to   –  Easily  divisible  which  makes  it   distribute  documents  among  the   shards  like  indexes,  they  can  be   easy  for  MongoDB  to  distribute   either  a  single  1ield,  or  a   compound  key   content  among  the  shards   –  Higher  “randomness”   –  Targeted  queries   –  May  need  to  be  computed  CIGNEX  Datamatics  Con1idential   19  
  20. 20. Choosing  Right  Shard  Key   Different  approach  for  Shard  Keys     •  Approach  1:  Random  Key    –  UserId   •  Approach  2:  Coarsely  ascending  key  +  Random  Key  –     YearMonth  +  UserId    CIGNEX  Datamatics  Con1idential   20  
  21. 21. Benchmarking  /  Load  Testing  Approach   Automated  scripts  with  varied  load      CIGNEX  Datamatics  Con1idential   21  
  22. 22. Results  -­‐  INSERTS   Approach  1   Over  80  million  documents  inserted   with  a  decreasing  threshold  over  10   million   Approach  2   Over  225  million  documents  inserted  at   a  stable  rate  of  6000  documents/sec  Benchmarks  done  on  8GB  Test  H/W  Machines  CIGNEX  Datamatics  Con1idential   22  
  23. 23. Results  -­‐  UPDATES   Approach  1   Over  50  million  documents  updated  at   avg.  400  documents/sec   Approach  2   Over  100  million  documents  updated  at   as  high  as.  4000  documents/sec  Benchmarks  done  on  8GB  Test  H/W  Machines  CIGNEX  Datamatics  Con1idential   23  
  24. 24. Results  –  INSERT,  UPDATE   Approach  2   Simultaneous  INSERT   >6000  documents/  second   >70  million  records   Simultaneous  UPDATE   >6000  documents/  second   >50  million  records  Benchmarks  done  on  8GB  Test  H/W  Machines  CIGNEX  Datamatics  Con1idential   24  
  25. 25. Benchmarking  –  Sharding  Vs  Non  Sharding   Operation   Sharding  (YearMonth  +   Non-­‐Sharding   UserId)   INSERTS   ~6000  docs/sec   ~2900  docs/sec   UPDATES   ~4000  docs/sec   ~620  updates/sec   INSERT  &   ~6000  docs/sec  &   ~2000  docs/sec  &   UPDATES   ~6100  docs/sec   ~600  docs/sec  Benchmarks  done  on  8GB  Test  H/W  Machines  CIGNEX  Datamatics  Con1idential   25  
  26. 26. Key  Takeaways   •  Comprehensive  approach  on  Performance  Tuning   •  Plan  Early  for  Performance   •  MongoDB  scales  &  shines   •  Sharding  scales  INSERTS/UPDATES  vs.  Non  sharding   •  Sharding  with  Approach  2  (Coarsely  ascending  Key  +  Random   Key)  provides  sustained  results  &  better  utilization  of  the  RAM     •  Different  set  of  server/s  for  NON-­‐Sharded  collections   •  Indexes  to  be  de1ined  carefully   •  Sharded  collections  to  have  minimal  number  of  indexes  CIGNEX  Datamatics  Con1idential   26  
  27. 27. Thank  You.  Any  Questions  ?   Making  Open  Source  Work     For  queries  reach  out  to  us  at          CIGNEX  Datamatics  Con1idential  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.