Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Teams, Processes and Architectures

26,150 views

Published on

Generic presentation about scalability challenges. First London Scalability Meetup. Quick overview of the DataSift architecture.

Published in: Technology, Business

Scaling Teams, Processes and Architectures

  1. Lorenzo Alberton @lorenzoalbertonScaling Teams,Processes andArchitectures Managing growth London Scalability Group,Innovation Warehouse, 16th April 2012 1
  2. Scalability Is About... People Processes Technology 2
  3. PeopleStaffing, Roles, Management, Teams 3
  4. Staffing Never compromise. Only hire people smarter than you. http://www.earthrangers.com/content/wildwire/toxic_spill.jpg 4
  5. Staffing Hire people who can fit the company culture. Promote fun in your working environment. http://www.earthrangers.com/content/wildwire/toxic_spill.jpg 4
  6. Staffing Beware of toxic people http://www.earthrangers.com/content/wildwire/toxic_spill.jpg 4
  7. Team Size and Structure Micromanaging managers Poor communicationtoo small Overworked team members Low morale too big Can’t accomplish much Low productivity 5
  8. Team Size and Structure Micromanaging managers Poor communicationtoo small Overworked team members Low morale too big Can’t accomplish much Low productivity CTO functional PM PM PM Designer Developer Tester Designer Developer Tester Designer Developer Tester Designer Developer Tester Designers Developers Testers 5
  9. Team Size and Structure Micromanaging managers Poor communicationtoo small Overworked team members Low morale too big Can’t accomplish much Low productivity CTO functional matrix PM PM PM Proj 1 PM Designer Developer Tester Proj 2 PM Designer Developer Tester Proj 3 PM Designer Developer Tester Proj 4 PM Designer Developer Tester Designers Developers Testers 5
  10. Processes 6
  11. Why are processes critical? Improve management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs 7
  12. Why are processes critical? Improve management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge 7
  13. Why are processes critical? Improve management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge right amount 7
  14. Why are processes critical? Improve management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge right amount right process 7
  15. Why are processes critical? Improve management of teams and employees Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis Determine system capacity and scalability needs Challenge right amount right process right time 7
  16. Determining Headroom Capacity Current Load 8
  17. Determining Headroom Why? Capacity Planning annual budget Hiring plan Current Load Prioritisation 8
  18. Controlling Change: Determine Risk http://dilbert.com/strips/comic/2008-05-08/ 9
  19. Controlling Change: Determine Risk http://dilbert.com/strips/comic/2008-05-08/ 9
  20. Risk Management Risk is cumulative Determine limits and tolerance 10
  21. Load / Stress Testing Load testing - identify, document and eliminate bottlenecks through a strict controlled process of measurement and analysis - measure system’s response and stability - verify the app can meet the desired performance objectives (SLA) Stress testing - determine the app’s stability when subjected to above-normal loads - verify the app’s behaviour when close to the breaking point - test the application recoverability (negative testing) 11
  22. Barrier Conditions Code reviews Manual and automated QA processes Performance and stress testing Release documentation checks (runbook) Dev, Test, Stage and Live environments Instrumentation checks Protection from significant failures 12
  23. TechnologyArchitecting Scalable Solutions 13
  24. Architectural Principles 14
  25. Architectural Principles +1 N + 1 design 14
  26. Architectural Principles +1 N + 1 design for rollback 14
  27. Architectural Principles +1 N + 1 design for rollback to be disabled 14
  28. Architectural Principles +1 N + 1 design for rollback to be disabled to be monitored 14
  29. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple monitored live sites 14
  30. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology 14
  31. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology asynchronous design 14
  32. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology asynchronous stateless design systems 14
  33. Architectural Principles +1 N + 1 design for rollback to be disabled to be for multiple use mature monitored live sites technology asynchronous stateless buy when design systems non core 14
  34. Stateless, Asynchronous Systems http://upload.wikimedia.org/wikipedia/commons/4/46/Synchronized_swimming_-_Russian_team.jpg 15
  35. Fault Isolative Structures 16
  36. Fault Isolative Structures Increase availability Limit impact of failures Easier debugging 16
  37. Fault Isolative Structures Increase availability Limit impact of failures Easier debugging First 16
  38. Fault Isolative Structures Increase availability Limit impact of failures Easier debugging Functions causing repetitive problems First 16
  39. Fault Isolative Structures Increase availability Limit impact of failures Easier debugging Functions Natural layout causing or topology repetitive of the site problems First 16
  40. Caching for Performance and Scale 17
  41. Caching for Performance and Scale Object Caches Usually serialized (marshalling / unmarshalling) get() / set() / replace()APC, Memcached 17
  42. Caching for Performance and Scale Object Caches Application Caches Usually serialized Proxy caches (marshalling / Reverse proxy unmarshalling) caches get() / set() / HTTP headers replace() ISP/Uni proxiesAPC, Memcached Squid, Varnish, mod_cache 17
  43. Caching for Performance and Scale Object Caches Application Caches CDNs Usually serialized Proxy caches Multiple locations (marshalling / / backbones Reverse proxy unmarshalling) caches get() / set() / HTTP headers CNAME entries replace() ISP/Uni proxies Akamai, Coral,APC, Memcached Squid, Varnish, Limelight... mod_cache 17
  44. Managing “Big Data” storage costs people and software power and space processing power backup time and costs 18
  45. Managing “Big Data” The more storage ...the more storage management storage costs people and software power and space processing power backup time and costs 18
  46. Managing “Big Data” The more storage ...the more storage management storage costs people and software power and space processing power backup time and costs Evaluate data retention policy Consider multi-tiered storage Distribute data/ work (Hadoop, M/R) 18
  47. Monitoring: Measure Everything 19
  48. Monitoring: Measure Everything 1. Is there a problem? User experience / Business metrics monitors 2. Where is the problem? System monitors (threshold - variance) 3. What is the problem? Application monitors 19
  49. Monitoring: Measure Everything 1. Is there a problem? User experience / Business metrics monitors 2. Where is the problem? System monitors (threshold - variance) 3. What is the problem? Application monitors Keep Signal vs. Noise ratio high 19
  50. Monitoring: Measure Everything StatsD 1. Is there a problem? User experience / Business metrics monitors 2. Where is the problem? System monitors (threshold - variance) 3. What is the problem? Application monitors Keep Signal vs. Noise ratio high 19
  51. DataSift Architecture Some Architecture Pr0n 20
  52. DataSift Architecture http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html 21
  53. DataSift Architecture SOA - loosely coupled, independently scalable services. Simple APIs http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html 21
  54. DataSift Architecture SOA - loosely coupled, independently scalable services. Simple APIs example http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html 21
  55. SOA - Scale Each Component 22
  56. Our StackLanguages: C++, PHP, Java, Scala, Ruby, Node.JSStorage: MySQL, HBaseCache: Memcached, APC, RedisQueues: ZeroMQ, Kafka, RedisDevelopment/Deployment: GIT, Jenkins CI, RPM, ChefMonitoring: StatsD + Graphite, Zenoss 23
  57. Our StackLanguages: C++, PHP, Java, Scala, Ruby, Node.JSStorage: MySQL, HBaseCache: Memcached, APC, RedisQueues: ZeroMQ, Kafka, RedisDevelopment/Deployment: GIT, Jenkins CI, RPM, ChefMonitoring: StatsD + Graphite, ZenossSecret recipe: amazing people and working environment 23
  58. MessagingZeroMQ: PUSH-PULL, REQ-REP, PUB-SUB (multicast, broadcast) Internal communication: pass messages to the next processing stage, control events, monitoringKafka/Redis: PUSH-PULL with persistence Internal message / workload buffering and distributionNode.js: WebSockets / HTTP Streaming Message delivery (output) 24
  59. 0mq PUSH-PULL (workload distribution) Consumer 1 Consumer 2 Consumer 3 [Round-Robin-ish] 25
  60. 0mq PUB-SUB (High Availability) Listener 1Publisher 1 Listener 2Publisher 2 Listener 3 [Broadcast] [Dynamic Subscriptions] 26
  61. 0mq PUB-SUB (High Availability) DC 1Publisher 1Publisher 2 DC 2 27
  62. Internal “Firehose” Publishers Subscribers Alice’s John’s Y Z timeline Inbox X subscribe to topic X Data Bus subscribe to topic Y System Fred’s Tech Monitor Followers Blog Feed 28
  63. Instrumentation https://play.google.com/store/apps/details?id=net.networksaremadeofstring.rhybudd 29
  64. We’re Hiring!http://datasift.com/whoweare/jobs 30
  65. References M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley http://theartofscalability.com/http://www.slideshare.net/quipo/the-art-of-scalability-managing-growthhttp://www.slideshare.net/postwait/scalable-internet-architecturehttp://bit.ly/IJKwuchttp://agile.dzone.com/news/approaches-organizationalhttps://bitly.com/vCSd49 31
  66. Lorenzo Alberton @lorenzoalberton Thank you! lorenzo@alberton.infohttp://www.alberton.info/talks Questions? 32

×