Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling teams, processes and architectures

Talk about the soft side of scalability, covering team management, process implementation and some solid technology-related principles. Based on 10 years of experience building scalable teams and scalable data platforms

  • Login to see the comments

Scaling teams, processes and architectures

  1. 1. MANAGING GROWTH SCALING TEAMS, PROCESSES, ARCHITECTURES Lorenzo Alberton, CTO @ DataSift MEST, Accra 10 December 2017
  2. 2. LORENZO ALBERTON Chief Technology Officer, DataSift @lorenzoalberton
  4. 4. SCALABILITY IS ABOUT… People Technology ProcessesTRUE FOUNDATION
  5. 5. PART 1. PEOPLE Staffing, Roles, Management, Teams
  6. 6. CULTURE ➤ Treat people as volunteers (*) ➤ Lead by living the values you promote ➤ Respect, collaboration ➤ Promote fun in the workplace ➤ Culture of safety at work (**) (*) Peter Drucker (**) Google, Project Aristotle
  7. 7. EFFECTIVE TEAMS PROJECTARISTOTLE(2012) Psychological safety: team climate characterised by interpersonal trust and mutual respect in which people are comfortable being themselves. Feeling free to share the things that scare us without fear of recriminations. Behaviours: conversational turn- taking and empathy.
  8. 8. TEAMS VS. INDIVIDUAL CONTRIBUTORS ➤ Beware of toxic people ➤ Value communication and team work over super-heroes (*) Sunday afternoon test
  9. 9. STAFFING Don’t hire experts Technologies come and go Focus more on people with passion and less on people with specific skills
  10. 10. TEAM SIZE ➤ Never underestimate the power of a small team ➤ Small teams force alignment and focus ➤ Bigger teams need an insane amount of overhead ➤ Parkinson's Law: “Work expands to fill the time available for its completion” work that keeps a person busy but has little value in itself
  11. 11. TEAM STRUCTURE No artificial boundaries around languages or skills Try cross-functional teams 
 (less friction, better end to end collaboration, project ownership)
  12. 12. MIDDLE-MANAGEMENT CURSE Mistakes: ➤ Prematurely re-organise for scale (deep hierarchy, over- specialisation) ➤ Process managers (factory mentality) vs Problem solvers ➤ Micromanagement ➤ Non-engineering culture ➤ 1-on-1s as calendar-filler ➤ Not being “on the ground” ➤ Over-confidence in tooling ➤ OTOH, coordination can be hard
  13. 13. PART 2. PROCESSES How to make day to day operations smooth
  14. 14. WHY ARE PROCESSES CRITICAL? Ease management of teams/projects Standardise actions in repetitive tasks Reduce mundane decisions to focus on grander ideas Allow the team to react quickly to crisis ➤ A process shouldn’t exist for the sake of it ➤ Introduce processes gradually, only keep what works ➤ Don’t put too much confidence in tools alone to fix issues
  15. 15. EXAMPLE PROCESSES ➤ Development methodology ➤ Risk / Benefit analysis ➤ Prioritisation / Planning ➤ Design and code reviews ➤ Evaluating headroom / scale ➤ Load / Stress testing ➤ Test automation ➤ Deployment automation ➤ Release checklists ➤ Risk assessment/management ➤ Blameless postmortems
  16. 16. PROMOTING SYSTEMS TO PROD ➤ Code reviews ➤ Dev, Test, Stage and Live environments ➤ Manual and automated QA processes ➤ Performance and stress testing ➤ Release check lists (runbook) ➤ Instrumentation checks ➤ Testing roll-back capability Protection from significant failures BARRIER CONDITIONS
  17. 17. DESIGN AND CODE REVIEWS ➤ Promote collaboration ➤ Validate ideas, assess risk, detect flaws, simplify the solution ➤ Reason about behaviour before coding DAILY STAND-UPS ➤ Important for knowledge sharing, collaboration, alignment
  18. 18. CONTROLLING CHANGE: RISK ESTIMATION ➤ Limit / log the impact of changes ➤ Assess risk methodologies: • Gut feeling / finger in the air • Semaphore method • Failure Mode and Effect Analysis
  19. 19. RISK MANAGEMENT ➤ Risk is cumulative ➤ Determine limits and tolerance ➤ Stress, long hours, peer pressure can multiply risk
  20. 20. WHEN/WHAT TO SCALE: DETERMINING HEADROOM Capacity Current Load Why? Budget plan Prioritisation Hiring plan Determine starting point, remaining capacity, expected demand
  21. 21. LOAD TESTING ➤ Identify, document and eliminate bottlenecks through a strict controlled process of measurement and analysis ➤ Measure system’s response and stability ➤ Verify the app can meet the desired performance objectives (SLA) ➤ Establish success criteria, test environment, tests, what needs to be monitored, what data needs to be collected
  22. 22. STRESS TESTING ➤ Determine the app’s stability when subjected to above- normal loads ➤ Verify the app’s behaviour when close to the breaking point ➤ Positive testing: progressively increase load to overwhelm the system’s resources ➤ Negative testing: take away resources (memory, threads, connections) to test the application recoverability
  23. 23. PART 3. TECHNOLOGY Architecting Robust, Scalable Solutions
  24. 24. DO NOT SCALE UNTIL YOU CAN’T AVOID IT ANYMORE ➤ “Go meet your people. Do things that don’t scale.” (Paul Graham to AirBNB’s founders) ➤ Solve for specific problems ➤ Don’t generalise until you rebuilt something for the 3rd time ➤ Don’t over-engineer the solution ➤ Automate repetitive and error-prone tasks ➤ Avoid complicating things ✴ Phone system
  25. 25. MVP APPROACH ➤ Test ideas before spending a year building something you haven’t proven in the market first ➤ Fake it till you make it ➤ Example: Zappos
  26. 26. ARCHITECTURAL / DESIGN PRINCIPLES N + 1 nodes for rollback to be disabled (feature flags) to be monitored for multiple live systems/sites use mature technology asynchronous communications stateless systems +1 buy when non core
  27. 27. FAULT-TOLERANT STRUCTURES ➤ Swim lanes: isolate and limit the impacts of failure within the system by segmenting pipelines ➤ Barrier and Guide (shard) ➤ Increase availability ➤ Make incidents easier to detect, identify and resolve
 ➤ Favour the transactions making the company money first ➤ Isolate functions causing repetitive problems (or busy tenants) ➤ Consider the natural layout or topology of the site
  28. 28. SCALING IN DIFFERENT DIRECTIONS x y z AKF Scaling Cube, “The Art of Scalability”, M.L.Abbott, M.T.Fisher cloning of services and data without any bias (e.g. more serving nodes in a worker pool where any node can do the work) separation of work responsibility by type of data or type of work (different specialised worker pools) separation of work by customer or requestor (dedicated highly specialised worker pools)
  29. 29. SCALING IN DIFFERENT DIRECTIONS - 1. SCALING WORK / APPS x cloning of entities or data - unbiased distribution of work y separation of work by activity or data z separation of work by person for whom the work is done web site
 (mirror 1) web site
 (mirror 2) search 
 server shopping cart server premium site standard site LB
  30. 30. SCALING IN DIFFERENT DIRECTIONS - 1. SCALING WORK / APPS x mirroring + scale transactions - scale data y split by service + scale isolation + scale function data - scale customer data z split by need / location / value + scale isolation + scale customer data - scale function data
  31. 31. SCALING IN DIFFERENT DIRECTIONS - 2. SCALING DATA x data cloning (replication / clustering) + load balancer y split different things by service / resource / data affinity z split similar things by modulus / hash- based lookups copy 1 copy 2 copy 3 ABC DEF GHI
  32. 32. SCALING IN DIFFERENT DIRECTIONS - 2. SCALING DATA x data cloning (replication / clustering) + load balancer + easy to implement + scale transaction volume + useful in case of high read to write ratio - scale data size and growth y split different things by service / resource / data affinity + fault isolation + reduce query time - more difficult - data migration z split similar things by modulus / hash- based lookups + uniformly balanced demand + fault isolation + scale data and transactions - more costly
  33. 33. QUEUES ➤ Asynchronous communication ➤ Workload distribution ➤ Failure isolation
  34. 34. MESSAGE QUEUES AS BUFFERS (ASYNC COMM - DECOUPLING) CP Unpredictable load spikes CP Load normalisation / smoothing Batching ⇒ higher throughput source / producer sink / consumer
  35. 35. WORKLOAD DISTRIBUTION - LOAD BALANCING Consumer 1 Consumer 2 Consumer 3 Producer push pull pull pull
  36. 36. MULTIPLEXING pull Consumer fair-queuing: R1, R4, R5, R2, R6, R3 Producer 1 Producer 2 Producer 3 push R4 push R1, R2, R3 push R5, R6
  37. 37. HIGH AVAILABILITY (PUB-SUB / BROADCAST) Listener 1 Listener 2 Listener 3 [Broadcast] Publisher 1 Publisher 2 [Dynamic Subscriptions]
  39. 39. MONITORING ➤ Measure all the things! ➤ Think about what metrics to track when you design your app: system/app/user level ➤ Engage with Ops / QA early on in the design phase ➤ Invest in a good monitoring solution ➤ Data integrity checks (bucket analysis, statistical analysis) ➤ Alerting and monitoring dashboards should be intuitive 39
  43. 43. OTHER SCALING TIPS ➤ Use caching aggressively (CDNs, app & object caches) ➤ Design to scale out horizontally ➤ Simplify scope, design, implementation: lean == fast ➤ Know latencies ➤ Relax temporal constraints ➤ Discuss and Learn from mistakes ➤ Design for fault tolerance, graceful failure, and resilience ➤ Avoid SPOFs ➤ Avoid or distribute state ➤ Be competent
  44. 44. REFERENCES scalability-managing-growth Made-Easy-QCon-London-2012 internet-architecture organizational M. L. Abbot, M. T. Fisher, “The Art Of Scalability”, Addison Wesley
  45. 45. @lorenzoalberton THANK YOU! /in/lorenzoalberton