1. From Zero to LOTS ScaleCamp UK Josh Devins, Software Architect
2. Who are we? Nokia Devices (duh) Services ovi.com Nokia Maps (Berlin) Device (native and WebKit-based clients) Web maps.ovi.com Map & Explore group Place registration and management Place discovery
6. The beginning Small group New services division of Nokia Big ambition Big company Lots of stuff to do Early problems No existing traffic to study No idea how popular services will be Lots of pressure to assume huge traffic
7. From 0 to N-1 200% increase in number of teams and team size Started transition from “chaos” to Scrum Initial launch of place services summer 2009 Strict focus on basic feature set Core dataset Search Ratings Start simple but know where you need to get to ~6.3M places Web only
8. Iteration N-1 choices Two main teams core competencies leveraged EJB 3.0 + JBoss, Spring + Tomcat Support contracts in place JBoss – JBoss AS, JBoss Messaging MySQL – cluster, then InnoDB Existing operations group Existing deployment mechanism Static, read-only PXE Linux image Used to deploying every couple months only
9. N-1 technology stack Client Firefoxplugin Server Java, Maven (Nexus), CI (Hudson) RESTful aggregated services EJB 3.0 + JBoss, Spring + Tomcat JPA, Hibernate JBoss Messaging MySQL (Master-Master) Apache 2 Testing JUnit, soapUI, JMeter Operations PXE Linux based server images (prod) Debian Nagios
10. From N-1 to N Today-ish 50% increase in number of teams and team size 120% increase in traffic 120% increase in number of places Focus on more community involvement and enhancing place metadata Create a place Prime Place (business owner content) Additional place metadata ~14M places Web and N900 devices
11. Iteration N choices Rapid development and release Spring + Tomcat everywhere Common configuration mechanism Common logging infrastructure/mechanism Standardized file system layout on server Automated static analysis with Sonar Slack in resources not matching growth, requires automation Built out replica QA environment with own team Puppet + Webistrano Hyperic monitoring ($)
12. N technology stack Client Plugin not required (although enhances experience) JS fameworks: Moo Tools Server Sonar Spring + Tomcat (standardized) Grails + Tomcat (administration) RESTful APIs (external) 2-legged OAuth Nokia CDN Testing Grinder, Selenium (some FitNesse) Replicated QA environment Operations Unchanged (prod) Puppet, Debian packages, Webistrano (QA) Hyperic (QA) and Nagios (prod)
13. From N to N+1 Planned for summer 2010 10% increase in team size (planned) 200% increase in traffic (expected) 100% increase in number of places Scalability, reliability and robustness Limited new feature set It’s a secret…shhhhh… Additional Navteq content Additional premium content ~30M places Web and N900, S60 devices
14. Iteration N+1 choices Scale and scale fast Caching (HTTP/app? TBD – pending load testing) Async business processes Decouple/isolate persistence layers for protection, performance Reconciliation/cleanup jobs Learning Hadoop data warehouse Trending and tracking Continued slack in operations resources Push automation developed in QA environment to production processes Kickstart, Puppet, RPMs, yum Hyperic monitoring (prod)
15. N+1 technology stack Client JS frameworks: combining the “good parts” from Moo Tools, Dojo jQuery SDK for Maemo devices Server Varnish HTTP “accelerator” and/or app caching ActiveMQ (RabbitMQ, Atom feeds, other?) MySQL (Master-Master + N-Slaves) Operations Kickstart, Puppet, RPMs, yum mirrors CentOS Hyperic (QA, prod)
16. The future Move out of the database Search already based on Lucene, still DB backed results (good NoSQL candidate) Complex place matching and de-duplication algorithms will bottom out Proxying and caching Pragmatic approach: only where needed and where measured Memcached, ehcache + Terracotta, JBossTreeCache, ehcache L2 cache? Depends… Protect ourselves against persistence layer failures and spikes in traffic Multi-homed, co-location, worldwide application distribution Continuity during outages, lower latency, legal (China) Master/slave, master/master, Paxos? Application robustness Robustness patterns (Release It!) Partial failure/outage modes Failure auto-detection and recovery (in the application) NoSQL Pragmatic approach: likely to stick with MySQL until it falls over Looking only at very special cases for NoSQL, k/vstores (like Search results)
17. A few lessons learned (so far) Consider possible sharding strategies and implications early Semi-opaque IDs End-to-end continuous integration from day one No matter how many components are involved, how hard it may seem Scaling Scrum is really hard! Self organization works when you have great people Ensure tools and support are in place to guide them from day one (static analysis, strong mentors, etc.) Build truly cross-functional teams Promote Agile everything from the inside out (your team, group, division, org) Automate, automate, automate Don’t be fooled by frameworks Shipping quality production software requires in-depth knowledge of the frameworks you use Be humble – known when you need help Find world class support and use it Building an application with all of the *ilities: Takes time, patience, expertise and flexibility Requires the entire team, group, division and organization
Services group is “organizationally imature” Large shift in organizational thinking to get to a real Agile organization
Core functionality built in two teamsFuture functionality built in two other teamsAll distributed components, no code sharing (except minor libraries), no integration between “now” and “future”
- Almost went live with MySQL cluster, but proved to be unstable and unwarranted, simplified to master-master
Legacy deployment mechanism- No monitoring!
Prime Place went live yesterday2 hr deployment processNeed more automation in QA and production
- Hyperic – moan, vendor lock in, expensive, blah – but gets the job done for us and very quickly, offers all features we need including JMX cluster management, no time for anything else right now
- Decouple through caching and proxying to deal with emergency outages and traffic spikes, without scaling out or giving time to get hardware
- Frequent production releases hard with lots of devices to test – need some support from outside of the team which slows things down a lot- Caching: Prove it before you use it- Sticking with MySQL due to support contracts, existing expertise, etc. – until it becomes an issue
Build teams, don’t throw them together It takes a lot of care and attention to scale Scrum teams Baby steps – tackle agile in the team, then promote good ideas and process to other teams, groups and out to the organization Promote the goodness from within …then automate some more Still struggling with automation, problem sometimes is teams wait for solutions instead of creating/proposing themselves Patience: some people see all of the problems and flip out, want to give up, complain a lot – take it in stride and chip away where you can