Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Keeping Your Software Ticking Testing with Metronome and the NMI Lab
  2. 2. Background: Why (In a Slide!) <ul><li>Grid Software: Important to Science and Industry </li></ul><ul><li>Quality of Grid Software: Not So Much </li></ul><ul><li>Testing: Key to Quality </li></ul><ul><li>Testing Distributed Software: Hard </li></ul><ul><li>Testing Distributed Software Stacks: Harder </li></ul><ul><li>Distributed Software Testing Tools: Nonexistent (before) </li></ul><ul><li>We Needed Help, We Built Something to Help Ourselves and Our Friends, We Think It Can Help Others </li></ul>
  3. 3. Background: What (In a Slide!) <ul><li>A Framework and Tool: Metronome </li></ul><ul><ul><li>Lightweight, built atop Condor, DAGMan, and other proven distributed computing tools </li></ul></ul><ul><ul><li>Portable, open source </li></ul></ul><ul><ul><li>Language/harness independent </li></ul></ul><ul><ul><li>Assumes >1 user, >1 project, >1 environment needing resources at >1 site. </li></ul></ul><ul><ul><li>Encourages explicit, well-controlled build/test environments for reproducibility </li></ul></ul><ul><ul><li>Central results repository </li></ul></ul><ul><ul><li>Fault-tolerant </li></ul></ul><ul><ul><li>Encourages build/test separation </li></ul></ul><ul><li>A Facility: The NMI Lab </li></ul><ul><ul><li>200+ cores, 50+ platforms @ UW (Noah’s Ark; the Anti-Cluster) </li></ul></ul><ul><ul><li>Built to use distributed resources at other sites, grids, etc. </li></ul></ul><ul><ul><li>200 users, dozens of registered projects (most of them “real”) </li></ul></ul><ul><ul><li>84k builds & tests managed by 1M Condor jobs, producing 6.5M tracked tasks in the DB </li></ul></ul><ul><li>A Team </li></ul><ul><ul><li>Subset of Condor Team: Becky Gietzel, Todd Miller, Ross Oldenburg, myself. (More coming.) </li></ul></ul><ul><li>A Community </li></ul><ul><ul><li>Working with TeraGrid, OSG, ETICS, others towards a common intl. build/test infrastructure. </li></ul></ul>
  4. 4. MySQL Results DB Web Status Pages Finished Binaries Customer Source Code Condor Queue Metronome Customer Build/Test Scripts INPUT OUTPUT Distributed Build/Test Pool Spec File Spec File results build/test jobs DAG results results Metronome Architecture (In a Slide!) DAGMan DAG
  5. 5. Why Is This Architecture Powerful? <ul><li>Fault tolerance, resource management. </li></ul><ul><li>Real scheduler, not a toy or afterthought. </li></ul><ul><li>Flexible workflow tools. </li></ul><ul><li>Nothing to deploy in advance on worker nodes except Condor </li></ul><ul><ul><li>can harness “unprepared” resources. </li></ul></ul><ul><li>Advanced job migration capabilities </li></ul><ul><ul><li>critical for goal of a common build/test infrastructure across projects, sites, countries. </li></ul></ul>
  6. 6. Example: NMI Lab / ETICS Site Federation with Condor-C
  7. 7. 10k Foot View <ul><li>Past: </li></ul><ul><ul><li>humble beginnings, ragtag crew of developers making building & testing easier for the projects around them (Condor, Globus, VDT, Teragrid...) </li></ul></ul><ul><li>Present: </li></ul><ul><ul><li>now we have tax money and users should have higher expectations </li></ul></ul><ul><ul><li>good news: six months into a new 3y funding cycle, our &quot;professionalism&quot; has improved from our humble beginnings -- better hardware, better processes, better staffing </li></ul></ul><ul><ul><li>bad news: we’re still a bit ragtag -- inconsistent support/development request tracking, inconsistent info on resource/lab improvements, issues, and resolution, generally reactive to problems </li></ul></ul><ul><ul><li>we're clearly contributing to the build & test capabilities of the community, but we’d like to deliver much more, especially WRT testing. </li></ul></ul>
  8. 8. 10k Foot View: Future <ul><li>Maintain Metronome and the NMI Lab </li></ul><ul><ul><li>continue to professionalize lab infrastructure, improve availability, stability, uptime </li></ul></ul><ul><ul><li>Better monitoring -> more proactive response to issues </li></ul></ul><ul><ul><li>Better scheduling of jobs, better use of VMs to respond to uneven x86 platform demand </li></ul></ul><ul><li>Enhance Metronome and the NMI Lab </li></ul><ul><ul><li>New features, new capabilities – but might be less important than clarity, usability, fit & finish of existing features. </li></ul></ul>
  9. 9. 10k Foot View: Future <ul><li>Support Metronome and the NMI Lab </li></ul><ul><ul><li>more systematic support operation (ticketing, etc.) </li></ul></ul><ul><ul><li>more utilization of basic testing capabilities by new users </li></ul></ul><ul><ul><li>more utilization of advanced testing capabilities by existing users </li></ul></ul><ul><ul><li>more & better information for users, admins, and pointed-haired bosses </li></ul></ul><ul><ul><ul><li>better reporting on users, resources, usage, operations, etc. </li></ul></ul></ul><ul><li>Nurture Distributed Software Testing Community </li></ul><ul><ul><li>to identify common B&T needs to improve software quality. </li></ul></ul><ul><ul><li>to challenge and help us to provide software & services to help meet B&T needs. </li></ul></ul><ul><ul><li>Tuesday’s meeting was a good start, I hope… </li></ul></ul>
  10. 10. Maslow’s Pyramid of Testing Needs
  11. 11. Testing Opportunities <ul><li>more resources == more possibilities (just like science) </li></ul><ul><ul><li>don’t just test under normal conditions, test the not-so-edge cases too (e.g., with CPU load!) </li></ul></ul><ul><ul><li>test everywhere your users run, not just where you develop </li></ul></ul><ul><ul><li>old/exotic/unique resources you don’t own (NMI Lab, TeraGrid) </li></ul></ul><ul><li>“ black box” </li></ul><ul><ul><li>run your existing tinderbox, etc. test harness inside Metronome </li></ul></ul><ul><li>decoupled builds & tests </li></ul><ul><ul><li>run new tests on old builds </li></ul></ul><ul><ul><li>cross-platform binary compatibility testing </li></ul></ul><ul><ul><li>run quick smoke tests continuously, heavy tests nightly, performance/scalability tests before release </li></ul></ul>
  12. 12. Testing Opportunities <ul><li>managed (static) vs. “unmanaged” (auto-updating) platforms </li></ul><ul><ul><li>isolate your changes from the OS vendors </li></ul></ul><ul><ul><li>test your changes against a fixed target </li></ul></ul><ul><ul><li>test your working code against a moving target </li></ul></ul><ul><li>root-level testing </li></ul><ul><li>automated reports from testing tools </li></ul><ul><ul><li>ValGrind, Purify, Coverity, etc. </li></ul></ul><ul><li>cross-platform binary testing (build on A, test on B) </li></ul>
  13. 13. Testing Opportunities <ul><li>Parameterized dependencies </li></ul><ul><ul><li>build with multiple library versions, compilers, etc. </li></ul></ul><ul><ul><li>test against every Java VM, Maven, Ant version around </li></ul></ul><ul><ul><li>test against different DBs (MySQL, Postgres, Oracle, etc.), VM platforms (Xen, VMWare, etc.), batch systems </li></ul></ul><ul><ul><li>make sure new versions of Condor, Globus, etc. don’t break your code </li></ul></ul><ul><li>Parallel scheduled testbeds </li></ul><ul><ul><li>cross-platform testing (A to B) </li></ul></ul><ul><ul><li>deploy software stack across many hosts, test whole stack </li></ul></ul><ul><ul><li>multi-site testing (US to Europe) </li></ul></ul><ul><ul><li>network testing (cross-firewall, low-bandwidth, etc.) </li></ul></ul><ul><ul><li>scalability testing </li></ul></ul>
  14. 14. Upshot <ul><li>This is all work we’d like to help this community do. </li></ul><ul><li>Start small -- automated builds are an excellent start. </li></ul><ul><li>Think big -- what kinds of testing would pay dividends? </li></ul><ul><li>Let us know what we can do to help make it happen. </li></ul>