SlideShare a Scribd company logo
1 of 28
TestOps: 
Continuous Integration 
when infrastructure 
is the product 
Barry Jaspan 
Senior Architect, Acquia Inc.
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/acquia-cloud-paas 
InfoQ.com: News & Community Site 
• 750,000 unique visitors/month 
• Published in 4 languages (English, Chinese, Japanese and Brazilian 
Portuguese) 
• Post content from our QCon conferences 
• News 15-20 / week 
• Articles 3-4 / week 
• Presentations (videos) 12-15 / week 
• Interviews 2-3 / week 
• Books 1 / month
Presented at QCon New York 
www.qconnewyork.com 
Purpose of QCon 
- to empower software development by facilitating the spread of 
knowledge and innovation 
Strategy 
- practitioner-driven conference designed for YOU: influencers of 
change and innovation in your teams 
- speakers and topics driving the evolution and innovation 
- connecting and catalyzing the influencers and innovators 
Highlights 
- attended by more than 12,000 delegates since 2007 
- held in 9 cities worldwide
This talk is about 
the hard parts. 
Rainbows and ponies 
have left the building.
Intro to Continuous Integration 
○ Maintain a code repository 
○ Automate the build 
○ Make the build self-testing 
○ Everyone commits to the baseline every day 
○ Every commit (to baseline) should be built 
○ Keep the build fast 
○ Test in a clone of the production environment 
○ Make it easy to get the latest deliverables 
○ Everyone can see the results of the latest build 
○ Automate deployment
Intro to Acquia Cloud 
● PaaS for PHP apps 
○ Multiple environments: Dev, Stage, Prod, … 
○ Continuous Integration environment for your app 
○ Special sauce for Drupal 
● Obligatory Impressive Numbers, ca. 03/2014 
○ 27 billion origin hits per month 
○ 422 TB data xfer/month 
○ 8000+ EC2 instances 
● Release every 1.6 days on average 
○ Each release alters the infrastructure under 
thousands of web apps that we do not control 
● Our customers REALLY hate downtime
Server configuration is software 
● Puppet, Chef, and similar tools turn server 
config into software 
○ Hurray! 
○ Deserves the same best practices as app 
development 
● TestOps == test-related CI principles for 
infrastructure software 
○ Make the build self-testing 
○ Test in a clone of the production environment 
○ Keep the build fast 
● “If it isn’t tested, it doesn’t work.”
Unit tests vs. System tests 
● Unit tests isolate individual program modules 
○ Injection mocks out external systems 
● Problem: You can’t mock out the real world 
and get accurate results 
○ Server configuration interacts with the OS, network, 
and services 
○ “puppetd –noop” doesn’t tell the whole story 
● Sample failure 
○ crontab and cron daemon race condition
Unit tests vs. System tests 
● System tests are end-to-end 
○ Apply code changes to real, running servers 
○ Exercise the infrastructure as the app(s) will 
● Problem: Reality is very messy! 
○ Launch failures 
○ Race conditions 
○ Vendor API scheduled maintenance 
○ Cosmic rays
System tests FTW 
● For infrastructure, system tests are essential 
● Making Acquia Cloud’s system tests reliable 
is the hardest engineering challenge I’ve 
ever faced. 
● We would be totally dead without them.
Test in a clone of production 
● No back-doors to make tests “easier” 
● Ex: HIPPA, PCI, etc. security requirements 
○ Access only from a bastion server using two-factor 
authentication 
○ No root logins, even from the bastion 
○ Any failure here locks Ops out from all servers! 
● Tests operate just as admins do 
○ Most tests operate “from a bastion”: 
ssh testadmin@server ‘sudo bash -c “cmd”’ 
○ Ensures the code works in production
Server build strategy 
● Always build from a reference base 
○ e.g. “Ubuntu 12.04 Server 64-bit” 
○ No incrementally evolved images 
○ Puppet makes this natural 
○ Sidebar: Docker gets this totally wrong! 
● Puppet can take a while 
○ Makes tests and MTTR slow 
● More on this later…
Basic build tests 
● Launch VMs, run puppet 
○ Replicate a functional production environment 
○ Isolated from production 
● Scan syslog for errors 
● Test config files, daemons, users, cron 
jobs… 
● Sample failure 
○ Incorrect Puppet dependencies work while iterating 
on development instances but not on clean launch
Functionally test the moving parts 
● Backup and restore 
● Message queues 
● Worker auto-scaling 
● Load balancing with up and down workers 
● ELB health check and recovery 
● Database failover 
● Monitoring & Alerting 
● Self healing
Application test 
● Install and verify application(s) 
○ Real site code 
○ Real site db (scrubbed) 
● Cause app to exercise the infrastructure 
○ Write to database, message queues, etc. 
○ Verify success on the back end 
● Operate app on degraded infrastructure 
○ Failed web nodes 
○ Database failover
Reboot test 
● Reboot all test servers 
● Re-run build tests 
○ Filesystems mounted? 
○ Services restarted? 
● Re-run functional and application tests 
● Sample failure 
○ Database quota daemon starts via /etc/init.d before 
MySQL daemon, then aborts
Relaunch test 
● Relaunch all test servers from base image 
○ Simulates server crash and recovery 
○ Persistent data retained? 
○ Server rejoins services? (e.g. MySQL replication 
restarts) 
○ Unexpected issues, ex: tmpfs 
● Re-run build, functional, and application 
tests 
● Sample failure 
○ Non-deployable customer application prevents 
relaunch from completing normally
Upgrade test 
● So far we've talked about new servers 
○ Use case: Your service is growing. 
● Also need to test upgrading existing servers 
○ Use case: You add a feature. 
● The Upgrade Test Dance 
○ Launch servers in test environment on current 
production code 
○ Run smoke tests to ensure system is operating 
○ Upgrade servers to latest development code 
■ Requires a fully automated upgrade process 
○ Run build, functional, and application tests from 
development code
Upgrade release process 
● Puppet cannot orchestrate all upgrades 
○ Rolling upgrade across HA clusters 
○ Server type upgrade order requirements 
○ Post-release tasks 
■ ex: Uninstall package X once all servers are 
upgraded 
○ Devs document release procedure with the commit 
○ Different devs run the release on an internal-use 
installation 
● Sample failure 
○ nginx failed to restart after version upgrade because 
prod server has more domain names than test
Server builds: continuous imaging 
● Remember: Always launch from a reference 
image. No evolved images! 
● Building servers from scratch can be slow 
● Automate pre-built images from 
development branch 
○ Speeds intra-day tests, reduces MTTR in prod 
● You will hit unexpected bumps in the road 
● Sample failure 
○ MySQL server, EBS, and init.d
Continuous Imaging image tests 
● Create development images nightly 
● Create per-branch images at release 
● Run system tests on both base and pre-built 
images 
● Test upgrade from per-branch to 
development pre-built images
Testing in parallel 
● Infrastructure system tests are slow 
● Run them in parallel 
○ Workers may alter server-wide behavior (e.g. kill 
Apache) 
○ Each worker needs an isolated set of servers 
○ Workers that break their servers need to self 
destruct, or they will cause false failures 
● Optimize running time 
○ Add more workers 
○ Reduce setup time 
○ Run the slower tests first
Management issues
Who writes the tests? 
● Our tests are as, or more, complex than the 
product 
○ Tests often take longer to write 
● Subtle cases require white-box testing 
○ Triggering specific failure scenarios requires 
understanding OS and code details together 
● First try: QA department 
○ Did not work, they could not keep up or go deep 
● Now: Engineering 
○ Every dev writes unit and system tests for their own 
code
Who fixes the tests? 
● Infrastructure system tests are fragile 
○ The damn things break for every little bug! 
○ … and every race condition imaginable 
○ … and every cosmic ray 
● Code reviews require a “passing” run 
○ Author must analyze any failures, confirm they are 
unrelated, and refer to or open a ticket for it 
● Bugs often only occur post-commit 
● Permanent, rotating team handles failures 
○ Authority to revert any commit causing a failure 
○ Usually it is easier to fix it instead
Who invests in the tests? 
● Management must accept that infrastructure 
system tests are 
○ hard 
○ time-consuming 
○ essential 
○ worth it 
● Under-investing will bite you badly 
○ “If it isn’t tested, it doesn’t work.” 
○ It will fail, at the worst possible time
Questions? 
● Barry Jaspan, barry.jaspan@acquia.com 
● Please evaluate this session! 
● Acquia is hiring! 
○ Boston, New York, Portland 
○ Europe! Australia!!! 
○ wherever you are
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/acquia-cloud- 
paas

More Related Content

More from C4Media

Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 

More from C4Media (20)

Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

TestOps: Continuous Integration when Infrastructure is the Product

  • 1. TestOps: Continuous Integration when infrastructure is the product Barry Jaspan Senior Architect, Acquia Inc.
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /acquia-cloud-paas InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. This talk is about the hard parts. Rainbows and ponies have left the building.
  • 5. Intro to Continuous Integration ○ Maintain a code repository ○ Automate the build ○ Make the build self-testing ○ Everyone commits to the baseline every day ○ Every commit (to baseline) should be built ○ Keep the build fast ○ Test in a clone of the production environment ○ Make it easy to get the latest deliverables ○ Everyone can see the results of the latest build ○ Automate deployment
  • 6. Intro to Acquia Cloud ● PaaS for PHP apps ○ Multiple environments: Dev, Stage, Prod, … ○ Continuous Integration environment for your app ○ Special sauce for Drupal ● Obligatory Impressive Numbers, ca. 03/2014 ○ 27 billion origin hits per month ○ 422 TB data xfer/month ○ 8000+ EC2 instances ● Release every 1.6 days on average ○ Each release alters the infrastructure under thousands of web apps that we do not control ● Our customers REALLY hate downtime
  • 7. Server configuration is software ● Puppet, Chef, and similar tools turn server config into software ○ Hurray! ○ Deserves the same best practices as app development ● TestOps == test-related CI principles for infrastructure software ○ Make the build self-testing ○ Test in a clone of the production environment ○ Keep the build fast ● “If it isn’t tested, it doesn’t work.”
  • 8. Unit tests vs. System tests ● Unit tests isolate individual program modules ○ Injection mocks out external systems ● Problem: You can’t mock out the real world and get accurate results ○ Server configuration interacts with the OS, network, and services ○ “puppetd –noop” doesn’t tell the whole story ● Sample failure ○ crontab and cron daemon race condition
  • 9. Unit tests vs. System tests ● System tests are end-to-end ○ Apply code changes to real, running servers ○ Exercise the infrastructure as the app(s) will ● Problem: Reality is very messy! ○ Launch failures ○ Race conditions ○ Vendor API scheduled maintenance ○ Cosmic rays
  • 10. System tests FTW ● For infrastructure, system tests are essential ● Making Acquia Cloud’s system tests reliable is the hardest engineering challenge I’ve ever faced. ● We would be totally dead without them.
  • 11. Test in a clone of production ● No back-doors to make tests “easier” ● Ex: HIPPA, PCI, etc. security requirements ○ Access only from a bastion server using two-factor authentication ○ No root logins, even from the bastion ○ Any failure here locks Ops out from all servers! ● Tests operate just as admins do ○ Most tests operate “from a bastion”: ssh testadmin@server ‘sudo bash -c “cmd”’ ○ Ensures the code works in production
  • 12. Server build strategy ● Always build from a reference base ○ e.g. “Ubuntu 12.04 Server 64-bit” ○ No incrementally evolved images ○ Puppet makes this natural ○ Sidebar: Docker gets this totally wrong! ● Puppet can take a while ○ Makes tests and MTTR slow ● More on this later…
  • 13. Basic build tests ● Launch VMs, run puppet ○ Replicate a functional production environment ○ Isolated from production ● Scan syslog for errors ● Test config files, daemons, users, cron jobs… ● Sample failure ○ Incorrect Puppet dependencies work while iterating on development instances but not on clean launch
  • 14. Functionally test the moving parts ● Backup and restore ● Message queues ● Worker auto-scaling ● Load balancing with up and down workers ● ELB health check and recovery ● Database failover ● Monitoring & Alerting ● Self healing
  • 15. Application test ● Install and verify application(s) ○ Real site code ○ Real site db (scrubbed) ● Cause app to exercise the infrastructure ○ Write to database, message queues, etc. ○ Verify success on the back end ● Operate app on degraded infrastructure ○ Failed web nodes ○ Database failover
  • 16. Reboot test ● Reboot all test servers ● Re-run build tests ○ Filesystems mounted? ○ Services restarted? ● Re-run functional and application tests ● Sample failure ○ Database quota daemon starts via /etc/init.d before MySQL daemon, then aborts
  • 17. Relaunch test ● Relaunch all test servers from base image ○ Simulates server crash and recovery ○ Persistent data retained? ○ Server rejoins services? (e.g. MySQL replication restarts) ○ Unexpected issues, ex: tmpfs ● Re-run build, functional, and application tests ● Sample failure ○ Non-deployable customer application prevents relaunch from completing normally
  • 18. Upgrade test ● So far we've talked about new servers ○ Use case: Your service is growing. ● Also need to test upgrading existing servers ○ Use case: You add a feature. ● The Upgrade Test Dance ○ Launch servers in test environment on current production code ○ Run smoke tests to ensure system is operating ○ Upgrade servers to latest development code ■ Requires a fully automated upgrade process ○ Run build, functional, and application tests from development code
  • 19. Upgrade release process ● Puppet cannot orchestrate all upgrades ○ Rolling upgrade across HA clusters ○ Server type upgrade order requirements ○ Post-release tasks ■ ex: Uninstall package X once all servers are upgraded ○ Devs document release procedure with the commit ○ Different devs run the release on an internal-use installation ● Sample failure ○ nginx failed to restart after version upgrade because prod server has more domain names than test
  • 20. Server builds: continuous imaging ● Remember: Always launch from a reference image. No evolved images! ● Building servers from scratch can be slow ● Automate pre-built images from development branch ○ Speeds intra-day tests, reduces MTTR in prod ● You will hit unexpected bumps in the road ● Sample failure ○ MySQL server, EBS, and init.d
  • 21. Continuous Imaging image tests ● Create development images nightly ● Create per-branch images at release ● Run system tests on both base and pre-built images ● Test upgrade from per-branch to development pre-built images
  • 22. Testing in parallel ● Infrastructure system tests are slow ● Run them in parallel ○ Workers may alter server-wide behavior (e.g. kill Apache) ○ Each worker needs an isolated set of servers ○ Workers that break their servers need to self destruct, or they will cause false failures ● Optimize running time ○ Add more workers ○ Reduce setup time ○ Run the slower tests first
  • 24. Who writes the tests? ● Our tests are as, or more, complex than the product ○ Tests often take longer to write ● Subtle cases require white-box testing ○ Triggering specific failure scenarios requires understanding OS and code details together ● First try: QA department ○ Did not work, they could not keep up or go deep ● Now: Engineering ○ Every dev writes unit and system tests for their own code
  • 25. Who fixes the tests? ● Infrastructure system tests are fragile ○ The damn things break for every little bug! ○ … and every race condition imaginable ○ … and every cosmic ray ● Code reviews require a “passing” run ○ Author must analyze any failures, confirm they are unrelated, and refer to or open a ticket for it ● Bugs often only occur post-commit ● Permanent, rotating team handles failures ○ Authority to revert any commit causing a failure ○ Usually it is easier to fix it instead
  • 26. Who invests in the tests? ● Management must accept that infrastructure system tests are ○ hard ○ time-consuming ○ essential ○ worth it ● Under-investing will bite you badly ○ “If it isn’t tested, it doesn’t work.” ○ It will fail, at the worst possible time
  • 27. Questions? ● Barry Jaspan, barry.jaspan@acquia.com ● Please evaluate this session! ● Acquia is hiring! ○ Boston, New York, Portland ○ Europe! Australia!!! ○ wherever you are
  • 28. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/acquia-cloud- paas