Rise of the Phoenix
Lessons Learned While Building an AI-powered Test Gen Engine
by Steve Brudz
Principal Engineer @ Def Method
Artificial Ruby NYC March 4, 2025
Agenda
● Story
● Phoenix Demo
● Key Learnings Developing Phoenix
Meet Maple
Newly Hired CTO at Cat’s Paw, LLC
Pain Points at Cat’s Paw, LLC
“Our last two releases were so buggy that we had to roll them back. Customers are pissed
off and some are threatening to leave.”
– Chocolate, Head of Sales
“The engineering team takes forever to get anything done and they never meet their
estimates.”
– Fluffy, Head of Product
“Some of this code is like the Fire Swamp from the Princess Bride. Every time I change
something in there, unexpected things break. I hate it.”
– Mochi, Senior Developer
Churn vs. Complexity of Cat’s Paw, LLC’s code
Test Coverage vs LOC of Cat’s Paw, LLC’s code
Impact of Technical Debt
● High risk of accidental breakage
● Slows development down
● Work is hard to estimate
● Lowers morale
How to fix things?
1. Add tests before changing code
2. Make the changes
3. Refactor
4. Rinse and repeat
There’s
got to be
a faster
way!
But that takes
so long…
Enter Phoenix
● AI-powered test generation
● Generates a full test suite for the system
● Uses Rails testing best practices
● Provides PR feedback*
● Maintains and updates the tests*
* coming soon
Phoenix Demo
Key Learnings Developing Phoenix
Key Learnings: LLMs
● Test out different providers & models
● Newest model isn’t always the best
● Keep the info you’re sending the LLM small and focused
● LLMs use probabilities – roll the dice multiple times and pick the best
● Prefer traditional automation to LLMs
OpenAI
StarCoder
Claude
Robot House
Key Learnings: Monitoring
● Capturing traces is essential for troubleshooting
● There are many options
○ LangSmith, ArizeAI, Langtrace, AgentOps, MLFlow, OpenLit, etc.
● Capture errors, LLM completions, tool calls
● Monitor tokens and cost
● Time-outs and recursion limits are a must-have
Key Learnings: Concurrency
● Important for processing large amounts of data quickly
● Many LLM apps are I/O bound not CPU bound
● Asyncio works differently in python than in javascript
● Watch out for API rate limits when doing concurrent programming
Key Learnings: Agents
● Agent-based workflows are powerful
and flexible but cost more
● CrewAI’s Agents, Tasks, and Tools
allow LLMs to collaborate
● Postel’s Law: “Be liberal in what you
accept, and validate what you send”
Key Learnings: Tools
● Very specific function improves reliability
● General tools can be useful as a back up
● Agents may use them in surprising ways
● CrewAI supports hand-offs between agents (ask question tool)
Happy Business + Happy Team = Happy Cat
● More frequent releases
● Faster speed to market
● Greatly reduced failure rate
● Happier developers
Steve Brudz
Principal Engineer
steve.brudz@defmethod.com
Thank you!
Def Method
336 W 37th St #335
New York, NY 10018
(212)-256-1460
Any questions?
Demo Backup Slides
Access Repo
Fork Repo
Generate Specs
Review the Results

Rise of the Phoenix: Lesson Learned Build an AI-powered Test Gen Engine

  • 1.
    Rise of thePhoenix Lessons Learned While Building an AI-powered Test Gen Engine by Steve Brudz Principal Engineer @ Def Method Artificial Ruby NYC March 4, 2025
  • 2.
    Agenda ● Story ● PhoenixDemo ● Key Learnings Developing Phoenix
  • 3.
    Meet Maple Newly HiredCTO at Cat’s Paw, LLC
  • 4.
    Pain Points atCat’s Paw, LLC “Our last two releases were so buggy that we had to roll them back. Customers are pissed off and some are threatening to leave.” – Chocolate, Head of Sales “The engineering team takes forever to get anything done and they never meet their estimates.” – Fluffy, Head of Product “Some of this code is like the Fire Swamp from the Princess Bride. Every time I change something in there, unexpected things break. I hate it.” – Mochi, Senior Developer
  • 5.
    Churn vs. Complexityof Cat’s Paw, LLC’s code
  • 6.
    Test Coverage vsLOC of Cat’s Paw, LLC’s code
  • 7.
    Impact of TechnicalDebt ● High risk of accidental breakage ● Slows development down ● Work is hard to estimate ● Lowers morale
  • 8.
    How to fixthings? 1. Add tests before changing code 2. Make the changes 3. Refactor 4. Rinse and repeat There’s got to be a faster way! But that takes so long…
  • 9.
    Enter Phoenix ● AI-poweredtest generation ● Generates a full test suite for the system ● Uses Rails testing best practices ● Provides PR feedback* ● Maintains and updates the tests* * coming soon
  • 10.
  • 11.
  • 12.
    Key Learnings: LLMs ●Test out different providers & models ● Newest model isn’t always the best ● Keep the info you’re sending the LLM small and focused ● LLMs use probabilities – roll the dice multiple times and pick the best ● Prefer traditional automation to LLMs OpenAI StarCoder Claude Robot House
  • 13.
    Key Learnings: Monitoring ●Capturing traces is essential for troubleshooting ● There are many options ○ LangSmith, ArizeAI, Langtrace, AgentOps, MLFlow, OpenLit, etc. ● Capture errors, LLM completions, tool calls ● Monitor tokens and cost ● Time-outs and recursion limits are a must-have
  • 14.
    Key Learnings: Concurrency ●Important for processing large amounts of data quickly ● Many LLM apps are I/O bound not CPU bound ● Asyncio works differently in python than in javascript ● Watch out for API rate limits when doing concurrent programming
  • 15.
    Key Learnings: Agents ●Agent-based workflows are powerful and flexible but cost more ● CrewAI’s Agents, Tasks, and Tools allow LLMs to collaborate ● Postel’s Law: “Be liberal in what you accept, and validate what you send”
  • 16.
    Key Learnings: Tools ●Very specific function improves reliability ● General tools can be useful as a back up ● Agents may use them in surprising ways ● CrewAI supports hand-offs between agents (ask question tool)
  • 17.
    Happy Business +Happy Team = Happy Cat ● More frequent releases ● Faster speed to market ● Greatly reduced failure rate ● Happier developers
  • 18.
    Steve Brudz Principal Engineer steve.brudz@defmethod.com Thankyou! Def Method 336 W 37th St #335 New York, NY 10018 (212)-256-1460 Any questions?
  • 19.
  • 20.
  • 21.
  • 22.
  • 24.

Editor's Notes

  • #2 First, I’m going to tell you a story Then I’ll show you a demo of Phoenix Finally, I’ll share some key learnings and tips about working effectively with AI
  • #3 This is Maple She just started a new job as CTO At Cat’s Paw, LLC Cat’s Paw is a startup that keeps cats happy by sending them new toys every week They’ve grown really fast
  • #4 But now they’re hitting problems Quality problems, productivity issues, low morale They’ve hired Maple to fix things
  • #5 To get a bird’s eye view of the code base, Maple runs ruby critic She looks at the graph of code complexity and churn (how often the file has changed) She can see there are some files that change a lot and are super complex That file in the upper right looks like a cat-astrophe But files like that usually have a lot of important business rules in them
  • #6 Next, she looks at test coverage There’s some But there are a lot of files with no or low coverage And the big files in particular have under 50% coverage
  • #7 Those complaints listed earlier are classic signs of technical debt and low test coverage
  • #8 It’s a lot of work Hard to estimate Hard to convince the business to pay for it (even with their complaints) Tedious work that people would rather not do Maple is a smart cat and a problem solver
  • #9 Maple hears about Phoenix while browsing Reddit AI-powered test generation? A full test suite with minimal developer effort? Let’s try it Phoenix churns through the code base Generating tests for all those huge models and controllers
  • #12 LLM’s are like a band lovable misfits – they’re capable of great things but take your eye off them and they’ll cause trouble We’ve had success with OpenAI’s GPT-4o, Anthropic’s Claude Sonnet, and StarCoder With OpenAI’s o1, we’ve actually seen worse results If you send too much information to an LLM, it’s more likely to get confused So if you paste a 2000 line model class into ChatGPT and say generate tests for this It will generate tests but they won’t be great Every time you send data to an LLM, you’re rolling the dice If you can implement something using normal automation, do it. LLMs are powerful and flexible but they’re not consistent and they’re expensive
  • #13 Capturing traces is essential On the right is a trace of an issue we found last week Keith and I were looking at a run and noticed that there were some outliers in terms of # of tokens used (Why 89k? The rest took 10-12k) We could dig in to the traces and find that the second task didn’t output ruby code like it was supposed to Instead it output a summary paragraph, which caused the agent handling the next step to get confused One of our early test runs got out of hand and ran for an hour, racking up a $200 bill. Put in time-outs so this doesn’t happen to you.
  • #14 Concurrency is hard. It infects your program. Easy to shoot yourself in the foot But it’s important for churning through a large code base in a reasonable time Choose the right concurrency model for your situation In python, if your app is using an API, it’ll be I/O bound so use asyncio instead of threads Be careful when mixing concurrency models When running a lot of API calls in parallel, you’ll hit rate limits As you use the APIs more, they’ll upgrade you
  • #15 With Phoenix, we started off using LangChain, which is one of the earlier frameworks for working with LLMs We hit limits, though, with flexibility. Our chains were linear. And decided to switch over to using CrewAI, which is a newer agent-based framework With an agent framework, you give agents tools to do their work and tasks to accomplish It’s powerful and pretty cool to see in action But this is where Postel’s Law becomes very important Validate those outputs (nod to Scott Werner Or you’ll end up like furious like Dean Vernon here CrewAI provides task outputs and guardrails
  • #17 Phoenix gave Maple and her team the safety net they needed to start addressing technical debt A few months in, there’s still a lot of work to do But the engineering team is able to focus on making improvements and delivering value Instead of fighting fires
  • #19 Just in case