Rise of the Phoenix: Lesson Learned Build an AI-powered Test Gen Engine
In this talk, I give an overview and demo of Phoenix, an AI-powered test generation engine for Ruby on Rails applications, and share lessons learned while building it. I presented this at the Artificial Ruby Meet Up in NYC on March 4, 2025.
Rise of the Phoenix: Lesson Learned Build an AI-powered Test Gen Engine
1.
Rise of thePhoenix
Lessons Learned While Building an AI-powered Test Gen Engine
by Steve Brudz
Principal Engineer @ Def Method
Artificial Ruby NYC March 4, 2025
Pain Points atCat’s Paw, LLC
“Our last two releases were so buggy that we had to roll them back. Customers are pissed
off and some are threatening to leave.”
– Chocolate, Head of Sales
“The engineering team takes forever to get anything done and they never meet their
estimates.”
– Fluffy, Head of Product
“Some of this code is like the Fire Swamp from the Princess Bride. Every time I change
something in there, unexpected things break. I hate it.”
– Mochi, Senior Developer
Impact of TechnicalDebt
● High risk of accidental breakage
● Slows development down
● Work is hard to estimate
● Lowers morale
8.
How to fixthings?
1. Add tests before changing code
2. Make the changes
3. Refactor
4. Rinse and repeat
There’s
got to be
a faster
way!
But that takes
so long…
9.
Enter Phoenix
● AI-poweredtest generation
● Generates a full test suite for the system
● Uses Rails testing best practices
● Provides PR feedback*
● Maintains and updates the tests*
* coming soon
Key Learnings: LLMs
●Test out different providers & models
● Newest model isn’t always the best
● Keep the info you’re sending the LLM small and focused
● LLMs use probabilities – roll the dice multiple times and pick the best
● Prefer traditional automation to LLMs
OpenAI
StarCoder
Claude
Robot House
13.
Key Learnings: Monitoring
●Capturing traces is essential for troubleshooting
● There are many options
○ LangSmith, ArizeAI, Langtrace, AgentOps, MLFlow, OpenLit, etc.
● Capture errors, LLM completions, tool calls
● Monitor tokens and cost
● Time-outs and recursion limits are a must-have
14.
Key Learnings: Concurrency
●Important for processing large amounts of data quickly
● Many LLM apps are I/O bound not CPU bound
● Asyncio works differently in python than in javascript
● Watch out for API rate limits when doing concurrent programming
15.
Key Learnings: Agents
●Agent-based workflows are powerful
and flexible but cost more
● CrewAI’s Agents, Tasks, and Tools
allow LLMs to collaborate
● Postel’s Law: “Be liberal in what you
accept, and validate what you send”
16.
Key Learnings: Tools
●Very specific function improves reliability
● General tools can be useful as a back up
● Agents may use them in surprising ways
● CrewAI supports hand-offs between agents (ask question tool)
17.
Happy Business +Happy Team = Happy Cat
● More frequent releases
● Faster speed to market
● Greatly reduced failure rate
● Happier developers
#2 First, I’m going to tell you a story
Then I’ll show you a demo of Phoenix
Finally, I’ll share some key learnings and tips about working effectively with AI
#3 This is Maple
She just started a new job as CTO
At Cat’s Paw, LLC
Cat’s Paw is a startup that keeps cats happy by sending them new toys every week
They’ve grown really fast
#4 But now they’re hitting problems
Quality problems, productivity issues, low morale
They’ve hired Maple to fix things
#5 To get a bird’s eye view of the code base, Maple runs ruby critic
She looks at the graph of code complexity and churn (how often the file has changed)
She can see there are some files that change a lot and are super complex
That file in the upper right looks like a cat-astrophe
But files like that usually have a lot of important business rules in them
#6 Next, she looks at test coverage
There’s some
But there are a lot of files with no or low coverage
And the big files in particular have under 50% coverage
#7 Those complaints listed earlier are classic signs of technical debt and low test coverage
#8 It’s a lot of work
Hard to estimate
Hard to convince the business to pay for it (even with their complaints)
Tedious work that people would rather not do
Maple is a smart cat and a problem solver
#9 Maple hears about Phoenix while browsing Reddit
AI-powered test generation? A full test suite with minimal developer effort?
Let’s try it
Phoenix churns through the code base
Generating tests for all those huge models and controllers
#12 LLM’s are like a band lovable misfits – they’re capable of great things but take your eye off them and they’ll cause trouble
We’ve had success with OpenAI’s GPT-4o, Anthropic’s Claude Sonnet, and StarCoder
With OpenAI’s o1, we’ve actually seen worse results
If you send too much information to an LLM, it’s more likely to get confused
So if you paste a 2000 line model class into ChatGPT and say generate tests for this
It will generate tests but they won’t be great
Every time you send data to an LLM, you’re rolling the dice
If you can implement something using normal automation, do it.
LLMs are powerful and flexible but they’re not consistent and they’re expensive
#13 Capturing traces is essential
On the right is a trace of an issue we found last week
Keith and I were looking at a run and noticed that there were some outliers in terms of # of tokens used (Why 89k? The rest took 10-12k)
We could dig in to the traces and find that the second task didn’t output ruby code like it was supposed to
Instead it output a summary paragraph, which caused the agent handling the next step to get confused
One of our early test runs got out of hand and ran for an hour, racking up a $200 bill. Put in time-outs so this doesn’t happen to you.
#14 Concurrency is hard. It infects your program.
Easy to shoot yourself in the foot
But it’s important for churning through a large code base in a reasonable time
Choose the right concurrency model for your situation
In python, if your app is using an API, it’ll be I/O bound so use asyncio instead of threads
Be careful when mixing concurrency models
When running a lot of API calls in parallel, you’ll hit rate limits
As you use the APIs more, they’ll upgrade you
#15 With Phoenix, we started off using LangChain, which is one of the earlier frameworks for working with LLMs
We hit limits, though, with flexibility. Our chains were linear.
And decided to switch over to using CrewAI, which is a newer agent-based framework
With an agent framework, you give agents tools to do their work and tasks to accomplish
It’s powerful and pretty cool to see in action
But this is where Postel’s Law becomes very important
Validate those outputs (nod to Scott Werner
Or you’ll end up like furious like Dean Vernon here
CrewAI provides task outputs and guardrails
#17 Phoenix gave Maple and her team the safety net they needed to start addressing technical debt
A few months in, there’s still a lot of work to do
But the engineering team is able to focus on making improvements and delivering value
Instead of fighting fires