by johnjwang on 10/24/24, 4:03 PM with 77 comments
by renegade-otter on 10/26/24, 10:21 PM
"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).
Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.
by mastersummoner on 10/27/24, 12:24 AM
I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.
I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.
by mkleczek on 10/27/24, 3:48 AM
In the past I've been involved in several projects deeply using MDA (Model Driven Architecture) techniques which used various code generation methods to develop software. One of the main obstacles was the problem of maintaining the generated code.
IOW: how should we treat generated code?
If we treat it in the same way as code produced by humans (ie. we maintain it) then the maintenance cost grows (super-linearly) with the amount of code we generate. To make matters worse for LLM: since the code it generates is buggy it means we have more buggy code to maintain. Code review is not the answer because code review power in finding bugs is very weak.
This is unlike compilers (that also generate code) because we don't maintain code generated by compilers - we regenerate it anytime we need.
The fundamental issue is: for a given set of requirements the goal is to produce less code, not more. _Any_ code generation (however smart it might be) goes against this goal.
EDIT: typos
by DeathArrow on 10/27/24, 7:47 AM
If you have injected services in your current service, the LLM doesn't know anything about those so it makes poor guesses. You have to bring those in context, so they can be mocked properly.
You end up spending a lot of time guiding the LLM, so it's not measurably faster than writing test by hand.
I want my prompt to be: "write unit tests for XYZ method" without having to accurately describe it the prompt what the method does, how it does it and why it does it. Writing too many details in the prompt takes the same time as writing the code myself.
Github Copilot should be better since it's supposed to have access to you entire code base. But somehow it doesn't look at dependencies and it just uses the knowledge of the codebase for stylistic purposes.
It's probably my fault, there are for sure better ways to use LLMs for code, but I am probably not the only one who struggles.
by nazgul17 on 10/26/24, 11:38 PM
by satisfice on 10/26/24, 8:31 PM
The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.
LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.
by tsv_ on 10/27/24, 4:41 PM
- The models often create several tests within the same equivalence class, which barely expands test coverage
- They either skip parameterization, creating multiple redundant tests, or go overboard with 5+ parameters that make tests hard to read and maintain
- The model seems focused on "writing a test at any cost" often resorting to excessive mocking or monkey-patching without much thought
- The models don’t leverage existing helper functions or classes in the project, requiring me to upload the whole project context each time or customize GPTs for every individual project
Given these limitations, I primarily use LLMs for refactoring tests where IDE isn’t as efficient:
- Extracting repetitive code in tests into helpers or fixtures
- Merging multiple tests into a single parameterized test
- Breaking up overly complex parameterized tests for readability
- Renaming tests to maintain a consistent style across a module, without getting stuck on names
by iambateman on 10/26/24, 9:03 PM
Happy to open source if anyone is interested.
by gengstrand on 10/27/24, 6:39 PM
by simonw on 10/26/24, 8:25 PM
by apwell23 on 10/26/24, 10:25 PM