from Hacker News

Semantic unit testing: test code without executing it

by alexmolas on 5/3/25, 9:44 AM with 71 comments

by vouwfietsman on 5/5/25, 7:40 AM
Maybe someone can help me out here:
I always get the feeling that fundamentally our software should be built on a foundation of sound logic and reasoning. That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently lack logic and reasoning, or at least such validation must be on par with human authored code + review. Because of this, the validation cannot be done by an LLM, as it would just compound the problem.
Unless we get a drastic change in the level of error detection and self-validation that can be done by an LLM, this remains a problem for the foreseeable future.
How is it then that people build tooling where the LLM validates the code they write? Or claim 2x speedups for code written by LLMs? Is there some kind of false positive/negative tradeoff I'm missing that allows people to extract robust software from an inherently not-robust generation process?
I'm not talking about search and documentation, where I'm already seeing a lot of benefit from LLMs today, because between the LLM output and the code is me, sanity checking and filtering everything. What I'm asking about is the: "LLM take the wheel!" type engineering.
by RainyDayTmrw on 5/5/25, 6:17 AM
I'm skeptical. Most of us maintaining medium sized codebases or larger are constantly fighting nondeterminism in the form of flaky tests. I can't imagine choosing a design that starts with nondeterminism baked in.
And if you're really dead-set on paying nondeterminism to get more coverage, property-based testing has existed for a long time and has a comparatively solid track record.
by dragonwriter on 5/5/25, 6:03 AM
This is more of "LLM code review" than any kind of testing, and calling it "testing" is just badly misleading.
by jonathanlydall on 5/5/25, 5:40 AM
If you’re stuck with dynamically typed languages, then tests like this can make a lot of sense.
On statically typed languages this happens for free at compile time.
I’ve often heard proponents of dynamically typed languages say how all the typing and boiler plate required by statically typed languages feels like such a waste of time, and on a small enough system maybe they are right.
But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.
They also allow trivial refactoring that people using dynamically typed languages wouldn’t even consider due to the risk being so high.
So keep this all in mind when you next choose your language for a new project.
by yuliyp on 5/5/25, 5:40 AM
Did the author do any analysis of the effectiveness of their tool on something beyond multiplication? Did they look to see if it caught any bugs in any codebases? What's the false positive rate? False negative?
As is it's neat that they wrote some code to generate some prompts for an LLM but there's no idea if it actually works.
by lgiordano_notte on 5/5/25, 11:36 AM
Treating docstrings as the spec and asking an LLM to flag mismatches feels promising in theory but personally I'd b wary of overfitting to underspecified docs. Might be useful as a lint-like signal, but hard to see it replacing real tests just yet.
by masklinn on 5/5/25, 6:52 AM
> But here’s the catch: you’re missing some edge cases. What about negative inputs?
The docstring literally says it only works with positive integers, and the LLM is supposed to follow the docstring (per previous assertions).
> The problem is that traditional tests can only cover a narrow slice of your function’s behavior.
Property tests? Fuzzers? Symbolic execution?
> Just because a high percentage of tests pass doesn’t mean your code is bug-free.
Neither does this thing. If you want your code to be bug-free what you're looking for is a proof assistant not vibe-reviewing.
Also
> One of the reasons to use suite is its seamless integration with pytest.
Exposing a predicate is not "seamless integration with pytest", it's just exposing a predicate.
by evanb on 5/5/25, 10:44 AM
> Beware of bugs in the above code; I have only proved it correct, not tried it.
-- Donald Knuth, Notes on the van Emde Boas construction of priority deques: An instructive use of recursion (1977)
https://www-cs-faculty.stanford.edu/~knuth/faq.html
by simianwords on 5/5/25, 5:38 AM
I was a bit skeptical at first but I think this is a good idea. Although I'm not convinced with the usage of max_depth parameter. In real life you rarely know what type your dependencies are if they are loaded at run time. This is kind of why we explicitly mock our dependencies.
On a side note: I have wondered whether LLM's are particularly good with functional languages. Imagine if your code entirely consisted of just pure functions and no side effects. You pass all parameters required and do not use static methods/variables and no OOP concepts like inheritance. I imagine every program can be converted in such a way, the tradeoff being human readability.
by rollulus on 5/5/25, 5:50 AM
I wonder if the random component of the LLM makes every test flaky by definition.
by gnabgib on 5/5/25, 5:37 AM
This seems to be your site @op.. your CSS needs attention. On a narrower screen (ie. portrait) the text is enormous, and worse, zooming out shrinks the quantity of words (increases the font-size).. which is the surely the opposite of expected? It's basically unusable.
Your CSS seems to assume all portrait screens (whether 80" or 3") deserve the same treatment.
by cerpins on 5/5/25, 7:35 AM
It sounds like it might be a good use case for testing documentation - verifying whether what documentation describes is actually in accordance with the code, and then you can act on it. With that in mind, it's also probably pointless to re-run if relevant code or documentation hasn't changed.
by gavmor on 5/5/25, 8:22 PM
If you don't try static typing, first, I feel like you're leaving money on the table... on your way to burn a pile of money.
Right? If you're looking to reduce bugs and errors... this is like putting a jetpack on a window-washer without even considering a carabiner harness.
by jmull on 5/5/25, 11:52 AM
This is probably better thought of as AI-assisted code review rather than unit testing.
Although you can automate running this test...
1. You may not want to blow up your token budget.
2. You probably want to manually review/use the results.
by stephantul on 5/5/25, 5:37 AM
This is cool! I think that, in general, generating test cases “offline” using an LLM and then running them using regular unit testing also solves this particular issue.
It also might be more transparent and cheaper.
by brap on 5/5/25, 11:56 AM
Skepticism aside, I think this would have worked better as a linter rule. 100% coverage out of the box. Or opt-in with linter comments.
by stoical1 on 5/5/25, 9:11 AM
Test driving a car by looking at it
by JanSchu on 5/5/25, 11:37 AM
Interesting experiment. I like that you framed it as “tests that read the docs” rather than “AI will magically find bugs”, because the former is exactly where LLMs shine: cross‑checking natural‑language intent with code.
A couple of thoughts after playing with a similar idea in private repos:
Token pressure is the real ceiling. Even moderately sized modules explode past 32k tokens once you inline dependencies and long docstrings. Chunking by call‑graph depth helps, but at some point you need aggressive summarization or cropping, otherwise you burn GPU time on boilerplate.
False confidence is worse than no test. LLMs love to pass your suite when the code and docstring are both wrong in the same way. I mitigated this by flipping the prompt: ask the model to propose three subtle, realistic bugs first, then check the implementation for each. The adversarial stance lowered the “looks good to me” rate.
Structured outputs let you fuse with traditional tests. If the model says passed: false, emit a property‑based test via Hypothesis that tries to hit the reasoning path it complained about. That way a human can reproduce the failure locally without a model in the loop.
Security review angle. LLM can spot obvious injection risks or unsafe eval calls even before SAST kicks in. Semantic tests that flag any use of exec, subprocess, or bare SQL are surprisingly helpful.
CI ergonomics. Running suite on pull requests only for files that changed keeps latency and costs sane. We cache model responses keyed by file hash so re‑runs are basically free.
Overall I would not drop my pytest corpus, but I would keep an async “semantic diff” bot around to yell when a quick refactor drifts away from the docstring. That feels like the sweet spot today.
P.S. If you want a local setup, Mistral‑7B‑Instruct via Ollama is plenty smart for doc/code mismatch checks and fits on a MacBook
by cjfd on 5/5/25, 5:35 AM
Much better solution: don't write useless docstrings.
by sigtstp on 5/5/25, 8:17 AM
I feel this makes some fundamental conceptual mistakes and is just riding the LLM wave.
"Semantics" is literally behavior under execution. This is syntactical analysis by a stochastic language model. I know the NLP literature uses "semantics" to talk about representations but that is an assertion which is contested [1].
Coming back to testing, this implicitly relies on the strong assumption of the LLM correctly associating the code (syntax) with assertions of properties under execution (semantic properties). This is a very risky assumption considering, once again, these things are stochastic in nature and cannot even guarantee syntactical correctness, let alone semantic. Being generous with the former, there is a track record of the latter often failing and producing subtle bugs [2][3][4][5]. Not to mention the observed effect of LLMs often being biased to "agree" with the premise presented to them.
It also kind of misses the point of testing, which is the engineering (not automation) task of reasoning about code and doing QC (even if said tests are later run automatically, I'm talking about their conception). I feel it's a dangerous, albeit tempting, decision to relegate that to an LLM. Fuzzing, sure. But not assertions about program behavior.
[1] A Primer in BERTology: What we know about how BERT works https://arxiv.org/abs/2002.12327 (Layers encode a mix of syntactic and semantic aspects of natural language, and it's problem-specific.)
[2] Large Language Models of Code Fail at Completing Code with Potential Bugs https://arxiv.org/abs/2306.03438
[3] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? https://arxiv.org/abs/2502.12115 (best models unable to solve the majority of coding problems)
[4] Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT https://arxiv.org/abs/2304.10778
[5] Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions https://arxiv.org/abs/2308.02312v4
EDIT: Added references
by jonstewart on 5/5/25, 11:21 AM
Does this buy carbon offsets, too?
by noodletheworld on 5/5/25, 6:10 AM
I don't think this is particularly terrible.
Broadly speaking, linters are good, and if you have a way of linting implementation errors it's probably helpful.
I would say it's probably more helpful while you're coding than at test/CI time because it will be, indubitably, flakey.
However, for a local developer workflow I can see a reasonable value in being able to go:
Take every function in my code and scan it to figure out if you think it's implemented correctly, and let me know if you spot anything that looks weird / wrong / broken. Ideally only functions that I've touched in my branch.
So... you know. Cool idea. I think it's overselling how useful it is, but hey, smash your AI into every possible thing and eventually you'll find a few modestly interesting uses for it.
This is probably a modestly interesting use case.
> suite allows you to run the tests asynchronously, and since the main bottleneck is IO (all the computations happen in a GPU in the cloud) it means that you can run your tests very fast. This is a huge advantage in comparison to standard tests, which need to be run sequentially.
uh... that said, saying that it's fast to run your functions through an LLM compared to, you know, just running tests, is a little bit strange.
I'm certain your laptop will melt if you run 500 functions in parallel through ollama gemma-3.
Running it over a network is, obviously, similarly insane.
This would also be enormously and time consuming and expensive to use with a hosted LLM api.
The 'happy path' is probably having a plugin in your IDE that scans the files you touch and then runs this in the background when you make a commit somehow using a local LLM of sufficient complexity it can be useful (gemma3 would probably work).
Kind of like having your tests in 'watch mode'; you don't expect instant feedback, but some-time-after you've done something you get a popup saying 'oh hey, are you sure you meant to return a string here..?'
Maybe it would just be annoying. You'd have to build it out properly and see. /shrug
I think it's not implausible though, that you could see something vaguely like this that was generally useful.
Probably what you see in this specific implementation is only the precursory contemplations of something actually useful though. Not really useful on its own, in its current form, imo.