by dakshgupta on 12/18/24, 4:31 PM with 169 comments
by Retr0id on 12/22/24, 11:44 AM
Straight-forwardly true and yet I'd never thought about it like this before. i.e. that there's a perverse incentive for LLM vendors to tune for verbose outputs. We rightly raise eyebrows at the idea of developers being paid per volume of code, but it's the default for LLMs.
by iLoveOncall on 12/21/24, 11:24 PM
Something that has an impact on the long term maintainability of code is definitely not nitpikcky, and in the majority of cases define a type fits this category as it makes refactors and extensions MUCH easier.
On top of that, I think the approach they went with is a huge mistake. The same comment can be a nitpick on one CR but crucial on another, clustering them is destined to result in false-positives and false-negatives.
I'm not sure I'd want to use a product to review my code for which 1) I cannot customize the rules, 2) it seems like the rules chosen by the creators are poor.
To be honest I wouldn't want to use any AI-based code reviewer at all. We have one at work (FAANG, so something with a large dedicated team) and it has not once produced a useful comment and instead has been factually wrong many times.
by pama on 12/21/24, 8:13 PM
by righthand on 12/22/24, 5:32 AM
Furthermore most of the code reviews I perform, rarely do I ever really leave commentary. There are so many frameworks and libraries today that solve whatever problem, unless someone adds complex code or puts a file in a goofy spot, it’s an instant approval. So an AI bot doesn’t help something which is a minimal non-problem task.
by XenophileJKO on 12/21/24, 10:25 PM
Usually the issue is that the models have a bias for action, so you need to give it an accetpable action when there isn't a good comment. Some other output/determination.
I've seen this in many other similar applications.
by jerrygoyal on 12/22/24, 5:07 AM
I recently signed up for Korbit AI, but it's too soon to provide feedback. Honestly, I’m getting a bit fed up with experimenting with different PR bots.
Question for the author: In what ways is your solution better than Coderabbit and Korbit AI?
by utdiscant on 12/22/24, 6:36 PM
This metric would go up if you leave almost no comments. Would it not be better to find a metric that rewards you for generating many comments which are addressed, not just having a high relevance?
You even mention this challenge yourselves: "Sadly, even with all kinds of prompting tricks, we simply could not get the LLM to produce fewer nits without also producing fewer critical comments."
If that was happening, that doesn't sound like it would be reflected in your performance metric.
by extr on 12/22/24, 5:49 AM
I have seen this pattern a few times actually, where you want the AI to mimic some heuristic humans use. You never want to ask it for the heuristic directly, just create the constitute data so you can do some simple regression or whatever on top of it and control the cutoff yourself.
by jumploops on 12/22/24, 1:05 AM
We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.
It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.
There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).
We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.
[0] https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...
by hsbauauvhabzb on 12/22/24, 12:55 PM
It should not substitute a human, and probably wasted more effort than it solves by a wide margin.
by panarchy on 12/22/24, 12:19 AM
by anonzzzies on 12/22/24, 5:34 AM
by dbetteridge on 12/21/24, 10:59 PM
If the comment could be omitted without affecting the codes functionality but is stylistic or otherwise can be ignored then preface the comment with
NITPICK
I'm guessing you've tried something like the above and then filtering for the preface, as you mentioned the llm being bad at understanding what is and isn't important.
by untech on 12/22/24, 12:49 AM
by AgentOrange1234 on 12/22/24, 3:42 PM
by throw310822 on 12/21/24, 11:11 PM
by iandanforth on 12/22/24, 5:09 PM
- Hilarious that a cutting edge solution (document embedding and search) from 5-6 years ago was their last resort.
- Doubly hilarious that "throw more AI at it" surprised them when it didn't work.
by tayo42 on 12/22/24, 2:41 AM
Wouldnt this be achievable with a classifier model? Maybe even a combo of getting the embedding and then putting it through a classifier? Kind of like how Gans work.
Edit: I read the article before the comment section, silly me lol
by pedrovhb on 12/22/24, 12:56 AM
It's hard to avoid thinking of a pink elephant, but easy enough to consciously recognize it's not relevant to the task at hand.
by profsummergig on 12/22/24, 9:05 AM
I found this article surprisingly enjoyable and interesting and if like to find more like it.
by pcwelder on 12/22/24, 1:18 PM
Anything you do today might become irrelevant tomorrow.
by callamdelaney on 12/22/24, 2:11 AM
by kgeist on 12/21/24, 11:25 PM
As I see it, the solution assumes the embeddings only capture the form: say, if developers previously downvoted suggestions to wrap code in unnecessary try..catch blocks, then similar suggestions will be successfully blocked in the future, regardless of the module/class etc. (i.e. a kind of generalization)
But what if enough suggestions regarding class X (or module X) get downvoted, and then the mechanism starts assuming class X/module X doesn't need review at all? I mean the case when a lot of such embeddings end up clustering around the class itself (or a function), not around the general form of the comment.
How do you prevent this? Or it's unlikely to happen? The only metric I've found in the article is the percentage of addressed suggestions that made it to the end user.
by just-another-se on 12/22/24, 3:15 AM
by planetpluta on 12/22/24, 12:28 PM
The solution of filtering after the comment is generated doesn’t seem to address the “paid by the token” piece.
by aarondia on 12/22/24, 6:18 PM
by Havoc on 12/22/24, 2:17 AM
by wzdd on 12/22/24, 12:28 PM
by lupire on 12/22/24, 8:24 PM
by keybored on 12/22/24, 11:07 AM
You can run that locally.
by nikolayasdf123 on 12/22/24, 1:54 AM
wow. this is really expensive... especially given core of this technology is open source and target customers can set it up themselves self-hosted
by Kwpolska on 12/22/24, 11:12 AM
by Falimonda on 12/21/24, 11:28 PM
by fnqi8ckfek on 12/22/24, 8:49 AM
by thomasahle on 12/22/24, 11:54 AM
> Giving few-shot examples to the generator didn't work.
> Using an LLM-judge (with no training) didn't work.
> Using an embedding + KNN-classifier (lots of training data) worked.
I don't know why they didn't try fine-tuning the LLM-judge, or at least give it some few-shot examples.
But it shows that embeddings can make very simple classifiers work well.
by Nullabillity on 12/21/24, 9:17 PM
by dcreater on 12/22/24, 4:15 AM