from Hacker News

How we made our AI code review bot stop leaving nitpicky comments

by dakshgupta on 12/18/24, 4:31 PM with 169 comments

  • by Retr0id on 12/22/24, 11:44 AM

    > LLMs (which are paid by the token)

    Straight-forwardly true and yet I'd never thought about it like this before. i.e. that there's a perverse incentive for LLM vendors to tune for verbose outputs. We rightly raise eyebrows at the idea of developers being paid per volume of code, but it's the default for LLMs.

  • by iLoveOncall on 12/21/24, 11:24 PM

    It's funny because I wouldn't consider the comment that they highlight in their post as a nitpick.

    Something that has an impact on the long term maintainability of code is definitely not nitpikcky, and in the majority of cases define a type fits this category as it makes refactors and extensions MUCH easier.

    On top of that, I think the approach they went with is a huge mistake. The same comment can be a nitpick on one CR but crucial on another, clustering them is destined to result in false-positives and false-negatives.

    I'm not sure I'd want to use a product to review my code for which 1) I cannot customize the rules, 2) it seems like the rules chosen by the creators are poor.

    To be honest I wouldn't want to use any AI-based code reviewer at all. We have one at work (FAANG, so something with a large dedicated team) and it has not once produced a useful comment and instead has been factually wrong many times.

  • by pama on 12/21/24, 8:13 PM

    After failing with three reasonable ideas, they solved the problem with an idea that previously would have seemed unlikely to work (get similarity to past known nits and use a threshold of 3 similar nits above a certain cutoff similarity score for filtering). A lot of applications of ML have a similar hacky trial and error flavor like that. Intuition is often built in retrospect and may not transfer well to other domains. I would have guessed that finetuning would also work, but agree with the author that it would be more expensive and less portable across different models.
  • by righthand on 12/22/24, 5:32 AM

    Our AI code review bot (Codacy) is just an LLM that compiles all linter rules and might be the most annoying useless thing. For example it will ding your PR for not considering Opera browser limitations on a backend NodeJS PR.

    Furthermore most of the code reviews I perform, rarely do I ever really leave commentary. There are so many frameworks and libraries today that solve whatever problem, unless someone adds complex code or puts a file in a goofy spot, it’s an instant approval. So an AI bot doesn’t help something which is a minimal non-problem task.

  • by XenophileJKO on 12/21/24, 10:25 PM

    Unless they were using a model too stupid for this operation, the fundamental problem is usually solvable via prompting.

    Usually the issue is that the models have a bias for action, so you need to give it an accetpable action when there isn't a good comment. Some other output/determination.

    I've seen this in many other similar applications.

  • by jerrygoyal on 12/22/24, 5:07 AM

    I'm in the market for PR review bots, as the nitpicking issue is real. So far, I have tried Coderabbit, but it adds too much noise to PRs, and only a very small percentage of comments are actually useful. I specifically instructed it to ignore nitpicks, but it still added such comments. Their cringy ASCII art comments make it even harder to take them seriously.

    I recently signed up for Korbit AI, but it's too soon to provide feedback. Honestly, I’m getting a bit fed up with experimenting with different PR bots.

    Question for the author: In what ways is your solution better than Coderabbit and Korbit AI?

  • by utdiscant on 12/22/24, 6:36 PM

    "We picked the latter, which also gave us our performance metric - percentage of generated comments that the author actually addresses."

    This metric would go up if you leave almost no comments. Would it not be better to find a metric that rewards you for generating many comments which are addressed, not just having a high relevance?

    You even mention this challenge yourselves: "Sadly, even with all kinds of prompting tricks, we simply could not get the LLM to produce fewer nits without also producing fewer critical comments."

    If that was happening, that doesn't sound like it would be reflected in your performance metric.

  • by extr on 12/22/24, 5:49 AM

    I'm surprised the authors didn't try the "dumb" version of the solution they went with: instead of using fancy cosine similarity create to implicit clusters, just ask it to classify the comment along a few dimensions and then do your own filtering on that (or create your own 0-100 scoring!) Seems like you would have more control that way and actually derive some rich(er) data to fine tune on. It seems they are already almost doing this: all the examples in the article start with "style"!

    I have seen this pattern a few times actually, where you want the AI to mimic some heuristic humans use. You never want to ask it for the heuristic directly, just create the constitute data so you can do some simple regression or whatever on top of it and control the cutoff yourself.

  • by jumploops on 12/22/24, 1:05 AM

    We do something internally[0] but specifically for security concerns.

    We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.

    It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.

    There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).

    We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.

    [0] https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...

  • by hsbauauvhabzb on 12/22/24, 12:55 PM

    I can’t fathom a world where an LLM would be able to review code in any meaningful way -at all-.

    It should not substitute a human, and probably wasted more effort than it solves by a wide margin.

  • by panarchy on 12/22/24, 12:19 AM

    I wonder how long until we end up with questionable devs making spurious changes just to try and game the LLM output to give them a pass.
  • by anonzzzies on 12/22/24, 5:34 AM

    We found that in general it's pretty hard to make the llm just stop without human intervention. You can see with things like Cline that if the llm has to check its own work, it'll keep making 'improvements' in a loop; removing all comments, adding all comments etc. It needs to generate something and seems overly helpful to give you something.
  • by dbetteridge on 12/21/24, 10:59 PM

    Prompt:

    If the comment could be omitted without affecting the codes functionality but is stylistic or otherwise can be ignored then preface the comment with

    NITPICK

    I'm guessing you've tried something like the above and then filtering for the preface, as you mentioned the llm being bad at understanding what is and isn't important.

  • by untech on 12/22/24, 12:49 AM

    I am not sure about using UPPERCASE in prompts for emphasis. I feel intuitively that uppercase is less “understandable” for LLMs because it is more likely to be tokenized as a sequence of characters. I have no data to back this up, though.
  • by AgentOrange1234 on 12/22/24, 3:42 PM

    I’d be curious to hear more on Attempt 2. It sounds like the approach was basically to ask an llm for a score for each comment. Adding specifics to this prompt might go a long way? Like, what specifically is the rationale for this change, is this likely to be a functional bug, is it a security issue, how does it impact maintainability over the long run, etc.; basically I wonder if asking about more specific criteria and trying to define what you mean by nits can help the LLM give you more reliable scores.
  • by throw310822 on 12/21/24, 11:11 PM

    Seems easier than getting the same from my colleagues.
  • by iandanforth on 12/22/24, 5:09 PM

    "As a last resort we tried machine learning ..."

    - Hilarious that a cutting edge solution (document embedding and search) from 5-6 years ago was their last resort.

    - Doubly hilarious that "throw more AI at it" surprised them when it didn't work.

  • by tayo42 on 12/22/24, 2:41 AM

    > Attempt 2: LLM-as-a-judge

    Wouldnt this be achievable with a classifier model? Maybe even a combo of getting the embedding and then putting it through a classifier? Kind of like how Gans work.

    Edit: I read the article before the comment section, silly me lol

  • by pedrovhb on 12/22/24, 12:56 AM

    Here's an idea: have the LLM output each comment with a "severity" score ranging from 0-100 or maybe a set of possible values ("trivial", "small", "high"). Let it get everything off of its chest outputting the nitpicks but recognizing they're minor. Filter the output to only contain comments above a given threshold.

    It's hard to avoid thinking of a pink elephant, but easy enough to consciously recognize it's not relevant to the task at hand.

  • by profsummergig on 12/22/24, 9:05 AM

    Is there a name for the activity of trying out different strategies to improve the output of AI?

    I found this article surprisingly enjoyable and interesting and if like to find more like it.

  • by pcwelder on 12/22/24, 1:18 PM

    A whole article without mentioning the name of the LLM. It's not like Sonnet and O1 have the same modalities.

    Anything you do today might become irrelevant tomorrow.

  • by callamdelaney on 12/22/24, 2:11 AM

    Does it work on real engineers?
  • by kgeist on 12/21/24, 11:25 PM

    What about false positives?

    As I see it, the solution assumes the embeddings only capture the form: say, if developers previously downvoted suggestions to wrap code in unnecessary try..catch blocks, then similar suggestions will be successfully blocked in the future, regardless of the module/class etc. (i.e. a kind of generalization)

    But what if enough suggestions regarding class X (or module X) get downvoted, and then the mechanism starts assuming class X/module X doesn't need review at all? I mean the case when a lot of such embeddings end up clustering around the class itself (or a function), not around the general form of the comment.

    How do you prevent this? Or it's unlikely to happen? The only metric I've found in the article is the percentage of addressed suggestions that made it to the end user.

  • by just-another-se on 12/22/24, 3:15 AM

    And how do you guys deal with a cold start problem then? Suppose the repo is new and has very few previous comments?
  • by planetpluta on 12/22/24, 12:28 PM

    > Essentially we needed to teach LLMs (which are paid by the token) to only generate a small number of high quality comments.

    The solution of filtering after the comment is generated doesn’t seem to address the “paid by the token” piece.

  • by aarondia on 12/22/24, 6:18 PM

    I find these retro blog posts from people building LLM solutions super useful for helping me try out new prompting, eval, etc. techniques. Which blog posts have you all found most useful? Would love a link.
  • by Havoc on 12/22/24, 2:17 AM

    Think it would initially have gone better had they not used „nits“ but rather nitpicks. ie something that’s in the dictionary that the chatbot is likely to understand
  • by wzdd on 12/22/24, 12:28 PM

  • by lupire on 12/22/24, 8:24 PM

    Instead of training an LLM to make comments, train the LLM to generate test cases and then guess the line of code that breaks the test.
  • by keybored on 12/22/24, 11:07 AM

    Why review bots. Why not a bot that runs as part of the validation suite? And then you dismiss and follow up on what you want?

    You can run that locally.

  • by nikolayasdf123 on 12/22/24, 1:54 AM

    > $0.45/file capped at $50/dev/month

    wow. this is really expensive... especially given core of this technology is open source and target customers can set it up themselves self-hosted

  • by Kwpolska on 12/22/24, 11:12 AM

    > Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.

    https://news.ycombinator.com/newsguidelines.html

  • by Falimonda on 12/21/24, 11:28 PM

    fwiw, not all the mobile site's menu items work when clicked
  • by fnqi8ckfek on 12/22/24, 8:49 AM

    Reading from the other comments here, I'm the only one thinking that this is just busywork...? Just get rid of the thing, it's a solution without a problem.
  • by thomasahle on 12/22/24, 11:54 AM

    TLDR:

    > Giving few-shot examples to the generator didn't work.

    > Using an LLM-judge (with no training) didn't work.

    > Using an embedding + KNN-classifier (lots of training data) worked.

    I don't know why they didn't try fine-tuning the LLM-judge, or at least give it some few-shot examples.

    But it shows that embeddings can make very simple classifiers work well.

  • by Nullabillity on 12/21/24, 9:17 PM

    [flagged]
  • by dcreater on 12/22/24, 4:15 AM

    Are you still looking for slaves, err I mean employees, who are going to work 80+hrs a week? Do you sponsor visas?