from Hacker News

Don't mock machine learning models in unit tests

by 7d7n on 2/28/24, 6:51 AM with 76 comments

  • by dangrossman on 2/28/24, 8:06 AM

    I was expecting an article about side effects of hurting an LLM's feelings in tests.
  • by necovek on 2/28/24, 10:26 AM

    On a more serious note, the author is describing a scenario where mocks are generally not useful, ML or not: never mock the code that is under your control if you can help it.

    Also, any test that calls out to another "function" (not necessarily a programming language function) is more than a unit test, and is usually considered an "integration" test (it tests that the code that calls out to something else is written properly).

    In general, an integration point is sufficiently well covered if the logic for the integration is tested. If you properly apply DI (Dependency Inversion/Injection), replacing external function with a fake/mock/stub implementation allows the integration point to be sufficiently tested, depending on the quality of fake/mock/stub.

    If you really want to test unpredictable output (this also applies to eg. performance testing), you want to introduce acceptable range (error deltas), and limit the test to exactly the point that's unpredictable by structuring the code appropriatelly. All the other code and tests should be able to trust that this bit of unpredictable behaviour is tested elsewhere and be able to test different outputs.

  • by politelemon on 2/28/24, 7:55 AM

    A better phrasing would be, ML models are better suited for integration testing rather than unit testing. Since the test is no longer running in isolation.
  • by pooper on 2/28/24, 11:51 AM

    I don't do any fancy research but for my simple stuff, I've mostly given up on the idea of unit tests. I still use them for some things and they totally help in places where the logic is wonky or unintuitive but I see my unit tests as living documentation of requirements more than actual tests. Things like make sure you get new tokens if your current ones will expire in five minutes or less.

    > Don’t test external libraries. We can assume that external libraries work. Thus, no need to test data loaders, tokenizers, optimizers, etc.

    I disagree with this. At $work I don't have all day to write perfect code. Neither does anyone else. I don't mock/substitute http anymore. I directly call my dependencies. If they fail, I try things out manually. If something goes wrong, I send them a message or go through their code if necessary.

    Life is too short to be dogmatic about tests. Do what works for your (dysfunctional) organization.

  • by hiddencost on 2/28/24, 8:18 AM

    The author is not describing unit tests.

    The concepts the author is looking for are integration tests and release evals.

  • by sarusso on 2/28/24, 9:53 AM

    You might also want to fix all random seeds so that you can check for exact numerical values and not “convergence” or similar concepts.
  • by noduerme on 2/28/24, 9:55 AM

    >> Software : Input Data + Handcrafted Logic = Expected Output

    Machine Learning : Input Data + Expected Output = Learned Logic

    Let me stop you right there. No logic is learned in this process.

    [edit] Also, the LLM is inductive, not deductive. That is, it can only generalize based on observable facts, not universalize based on logical conditions. This also goes to the question of whether a logical statement itself can ever be arrived at by induction, such as whether the absence of life in the observable universe is a problem of our ability to observe or a generally applicable phenomenon. But for the purpose of LLMs we have to conclude that no, it can't find logic by reducing a set of outcomes, regardless of the size of the set. All it can do is find a set of incomprehensible equations that seem to fit the set in every example you throw at it. That's not logic, it's a lens.

  • by Hackbraten on 2/28/24, 11:58 AM

    > Avoid loading CSVs or Parquet files as sample data. (It’s fine for evals but not unit tests.) Define sample data directly in unit test code to test key functionality

    How does it matter whether I inline my test data inside the unit test code, or have my unit test code load that same data from a checked-in file instead?

  • by javier_e06 on 2/28/24, 1:04 PM

    The problem when mocks happen when all your unit test passes and the program fails on integration. The mocks are a pristine place where your library unit test works like a champ. Bad mocks or bad library? Or both. Developers are then sent to debug the unit test... overhead. I don't much about ML but I would think that they should follow some rules resembling judicial rules of precedence and witness cross-examination techniques.
  • by elif on 2/28/24, 12:18 PM

    Depends on the model honestly. If you include gpt model in your unit tests, be prepared to run them over and over again until you get a pass, or chase your own shadow debugging non-errors.
  • by posix_monad on 2/28/24, 11:07 AM

    Prediction for the future:

    - Algebraic Effects will land in mainstream languages, in the same way that anonymous lambda functions have

    - This will render "mocks" pointless

  • by yawpitch on 2/28/24, 6:57 AM

    Oh dear (possibly artificial) god, have they developed _feelings_?!?

    Sorry… with a title like that, I couldn’t help myself.

  • by mindcrime on 2/28/24, 3:24 PM

    Not intended as a comment on the current TFA, but based on observing many conversations on the topic of unit testing in the past, I believe this to be a true statement:

    "If you're ever lost in a wilderness setting, far from civilization, and need to be rescued, just start talking about unit testing. Somebody will immediately show up to tell you that you're doing it wrong."

  • by mellutussa on 2/28/24, 1:33 PM

    > never mock the code that is under your control if you can help it.

    This is just nonsense. It'd effectively mean you only had integration tests. While they are absolutely fantastic they are too slow during development.