from Hacker News

Loading Pydantic models from JSON without running out of memory

by itamarst on 5/22/25, 6:06 PM with 45 comments

  • by scolvin on 5/23/25, 10:56 AM

    Pydantic author here. We have plans for an improvement to pydantic where JSON is parsed iteratively, which will make way for reading a file as we parse it. Details in https://github.com/pydantic/pydantic/issues/10032.

    Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.

    This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.

  • by fidotron on 5/22/25, 11:40 PM

    Having only recently encountered this, does anyone have any insight as to why it takes 2GB to handle a 100MB file?

    This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.

  • by jmugan on 5/22/25, 7:29 PM

    My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?
  • by deepsquirrelnet on 5/23/25, 1:00 AM

    Alternatively, if you had to go with json, you could consider using jsonl. I think I’d start by evaluating whether this is a good application for json. I tend to only want to use it for small files. Binary formats are usually much better in this scenario.
  • by dgan on 5/22/25, 8:24 PM

    i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods

    Automatic, statically typed deserialization is worth the trouble in my opinion

  • by fjasdfas on 5/22/25, 7:05 PM

    So are there downsides to just always setting slots=True on all of my python data types?
  • by thisguy47 on 5/22/25, 6:57 PM

    I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.
  • by zxilly on 5/22/25, 8:21 PM

    Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.
  • by kayson on 5/23/25, 4:07 AM

    How does the speed of the dataclass version compare?
  • by m_ke on 5/22/25, 7:31 PM

    Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/