from Hacker News

How Does GPT-4o Encode Images?

by olooney on 6/7/24, 12:54 PM with 112 comments

by ComputerGuru on 6/7/24, 4:37 PM
We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.
by valine on 6/7/24, 2:05 PM
Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.
Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.
by riemannzeta on 6/7/24, 2:19 PM
Love this curious and open-minded exploration of how this stuff works.
The pyramid strategy loosely tracks with renormalization group theory, which has been formally studied for years as a method of interpreting machine learning models:
https://arxiv.org/abs/1410.3831
I love the convergence we're seeing in the use of models from different fields to understand machine learning, fundamental physics, and human consciousness. What a time to be alive.
by enjoylife on 6/7/24, 6:42 PM
> Interestingly enough, it’s actually more efficient to send text as images: A 512x512 image with a small but readable font can easily fit 400-500 tokens worth of text, yet you’re only charged for 170 input tokens plus the 85 for the ‘master thumbnail’ for a grand total of 255 tokens—far less than the number of words on the image.
Sounds like an arbitrage opportunity for all those gpt wrappers. Price your cost per token the same, send over the prompt via image, pocket the difference?
by simonw on 6/7/24, 1:42 PM
Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,
I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.
To make good judgements about how to use this stuff I need to know how it works!
I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...
If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.
by rafaelero on 6/7/24, 2:31 PM
They are very likely using VQVAE to create a dictionary of tokens and then just converting images into them with an encoder.
by comboy on 6/7/24, 7:35 PM
I love how well this is written. Definitely "look how interesting this is" rather than "look how much do I know". And it dives as deep as needs to, while being accessible for almost everyone. One really needs to master some topic to be able to describe it simply. Great job.
by GaggiX on 6/7/24, 1:40 PM
An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).
by cs702 on 6/7/24, 2:42 PM
One possibility is that mapping images to a token embedding consumes ~170x more compute+space than mapping a token id.
Another possibility is that OpenAI is mapping each image to ~170 vectors in an embedding space that is shared with token IDs. If that's the case, the architecture of the image-to-fixed-number-of-tokens model has not been disclosed. It could be a standard CNN, a ViT-like model, an autoencoder, a model that routes a variable number of vectors with RGB data to a fixed number of vectors, or something else that has not yet been ublished. The whole thing is likely trained end-to-end.
by HarHarVeryFunny on 6/7/24, 3:49 PM
I don't think a 13x13 tiling (of N channels/features) can be ruled out just because it can't recognize a grid of 13x13 objects. There is presumably a lot of overlap between the receptive fields of the tiles (due to kernel step sizes).
A pyramid of overlapped tiling resolutions is of course possible too.
by simonw on 6/7/24, 1:35 PM
The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever.
by geor9e on 6/7/24, 3:32 PM
Nit: the implied premise that this isn't a beautiful and skilled painting https://www.oranlooney.com/post/gpt-cnn_files/malicious_dogs...
by iknownothow on 6/7/24, 2:13 PM
I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup.
The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.
If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!
EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.
by blixt on 6/7/24, 1:47 PM
I went through a similar journey back when GPT-4V came out. Here's an additional puzzle for you: GPT-4V knows the exact pixel dimensions of the image (post-resize since there is a max size for images in the pipeline, besides 512x512), but I'm 99% sure it's not provided as text tokens. How am I so sure? It's easy to get GPT to divulge everything from system prompt to tool details, etc. but I've tried every trick in the book and then some, multiple times over, and there is no way to get it to quote the dimensions as text. The only way to get it to give you the dimensions is to tell it to output a structure that contains width and height and just pick something reasonable, and they will "randomly" be the correct values:
https://x.com/blixt/status/1722298733470024076
by joelburget on 6/7/24, 2:04 PM
Vision transformers should be our default guess as to how GPT-4o works, yet this article never mentions them.
by sva_ on 6/7/24, 2:46 PM
Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM).
by surfingdino on 6/7/24, 1:58 PM
OCR is hard https://www.vice.com/en/article/gvy4gb/one-mans-david-and-go...
by yorwba on 6/7/24, 2:29 PM
It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy.
by SubiculumCode on 6/7/24, 7:28 PM
I'm not sure how chatgpt4o routes information. If a picture is submitted that contains text, does the text then get resubmitted to chatgpt4o as a textual query, or do the model weights themselves essentially transform the textual images to textual tokens. I do wonder if a response to the textual images is similar to a response to text queries...i.e. processed by the the same weights.
by imranhou on 6/7/24, 6:15 PM
Not to be nit-picky but double checking myself, isn't a token just 0.75 words, so 170 token would be 127 words and not 227?
by tantalor on 6/7/24, 1:44 PM
> CLIP embeds the entire image as a single vector, not 170 of them.
Single token?
> GPT-4o must be using a different, more advanced strategy internally
Why
by jmount on 6/7/24, 2:51 PM
Scanning images is quite the problem in the presence of compression (and now interpolation) https://www.bbc.com/news/technology-23588202 .
by jamesy0ung on 6/7/24, 10:44 PM
I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg?
by rvnx on 6/7/24, 1:54 PM
Author claims that the most likely is that there is Tesseract running behind ChatGPT-4v/o.
There is no way that this is Tesseract.
-> Tesseract accuracy is very low, it can barely do OCR on printed documents.
by alach11 on 6/7/24, 1:48 PM
I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications.
by eminence32 on 6/7/24, 1:52 PM
I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?
by sashank_1509 on 6/7/24, 7:31 PM
I would be disappointed if OpenAI had a separate model for OCR, though I guess that is believable. Much cooler if the LLM just understands language from text