by grbsh on 2/3/24, 7:39 PM with 0 comments
I started with a single prompt to GPT4. The results were pretty terrible in terms of layout (many overlapping / obstructed elements), but (I think) show promise. Also, it takes a really long time to generate, because the SVG path is usually 1-2 characters per token.
Next, I implemented a visual critique and self-refinement. I created a rubric that asked a bunch of questions about layout and educational / communicative value. I take the initial generated SVG, convert it to PNG and send it to gpt-4-vision-preview with the rubric, asking for a critique and regeneration.
The generation + visual self-critique and refinement takes about 3 minutes per SVG, so I generated about 300 examples, eliminated the bottom 50%, and fine-tuned on gpt-3.5-turbo. The fine-tune now takes about 7-10 seconds to generate one SVG with comparable quality to the GPT4 + GPT4-V refinement pipeline (according to GPT4-V's own scoring of the rubric).
Thoughts on this approach? I'm considering doing 10-100x the compute / dataset size, wondering if this interests anyone else. Happy to expose a (free) API to the fine-tuned model if people are interested in playing around with it.