by reqo on 4/24/25, 3:53 PM with 17 comments
by Centigonal on 4/24/25, 4:14 PM
by i5heu on 4/24/25, 4:38 PM
- Vision Transformers can be parallelized to reduce latency and improve optimization without sacrificing accuracy.
- Fine-tuning only the attention layers is often sufficient for adapting ViTs to new tasks or resolutions, saving compute and memory.
- Using MLP-based patch preprocessing improves performance in masked self-supervised learning by preserving patch independence.