IMHO, your article is missing an important point: 90% of implementations today flatten documents to plain text before chunking them. Why not consider the visual appearance that the human gave to the document?
Using layout information combined with semantics, you can increase rag performances by +160% (tested via benchmarks), so why do most of us only use text?
Note: multimodal ≠ layout