by ericjang on 7/30/22, 4:16 PM with 3 comments
by ilaksh on 7/30/22, 7:25 PM
For example, translating the visual sampling into a 3d model first. Or maybe some neural representations that can generate the 3d models. Then train the movement on that rather than raw pixels.
Similarly, for textual prompts of interactions, first create a model that relates the word embedding to the same 3d modeling and physics interactions.
Obviously much easier said than done.