by borzunov on 12/13/22, 3:32 PM with 0 comments
https://colab.research.google.com/drive/1Ervk6HPNS6AYVr3xVdQ...
Thing is, even though BLOOM weights were publicly released, it was extremely difficult to run inference efficiently unless you had lots of hardware to load the entire model into the GPU memory (you need at least 3x A100 or 8x 3090 GPUs). E.g., in case of offloading, you can only reach the speed of ~10 sec/step for sequential (non-parallel) generation. A possible alternative is to use APIs, but they are paid and not always flexible (you can’t adopt new fine-tuning/sampling methods or take a look at hidden states). So, Petals come to the rescue!
Please share what you think of it!