from Hacker News

Llasa: Llama-Based Speech Synthesis

by CalmStorm on 5/1/25, 4:43 PM with 22 comments

  • by ks2048 on 5/1/25, 8:03 PM

    Odd that the page doesn't seem to link to either,

    paper: https://arxiv.org/abs/2502.04128

    github: https://github.com/zhenye234/LLaSA_training

  • by CalmStorm on 5/1/25, 4:43 PM

    LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.
  • by dheera on 5/1/25, 8:02 PM

    > employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align

    I really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in/out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.

    These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.

  • by StevenNunez on 5/1/25, 6:00 PM

    I can't wait see this integrated into Open WebUI! These sound amazing.
  • by mring33621 on 5/1/25, 7:08 PM

    the long 'uuuuhhhhhhh' from some of the lesser models is killing me.