from Hacker News

Hertz-dev, the first open-source base model for conversational audio

by mnk47 on 11/3/24, 11:30 PM with 56 comments

by reissbaker on 11/4/24, 7:10 AM
This is really cool. FWIW, existing open-source TTS engines are really bad in comparison to what you have here: I know this is voice-to-voice, but I think there'd be a lot of appetite to get this to also be multimodal and accept text (essentially making it a really good TTS model, in addition to a great voice-to-voice model).
I suppose someone could hack their way around the problem by finetuning it to essentially replay Piper (or whatever) output, only with more natural prosody and intonation. And then have the text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would be pretty useful to have it accept text natively!
by blixt on 11/4/24, 7:24 AM
They say Hertz is first of its kind but Moshi is another duplex audio model from earlier this year that seems to perform similarly (and it runs on a MacBook): https://github.com/kyutai-labs/moshi
by wwwlouishinofun on 11/4/24, 4:26 AM
Tesla’s approach to pure vision-based autonomous driving—temporarily setting aside lidar and other sensors—seems designed to make this technology more accessible and scalable. By focusing on a vision-only model, they can accelerate adoption and gather large datasets for quicker iterations. Once the vision-based system reaches a mature stage, I imagine Tesla might reintegrate additional sensor data, like lidar or radar, to refine their autonomous driving suite, making it even more robust and closer to perfection.
Additionally, I’ve been exploring an idea about voice interaction systems. Currently, most voice interactions are processed by converting voice input into text, generating a text-based response, and then turning this text back into audio. But what if we could train the system to respond directly in voice, without involving text at all? If developed to maturity, this model could produce responses that feel more natural and spontaneous, possibly diverging from traditional text-to-speech outputs. Natural speech has unique syntax and rhythm, not to mention dialect and tone variations, which could make a purely voice-trained system fascinating and more human-like.
Could you let me know if your current voice interaction model follows the standard speech-to-text-to-speech process, or if there is exploration in voice-to-voice processing?
by BrandiATMuhkuh on 11/4/24, 3:35 AM
That's really cool. I'm currently exploring VUI (Voice User Interface) and this might come in handy.
I might be a bit biased (did my PhD exploring how VUI can persuade humans), but I think VUI is "the future" of computer interaction. If it's not the future, than at least it adds a new group of people (kids + elderly people) as potential users.
by jcims on 11/4/24, 12:16 PM
If the authors or anyone else that works on a voice model are in here, do you ever get creeped out or feel the sounds you’re getting from the system have a physiological effect on you?
by wg0 on 11/4/24, 12:55 AM
So it is kind of LLM but audio LLM where prompt is audio and generated output is audio too?
by m11a on 11/4/24, 8:23 PM
> Base models are uniquely valuable as a research product because they accurately model the distribution of the data that they were trained on, as opposed to models that have had substantial RL tuning done to collapse their generation distributions. This makes base models the best starting point to fine-tune for a large number of different tasks.
Is this idea (‘collapse of their generation distributions’) a researched topic? If so, under what name?
Sounds interesting and maybe related to the whole continual learning / how to finetune properly line of work
by nitizaz on 11/14/24, 11:41 AM
How to pretrain the hertz-dev base model in a different language? Where can I get the information?
by codedokode on 11/4/24, 9:56 AM
The voice sounds a little bit distorted, and there is often a noise in the background (especially noticeable when this noise disappears when the voice pauses). I wonder, is it model limitations or is it the problem with quality of training data?
by mazoza on 11/4/24, 6:13 PM
Can one of the authors explain what this actually means from the post?
hertz-vae: a 1.8 billion parameter transformer decoder which acts as a learned prior for the audio VAE. The model uses a context of 8192 sampled latent representations (17 minutes) and predicts the next encoded audio frame as a mixture of gaussians. 15 bits of quantized information from the next token act as semantic scaffolding to steer the generation in a streamable manner.
by mnk47 on 11/3/24, 11:30 PM
Repo: https://github.com/Standard-Intelligence/hertz-dev
by zachthewf on 11/4/24, 5:35 PM
Cool, looks like this is trained on 16 million hours of audio (500B tokens at ~.11 seconds per token).
Even the large open source TTS models (see F5 TTS, Mask GCT) are mostly trained on very small audio datasets (say 100k hours) relative to the amount of audio available on the internet, so it's cool to see an open source effort to scale up training significantly.
by briansm on 11/4/24, 9:39 AM
The codec parameters remind me of the ~300bps NRV military speech codec from 2010. It also uses 120ms (8hz) frames, vbr encoded using 16KHz audio (closed source though).
https://ieeexplore.ieee.org/document/5680311
by lordofgibbons on 11/4/24, 2:04 AM
Can it effectively be used as a TTS model?
by xarope on 11/4/24, 7:43 AM
the One-channel generation seems to be speaking gibberish english. I'm not sure what it is supposed to represent?
And is the interactive generation just doing an ELIZA? i.e. "P: tell us about how AI will be interesting", "A: Yeah AI will, yeah, be interesting".
by kunley on 11/4/24, 2:54 PM
Anything more about the company, founders, affiliations..?
by Jayakumark on 11/4/24, 1:21 PM
What is the license on model weights ?
by nitizaz on 11/7/24, 11:02 AM
Any information about the architecture config of the model?
by awinter-py on 11/4/24, 6:08 AM
what is up with the first sample? and/or am I having a stroke
by Dawny33 on 11/4/24, 5:59 AM
Congrats, team.
Does Hertz support multi-lingual audio right now?
by timnetworks on 11/6/24, 4:12 AM
The Sims could really use this.
by ryukoposting on 11/4/24, 1:19 PM
The voice samples are speaking gibberish a lot of the time, but sonically the voices are fantastic. They sound human, even if it's nonsense syllables.
With SD and LLMs, there's a lot you can do to debug it by studying the way it responds to small changes in the prompt. But, since Hertz-dev is using sound as its input, it would be hard to discern which token you should tweak. Of course, if it's meant to be used in real time, that kind of fiddling isn't an option at all. How would you go about systematically studying Hertz-dev's behavior?
by blixt on 11/4/24, 7:16 AM
Gotta say I was confused for a second but yeah apparently si.inc and ssi.inc are the domains for two different AGI companies and I can only assume it’s intentional?