from Hacker News

A multimodal dataset with one trillion tokens

by kulikalov on 7/24/24, 8:04 PM with 52 comments

by punnerud on 7/24/24, 8:34 PM
More info on the Salesforce blog: https://blog.salesforceairesearch.com/mint-1t/
by supermatt on 7/25/24, 3:46 AM
I havent trained any LLMs, so please accept my comment with all the naivety with which it is given - but in the "examples of MINT multimodal documents" graphic at the top of the README, it feels to me as though the labeling (for the images on the left) couldn't be much worse? Is this normal for these datasets? How are we able to build such powerful models with such poor quality data?
by naveen99 on 7/25/24, 11:42 AM
Copyright and intellectual property are directly at odds with these types of efforts, and has been losing to linux, gnu, github, wikipedia, mit open courseware, youtube, LLMs and their datasets. But copyright did slay Napster, PirateBay, anna’s archive etc…
by sva_ on 7/25/24, 12:26 AM
Does it make sense to measure a dataset in tokens? Shouldn't it be tokenizer-agnostic? I.e. the OpenAI tokenizer encodes about ~4 characters per token, but I could also have a tokenizer that does 1 character per token leading to a ~4x increase in token count (relative to the OpenAI tokenizer.)
by ks2048 on 7/25/24, 1:41 AM
It looks like it contains data from CommonCrawl and ArXiv. It's not clear what kind of processing they did, but sometimes these releases seem like just repackaging existing datasets with your name own name on them. It's not hard to get bulk downloads from these sources directly.
I thought CommonCrawl truncated files at 1MB. I wonder if the PDFs for CommonCrawl were re-fetched from the URLs. That could be useful if they provide simple way to get those full files.
by wsc981 on 7/25/24, 9:10 AM
So, I read the blog post and checked the Github page, but not a clear picture here for me. I am still kinda new to the LLM space.
What would the use-case be for this model? What are the advantages over something like Llama?
by stealthcat on 7/25/24, 3:10 PM
Marketed as “multimodal” but actually texts and images.
Multimodal dataset should be multimedia: text, audio, images, video, and optionally more like sensor readings and robot actions.
by optimalsolver on 7/24/24, 8:44 PM
How effective would modeling raw byte sequences be, with the individual bytes as the "tokens", and a vocabulary of 256 elements?
You could then train on any kind of digital data.
by brianjking on 7/24/24, 11:14 PM
What's the license though?
by EGreg on 7/25/24, 5:42 AM
License: None
Means we can’t legally use it?
by benreesman on 7/25/24, 9:02 AM
Salesforce quietly does some truly tier-one stuff. They don’t showboat it which makes them seem more, not less, serious at least from my seat.
They use Bazel and shit, which is an acid test for being professionals, it’s a real shop.
The Magnificent 7 are about to get the taste slapped out of their mouth by skittish momentum guys and their chattels on Sand Hill Road. I look forward to the space this week will create for shops like Salesforce.