by chunkles on 12/26/24, 7:06 PM with 116 comments
by antirez on 12/30/24, 4:44 PM
So for instance:
> In my pasta I put a lot of [cheese]
LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]
The real next token is "cheese" so I'll store "1".
Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: https://github.com/antirez/smaz2 And so forth.
by userbinator on 12/31/24, 5:12 AM
The brotli comparison is IMHO slightly misleading. Yes, it "embeds a dictionary to optimize the compression of small messages", but that dictionary is a few orders of magnitude smaller than the embedded "dictionary" which is the LLM in ts_sms.
There's a reason the Hutter Prize (and the demoscene) counts the whole data necessary to reproduce its output. In other words, ts_sms took around 18 bytes + ~152MB while brotli took around 70 bytes + ~128KB (approximately size of its dictionary and decompressor.)
by tshaddox on 12/30/24, 5:30 PM
https://en.wikipedia.org/wiki/Hutter_Prize
From a cursory web search it doesn't appear that LLMs have been useful for this particular challenge, presumably because the challenge imposes rather strict size, CPU, and memory constraints.
by kianN on 12/30/24, 4:40 PM
> The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities. [1]
It’s also mentioned that the model is configured to be deterministic, which is how I would guess the decompression is able to map a set of token likelihoods to the original token?
by giovannibonetti on 12/30/24, 4:46 PM
Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.
I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.
[1] https://github.com/citusdata/citus?tab=readme-ov-file#creati... [2] https://cloud.google.com/sql/docs/postgres/extensions
by tdiff on 12/31/24, 12:12 AM
by max_ on 12/30/24, 4:41 PM
by mNovak on 12/30/24, 5:46 PM
by stabbles on 12/30/24, 5:00 PM
by slater on 12/30/24, 7:25 PM
by j_juggernaut on 12/30/24, 9:15 PM
by lxgr on 12/30/24, 6:39 PM
I wonder if this is at all similar to what Apple uses for their satellite iMessage/SMS service, as that's a domain where it's probably worth spending significant compute on both sides to shave off even a single byte to transmit.
by Retr0id on 12/30/24, 5:01 PM
by crazygringo on 12/30/24, 6:44 PM
> 뮭䅰㼦覞㻪紹陠聚牊
I've never seen that before. The base64 below it, in contrast, is quite familiar.
by SeptiumMMX on 12/31/24, 5:52 AM
by deadbabe on 12/30/24, 4:50 PM
by the5avage on 12/31/24, 8:44 AM
by gcr on 12/31/24, 7:09 PM
It's really fun to see what happens when you feed the model keysmash! Each part of the input space seems highly semantically meaningful.
Here's a few decompressions of short strings (in base64):
$ ./ts_sms.exe d -F base64 sAbC
Functional improvements of the wva
$ ./ts_sms.exe d -F base64 aBcDefGh
In the Case of Detained Van Vliet {#
$ ./ts_sms.exe d -F base64 yolo9000
Give the best tendering
$ ./ts_sms.exe d -F base64 elonMuskSuckss=
As a result, there are safety mandates on radium-based medical devices
$ ./ts_sms.exe d -F base64 trump4Prezident=
Order Fostering Actions Supported in May
In our yellow
$ ./ts_sms.exe d -F base64 harris4Prezident=
Colleges Beto O'Rourke voted with Cher ¡La
$ ./ts_sms.exe d -F base64 obama4Prezident=
2018 AFC Champions League activity televised live on Telegram:
$ ./ts_sms.exe d -F base64 hunter2=
All contact and birthday parties
$ ./ts_sms.exe d -F base64 'correctHorseBatteryStaples='
---
author:
- Stefano Vezzalini
- Paolo Di Rio
- Petros Maev
- Chris Copi
- Andreas Smit
bibliography:
$ ./ts_sms.exe d -F base64 'https//news/ycombinator/com/item/id/42517035'
Allergen-specific Tregs or Treg used in cancer immunotherapy.
Tregs are a critical feature of immunotherapies for cancer. Our previous
studies indicated a role of Tregs in multiple
cancers such as breast, liver, prostate, lung, renal and pancreatitis. Ten years ago, most clinical studies were positi
ve, and zero percent response rates
$ ./ts_sms.exe d -F base64 'helloWorld='
US Internal Revenue Service (IRS) seized $1.6 billion worth of bitcoin and
In terms of compressions, set phrases are pretty short: $ ./ts_sms.exe c -F base64 'I love you'
G5eY
$ ./ts_sms.exe c -F base64 'Happy Birthday'
6C+g
Common mutations lead to much shorter output than uncommon mutations / typos, as expected: $ ./ts_sms.exe c -F base64 'one in the hand is worth two in the bush'
Y+ox+lmtc++G
$ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush'
kC4Y5cUJgL3s
$ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush.'
kC4Y5cUJgL3b
$ ./ts_sms.exe c -F base64 'One in the hand .is worth two in the bush.'
kC4Y5c+urSDmrod4
Note that the correct version of this idiom is a couple bits shorter: $ ./ts_sms.exe c -F base64 'A bird in the hand is worth two in the bush.'
ERdNZC0WYw==
Slight corruptions at different points lead to wildly different (but meaningful) output: $ ./ts_sms.exe d -F base64 FRdNZC0WYw==
Dionis Ellison
Dionis Ellison is an American film director,
$ ./ts_sms.exe d -F base64 ERcNZC0WYw==
A preliminary assessment of an endodontic periapical fluor
$ ./ts_sms.exe d -F base64 ERdNYC0WYw==
A bird in the hand and love of the divine
$ ./ts_sms.exe d -F base64 ERdNZC1WYw==
A bird in the hand is worth thinking about
$ ./ts_sms.exe d -F base64 ERdNZD0WYw==
A bird in the hand is nearly as big as the human body
$ ./ts_sms.exe d -F base64 ERdNZC0wYw==
A bird in the hand is worth something!
Friday
$ ./ts_sms.exe d -F base64 ERdNZC0XYw==
A bird in the hand is worth two studies
by yalok on 12/30/24, 4:47 PM
by jonplackett on 12/30/24, 10:54 PM
Get Sora to guess the next frame and then correct any parts that are wrong?
I mean, it would be an absolutely insane waste of power, but maybe one day it’ll make sense!
by MPSimmons on 12/31/24, 4:38 PM
This is more like a book cipher than a compression algorithm.
by mlok on 12/30/24, 6:49 PM
by RandomThoughts3 on 12/30/24, 6:32 PM
I could see it becoming very useful if on device LLM becomes a thing. That might allow storing a lot of original sources for not much additional data. We might be able to get an on device chat bot sending you to a copy of Wikipedia/reference material all stored on device and working fully offline.