from Hacker News

Tesla Dojo Whitepaper

by edison112358 on 10/26/21, 11:52 AM with 2 comments

by childintime on 10/26/21, 4:14 PM
I'd like to mention a thought I had some time ago regarding the idea of using a byte FP format for ML training: instead of describing a byte in a sign/mantissa/exponent format, it might be advantageous to map the byte the 256 possible FP values, using a lookup table, to ideally chosen values. The curve implemented could be a sigmoid curve, for example. This would reduce quantization effects, likely not only resulting in a better convergence, but consistently so.
Maybe it would be necessary to adjust the curve to facilitate the reverse lookup, and reduce the time and silicon needed.
by francoisp on 10/26/21, 3:17 PM
Interesting read. I wonder if this is only some bandwidth optimization to throw more hardware at the problem or an actual shift in perspective, ref no NaN/Inf, instead clamps to maxval. Could this introduce artifacts/will math libs need to code around this, or will this enable some new insight?