from Hacker News

Fixing C Strings

by ushakov on 12/17/24, 9:00 AM with 65 comments

  • by WalterBright on 12/21/24, 9:17 PM

        struct str {
         char *dat;
         sz len;
        };
    
    It's the same solution D uses, except that it's a builtin type, and works for all arrays. I proposed this solution for C:

    https://www.digitalmars.com/articles/C-biggest-mistake.html

    It's hard to overstate what a huge win this is. D has had 23 years of experience with it, and the virtual elimination of array overflow bugs is just win, win, win.

    I will never understand why C keeps adding extensions consisting of marginal features, and ignores this foundational fix. I guess they still aren't tired of buffer overflow bugs always being the #1 security vulnerability of shipped C code (and C++, too!).

  • by kevin_thibedeau on 12/21/24, 9:01 PM

    > Current compilers warn you if the format string doesn’t match its arguments. But this only works on functions that have the same signature as printf so it doesn’t work on my implementation.

    GCC has the format attribute that lets you have printf type checking on your own variadic functions:

    https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/Common-Functio...

  • by simscitizen on 12/17/24, 11:56 PM

    There are quite a few of these "better C string" idioms floating around.

    Another one to consider is e.g. https://github.com/antirez/sds (used by Redis), which instead stores the string contents in-line with the metadata.

  • by ropejumper on 12/18/24, 8:36 AM

    Two people have already mentioned things like storing the length inline or including a null-terminator to be backwards-compatible. What's described there is basically the same as std::string_view or &str, and to me one of the biggest reasons to use these structures is that your particular view of the string doesn't interfere with someone else's. You can slice your string in the middle and just look at it piecewise without bothering anyone else.

    Choosing between these trade-offs just depends on what you're doing. I'd definitely choose this pattern if I were to write a parser for instance.

  • by jdblair on 12/21/24, 9:08 PM

    I've done something similar, but unlike the author, I always reserved one extra byte and I always null terminated the string. This was so I could use existing string output functions.
  • by cozzyd on 12/18/24, 3:04 AM

    Why not have the null terminator so you can pass to normal printf?

    You could even do something crazy with packing a null byte with sz on 64-bit systems (since you will never have a string that long anyway...)

  • by up2isomorphism on 12/21/24, 9:34 PM

    For all the complaints ,all you need to do is to include an another .h files from some string lib and that’s it.

    But I would say for 95% percent using a fixed length char array with strncpy will work just fine.

  • by superjared on 12/21/24, 11:50 PM

    The bstring library[0] has been around a _long_ time.

    [0]: https://bstring.sourceforge.net/

  • by codr7 on 12/21/24, 10:25 PM

    I would consider putting the buffer last in the structure and making it flexible to allow skipping one allocation.
  • by Levitating on 12/21/24, 10:30 PM

    > I liked this kind of pattern at the bottom of OpenAI's site :)

    Where on OpenAI's site do I find a footer like that?

  • by Quis_sum on 12/22/24, 2:32 PM

    Sorry, but there is a significant misunderstanding: There is no such thing as a string in C. What you call a string is a pointer to char (typically "int8") - nothing more nothing less. The \0 termination is just a convention/convenience to avoid passing the bounds of the memory segment, resp. when to stop processing earlier.

    Once you go down the route proposed by many of the comments here - why not enhance it to deal with UTF8... Or rather implement a proper "array" type? What about the lack of multidimensional arrays instead of the pointer to pointer to ... approach? Idiosyncracies such as "int a[2][3];" being of type "int *" and not "int **"?

    C was never intended to shield you from mistakes, but rather replace a macro assembler. ANSI C addressed some of the issues in the original K&R C, but that is about it.

    If your use case would benefit from all of these protections, there are plenty of higher level language alternatives...

  • by teo_zero on 12/22/24, 8:58 AM

    Good attempt at a topic that annoys many programmers.

    I see a problem with the separation between str and str_buf, though: you create new strings with the latter, but most functions take the former as arguments. Do you convert them every time? Isn't your code littered with str_from_buf()?

    Put it in another way, it's like the mess with const that you mention in your article. If str is the type you use for a const read-only string, and str_buf for a non-const mutable string, you would like to pass a non-const even to those functions that "only" require a const. (I say "only" because being const is a weaker requirement than being mutable; the fact that it's more wordy is another thing that C's syntax makes confusing, but this is an entirely different topic!)

    It would be nice if the compiler could be instructed to automatically cast str_buf into str and not vice versa, just like it does for non-const to const.

    The only way out I can think of, would be to get rid of the two types and only use the one with the cap field, with the convention that if cap is zero, then the string is read-only. The drawback is that certain mistakes are only detected at run-time and not enforced by the compiler. For example, a function than takes a string s and replaces every substring s1 with s2 could have the following prototype in the two-type system:

      replace(str_buf s, str s1, str s2);
    
    And it would be immediate to recognize that you cannot pass a read-only string as the first argument. With a one-type system you loose this ability.

    Oh well, I guess if a perfect solution existed, it would have been adopted by the C committee, wouldn't it? /s

  • by zwnow on 12/21/24, 10:36 PM

    Never had a string related bug in any programming language in 4 years. I sincerely don't know what people talk about when they claim strings are buggy? What kinda tasks do these happen in?
  • by zabzonk on 12/21/24, 8:59 PM

    I have been using null terminated strings since the mid 1970s - before using C, and have never had any problems with them.I have never seen an explanation from someone that has that makes any sense.