from Hacker News

String tokenization in C

by throwaway2419 on 12/15/18, 9:38 AM with 114 comments

  • by kazinator on 12/15/18, 3:15 PM

    The actions of strtok can easily be coded using strspn and strcspn.

    https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2001]

    https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2011 repost]

    strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.

    The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .

  • by jstimpfle on 12/15/18, 12:31 PM

    strtok is one of the silliest parts of the standard library. (And there are many bad ones). It's broken. It's not thread safe (yes there is strtok_r). It's needlessly hard to use. And it writes zeros to the input array. The latter means it's unfit for most use cases, including non-trivial tokenization where you want e.g. to split "a+1" into three tokens.

    If you program in C please just write those four obvious lines yourself.

  • by stochastic_monk on 12/15/18, 2:30 PM

    I recommend ksplit/ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0]. It modifies the string in-place, adding null terminators, and provides a list of offsets into the string. This gives you the flexibility of accessing tokens by index without paying costs of copying or memory allocation.

    [0] https://github.com/attractivechaos/klib

  • by lixtra on 12/15/18, 12:56 PM

    I have an obsession with unsafe example code:

      strcpy(str,"abc,def,ghi");
      token = strtok(str,",");
      printf("%s \n",token);
    
    Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.
  • by jfries on 12/15/18, 11:32 AM

    Well, yes, using strtok works if the data happens to be structured in a certain simple way. Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.
  • by graycat on 12/15/18, 1:12 PM

    A lot of experience shows that the string tokenization in Open Object Rexx is darned useful. E.g., for many years, IBM's internal computing was from about 3600 mainframe computers around the world running VM/CMS with a lot of service machines written in Rexx. Rexx is no toy but a powerful, polished, scripting language and really good at handling strings.

    A little example of some Rexx code with some string parsing is in

    https://news.ycombinator.com/item?id=18648999

  • by pasokan on 12/15/18, 11:34 AM

    It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today
  • by caf on 12/15/18, 12:11 PM

    Note though that strsep() is not as portable, because it is an extension to standard C.
  • by satyenr on 12/16/18, 5:45 AM

    > Next, strtok is not thread-safe. That's because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.

    I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:

      int strtok(char *str, char *delim, char **tokens);
    
    Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?

    Does anyone here have the historical prospective?

  • by megous on 12/15/18, 6:22 PM

    Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It's very powerful in combination with goto.
  • by saagarjha on 12/15/18, 6:30 PM

      str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));
    
      strcpy(str,TESTSTRING);
    
    str = strdup(TESTSTRING)?
  • by rurban on 12/15/18, 2:46 PM

    AFAIK strtok has restrict on both args since C99. And the safe variants strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not ASCII.

    https://en.cppreference.com/w/c/string/byte/strtok

  • by bsenftner on 12/15/18, 2:48 PM

    ...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...
  • by the_clarence on 12/15/18, 5:21 PM

    Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?
  • by setquk on 12/15/18, 12:23 PM

    I just use flex. You don’t have to ship flex as a dependency either.
  • by alexandernst on 12/15/18, 5:08 PM

    How about just using a properly suited language por string manipulation?