from Hacker News

5% of 666 Python repos had comma typo bugs (inc V8, TensorFlow and PyTorch)

by rikatee on 1/7/22, 5:00 PM with 327 comments

  • by aeturnum on 1/7/22, 6:25 PM

    The high-level goals of python end up creating these little syntactic landmines that can get even experienced coders. My personal nomination for the worst one of these is that having a comma after a single value often (depending on the surrounding syntax) creates a tuple. It's easy to miss and creates maddening errors where nothing works how you expect.

    I've moved away from working in Python in general, but I think the #1 feature I want in the core of the language is the ability to make violating type hints an exception[1]. The core team has been slowly integrating type information, but it feels like they have really struggled to articulate a vision about what type information is "for" in the core ecosystem. I think a little more opinion from them would go a long way to ecosystem health.

    [1] I know there are libraries that do this, I am not seeking recommendations.

  • by micimize on 1/7/22, 6:15 PM

    For those looking to avoid this specific problem, there is a flake8 rule: https://pypi.org/project/flake8-no-implicit-concat.

    More broadly, the https://codereview.doctors makers are making the point that their tool caught an easy-to-miss issue that most wouldn't think to add a rule for. A bit of an open question to me how many of those there really are at the language level, but still seems like a neat project.

  • by arusahni on 1/7/22, 5:45 PM

    The removal of implicit string concatenation was proposed for Py3k[1], but was rejected.

    [1] https://www.python.org/dev/peps/pep-3126/

  • by shoyer on 1/7/22, 7:28 PM

    Most of the "bugs" caught here (including in TensorFlow and in my own project, Xarray) seems to actually be typos in the test suite. This is certainly a good catch (and yes, linters should check for this!), but seems a little oversold to me.
  • by usrbinbash on 1/7/22, 5:49 PM

    Literally the second item in the "Zen of Python" (https://www.python.org/dev/peps/pep-0020/):

    Explicit is better than implicit.

    And yet, s = ["one", "two" "three"] will implicitly and silently do something, that is probably wrong most of the time.

  • by LAC-Tech on 1/8/22, 5:29 AM

    A lot of people are criticising dynamic typing for this.

    It doesn't seem to have anything to do with typing discipline.

        words = (
            'yes',
            'correct',
            'affirmative'
            'agreed',
         )
    
    Would be a tuple (immutable list) of strings, while

        words = (
            'yes',
            'correct',
            'affirmative',
            'agreed',
         )
    
    would also be a tuple of strings.

    If haskell had for some reason decided to have the same syntax sugar, it also would have caused an issue.

  • by oaiey on 1/7/22, 5:49 PM

    I am a bit in shock. Accidental string concatenation. Python just lost a lot of reputation in my brain.
  • by karolkozub on 1/7/22, 5:40 PM

    I really like the idea of automated code review tools that point out unusual or suspicious solutions and code patterns. Kind of like an advanced linter that looks deeper into the code structure. With emerging AI tools like Github Copilot, it seems like the inevitable future. Programming is very pattern-oriented and even though these kinds of tools might not necessarily be able to point out architectural flaws in a codebase, there might be lots of low-hanging fruits in this area and opportunities to add automated value.
  • by pmontra on 1/7/22, 5:49 PM

    As a comparison, in Ruby

      puts "a" "b" == "ab" # true
    
    and

      puts "a"
        "b" == "ab"
    
    prints "a" with "b" == "ab" evaluated to false and discarded. This could create bugs as with Python. However

      ["a"
         "b"] == ["ab"]
    
    is syntax error at the beginning of the second line. The parser expects a ] It would evaluate to true if it were on one line.
  • by wirthjason on 1/7/22, 7:14 PM

    Ironic to see this today. I spent an hour debugging this very same issue this morning.

    I was just doing some simple refactoring, changing a hard coded sting into a parameterized list of f-strings that’s filtered and joined back into a string.

    I’m glad that I had unit tests that caught the problem! I couldn’t figure out why it was breaking, that comma is very devilish to spot with the naked eye. I’m surprised my linters didn’t catch it either. Maybe time to revisit them.

  • by wartijn_ on 1/7/22, 6:28 PM

    I like this. It's clearly meant as marketing for their product, but imo the best kind of marketing. They don't just run their tool and automatically make tickets, but check for false positive and (offer to) make pr's.

    It's both good for those projects and for the company that does the marketing since they reach there exact target group. Plus it gets them on the front page of HN.

  • by ehsankia on 1/7/22, 8:34 PM

    Nice! Internally we have a PCRE support on our code search and I regularly run a regex to find and fix these. I've also found a ton on opensource project which I've been trying to fix:

    https://github.com/YosysHQ/prjtrellis/pull/176

    https://github.com/UWQuickstep/quickstep/pull/9

    https://github.com/tensorflow/tensorflow/pull/51578

    https://github.com/mono/mono/pull/21197

    https://github.com/llvm/llvm-project/pull/335

    https://github.com/PyCQA/baron/pull/156

    https://github.com/dagwieers/pygments/pull/1

    https://github.com/zhuyifei1999/guppy3/pull/12

    https://github.com/pyusb/pyusb/pull/277

    https://github.com/KhronosGroup/Vulkan-ValidationLayers/pull...

    It is indeed a very common mistake in Python, and can be very hard to debug. It bit me once and wasted a whole day for me, so I've been finding/fixing them ever since trying to save others the same pain I went through.

    EDIT: I will point out that I've found this error in other non-Python code too, such as c++ (see the 2nd PR for example).

    Here's the regex for anyone curious:

    [([{]\s*\n?(\s*['"](\w)+['"],\n)+(\s*['"]\w+['"]\n)(\s*['"]\w+['"],\n)*

  • by titzer on 1/7/22, 6:04 PM

    Just to be clear, the V8 "bug" was in the test runner code and caused mis-parsing of command line options for testing for non-SSE hardware. Not exactly a critical bug.
  • by bilalq on 1/7/22, 6:17 PM

    The whole "666" thing really threw me off. I thought it was some Python specific term or something at first glance. They open with a sentence that mentions "5% of the 666 Python open source GitHub repositories" as though there were only 666 total open source Python GH repos. Picking a number with other fun connotations or whatever to use as a sample is fine, but without setting that context, it was kind of distracting from their main content.
  • by routerl on 1/7/22, 5:37 PM

    tl;dr: Python concatenates space separated strings, so ['foo' 'bar'] becomes ['foobar'], leading to silent bugs due to typos.

    I've been bitten by this one at work, and can't help but think it is an insane behaviour, given that ['foo' + 'bar'] explicitly concatenates the strings, and ['foo', 'bar'] is the much more common desired result.

    edit: This also applies to un-separated strings, so ['foo''bar'] also becomes ['foobar']

  • by prepend on 1/7/22, 5:40 PM

    This seems like not a big deal. It’s a common mistake and is in 5% of repos but it’s not causing major damage.

    And there’s no evaluation of importance as to whether these instances are in test files or non-critical code. Packages are big and can have hundreds or thousands of files.

    It could be that if these mattered, they would have been detected and fixed.

    A good example for unit tests and perhaps checking to see if these bugs are covered or not covered.

    I like these kinds of analyses but don’t like the presented like it’s some significant failure.

  • by pxeger1 on 1/8/22, 7:18 AM

    The first one, the implicit concatenation, I can see. But the rest of the things seem like most of the time they're intentional.

        {
            'key': (
                'long string long string long string'
            )
        }
    
    Using parentheses like this to put long strings on their own line is standard practice.

        title = 'Hello world',
    
    I, for one, have often used this deliberately.
  • by the_gigi on 1/8/22, 3:06 AM

    I often use split().

    Instead of:

      s = ['a', 'b', 'c']
    
    I'll type:

      s = 'a b c'.split()
    
    For multiline lists where I want to get rid of leading whitespace I'll add lstrip():

      lines = """line 1
                 line 2
                 line 3
      """.split('\n')
      lines = [line.lstrip() for line in lines]
  • by anonymousiam on 1/8/22, 5:04 AM

    Heh. "cromulent" again.

    ..."there are perfectly cromulent reasons a developer would do implicit string concatenation spanning multiple lines"...

    https://www.merriam-webster.com/words-at-play/what-does-crom...

  • by tyingq on 1/7/22, 5:43 PM

    Seems expected, as linters can't be sure when it's not intentional. Like this request to pylint:

    https://github.com/PyCQA/pylint/issues/1589

    Is there usually enough context for a linter to make an educated guess?

  • by Subsentient on 1/8/22, 12:00 AM

    Interesting. I've hit this bug before, but not often in Python as far as I can remember. I guess if I need a huge list of something, I'm more likely to look to a dict than use a list with normal indexes.
  • by einpoklum on 1/7/22, 11:51 PM

    I wonder how many of those 666 have syntax bugs which are _difficult_ to locate using code analysis tools, because they are legit in themselves and you need to know what the author meant to make the call.
  • by delgaudm on 1/8/22, 3:31 PM

    When I used to write code, especially SQL statements I would:

        "put"
      , "Commas"
      , "first"
      , "to"
    
    avoid these kinds of things.
  • by gumby on 1/8/22, 1:39 AM

    Clearly bugs by programmers who don’t adhere to the Oxford comma.
  • by tus666 on 1/7/22, 6:36 PM

    Alternative title: 5% of Python repos has inadequate test coverage.
  • by timzaman on 1/8/22, 4:24 AM

    Haha the bug in Tensorflow is in "tensorflow/tensorflow/python/keras/engine/training_generator_test.py". clickbait.
  • by codeptualize on 1/7/22, 10:45 PM

    And then people make fun of JavaScript! (Just joking, I like Python, also JS, I guess everything has its quirks, it's a good thing we have linters)
  • by ficklepickle on 1/7/22, 8:53 PM

    Ironically there are a variety of typos in the article.

    A paragraph is repeated and the markdown links at the end are broken because there is a space between ] and (.

  • by asow92 on 1/7/22, 5:32 PM

    I'm sure the devil is in the details on this bug.
  • by dragonwriter on 1/8/22, 10:43 PM

    v8 may be a repo that includes some Python, but there is no reasonable standard by which it is a “Python repo”.
  • by Forge36 on 1/7/22, 5:46 PM

    I wonder if any of the found issues will turn out to be important issues.
  • by jiveturkey on 1/7/22, 8:28 PM

    nice ad!
  • by Pensacola on 1/7/22, 10:18 PM

    Why 666?
  • by xvilka on 1/8/22, 1:41 AM

    Python was never supposed to be a language for anything more complex than basic scripting and prototyping. Use proper languages with static typing (and better speed) for anything serious. And no, JavaScript isn't a good language either.
  • by dannymi on 1/7/22, 11:25 PM

    I can see the value of a lint (if there's a newline without a comma, warn), but concatenating strings by multiplication is the correct thing to do (since it's also used this way in mathematics of parsers).

    Using the plus operator to concatenate strings is just weird.

    Think of the usual algebraic properties these operators are supposed to have.

    "+" always is supposed to be commutative--so "a"+"b" = "b"+"a", if those mean alternatives (they usually do mean that in mathematics), is just fine.

    On the other hand, multiplication is often not commutative--also not here. "a" "b" != "b" "a".

    So string concatenation should be the latter. And indeed that's how it's in regular expression mathematics for example.

  • by kazinator on 1/7/22, 5:54 PM

    Not in Lisp! ("foo" "bar") and ("foobar") are lists of length 2 and 1, respectively.

    (Python copies some bad ideas from C. Another one is having to import everything you use. It seems that since Python is written in C, its designer took it for granted that there will be something analogous to #include for using libraries, even standard ones that come with the language.)

    Implicit string literal catenation is tempting to implement because it solves problems like:

       printf("long %s string"
              "nicely breaks up"
              "with indentation and all",
              arg, arg, ...)
    
    and if you're working in a language which has comma separation everywhere, you can get away with it easily.

    There are other ways to solve it. In TXR Lisp, I allow string literals to go across multiple lines with a backslash newline sequence. All contiguous unescaped whitespace adjacent to the backslash is eaten:

      This is the TXR Lisp interactive listener of TXR 273.
      Quit with :quit or Ctrl-D on an empty line. Ctrl-X ? for cheatsheet.
      TXR needs money, so even abnormal exits now go through the gift shop.
      1> "abcd \
          efg"
      "abcdefg"
    
    If you want a significant space, you can backslash escape it; the exact placement is up to you:

      2> "abcd\ \
          efg"
      "abcd efg"
      3> "abcd    \
         \ efg"
      "abcd efg"
      4> "abcd    \ \
                   efg"
      "abcd     efg"
      5> "abcd    \ \
         \         efg"
      "abcd              efg"