from Hacker News

Semgrep: Semantic grep for code

by ievans on 4/22/21, 4:51 PM with 104 comments

  • by hyper_reality on 4/22/21, 6:06 PM

    This is an excellent tool to have as a security consultant, and it just keeps getting better and better. When approaching a large codebase, it enables you to write custom rules that match on certain antipatterns you've spotted that may be unique to the codebase. That's the real value of the tool, but the repository of per-language rules is also convenient for quickly finding low-hanging fruit (like every use of a potentially injectable function such as exec,system,etc. in PHP).

    For example, a webapp may have been designed such that authorisation needs to be explicitly added with a line or two to each controller. A semgrep rule can be written to match all the controllers which are missing this line. Then these controllers can be manually reviewed to assess whether unauthorised access should be allowed. Depending on what you are trying to match, this is something that may be very complex or even impossible to implement accurately in plain grep. Some languages like Ruby have powerful static analysis tools (Brakeman) that can also do this, but the benefit of Semgrep is the flexibility across multiple languages and how readable the rulesets are. [1]

    [1] https://blog.includesecurity.com/2021/01/custom-static-analy...

  • by thesuperbigfrog on 4/22/21, 6:16 PM

    The name "Semantic Grep" does not give a good idea for what this tool is and what it does.

    The web page states: "Static analysis at ludicrous speed. Find bugs and enforce code standards"

    "grep" is short for "global regular expression print". It finds matches for the given regular expression and prints them.

    "Semantic Grep" is a static analyzer with configurable rules, style checks, etc. It does much more than search and print.

    Perhaps a better name is needed?

    Edit: How about "omnilint" or "omnicritic" since semgrep is more of a "lint" (https://en.wikipedia.org/wiki/Lint_(software)) or "critic" (https://en.wikipedia.org/wiki/Perl::Critic) type of tool that handles multiple languages?

    Edit2: "Static analysis at ludicrous speed" ==> "turbolint"? ("ludicrous speed" reminds of the hilarious Space Balls scene :) "turbolint, GO!"

  • by westurner on 4/22/21, 6:17 PM

    Is there a more complete example of how to call semgrep from pre-commit (which gets called before every git commit) in order to prevent e.g. Python print calls (print(), print \\n(), etc.) from being checked in?

    https://semgrep.dev/docs/extensions/ describes how to do pre-commit.

    Nvm, here's semgrep's own .pre-commit-config.yml for semgrep itself: https://github.com/returntocorp/semgrep/blob/develop/.pre-co...

  • by SavantIdiot on 4/22/21, 5:42 PM

    Since the capability has never existed, I don't think in terms of being able to semgrep. If that makes any sense. My brain is not wired this way, yet.

    Like, if you've never tasted lychee, it would never occur to you how to cook with it.

    I'm going to need to see some useful, real-world examples to jumpstart my brain to think this way.

  • by joshuamorton on 4/23/21, 12:24 AM

    There's lots of confusion about what semgrep does here, which is kind of unfortunate. I haven't touched it much, but I have built a very similar tool (I'm one of the contributors to refex[1], which is a very similar project).

    The starting point of semantic grep is very useful. When you have a big codebase, you often want to detect antipatterns, or not even antipatterns, but just uses of a thing, say you're renaming a method and want to track down the callers.

    Being able to act on the AST, instead of hoping you searched up all of the variants of whitespace and line breaks and, depending on the specific example, different uses of argument passing, is really useful.

    But often when you're semantically grepping, your goal is to replace something with something else (this is what refex was initially built for: to aide in large scale changes in python, as a sort of equivalent to the C++ tools that Google uses).

    But then you want to shift left even further: once you have a pattern that you want to replace once, you can just enforce that a linter yell at you when anyone does it again. So it's very natural to develop a linter-style thing on top of one of these[2].

    This is, as I understand it sort of the same thing that happens in C++: clang-tidy and clang-format are written on top of AST libraries that can be used for ad-hoc analysis and transformations, but you can also just plug them into a linter.

    The thing is, for most organizations, enforcing code style and best practices is more valuable than apply a refactoring to 10M lines of code, because most organizations don't have 10M lines of code to refactor. That doesn't mean that these tools aren't also useful for ad-hoc transforms and exploratory analysis. They absolutely are!

    [1]: https://github.com/ssbr/refex

    [2]: https://github.com/ssbr/refex/tree/main/refex/fix

  • by enriquto on 4/22/21, 5:24 PM

    > You need to enable JavaScript to run this app.

    Wait, is this a web app? I was expecting a command line tool to navigate my code locally.

  • by unwind on 4/22/21, 6:45 PM

    When tools like this use terms like "legacy languages", and don't show that C is supported unless you click "More Languages", it makes me feel old. :)

    Still, it seems rather cool, I like the idea of being able to search code at a higher level than just raw source text.

  • by kesterallen on 4/22/21, 6:04 PM

    Typo in the "Trying Semgrep" screenshot ("ruleste"): https://semgrep.dev/static/media/Step1.df848497.png
  • by jhgb on 4/22/21, 5:21 PM

    Isn't "grep for code" called just "grep"?
  • by leafmeal on 4/22/21, 8:52 PM

    What does this give you over writing a flake8 plugin (for Python at least)?

    I've found the flake8 API and documentation lacking, so perhaps just a cleaner interface?

  • by rmetzler on 4/22/21, 6:17 PM

    Looks like a useful tool for me and I would like to try it.

    Go down, see "brew install semgrep" and try to copy paste it. And it's an image :(

  • by hn_throwaway_99 on 4/22/21, 6:40 PM

    I currently use a highly opinionated ESLint config (based on the airbnb one) together with strict checking in my TypeScript config, and it is configured to run on every commit with husky git hooks. The example given on the Semgrep homepage is an exact match to one that exists in my ESLint config (eslint's no-console rule).

    How does Semgrep compare to ESLint+a strict tsconfig?

  • by shuringai on 4/22/21, 10:05 PM

    This is much better alternative to codeQL used by google and does not use a shameless registration-only model! Thanks for sharing
  • by vlovich123 on 4/22/21, 7:50 PM

    I want the ease of use of their AST specification with the power of clang’s refactor tool. Has anyone attempted to do that?
  • by pabs3 on 4/22/21, 10:31 PM

    Does it come with a standard set of rules that finds bad code without any false positives out of the box? Or is it more of a tool for people doing code security audits & pentesting who know what they are looking for and want to read the surrounding code?
  • by layer8 on 4/22/21, 6:18 PM

  • by CGamesPlay on 4/22/21, 11:33 PM

    How much does the CI service cost? I can't seem to find any information about it on the website without creating an account.
  • by nojvek on 4/23/21, 6:50 AM

    The underlying package tree-sitter that semgrep uses is pretty amazing too. It’s an incremental parser for many different languages written in C.

    It blows my mind how fast it is compared to many tools in js ecosystem. Tree-sitter was parsing millions of files in half a minute. JS, TS, Ruby, yaml, html, Css. It’s quite magical. Such great engineering.

  • by vindarel on 4/23/21, 8:11 AM

    Interesting. Looks similar to Comby: https://comby.dev/ "a tool for searching and changing code structure". Comby is more on rewriting, it has less integration for a CI (though you can do it), it is less geared towards reporting.
  • by wdb on 4/22/21, 11:10 PM

    Apparently this is invalid TypeScript (cannot parse it says):

      try {
        const parsedURL = new URL(url)
        requestPath = parsedURL.pathname
      } catch (error: unknown) {
        // NOOP
      }
    
    It's complaining about : unknown bit which one of the newer typescript eslint rules enforces.
  • by realquadrant on 4/22/21, 10:38 PM

    Hi, this is very cool. I have been building up a suite of tools to roll out across major open source projects to improve security. I like what I have seen so far, this is a great use case. Whom can I connect with to learn more? And similarity/diff with sourcegraph, also like a lot.
  • by silasb on 4/22/21, 6:19 PM

    Just the tool that I was looking for. We are looking to do Service linting in our organization as a method of making sure our services don't drift too far apart.

    Anyone else know of a Service linting tool? OPA/conftest come close but lack syntax parsers for Ruby/Javascript.

  • by more_corn on 4/22/21, 8:11 PM

    I used to use SAST-SCAN but that seems abandonware. I like that this exists. Everyone should go from nothing to something in the SAST space. A free/freemium tool/service for that is pretty great. The first couple runs have found useful results.
  • by afro88 on 4/22/21, 7:46 PM

    No swift support yet. What would be involved in adding it?
  • by minusf on 4/22/21, 10:50 PM

    probably doing something wrong but running the ci ruleset on a tiny django hobby project made all cores spin at 100% after 33% of the progress bar and made the OS almost unresponsive. ctrl-c after 5 minutes and i still had to pkill every semgrep process... never seen the M1 airbook overheat this much before.
  • by sriram_malhar on 4/23/21, 9:31 AM

    Nice looking tool.

    Is there a way to search for functions in C (other than printf!) whose return value is ignored at the call site?

  • by pantuza on 4/22/21, 8:59 PM

    Really outstanding those guardrails rules from semgrep. Useful to enforce code. Thanks for sharing the tool.
  • by globular-toast on 4/23/21, 8:05 AM

    Whenever I see "at ludicrous speed" or something to that effect, I now assume it's slow.
  • by Annatar on 4/23/21, 6:27 AM

    I click on the link above and I get a seemingly blank page, all because the website uses some JavaScript garbage and violates W3C standards. That's the ridiculous, disgusting state of the information technology industry in the 21st century. I rue the day I decided to do this professionally, and I am deeply ashamed and despondent.
  • by hardon4semgrep on 4/23/21, 8:15 AM

    How does this compare to the tools available at large companies like Google and Facebook?
  • by solipsism on 4/23/21, 5:30 AM

    What's the status of C++ support?