from Hacker News

The technology behind GitHub’s new code search

by joshbetz on 2/6/23, 5:32 PM with 181 comments

by ezekg on 2/6/23, 6:10 PM
I use their new code search a lot to grok how people use certain features, or implement certain things. But I do wish there was a way to filter out forks. Sometimes I search a string and just get a bunch of forks all with the same result. For example, searching a common class in a Rails app often just shows a bunch of rails/rails forks, which is a lot of noise to sift through when you're trying to see how devs commonly use a certain feature.
by andrewmcwatters on 2/6/23, 8:06 PM
> Just use grep? First though, let’s explore the brute force approach to the problem. We get this question a lot: “Why don’t you just use grep?” To answer that, let’s do a little napkin math using ripgrep on that 115 TB of content. On a machine with an eight core Intel CPU, ripgrep can run an exhaustive regular expression query on a 13 GB file cached in memory in 2.769 seconds, or about 0.6 GB/sec/core.
But you don't NEED to do this do you? I'm ALREADY in a repository, I just don't want to check out, say all of WebKit, I just need to find where a specific reference is defined.
Maybe, maybe on a really serious day do I need to search an entire organization. But hardly ever.
I have never, in over a decade ever, wanted sophisticated symbolic searching from GitHub code search, I just need remote grep.
Why is the code search not feature bisected into this 99% use case, and then the occasional global repository search, which can behave entirely differently?
by PaulHoule on 2/6/23, 7:44 PM
My beef with GitHub's code search is that it doesn't distinguish between the definition of a symbol and the uses of the symbol, so you need to wade through 5 pages of results to get the one result you're looking for. I would contrast that to my IDE which usually scores a direct hit if I enter a search in the right box.
The indexing they talk about in that article seems like rearranging the deck chairs on the Titanic so far as that is concerned.
by kirillbobyrev on 2/6/23, 10:07 PM
This is exciting! I see a lot of familiar pieces here that propagated from Google's Code Search and I know few people from Code Search went to GitHub, probably specifically to work on this. I always wondered why GitHub didn't invest into a decent code searching features, but I'm happy it finally gets to the State of the Art one step at a time. Some of the folks going to GitHub to work on this I know are just incredible and I have no doubt GitHub's code search will be amazing.
I also worked on something similar to the search engine that is described here for the purposes of making auto-complete fast for C++ in Clangd. That was my intern project back in 2018 and it was very successful in reducing the delays and latencies in the auto-complete pipeline. That project was a lot of fun and was also based on Russ Cox's original Google Code Search trigram index. My implementation of the index is still largely untouched and is a hot path of Clangd. I made a huge effort to document it as much as I can and the code is, I believe, very readable (although I'm obviously very biased because I spent a loot of time with it).
Here is the implementation:
https://github.com/llvm/llvm-project/tree/main/clang-tools-e...
I also wrote a... very long design document about how exactly this works, so if you're interested in understanding the internals of a code search engine, you can check it out:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG...
by Scaevolus on 2/6/23, 6:26 PM
As a comparison to Sourcegraph: Sourcegraph shards and indexes a repository at a time, and uses trigrams and bloom filters (to skip shards).
Github shards and indexes individual files according to their hashes. It also uses variable length ngrams (neat!). This makes horizontal scaling simpler, but also means more of the index needs to be scanned for org/repo-scoped queries ("Due to our sharding strategy, a query request must be sent to each shard in the cluster.").
by boyter on 2/6/23, 7:11 PM
The sparse grams solution to deal with stupidly common ngrams such as for or tes is very interesting.
I’d love to see more discussion on how they are dealing with the false positives though. It looks like a positional index is being used to achieve this, but that usually blows out your index size.
Additional information about deduplication would be especially interesting to me as well. It seems to solve this quite well. I usually try a search of Jquery to test this and it does not return multiple copies of different versions of it which is a good indicator that it’s slightly fuzzy.
What I find really interesting about all the code search engines I know of is that each one implemented its own index. Nobody is using off the shelf software for this. I suspect that might be down to no off the shelf software providing a decent enough solution, and none providing a solution that scales. At least none that scales with decent costs.
I did a small comparison of GitHub code search a while ago https://twitter.com/boyter/status/1480667185475244036?s=61&t... But I should note a lot has improved since then, and it looks like sourcegraph now also does default AND of terms rather than exact match, so my complaints there are resolved.
Impressive work by GitHub. I am sure some of the people behind it will read this comment, let me say well done to you all. I am very impressed. Also please post more information like this. There is so little out there.
by ZephyrBlu on 2/6/23, 6:12 PM
I really hope they release this soon and that it’s actually good.
The current search sucks ass, you can’t find anything.
I was trying to search for something in the WebKit source the other day and I had to use Sourcegraph because the GitHub search gave me zero results.
by colin353 on 2/6/23, 6:43 PM
Hey everyone, I'm Colin from GitHub's code search team: happy to answer any questions people have about it. Also, you can sign up to get access here: https://github.com/features/code-search
by simonw on 2/6/23, 6:45 PM
I really appreciate that this includes details about how search permissions work - how they ensure that search results include data from my private repos.
I'd always wondered how they implemented that: it turns out they add extra internal filters to their searches along the lines of "RepoIDs(...) or PublicRepo".
Question for the team: Do you have an additional permission check in the view layer before the results are shown to the end-user? I worry that if I switch a repo from public to private it may take a while for the code search index to catch up to the new permissions.
by saagarjha on 2/6/23, 7:39 PM
I’ve been using the new code search for a couple of months and I like it, but the UI is kind of antagonistic to how I typically want to search for things. For one, the new experience doesn’t actually load code onto the page, it does some sort of lazy loading thing as you scroll around, so ⌘F doesn’t work. I understand that there’s a custom search box to try to get around this but it’s pretty slow and fiddly and I don’t really want to use it. I also find the layout to be pretty annoying, because invariably there’s a symbol panel on the side that doesn’t work for the code I want to look at, and then it’s just there taking space. If I hit “t” to enter a file name and start typing the text field loses focus after a second and I need to click on it again. I know there are a couple of people on the team in this thread: I search a lot of code on GitHub and I feel like there’s a couple of tweaks that would greatly improve my experience. Like, I think I could even show you a video of all the places where the UI has gotten less usable for me. What would be the best way to get this feedback to you? I’ve posted stuff on the forum or whatever but it’s unclear to me if this is the intended way to raise issues.
by kjuulh on 2/6/23, 6:22 PM
I really like the new search. Though sometimes it is a bit deceptive. I.e. when searching for a function name by clicking on a piece of code and suddenly you are in an entitely different code base with an unrelated function though it shares the name.
It feels like github code browsing is a step between a full editor with lsp and a static site. I Hope they work out the Kinks and make it more smooth
by tonymet on 2/6/23, 6:33 PM
This is a great intro / overview of full-text search for those wondering how to build your own search engine.
It's a great 101-level exercise to write an inverted index implementation you can do it in an afternoon , and then expand to a leaf /aggregator in follow-up exercises.
by drcongo on 2/6/23, 7:40 PM
With current search, I can search [0] the Django repo for a class that definitely exists [1] in Django, there are 0 code results. Zero. GitHub search is mystifyingly bad, I hope this is a LOT better.
[0] https://github.com/django/django/search?q=DeleteView&type=co...
[1] https://github.com/django/django/blob/main/django/views/gene...
by debdut on 2/6/23, 6:49 PM
https://grep.app
by gavinray on 2/6/23, 9:01 PM
I just want to say thank-you to the folks who work on Code Search at GitHub.
It's the number one way I research and understand new libraries/API's and programming languages.
There's a lot more you can learn from usage in the wild than tutorial posts sometimes.
by Daffodils on 2/6/23, 7:18 PM
Was looking for more details on the data structure 'Geometric filter' mentioned in the footnotes. Couldn't find anything (a few unrelated papers in object recognition aside). If anybody can share anything that would be great !
by solarkraft on 2/6/23, 6:22 PM
Damn, it's about time, the current search sucks. What I have found to work very well is SourceGraph; they offer search for public repos. Maybe this'll be an alternative to it.
by tuan on 2/6/23, 9:51 PM
I wish they provide short name versions for their filters. For example: instead of "withContext language:python path:tests", I could write "withContext l:python p:tests".
by Existenceblinks on 2/6/23, 6:18 PM
Blackbird written in Rust is a natural approach. Those who try to sell build the whole thing with a whole thing is unwise (look at you isomorphic javascript)
by purkka on 2/7/23, 7:03 AM
My biggest feature request would be sorting or filtering by code/commit/repo age, or even repo stars.
Most often I end up using code search for figuring out where a piece of code originated, just to find thousands of random projects that have also copied the same code verbatim. Sorting for "relevance" or "latest/oldest indexed" are equally useless.
by mperham on 2/6/23, 8:20 PM
On the spectrum of "build vs buy", this is a good example where a business should build it. Scaling code search is their core value.
by chatmasta on 2/6/23, 8:42 PM
In general, I really recommend code search as a tool for supplementing reading the documentation and source code of your dependencies (you are reading the source code, right?). I reach for it almost every day, and I find it's a reliable tool for identifying "the right way" to use a library, especially one that isn't fully documented.
by robertlagrant on 2/6/23, 7:31 PM
Not to diminish this excellent work, but:
1) I never want to search all repos globally. At worst I want to search all of my org's repos.
2) the search UI is a little clunky, in a way I'd need to be using it again to remember.
Between those two I think there's loads of progress to be made outside of raw search power. Of course it's nice to have that, but that's what I'm really after.
by loginatnine on 2/6/23, 6:13 PM
I'm curious if they'll open source Blackbird, it does not seem mentioned in the post.
by Beefin on 2/6/23, 6:16 PM
If you ever want to search binary files (image, video, pdf, etc.) within github repos: https://learn.mixpeek.com/github-search/
by _pastel on 2/7/23, 4:00 AM
So in the sparse grams explanation, what are the bigram weights?
Is it inverse frequency, so common bigrams get split last? And the goal is to be able to search on a larger gram that covers the more common trigrams as often as possible?
by Waterluvian on 2/6/23, 7:07 PM
This looks delightful!
One nit I have about current search: I’ll look something up and find I’m getting results for some obtuse commit in some old branch somewhere. I’d like to be able to optionally say “latest commit on branches only please” or “main branch only please.”
Another thing, which might betray that I don’t understand search all that well: language aware searching that knows, for example, that a single or a double quote are syntactically interchangeable. Don’t omit half the results because I used one quote over the other when looking up `interpolation = ‘nearest’`
by j1elo on 2/6/23, 11:07 PM
Will this allow for a happy closure of this question about searching partial words? [0]
Like searching for "OPTION" and getting "-DOPTION=TRUE" among the results. Very commonly needed to find all usages of a flag, even instances where the flag is being passed to (at least, that I know of) CMake and Meson.
[0]: https://stackoverflow.com/questions/43891605/search-partial-...
by imadethis on 2/6/23, 5:50 PM
Sourcegraph should’ve accepted that offer from GitHub.
by Cian911 on 2/7/23, 2:49 PM
> Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication. There won’t be any hot servers due to special repositories and we can easily scale the number of shards as necessary.
What exactly do they mean by "special repositories" here?
by WoodenChair on 2/7/23, 12:22 AM
The biggest problems I have with their code search are basic usability features, not the search itself. I need a way to exclude private repositories in the result so I’m not clogged by internal instances of what I’m looking for. I need the UI to improve so I don’t have to go to advanced search for every filter I want to do.
by Royaljj on 2/7/23, 12:37 AM
Interesting stuff, was curious how they search repeated letters through ngram index? I understand their example search with the string “limits” (find intersection of “lim”, “imi”, “mit” and “its”. However, if the user wants to search the string “aaaaa” how would they go about searching that?
by ZephyrBlu on 2/6/23, 7:10 PM
Search is a fascinating topic because it's such a fundamental problem and every search engine is based around the same extremely simple data structure (Posting list/inverted index). Despite that, search isn't easy and every search engine seems to be quite unique. It also seems to get exponentially harder with scale.
You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: https://pagefind.app/.
For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): https://swtch.com/~rsc/regexp/regexp4.html.
Algolia also has a series of blog posts describing how their search engine works: https://www.algolia.com/blog/engineering/inside-the-algolia-....
---
It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:
"Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."
I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:
"Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it."
"Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."
https://stackshare.io/posts/how-algolia-built-their-realtime...
100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on https://hn.algolia.com I've seen single digit millisecond search times in the responses, which seems to corroborate this.
I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.
Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".
Search is so cool. I could talk about it all day.
by hbn on 2/6/23, 7:46 PM
I've been using this since it was still an email signup beta. I don't do anything too complicated, but man it's been invaluable to do exact-string searches across all of my organization's repos. I use it most days at work
by webmaven on 2/7/23, 12:28 AM
Blackbird? I wonder if the name is coincidence or irony:
https://en.m.wikipedia.org/wiki/Blackbird_(online_platform)
by sidcool on 2/7/23, 4:18 AM
I feel search is the most complex domain tech wise. I always feel overwhelmed how people design such systems. Would love to learn more about search. Any books or courses? Right now I can only do binary search.
by jd3 on 2/6/23, 6:26 PM
I was working on a research project awhile ago and every time I searched for something particular it immediately thought I was a bot after like 2-3 particular/exact queries.
Ever since then, I've exclusively used sourcegraph.
by latchkey on 2/6/23, 11:55 PM
I wonder if they've done any work to deal with index pollution, such that one can achieve higher ranking in the results?
by user3939382 on 2/6/23, 10:16 PM
The cursor position in the free-form query terms in the search input doesn’t align correctly when the input contains tags.
by tantalor on 2/6/23, 7:08 PM
Why not kythe?
https://kythe.io/
by bjd2385 on 2/6/23, 7:08 PM
When can we have a usable search in GitLab?
by duckydude20 on 2/7/23, 8:27 AM
working with this much data is like high voltage engineering. so fascinating...
by cozos on 2/6/23, 6:20 PM
I have been waiting for this for so long.
by thinking001001 on 2/6/23, 9:42 PM
Hmm not sure if I should delete my (2nd) Github account again, just thinking about how much data they are getting from users, it could become the Facebook of Git.