by joshbetz on 2/6/23, 5:32 PM with 181 comments
by ezekg on 2/6/23, 6:10 PM
by andrewmcwatters on 2/6/23, 8:06 PM
But you don't NEED to do this do you? I'm ALREADY in a repository, I just don't want to check out, say all of WebKit, I just need to find where a specific reference is defined.
Maybe, maybe on a really serious day do I need to search an entire organization. But hardly ever.
I have never, in over a decade ever, wanted sophisticated symbolic searching from GitHub code search, I just need remote grep.
Why is the code search not feature bisected into this 99% use case, and then the occasional global repository search, which can behave entirely differently?
by PaulHoule on 2/6/23, 7:44 PM
The indexing they talk about in that article seems like rearranging the deck chairs on the Titanic so far as that is concerned.
by kirillbobyrev on 2/6/23, 10:07 PM
I also worked on something similar to the search engine that is described here for the purposes of making auto-complete fast for C++ in Clangd. That was my intern project back in 2018 and it was very successful in reducing the delays and latencies in the auto-complete pipeline. That project was a lot of fun and was also based on Russ Cox's original Google Code Search trigram index. My implementation of the index is still largely untouched and is a hot path of Clangd. I made a huge effort to document it as much as I can and the code is, I believe, very readable (although I'm obviously very biased because I spent a loot of time with it).
Here is the implementation:
https://github.com/llvm/llvm-project/tree/main/clang-tools-e...
I also wrote a... very long design document about how exactly this works, so if you're interested in understanding the internals of a code search engine, you can check it out:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiG...
by Scaevolus on 2/6/23, 6:26 PM
Github shards and indexes individual files according to their hashes. It also uses variable length ngrams (neat!). This makes horizontal scaling simpler, but also means more of the index needs to be scanned for org/repo-scoped queries ("Due to our sharding strategy, a query request must be sent to each shard in the cluster.").
by boyter on 2/6/23, 7:11 PM
I’d love to see more discussion on how they are dealing with the false positives though. It looks like a positional index is being used to achieve this, but that usually blows out your index size.
Additional information about deduplication would be especially interesting to me as well. It seems to solve this quite well. I usually try a search of Jquery to test this and it does not return multiple copies of different versions of it which is a good indicator that it’s slightly fuzzy.
What I find really interesting about all the code search engines I know of is that each one implemented its own index. Nobody is using off the shelf software for this. I suspect that might be down to no off the shelf software providing a decent enough solution, and none providing a solution that scales. At least none that scales with decent costs.
I did a small comparison of GitHub code search a while ago https://twitter.com/boyter/status/1480667185475244036?s=61&t... But I should note a lot has improved since then, and it looks like sourcegraph now also does default AND of terms rather than exact match, so my complaints there are resolved.
Impressive work by GitHub. I am sure some of the people behind it will read this comment, let me say well done to you all. I am very impressed. Also please post more information like this. There is so little out there.
by ZephyrBlu on 2/6/23, 6:12 PM
The current search sucks ass, you can’t find anything.
I was trying to search for something in the WebKit source the other day and I had to use Sourcegraph because the GitHub search gave me zero results.
by colin353 on 2/6/23, 6:43 PM
by simonw on 2/6/23, 6:45 PM
I'd always wondered how they implemented that: it turns out they add extra internal filters to their searches along the lines of "RepoIDs(...) or PublicRepo".
Question for the team: Do you have an additional permission check in the view layer before the results are shown to the end-user? I worry that if I switch a repo from public to private it may take a while for the code search index to catch up to the new permissions.
by saagarjha on 2/6/23, 7:39 PM
by kjuulh on 2/6/23, 6:22 PM
It feels like github code browsing is a step between a full editor with lsp and a static site. I Hope they work out the Kinks and make it more smooth
by tonymet on 2/6/23, 6:33 PM
It's a great 101-level exercise to write an inverted index implementation you can do it in an afternoon , and then expand to a leaf /aggregator in follow-up exercises.
by drcongo on 2/6/23, 7:40 PM
[0] https://github.com/django/django/search?q=DeleteView&type=co...
[1] https://github.com/django/django/blob/main/django/views/gene...
by debdut on 2/6/23, 6:49 PM
by gavinray on 2/6/23, 9:01 PM
It's the number one way I research and understand new libraries/API's and programming languages.
There's a lot more you can learn from usage in the wild than tutorial posts sometimes.
by Daffodils on 2/6/23, 7:18 PM
by solarkraft on 2/6/23, 6:22 PM
by tuan on 2/6/23, 9:51 PM
by Existenceblinks on 2/6/23, 6:18 PM
by purkka on 2/7/23, 7:03 AM
Most often I end up using code search for figuring out where a piece of code originated, just to find thousands of random projects that have also copied the same code verbatim. Sorting for "relevance" or "latest/oldest indexed" are equally useless.
by mperham on 2/6/23, 8:20 PM
by chatmasta on 2/6/23, 8:42 PM
by robertlagrant on 2/6/23, 7:31 PM
1) I never want to search all repos globally. At worst I want to search all of my org's repos.
2) the search UI is a little clunky, in a way I'd need to be using it again to remember.
Between those two I think there's loads of progress to be made outside of raw search power. Of course it's nice to have that, but that's what I'm really after.
by loginatnine on 2/6/23, 6:13 PM
by Beefin on 2/6/23, 6:16 PM
by _pastel on 2/7/23, 4:00 AM
Is it inverse frequency, so common bigrams get split last? And the goal is to be able to search on a larger gram that covers the more common trigrams as often as possible?
by Waterluvian on 2/6/23, 7:07 PM
One nit I have about current search: I’ll look something up and find I’m getting results for some obtuse commit in some old branch somewhere. I’d like to be able to optionally say “latest commit on branches only please” or “main branch only please.”
Another thing, which might betray that I don’t understand search all that well: language aware searching that knows, for example, that a single or a double quote are syntactically interchangeable. Don’t omit half the results because I used one quote over the other when looking up `interpolation = ‘nearest’`
by j1elo on 2/6/23, 11:07 PM
Like searching for "OPTION" and getting "-DOPTION=TRUE" among the results. Very commonly needed to find all usages of a flag, even instances where the flag is being passed to (at least, that I know of) CMake and Meson.
[0]: https://stackoverflow.com/questions/43891605/search-partial-...
by imadethis on 2/6/23, 5:50 PM
by Cian911 on 2/7/23, 2:49 PM
What exactly do they mean by "special repositories" here?
by WoodenChair on 2/7/23, 12:22 AM
by Royaljj on 2/7/23, 12:37 AM
by ZephyrBlu on 2/6/23, 7:10 PM
You can write your own search engine that will perform very well on a surprisingly large amount of data, even doing naive full-text search. A search tool I came across a while back is a great example of something at that scale: https://pagefind.app/.
For anyone who doesn't know anything about search I highly recommend reading this (It's mentioned in the blog post as well): https://swtch.com/~rsc/regexp/regexp4.html.
Algolia also has a series of blog posts describing how their search engine works: https://www.algolia.com/blog/engineering/inside-the-algolia-....
---
It's interesting that GitHub seems to have quite a few shards. Algolia basically has a monolithic architecture with 3 different hosts which replicate data and they embed their search engine in Nginx:
"Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client."
I'm guessing GitHub probably doesn't store repos in a custom binary format like Algolia does though:
"Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it."
"Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure."
https://stackshare.io/posts/how-algolia-built-their-realtime...
100ms p99 seems pretty good, but I'm curious what the p50 is and how much time is spent searching vs ranking. I've seen Dan Luu say that majority of time should be spent ranking rather than searching and when I've snooped on https://hn.algolia.com I've seen single digit millisecond search times in the responses, which seems to corroborate this.
I'm curious why they chose to optimize ingestion when it only took 36hrs to re-index the entire corpus without optimizations. A 50% speedup is nice, but 36hrs and 18hrs are the same order of magnitude and it sounds like there was a fair amount of engineering effort put into this. An index 1/5 of the size is pretty sweet though, I have to assume that's a bigger win that 50% faster ingestion.
Since they're indexing by language I wonder if they have custom indexing/searching for each language, or if their ngram strategy is generic over all languages. Perhaps their "sparse grams" naturally token different for every language. Hard to tell when they leave out the juiciest part of the strategy though: "Assume you have some function that given a bigram gives a weight".
Search is so cool. I could talk about it all day.
by hbn on 2/6/23, 7:46 PM
by webmaven on 2/7/23, 12:28 AM
by sidcool on 2/7/23, 4:18 AM
by jd3 on 2/6/23, 6:26 PM
Ever since then, I've exclusively used sourcegraph.
by latchkey on 2/6/23, 11:55 PM
by user3939382 on 2/6/23, 10:16 PM
by tantalor on 2/6/23, 7:08 PM
by bjd2385 on 2/6/23, 7:08 PM
by duckydude20 on 2/7/23, 8:27 AM
by cozos on 2/6/23, 6:20 PM
by thinking001001 on 2/6/23, 9:42 PM