by zxt_tzx on 3/8/25, 12:23 PM with 51 comments
by kevmo314 on 3/8/25, 7:56 PM
> Supabase easily the most expensive part of my stack (at $200/month, if we ran in it XL, i.e. the lowest tier with 4-core CPU)
That could get you a pretty decent VPS and allow you to coassemble everything with less complexity. This is exemplified in some of the gotchas, like
> Cloudflare Workers demand an entirely different pattern, even compared to other serverless runtimes like Lambda
If I'm hacking something together, learning an entirely different pattern for some third-party service is the last thing I want to do.
All that being said though, maybe all it would've done is prolong the inevitable death due to the product gap the author concludes with.
by zxt_tzx on 3/8/25, 2:03 PM
I have also summarized my key lessons here:
1. Default to pgvector, avoid premature optimization.
2. You probably can get away with shorter embeddings if you’re using Matryoshka embedding models.
3. Filtering with vector search may be harder than you expect.
4. If you love full stack TypeScript and use AWS, you’ll love SST. One day, I wish I can recommend Cloudflare in equally strong terms too.
5. Building is only half the battle. You have to solve a big enough problem and meet your users where they’re at.
by whakim on 3/9/25, 6:28 AM
I agree that starting with pgvector is wise. It’s the thing you already have (postgres), and it works pretty well out of the box. But there are definitely gotchas that don’t usually get mentioned. Although the pgvector filtering story is better than it was a year ago, high-cardinality filters still feel like a bit of an afterthought (low-cardinality filters can be solved with partial indices even at scale). You should also be aware that the workload for ANN is pretty different from normal web-app stuff, so you probably want your embeddings in a separate, differently-optimized database. And if you do lots of updates or deletes, you’ll need to make sure autovacuum is properly tuned or else index performance will suffer. Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.
Dedicated vector stores often solve some of these problems but create others. Index builds are often much faster, and you’re working at a higher level (for better or worse) so there’s less time spent on tuning indices or database configurations. But (as mentioned in other comments) keeping your data in sync is a huge issue. Even if updates and deletes aren’t a big part of your workload, figuring out what metadata to index alongside your vectors can be challenging. Adding new pieces of metadata may involve rebuilding the entire index, so you need a robust way to move terabytes of data reasonably quickly. The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.
by johnfn on 3/8/25, 7:02 PM
If you don't mind me giving you some unsolicited product feedback: I think SemHub didn't do well because it's unclear what problem it's actually solving. Who actually wants your product? What's the use case? I use GitHub issues all the time, and I can't think of a reason I'd want semhub. If I need to find a particular issue on, say, TypeScript, I'll just google "github typescript issue [description]" and pull up the correct thing 9 times out of 10. And that's already a pretty rare percentage of the time I spend on GitHub.
by nchmy on 3/8/25, 10:20 PM
https://manticoresearch.com/blog/manticoresearch-github-issu...
You can index any GH repo and then search it with vector, keyword, hybrid and more. There's faceting and anything else you could ever want. And it is astoundingly fast - even vector search.
Here's the direct link to the demo https://github.manticoresearch.com/
by VirgilShelton on 3/9/25, 5:40 PM
GL!
by serjester on 3/8/25, 8:25 PM
by scottyeager on 3/9/25, 4:34 AM
I don't quite understand, because searching issues across all of Github and also within orgs is already supported. Those searches show both open and closed issues by default.
For searches on a single repo, just removing the "state" filter entirely from the query also shows open and closed issues.
I do think that semantic search on issues is a cool idea and the semantic/fuzzy aspect is probably the biggest motivator for the project. It just felt funny to see stuff that Github can actually already do listed at the top of motivating issues.
by brian-armstrong on 3/8/25, 8:28 PM
I know github kind of added this but their version falls apart still even in common languages like C++. It's not unusual for it to just completely miss cross references, even in smaller repos. A proper compiler's eye view of symbolic data would be super useful, and Github's halfway attempt can be frustratingly daft about it.
by franky47 on 3/8/25, 8:09 PM
This article might be super helpful, thanks! I don't intend to make a product out of it though, so I can cut a lot of corners, like using a PAT for auth and running everything locally.
by nosefrog on 3/8/25, 8:17 PM
Yikes, these sorts of errors are so hard to debug. Especially if you don't have a real server to log into to get pcaps.
by gregorvand on 3/9/25, 6:43 AM