from Hacker News

Pg_bm25: Elastic-Quality Full Text Search Inside Postgres

by billwashere on 10/8/23, 9:43 AM with 71 comments

by samokhvalov on 10/8/23, 4:50 PM
I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.
Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.
https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);
This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.
GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.
Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.
by retakeming on 10/8/23, 5:11 PM
Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!
pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.
To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (https://www.paradedb.com/).
by phamilton on 10/8/23, 3:33 PM
With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?
My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?
by hardwaresofton on 10/8/23, 2:47 PM
pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.
Awesome to see so many high quality extensions come out of it.
https://github.com/pgcentralfoundation/pgrx
by wkoszek on 10/8/23, 7:20 PM
Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count() from table where text match is there`?
This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT() in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.
ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.
by machty on 10/11/23, 11:18 PM
What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).
by mkleczek on 10/8/23, 2:29 PM
This is really exciting and I hope to try it out at my company ASAP.
by iamdanieljohns on 10/8/23, 7:13 PM
Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?
by ckok on 10/9/23, 8:30 PM
Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.
by antman on 10/8/23, 2:25 PM
An important step, could be a good combination with pg_vector if they are fast enough
by tristan957 on 10/9/23, 1:27 AM
Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?
by rawsh on 10/8/23, 6:56 PM
Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres
by anon373839 on 10/8/23, 10:58 PM
This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.
by est on 10/8/23, 3:28 PM
looks like a cool project https://github.com/paradedb/paradedb
by aiunboxed on 10/8/23, 2:56 PM
I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?
by canadiantim on 10/8/23, 3:04 PM
ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.
by eclectic29 on 10/8/23, 9:56 PM
Is BM25 still used by "modern" search engines? I wasn't aware.
by mugivarra69 on 10/8/23, 4:37 PM
is this better than lucene
by stopman on 10/8/23, 4:51 PM
Excited to give this a try.