by Sirupsen on 7/9/24, 2:48 PM with 64 comments
by softwaredoug on 7/9/24, 8:01 PM
My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.
by cmcollier on 7/9/24, 8:18 PM
by nh2 on 7/10/24, 3:58 AM
It doesn't have to be that way.
At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.
Sometimes you can reach the goal faster with less complexity by removing the part with the 20x markup.
by omneity on 7/9/24, 10:40 PM
This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?
by bigbones on 7/9/24, 8:35 PM
by eknkc on 7/9/24, 9:29 PM
Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.
I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.
by solatic on 7/10/24, 4:19 PM
Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be incredible. The applicability here isn't only for vector search.
by zX41ZdbW on 7/10/24, 6:45 AM
Warehouse BigQuery, Snowflake, Clickhouse ≥1s Minutes
For ClickHouse, it should be: read latency <= 100ms, write latency <= 1s.Logging, real-time analytics, and RAG are also suitable for ClickHouse.
by drodgers on 7/9/24, 9:22 PM
by cdchn on 7/10/24, 4:32 AM
by arnorhs on 7/10/24, 12:31 PM
Seems like a topic I need to delive into a bit more.
by endisneigh on 7/10/24, 1:33 AM
Am I alone in this?
In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.
by CyberDildonics on 7/9/24, 9:08 PM
by yawnxyz on 7/10/24, 4:21 AM
by vidar on 7/9/24, 8:45 PM
by yamumsahoe on 7/9/24, 11:29 PM
by hipadev23 on 7/10/24, 1:19 AM