from Hacker News

Undergraduate shows that searches within hash tables can be much faster

by Jhsto on 2/10/25, 5:05 PM with 584 comments

by brink on 2/10/25, 9:13 PM
Krapivin made this breakthrough by being unaware of Yao's conjecture.
The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.
I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.
by abetusk on 2/11/25, 6:51 AM
Ok, big shout out to monort [0] for the link to the video [1].
This is just a quick overview from a single viewing of the video, but it's called "funnel hashing". The idea is to split into exponentially smaller sub arrays, so the first chunk is n/m, the second is n/(m^2), etc. until you get down to a single element. Call them A0, A1, etc., so |A0| = n/m, |A1| = n/(m^2) etc., k levels in total.
Try inserting into A0 c times. If it fails, try inserting into A1 c times. If it fails, go down the "funnel" until you find a free slot.
Call \delta the fraction of slots that are empty (I'm unclear if this is a parameter that gets set at hash table creation or one that's dynamically updated). Setting c = log(1/d) and k = log(1/d) to get worst case complexity O(log^2(1/d)).
This circumvents Yao's result by not being greedy. Yao's result holds true for greedy insertion and search policies and the above is non-greedy, as it's cascading down the funnels.
There are probably many little hairy details to work out but that's the idea, as far as I've been able to understand it. People should let me know if I'm way off base.
This very much reminds me of the "Distinct Elements in Streams" idea by Chakraborty, Vinodchandran and Meel[2].
[0] https://news.ycombinator.com/item?id=43007860
[1] https://www.youtube.com/watch?v=ArQNyOU1hyE
[2] https://arxiv.org/pdf/2301.10191
by monort on 2/11/25, 1:29 AM
Talk by the inventor: https://www.youtube.com/watch?v=ArQNyOU1hyE
by orlp on 2/10/25, 8:37 PM
Skimming the paper [1], the key difference they used is that their hash table insertion algorithm will probe further than the first empty slot, instead of greedily filling the first empty slot it finds. They combine this with a clever probing sequence which provably finds empty slots efficiently, even if the table is very full.
This means insertions when the hash table is less full are slower, but you avoid the worst-case scenario where you're probing for the last (few) remaining open slot(s) without any idea as to where they are.
[1]: https://arxiv.org/pdf/2501.02305
---
An interesting theoretical result but I would expect the current 'trick' of simply allocating a larger table than necessary to be the superior solution in practice. For example, Rust's hashbrown intentionally leaves 1/8th (12.5%) of the table empty, which does cost a bit more memory but makes insertions/lookups very fast with high probability.
by trebligdivad on 2/10/25, 10:09 PM
Anyone got a simple implementation of 'Tiny pointers'? My mind prefers code/pseudo-code first rather than the proof.
by quantum2022 on 2/10/25, 10:00 PM
This is neat! I always wondered if there would be a way to 'containerize' tables like this. IE a regular table is like a bulk carrier ship, with everything stuffed into it. If you could better organize it like a container ship, you could carry much more stuff more efficiently (and offload it faster too!)
by joe_the_user on 2/10/25, 8:41 PM
The theoretical properties of hash table always seemed so impressive to me that they bordered on magic (and this just extends them). What seemed crazy was how they could be so much better than trees, which to me were intuitively the most efficient way to store data.
What I realized is that the theory of hash tables involves a fixed-sized collection of objects. For this fixed collection, you create a hash-function and used that like a vector-index and store the collection in a (pre-allocated) vector. This gives a (fuzzy-lens'd) recipe for O(1) time insert, deletion and look-up. (The various tree structures, in contrast, don't assume a particular size).
The two problems are you have to decide size beforehand and if your vector gets close to full, you insert etc processes might bog-down. So scanning the article, it seems this is a solution to the bogging down part - it allows quick insertion to a nearly-full table. It seems interesting and clever but actually not a great practical advance. In practice, rather than worrying a clever way to fill the table, I'd assume you just increase your assumed size.
Edit: I'm posting partly to test my understanding, so feel to correct me if I'm not getting something.
by default-kramer on 2/10/25, 8:47 PM
> And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x.
> The team’s results may not lead to any immediate applications
I don't understand why it wouldn't lead to immediate applications. Is this a situation where analysis of real-world use cases allows you to tune your hash implementation better than what a purely mathematical approach would get you?
by dooglius on 2/10/25, 9:29 PM
It looks like the result only matters in the case where the hash table is close to full. But couldn't one just deal with this case by making the table size 10% bigger? (Or, if it is resizeable, resizing earlier)
by throwme_123 on 2/11/25, 12:17 AM
Is someone aware of a GitHub repo with an implementation of this?
by matsemann on 2/11/25, 5:18 PM
The intro picture about pointers in a drawer immediately reminded me of a talk I saw at FUN with Algorithms 2018 called Mind the Gap that gave me an aha moment about leaving space in data structures. Cool then to try to locate it, and see that it was by the same professor in the article, Martín Farach-Colton.
Not sure if it's viewable somewhere. But the conference itself was so fun. https://sites.google.com/view/fun2018/home
I'm not an academic and got my company to sponsor a trip to this Italian island to relax on the beach and watch fun talks, heh.
by ThinkBeat on 2/11/25, 1:52 AM
Do we have some nice implementations yet? I do better reading code than math.
by _1tan on 2/11/25, 6:19 AM
Neat, started on some implementation: https://kraftwerk.social/innovation-in-hash-tables/
by cb321 on 2/11/25, 1:09 PM
For a different, perhaps more practical take on small pointers in hash tables, you might find this interesting: https://probablydance.com/2018/05/28/a-new-fast-hash-table-i... with contemporaneous discussion at https://news.ycombinator.com/item?id=17176713
by sternma on 2/11/25, 8:25 AM
For anyone looking for a PoC implementation, here's python:
https://github.com/sternma/optopenhash
by foota on 2/11/25, 5:19 AM
I guess the most we could hope for here is that this leads to some other discovery down the road, either in hashtables or maybe one of the similar structures like bloom filters?
by nexawave-ai on 2/11/25, 11:31 AM
I would like to see this being applied practically. Is there a video demonstrating this or is it still too soon? Is the algorithm secret sauce or will it be open sourced?
by elcritch on 2/11/25, 8:01 AM
Anyone else think this could be used with distributed hash tables to dramatically speed up searching or building them? Maybe more exoticly to LLMs and lookup tables. A clever algorithm like this should be applicable in a lot of more specialized data structures or applications.
It's likely a DHT would greatly benefit from this sort of algorithmic reduction in time and be less susceptible to constant factor overheads (if there are any).
by froh on 2/11/25, 7:07 AM
(2021) for the paper itself
https://arxiv.org/abs/2111.12800
by Canigou on 2/12/25, 4:44 PM
I unfortunately did not study well enough to understand the paper.
Can someone explain to me how this isn't some kind of Dewey Decimal Classification (https://en.wikipedia.org/wiki/Dewey_Decimal_Classification) ?
by duskwuff on 2/10/25, 8:01 PM
Paper: https://arxiv.org/pdf/2111.12800
by shaganer on 2/13/25, 3:00 PM
Read this within my half hour break and man, wow what a story. I'm not a software guy, I'm a sys and net guy. Despite not caring or knowing about hash tables, that articles a great read! Thanks for sharing!
by varjag on 2/10/25, 7:46 PM
tl;dr sublinear worst case query and insertion in hash tables.
by seinecle on 2/11/25, 6:22 AM
Anyone competent enough here to venture a guess on the speed gain to expect under various scenarios?
by isaacfrond on 2/11/25, 9:26 AM
The paper is here: https://arxiv.org/pdf/2111.12800
Curiously, Andrew Krapivin, the genious undergrad in the article, is not one of the authors.
by reportgunner on 2/11/25, 12:29 PM
Sad that the article doesn't say what his approach actually is.
by bnly on 2/12/25, 6:11 PM
Step one: Be a genius
Step two: Try to solve hard problems
Step three: Avoid reading too much of other people's work in the area
Step four: (Maybe) Invent a brilliant new solution
But really, really don't skip step one.
by jjallen on 2/10/25, 8:22 PM
Is it just me or did the article not go in to how the improvement works, just the speed of it?
by lupire on 2/10/25, 11:50 PM
The older a conjecture is, the more likely it is false.
That's why the conjecture resists proof -- there is an counterexample that people aren't seeing.
by DeathArrow on 2/11/25, 8:45 AM
And we are taught to not try reinventing the wheel!
by pizza on 2/11/25, 8:02 PM
Just realized that the Mixture of Million Experts paper from last year is similar in some respects to this tiny pointers idea
by aqueueaqueue on 2/11/25, 2:09 AM
How full is your typical production hashtable?
by hoseja on 2/11/25, 8:15 AM
Is this just theoretically better O(n) or is there an actually faster implementation somewhere?
by victor106 on 2/11/25, 11:52 AM
> The team’s results may not lead to any immediate applications
Why not?
by qntty on 2/10/25, 8:12 PM
A cool result, but it seems like it should be called a computer science conjecture
by EternalFury on 2/11/25, 1:38 AM
What’s the time and space complexity of the new approach?
by amazingamazing on 2/10/25, 8:36 PM
This is a good test because it’s recent. Let’s see if deep research can come up with this result without just copying this.
Edit: gpt4, Gemini 2 and Claude had no luck. Human driven computer science is still safe.
by hemant1041 on 2/11/25, 1:28 PM
Interesting read!
by nickhodge on 2/10/25, 10:39 PM
I bet this guy would still fail a first round FAANG developer interview requiring a Hash Table solution to move on in the process.
"Yeah, sorry. You didn't use the right Hash Table"
by ziofill on 2/11/25, 1:11 AM
"it is well known that a vital ingredient of success is not knowing that what you are attempting can’t be done." — Terry Pratchett (equal rites)
by percentcer on 2/10/25, 11:34 PM
"arrowlike entities"
by MR4D on 2/10/25, 8:44 PM
Reading through this article is like reading a description of the Monty-Hall problem. [0]
It's as through the conclusion seems to defy common sense, yet is provable. [1]
[0] - https://priceonomics.com/the-time-everyone-corrected-the-wor...
[1] - 2nd to the last paragraph: "The fact that you can achieve a constant average query time, regardless of the hash table’s fullness, was wholly unexpected — even to the authors themselves."
by jheriko on 2/11/25, 5:33 AM
i feel this article is missing some detail or incorrect in reporting the actual development here. either that or i am missing something myself...
hash tables are constant time on average for all insertion, lookup and deletion operations, and in some special cases, which i've seen used in practice very, very often, they have very small constant run-time just like a fixed-size array (exactly equivalent in-fact).
this came up in an interview question i had in 2009 where i got judged poorly for deriding the structure as "not something i've often needed", and i've seen it in much older code.
i'm guessing maybe there are constraints at play here, like having to support unbounded growth, and some generic use case that i've not encountered in the wild...?
by ascorbic on 2/10/25, 8:45 PM
And they wouldn't make him first named author on the paper
by ryao on 2/10/25, 9:22 PM
<deleted>
by travisgriggs on 2/10/25, 11:24 PM
This is cool enough. But I find the "celebrification" style of the piece a bit off putting. Did I really need to see multiple posed shots of this young man reposing in various university settings? It's like we need our own version of La La Land to glorify the survivors of computer success to motivate more to participate.
by pmags on 2/11/25, 12:39 AM
Nice result!
<rhetorical> Hmm....I wonder how such research gets funded?... </rhetorical>
by jimnotgym on 2/10/25, 10:28 PM
Now we have faster data structures we can fill that extra time by writing less efficient code, and loading more pointless libraries. This is the march of computer science.
by ChrisMarshallNY on 2/10/25, 10:40 PM
As the villain in Scooby Doo always said:
"And I would have gotten away with it, if it hadn't been for those meddling kids!"
by zombiwoof on 2/10/25, 10:16 PM
Take that AI :)
by sam0x17 on 2/10/25, 7:52 PM
This is huge, when can we get a rust implementation?
by bruce343434 on 2/11/25, 11:26 AM
Ok so what's the algorithm? Ass article
by kittikitti on 2/11/25, 1:11 AM
I read through this and I'm not sure if people have heard of dictionary trees for hash tables. Of course, quantamagazine.org has been known to sensationalize these types of things.