from Hacker News

For Want of a JOIN

by subset on 12/22/22, 12:28 PM with 316 comments

by dagss on 12/22/22, 3:18 PM
It is one thing when a junior does this because they haven't learned better.
It's quite another when experienced seniors ban the use of SQL features because it's not "modern" or there is an architectural principle to ban "business logic" in SQL.
In our team we use SQL quite heavily: Process millions of input events, sum them together, produce some output events, repeat -- perfect cases for pushing compute to where the data is, instead of writing a loop in a backend that fetches events and updates projections.
Almost every time we interact with other programmers or architects it's an uphill battle to explain this -- "why can't just just put your millions of events into a service bus and write some backend to react to them to update your aggregate". Yes we CAN do that but why do that it's 15 lines of SQL and 5 seconds compute -- instead of a new microservice or whatever and some minutes of compute.
People bend over backwards and basically re-implement what the databases does for you in their service mesh.
And with events and business logic in SQL we can do simulations, debugging, inspect state at every point with very low effort and without relying on getting logging right in our services (because you know -- doing JOIN in SQL is not modern, but pushing the data to your service logs and joining those to do some debugging is just fine...)
I think a lot of blame is with the database vendors. They only targeted some domains and not others, so writing SQL is something of an acquired taste. I wish there was a modern language that compiled to SQL (like PRQL, but with data mutation).
by Jupe on 12/22/22, 1:16 PM
Firstly, I'd suggest the author look at this differently; perhaps "For Want of a Code Review". Especially code from a relatively recent graduate, on a piece of code for which the engineer in question has little experience.
With that said, the JOIN is a very powerful concept which, unfortunately, has been given a terrible reputation by the NoSQL community. Moving such logic out of the database and into to DB's client is just a waste of IO and computing bandwidth.
SQL has been the ONLY technology/language that has stuck with me for > 25 years. The fact that it is (apparently) not being taught by institutions of higher learning is just a shame.
by phamilton on 12/22/22, 2:25 PM
I've learned that "a month in the lab saves an hour in the library" usually can be distilled to "A shallow understanding produces complex solutions. A deeper understanding is usually required to create simple solutions."
While the original example of not understanding JOIN might just be a lack of of general knowledge, the later steps are great examples of this, especially if someone else comes along and is told to fix the error.
Making something execute slow code in parallel is pretty easy to do generically. It doesn't require understanding much about the slow code. It's fairly low risk, you probably won't have to tweak tests, there won't be additional side effects. The major risks will be around error handling and it's easy to turn a blind eye to partial success/failure and leave that as a problem for a future team. You can confidently build the parallel for loop, call the task done and move on.
Striving for a deeper understanding requires a lot more effort and a lot more risk. Re-writing the slow code is a lot more risk. All side effects must be accounted for. Tests might have to be re-written. The new implementation might be slower. The new index might confuse the query planner and make unrelated queries slower somehow. It's not just a matter of investing time, it's investing energy/focus and taking on risk. But the result will have comparatively fewer failure modes, it'll be cheaper to operate and less likely to have security implications.
I've been in both spots and while I wish I could say we always went with the deeper understanding that wouldn't be an honest statement. But the framing has been really helpful, especially as I work with other execs in the company to prioritize our limited resources.
by yyyk on 12/22/22, 1:25 PM
The worst part is not the missing JOIN. This happens, especially with juniors.
It's the 'all signup errors warranted paging the on-call even on 4am' bureaucratic decision followed by being unable to apply any fix quickly. No surprise the author did not stay.
by rubyist5eva on 12/22/22, 2:07 PM
This article speaks to me. So many times I have needed to go back and fix queries that were naively written this way like it was some kind of "optimization". There is no difference in effort between writing a join or doing the ORM-double-round-trip in the vast majority of cases. People are so afraid of doing joins I see people doing subqueries with the id in a subselect because "joins are slow". The worst is usually some kind of pseudo-join and then an aggregate or filtering in the application code. It drives me up the wall when I see it in code review, usually because I get into some argument about "joins are slow" (with no evidence) and then I have to go and rewrite the query and maybe add an index to show that, yes - an aggregate that takes seconds and a ton of memory in the application code can in fact take milliseconds in the database.
The NoSQL people have really done a lot of brain-damage to this industry.
It's so pervasive that I've starting using this kind of question in our technical interviews, doing a double round-trip ends the interview for anyone higher than a junior.
by acdha on 12/22/22, 5:52 PM
Back in the 90s, I remember getting a project from a local F500 company. Our design team had been doing some work for them and they'd been happy with the results so when they had problems on a backend project which was over a year behind schedule they asked if we could help & I was pulled in. The project was a fairly straight forward product selector for industrial equipment but the team from a large consulting firm which had been working on it was struggling with performance & hadn't completed most of the features. The client was saying it was unacceptable that pages would take 5 or more minutes to load and they weren't going to drop $500K on bigger servers like the developers were swearing were necessary to run the site.
I knew something was off performance-wise since the entire product catalog was only on the order of tens of thousands of records. As soon as I looked at the source code, the mystery was explained: they had allegedly experienced 3 developers working on it but none of them knew about SQL WHERE constraints! Instead, they were doing nested for loops to repeatedly retrieve every row of every table and doing the equality checks in VBScript. Finishing the rest of the project backlog took me a couple of days and the customer was quite happy that the slowest pages were now measured in hundreds of milliseconds rather than tens of minutes.
I was proud of how quickly we were able to turn that project around but the PM & I were discussing how even our rush rate wasn't enough to get us anywhere close to the amount of money the previous contractors had charged.
by funstuff007 on 12/22/22, 1:54 PM
> With the exception of NPM modules, most tools are designed to solve problems, possibly the ones you have
Upvoted just because of the chuckle this gave me.
by btown on 12/22/22, 1:55 PM
As someone who’s primarily worked with monoliths, I often wonder how often this exact problem happens, but where A and B are [micro]services owned by two different teams, one is required by company policy to use their APIs not their raw databases, and escalation of each of these issues e.g. query size/rate limiting runs the risk of burning political capital on top of everything else.
How does one JOIN across not just tables but opaque services, in the general case? Or does every team doing microservices silently expect that one day a data team will start querying for a massive number of records-by-ID from every service, and the veterans in each team plan for this load pattern accordingly?
by Ayesh on 12/22/22, 2:01 PM
That was a fun read, and I loved that little joke with NPM packages.
I find SQL, Regular Expressions, DNS, Client-side caching, CORS, TLS, and a few other things to be a MUST when hiring people, because most of the over-engineered crap can be avoided with a little bit of expertise with these. I spend most of my semi-leisure time with some good Regex books and golfing too.
Modern databases are amazing. Every few months, I take pleasure and not shy away in refactoring some complex and frequent queries into SQL views, carefully replace data logic (but not business logic) into stored procedures, and replace certain batch scripts with one-off queries.
by icedchai on 12/22/22, 6:09 PM
I've seen systems where people are doing manual JOINs with CSV, JSON, and the results of DynamoDB scans on relatively tiny datasets (<10 megs.) Everything could fit in sqlite on a single machine. Instead, they build a Rube Goldberg contraption that uses "modern cloud architecture."
by pier25 on 12/22/22, 3:33 PM
The aversion to SQL by younger devs is pretty amazing. Yes it has a weird cognitive model and a learning curve but it's a cornerstone of web dev. Instead they resort to convoluted and technically inferior solutions like Prisma just because of a superficial DX advantage.
I'm certain Mongo only became popular because of this even though for many years it was crap.
That said I do think we need a better SQL. It's still not there but EdgeDB looks very promising.
by Thaxll on 12/22/22, 1:37 PM
The other problem with the double join is that it's not an atomic operation so between the two select data could have changed.
by nightpool on 12/22/22, 3:29 PM
I'm not sure I understand how a JOIN would have fixed this problem. That is, if each chunk is fetching 1k rows, and you're doing 50 simultaneous chunks, then you're doing a 50,000 row query, and that's ALSO going to be extremely slow, in terms of exclusive database contention (less of an issue with bigquery) and result set memory usage (definitely still a huge issue for python). In fact, one of my most frequent pieces of feedback to junior engineers who are just working on a larger backend for the first time is "this query tries to fetch too much data at once, it will take too long and use way more memory then it needs to, please use find_each to automatically batch the query so that we balance memory usage and database contention". Indeed, Rails by default will use the exact same batching strategy the junior engineer chose in this case: fetch 1k items, process those items, and move on to the next 1k items. I understand that the author rankles about things not being done the "right way" with JOIN, but I question whether their focus on "best practices" is preventing them from seeing the optimization forest (split things up into parallel background tasks, don't try to keep the entire dataset in memory at once) for the "doing things right" trees (use JOIN)
by darepublic on 12/22/22, 3:18 PM
Yes I remember inheriting a project where in a similar fashion people were allergic to join. So we got js code selecting entire table, looping over the rows and then doing inner loops with further selects. I eventually had to switch everything around to using joins. There seemed to be a huge disdain for SQL.. like if you ever endeavoured to try some raw SQL you were playing with matches. Sure OK but code that is handling db operations that inefficiently is 100x worse tho...
by Nihilartikel on 12/22/22, 9:34 PM
I sling a lot of SQL, and, mirroring a lot of peoples sentiment here, wish it had better syntax and composability.
DuckDB and Apache spark expose nice apis that almost completely remove the need to faff around with textual strings. Each projection returns a view that can be treated like another table, so composition and reuse is simple.. It would be nice if such a thing we're more standard and available on the other dbms that I have to work with.
I feel like, in the continuum of abstraction, SQL is like opengl 3.. high level and a bit inflexible. Taking the analogy further, an ORM would be like the game engine on top of opengl.. What doesn't exist, as far as I know, is the Vulkan equivalent. A low level, api that exposes the relational algebra and exactly how to execute it. There are cases where I would have saved a lot of effort if I could just write the damned physical plan for a query execution myself rather than rearranging table join orders and sending hints that the query optimizer is just going to passive aggressively ignore anyway.
by bayesian_horse on 12/22/22, 3:03 PM
An SQL query goes into a bar, walks up to two tables and asks: May I join you?
by LudwigNagasena on 12/22/22, 2:16 PM
I would call it “for want of reasonable hiring and onboarding processes”. How does someone get into a data engineering job without any knowledge of SQL and doesn’t even get basic onsite training?
by jeffreygoesto on 12/22/22, 2:43 PM
"If you encounter an unusually round system limit, you’re probably using the system in a way its designers never imagined."
Haha, so true. We triggered a static code analyzer error "Cyclomatic Complexity bigger than 1.000.000.000!". The vendor was very interested in that code snippet (generated classifier code) and we shared a good laugh.
by Ensorceled on 12/22/22, 3:00 PM
We hired a data engineering consulting company and none of their team of SQL experts had heard of upsert or merge. I find it weird that people don't spend a bit of time searching for a better way of doing stuff before just jumping into a long, hard way of doing things.
by johnthuss on 12/22/22, 4:08 PM
"Don’t let junior SWEs get 2000 lines into a change before submitting a pull request."
This is good advice. Share your code early and often so you can get feedback before you're fully committed to one approach.
by data-ottawa on 12/22/22, 2:29 PM
In the spirit of HN I should confess I’ve done exactly this, not the using Python to feed data back to SQL, but writing terrible hacks to get around resource limits on deadlines.
On a recent project I needed to process a couple years of data for a hard deadline of Monday, and it was Friday. Our DB had a query timeout and a resource memory limit which blocked doing the full analysis without building new data models which would take days to get shipped and to build the new data models. The deadline couldn’t be moved so hacks were needed.
The solution: write some Python code to generate one query per week of data going back two years (over 100 queries), save the results to individual scratch tables, and then use a second query to union all the results together in our BI tool.
Of course the first time I ran it serially it was too slow, so I parallelized it. That was too many queries so I added a limit. Then one query failure broke the whole thing so I added retries… by the end of the day it looked exactly like this article.
It worked though! I got all the data we needed processed for Monday, I presented it to our execs and our project was approved. We only needed to manually run that script once more before I built the real solution and deleted the script.
by im3w1l on 12/22/22, 4:53 PM
I'm gonna be the contrarian and say this is mostly fine. We can research the proper way to do things, or use code review to teach about the proper ways. But this can lead to code shaming and a fearful environment where people second guess themselves and spend a lot of time chasing a perfection that doesn't move the business metrics.
In this case, doing the join manually isn't a huge deal, chunking isn't a huge deal, parallel requests isn't a huge deal. But "concurrent limit reached" is the point in this story where Bob should have put on the thinking cap and reasoned that "this shouldn't be hard, other people do things like this with bigger datasets all the time, I wonder how". Before that point it's literally just a matter of changing a couple lines to solve the issue. So what? After that point however, it's starting to affect the overall design around it in harmful ways, and turning the issue into a bigger one.
by outsidetheparty on 12/22/22, 2:28 PM
I got a chuckle out of "With the exception of NPM modules, most tools are designed to solve problems, possibly the ones you have," but have to agree with some other commenters that a better title for this would have been "for want of a mature code review process"
by gsvclass on 12/22/22, 10:49 PM
This is more common that you would believe other issues i've seen are no `limit` on the query, fetching all the results and then sorting in your own app code, using wrong joins. Many of these happen while using ORMs as well. SQL is a context switch for more devs and very few understand it and even those that do might not be familiar with the capabilites of your startups db choice.
Shameless plug but this was my motivation behind building GraphJin a GraphQL to SQL compiler and it's my single goto force multipler for most projects. https://github.com/dosco/graphjin
by Arwill on 12/22/22, 6:09 PM
The principle is that you should be using the underlying API if something is already solved on a lower level, and not replicate the functionality on a higher level, because it will perform poorly. There was a reason why the lower level API exists in the first place.
This applies to graphics programming very well, its not a question that you wouldn't be making your own pixel rasterizer instead of using DX, OpenGL or Vulkan, for example.
The big recognition is that when doing business apps, SQL database functionality is the underlying API, and you should prefer using that.
by ivanhoe on 12/22/22, 6:39 PM
Back in the days of Mysql w/ MyISAM engine it was sometimes way faster on big data-sets and underpowered DB servers to do the query exactly this way. Even with all the correct indices in place the JOINs (especially if more than one table was in game) would often just freeze the server for 15-20 minutes, while joining data at the app level in the for loop and with the lookup tables for id-s would typically take only a few seconds. Obviously this is an obsolete hack for long time now...
by LAC-Tech on 12/22/22, 9:30 PM
My key take away here is that not spending an hour reviewing code probably man-days worth of work.
The technical capabilities are all there on the team, from description. What was probably missing is someone both technical and assertive, who could politely say to the deadline setters "This is fucking stupid and it's not going to work".
by tmp60beb0ed on 12/22/22, 7:42 PM
> Don’t let junior SWEs get 2000 lines into a change before submitting a pull request.
Why junior SWEs and not all SWEs?
by jmull on 12/22/22, 4:58 PM
This doesn't really add up to me.
The article explains how the original bad code gets checked in which seems plausible enough.
But that doesn't explain why the first fix wasn't to just start using a JOIN? Or the second fix.
I guess it's a made up story, to make a point? Anyway, I found the plot holes distracting.
by tantaman on 12/22/22, 7:19 PM
This is an incredibly common thing. The worst I've seen it is when people drop their SQL DB for a No-SQL thing (for no good reason) and then end up implementing all the joins they lost in the application :(
by phendrenad2 on 12/22/22, 6:01 PM
Ragequitting a company and calling it a "trainwreck" because one developer didn't know about JOIN seems... extreme.
by brightball on 12/22/22, 6:51 PM
That's an excellent description of Contagion caused by tech debt.
How much does this problem grow and spread the longer it goes unfixed?
by tomerbd on 12/22/22, 4:47 PM
Wait until he hears about LEFT JOIN
by jmartrican on 12/22/22, 8:02 PM
We've all been Bob at least once in our life. By 'we' I really mean me.
by bjornsing on 12/22/22, 2:07 PM
Shipping it the first time may have made sense. Second time? Not so much.
by mgaunard on 12/22/22, 9:19 PM
how is CSV not "wrap the fields in double quotes and join them with commas and newlines"?