from Hacker News

Wrangling 2000 Git Repos at Reddit

by jdorfman on 3/18/24, 8:00 PM with 62 comments

by conjecTech on 3/18/24, 9:08 PM
I worked at Reddit in the not so distant past. The entire recommendation system lived in 3 repos. I'm pretty sure there are just 2000 repos because the onboarding tutorials have you create one, and that number is probably around the number of engineers that have worked there. I'd guess 100-200 have some production component.
by airstrike on 3/18/24, 8:49 PM
You know, big sweeping refactors deservedly get a bad rep, but as everything else in life, there are always exceptions
At some point, I don't know, maybe when you cross the 100 repos mark, you've gotta ask yourself "maybe we could try a different approach?"
It's not like reddit has been known for its wonderful stability over the years
I'm sure the scale here is completely unlike anything I've ever worked on, but how hard can it be to write a sane implementation of a message board?
I'd be curious how much of this problem is caused by the junk that is "new reddit". I've been there since 2007... The day old.reddit.com is the day I abandon it for good
by mebazaa on 3/18/24, 8:49 PM
Yes, the Reddit dev team might have spawned a 2000+ repo mess, but they also host it under the snooguts.net domain name, which is objectively adorable, so all is forgiven.
by tayo42 on 3/18/24, 8:34 PM
I worked with a monorepo and multiple teams dedicated to the dev experience, I had my complaints but I was spoiled in hindsight.
I know they did alot with git to make it manageable, hopefully what ever they did makes it to the open source world eventually so we can all avoid these crazy thousand repo worlds.
by heads on 3/18/24, 10:31 PM
We used to have an ecosystem like this. In our case it reflected an entrenched set of divisions between warring teams. In some ways it may have then enhanced those positions and we still bear a few of the scars today.
A lot of the old guard have left the company though and our main product moved from four repos to just one. The threat from the legal team to have enforced OWNERS files — essentially replicating the divisive politics of the old repos but in the monorepo — thankfully withered on the vine. We still audit what goes into each release but it’s no longer part of any active permissions thing. We trust our developers but verify, for legal reasons, that nothing went wrong.
You either want one engineering team to act in unison behind your company’s mission, or you want to live a divisive narrative that you are actually multiple teams “working” together with none of the advantages of living under one roof and all the disadvantages of hard repository boundaries crisscrossing your intellectual property.
So many factors threaten to curdle your team dynamic: multiple offices, multiple floors, work from home hermits, bad management, etc. It’s simply org entropy and it takes much effort to keep the weeds out of the garden. Multiple repositories is one less bullet you can keep out of your feet while fighting all the other battles that threaten to turn your team from 1990s Sun Microsystems into 2010 Sun Microsystems.
by IshKebab on 3/18/24, 9:18 PM
That's crazy. Monorepo definitely makes more sense.
Though I always wonder - how do Google, Microsoft, Facebook etc. deal with developing code near the root of their dependency tree? Utility libraries for example. Technically you're going to have every change you make there building all the code and running all the tests, which is obviously unworkable. What do they do?
by miduil on 3/18/24, 8:25 PM
I can't believe how it is working in such a big structure with just GitHub alone. GitLab with groups/subgroups and also integrated sourcegraph seems such more practical at this scale.
by nolist_policy on 3/18/24, 9:40 PM
I don't get the hating here. With the right tooling it doesn't matter if its 10, 100 or 2000 repos. And it buys you some nice things like per-repo permission settings.
by ivanjermakov on 3/18/24, 9:12 PM
Out of those 2k repos, how many of them actually used in production?
by ydnaclementine on 3/18/24, 8:48 PM
Sounds made up for why the R&D costs in their IPO docs was 450million or whatever
by sethammons on 3/18/24, 9:07 PM
To those who are swinging towards monorepos, I don't think that is a good solution. The reason being is that developers simply cannot be trusted to "do the right thing" on data and module boundaries. Someone comes in new to the project and does something they don't know they shouldn't. It is the honor system backed by weak linters and tooling.
In our monorepo, everyone passes around django orm objects and boundaries are practically non-existent. N+1 queries abound. Tests are full of patching and mocking and are _slow_. Our build takes over an hour to run tests. Someone on team A can and absolutely will mess up what someone on team B is doing. We are now having to spend quarter upon quarter as we define and enforce domain boundaries within the python code base. It is all bolted on checks. Tests are getting worse and people are actively trying to figure out ways around the testing system because it sucks.
Compare to my last gig. We had several hundred production repos. Each repo starts from a template with its own build pipeline. All production repos are gated so that any PR must pass tests before it can merge. Any merge has to pass tests before it could be deployed. As the base build processes matured, teams could, at their leisure, pull their services up to the latest and greatest. We even migrated from Jenkins to Buildkite; yeah, it took N pulls into N repos. Not a big deal. Most projects' tests and builds could get code out to production in under 10 minutes, including all those checks. Due to the network boundary, you couldn't accidentally get around someone's abstraction. And if one team blew up their build doing something dumb? No problem, it only affects that one team.
The argument is "gah, managing all those services!" Keep data behind APIs. Keep APIs backwards compatible. Keep dependencies acyclic. This is _possible_ with monorepos, but you have to do extra work compared to networked services -- yes, when any particular team/service can deploy in minutes due to low build system complexity you are winning. Can you get that wrong and make strange cyclic dependencies and introduce performance issues due to network hops? Yeah, of course. However, we were processing, literally, 10s of billions of api requests on this system and teams could work untethered from one another. The new gig does eerily similar software, but is several orders of magnitude slower in their ability to process data and their ability to move new features.
yes, yes, you could have networked services and a monorepo and you can leverage tooling like Pants to minimize the testing to only account for changed files. It is just fighting what I have found to be a better model. Keep things separate. Keep things fast to change.
by ZephyrBlu on 3/18/24, 8:33 PM
2000 repos what the fuck. More repos than engineers sounds terrible. Having worked in a large monolithic repo, I much prefer that. Everything (Shipping, testing, debugging, etc) is much easier that way.
by MilStdJunkie on 3/18/24, 8:38 PM
Holy Jesus Buddha Muhammad on a Harley. 2000 repos for a messageboard?!
I don't think someone knows what "repository" means.
At least they're bringing in Sourcegraph. That tool's helped me make sense of some chaos. Not 2000 repos' worth of chaos, but still, some chaos.
by hackmiester on 3/18/24, 8:37 PM
Non-legacy Reddit link: https://reddit.com/r/RedditEng/comments/1bdtrjq/wrangling_20...