by broken_broken_ on 2/29/24, 2:11 PM with 347 comments
by eschneider on 2/29/24, 8:48 PM
After inheriting quite a few giant C++ projects over the years, there are a few obvious big wins to start with:
* Reproducible builds. The sanity you save will be your own. Pro-tip: wrap your build environment with docker (or your favorite packager) so that your tooling and dependencies become both explicit and reproducable. The sanity you save will be your own.
* Get the code to build clean with -Wall. This is for a couple of reasons. a) You'll turn up some amount of bad code/undefined behavior/bugs this way. Fix them and make the warning go away. It's ok to #pragma away some warnings once you've determined you understand what's happening and it's "ok" in your situation. But that should be rare. b) Once the build is clean, you'll get obvious warnings when YOU do something sketchy and you can fix that shit immediately. Again, the sanity you save will be your own.
* Do some early testing with something like valgrind and investigate any read/write errors it turns up. This is an easy win from a bugfix/stability point of view.
* At least initially, keep refactorings localized. If you work on a section and learn what it's doing, it's fine to clean it up and make it better, but rearchitecting the world before you have a good grasp on what's going on globally is just asking for pain and agony.
by Jtsummers on 2/29/24, 4:33 PM
Things that get flagged by a static analysis tool (today) will often be areas where you can tear out entire functions and maybe even classes and files because they'll be a re-creation of STL concepts. Like homegrown iterator libraries (with subtle problems) that can be replaced with the STL algorithms library, or homegrown smart pointers that can just be replaced with actual smart pointers, or replacing the C string functions with C++'s on string class (and related classes) and functions/methods.
But you won't see that as easily until you start scanning the code. And you won't be able to evaluate the consequences until you push towards more rapid test builds (at least) if not deployment.
by vijucat on 2/29/24, 10:51 PM
I used to use a tool called Source Navigator (written in Tcl/tk!) that was great at indexing code bases. You could then check the Call Hierarchy of the current method, for example, then use that to make UML Sequence Diagrams. A similar one called Source Insight shown below [1].
And oh, notes. Writing as if you're teaching someone is key.
Over the years, I got quite good at comprehending code, even code written by an entire team over years. For a brief period, I was the only person actively supporting and developing an algorithmic trading code base in Java that traded ~$200m per day on 4 or 5 exchanges. I had 35 MB of documentation on that, lol. Loved the responsibility (ignoring the key man risk :|). Honestly, there's a lot of overengineering and redundancy in most large code bases.
[1] References in "Source Insight" https://d4.alternativeto.net/6S4rr6_0rutCUWnpHNhVq7HMs8GTBs6...
by Night_Thastus on 2/29/24, 4:38 PM
I'm pretty sure my stomach did somersaults on that.
But as for the advice:
>Get out the chainsaw and rip out everything that’s not absolutely required to provide the features your company/open source project is advertising and selling
I hear you, but this is incredibly dangerous. Might as well take that chainsaw to yourself if you want to try this.
It's dangerous for multiple reasons. Mainly it's a case of Chesterton's fence. Unless you fully understand why X was in the software and fully understand how the current version of the software is used, you cannot remove it. A worst case scenario would be that maybe a month or so later you make a release and the users find out an important feature is subtly broken. You'll spend days trying to track down exactly how it broke.
>Make the project enter the 21st century by adding CI, linters, fuzzing, auto-formatting, etc
It's a nice idea, but it's hard to do. One person is using VIM, another is using emacs, another is using QTCreator, another primarily edits in VSCode.. Trying to get everyone on the same page about all this is very, very hard.
If it's an optional step that requires that they install something new (like commit hook) it's just not going to happen.
Linters also won't do you any good when you open the project and 2000+ warnings appear.
by keepamovin on 2/29/24, 4:04 PM
0. You reach out to the previous maintainers, visit them, buy them tea/beer and chat (eventually) about the codebase. Learned Wizards will teach you much.
But I didn't see that anywhere. I think the rest of the suggestions (like get it running across platform, get tests passing) are useful stress tests likely to lead you to robustness and understanding however.But I'd def be going for that sweeet, sweet low-hangin' fruit of just talking to the ol' folks who came that way before. Haha :)
by mk_chan on 2/29/24, 8:09 PM
If it’s a big enough change, export whatever you need out of the legacy code (by calling an external function/introducing a network layer/pulling the exact same code out into a library/other assorted ways of separating code) and do the rest in a fresh environment.
I wouldn’t try to do any major refactor unless several people are going to work on the code in the future and the code needs to have certain assumptions and standards so it is easy for the group to work together on it.
by summerlight on 3/1/24, 1:57 AM
* You can find out the most relevant files/functions for your upcoming works. If some functions/files have been frequently changed, then it's going to be the hot spot for your works. Focus on them to improve your quality of life. If you want to introduce unit tests? Then focus on the hot spot. Suffer from lots of merge conflicts? The same.
* You can also figure out correlation among the project and its source files. Some seemingly distant files are frequently changed together? Those might suggest an implicit structure that not might be clear from the code itself. This kind of information from external contexts can be useful to understand the bird's eye view.
* Real ownership models of each module can be inferred from the history. Having a clear ownership model helps, especially if you want to introduce some form of code review. If some code/data/module seems to have unclear ownership? That might be a signal for refactoring needs.
* Specific to C/C++ contexts, build time improvements could be focused on important modules, in a data driven way. Incremental build time matters a lot. Break down frequently changed modules rather than blindly removing dependencies on random files. You can even combine this with header dependency to score the module with the real build time impact.
There could be so many other things if you can integrate other development tools with VCS. In the era of LLM, I guess we can even try to feed the project history and metadata to the model and ask for some interesting insights, though I haven't tried this. It might need some dedicated model engineering if we want to do this without a huge context window but my guts tell that this should be something worth try.by bArray on 2/29/24, 5:09 PM
I would break this down:
a) CI - Ensure not just you can build this, but it can be built elsewhere too. This should prevent compile-based regressions.
b) Compiler warnings and static analysers - They are likely both smarter than you. When it says "warning, you're doing weird things with a pointer and it scares me", it's a good indication you should go check it out.
c) Unit testing - Set up a series of tests for important parts of the code to ensure it performs precisely the task you expect it to, all the way down to the low level. There's a really good chance it doesn't, and you need to understand why. Fixing something could cause something else to blow up as it was written around this bugged code. You also end up with a series of regression tests for the most important code.
n) Auto-formatting - Not a priority. You should adopt the same style as the original maintainer.
> 5. If you can, contemplate rewrite some parts in a memory safe language
The last step of an inherited C++ codebase is to rewrite it in a memory safe language? A few reasons why this probably won't work:
1. Getting resources to do additional work on something that isn't broken can be difficult.
2. Rather than just needing knowledge in C++, you now also need knowledge in an additional language too.
3. Your testing potentially becomes more complex.
4. Your project likely won't lend itself to being written in multiple languages, due to memory/performance constraints. It must be a significantly hard problem that you didn't just write it yourself.
5. You have chosen to inherit a legacy codebase rather than write something from scratch. It's an admittance that you don't have some resource (time/money/knowledge/etc) to do so.
by delta_p_delta_x on 2/29/24, 10:37 PM
It is strange that the author complains so much about automating BOMs, package versioning, dependency sources, etc, and then proceeds to suggest git submodules as superior to package managers.
The author needs to try vcpkg before making these criticisms; almost all of these are straightforwardly satisfied with vcpkg, barring a few sharp edges (updating dependencies is a little harder than with git submodules, but that's IMO a feature and not a bug—dependencies are built in individual sandboxes which are then installed to a specified directory. vcpkg can set internal repositories as the registry instead of the official one, thus maintaining the 'vendored in' aspect. vcpkg can chainload toolchains to compile everything with a fixed set of flags, and allows users to specify per-port customisations.
These are useful abstractions and it's why package managers are so popular, rather than having everyone deal with veritable bedsheets' worth of strings containing compile flags, macros, warnings, etc.
by tehnub on 2/29/24, 4:54 PM
by Kon-Peki on 2/29/24, 10:04 PM
The approach I've taken is, when you do work on a function and find that it uses a global variable, try to add the GV as a function parameter (and update the calling sites). Even if it's just a pointer to the global variable, you now have another function that is more easily testable. Eventually you can get to the point where the GV can be trivially changed to a local variable somewhere appropriate.
by bluetomcat on 2/29/24, 4:46 PM
by VyseofArcadia on 2/29/24, 8:19 PM
Except every legacy C++ codebase I've worked on is decades old. Just enumerating the different "features" is a fool's errand. Because of reshuffling and process changes, even marketing doesn't have a complete list of our "features". And even it there was a complete list of features, we have too many customers that rely on spacebar heating[0] to just remove code that we think doesn't map to a feature.
That's if we can even tease apart which bits of code map to a feature. It's not like we only added brand new code for each feature. We also relied on and modified existing code. The only code that's "safe" to remove is dead code, and sometimes that's not as dead as you might think.
Even if we had a list of features and even if code mapped cleanly to features, the idea of removing all code not related to "features your company is advertising or selling" is absurd. Sometimes a feature is so widely used that you don't advertise it anymore. It's just there. Should Microsoft remove boldface text from Word because they're not actively advertising it?
The only way this makes sense is if the author and I have wildly different ideas about what "legacy" means.
by ecshafer on 2/29/24, 4:34 PM
by grandinj on 2/29/24, 5:42 PM
We built our own find-dead-code tool, because the extant ones were imprecise, and boy oh boy did they find lots of dead stuff. And more dead stuff. And more dead stuff. Like peeling an onion, it went on for quite a while. But totally worth it in the end, made various improvements much easier.
by pvarangot on 2/29/24, 11:10 PM
by ilitirit on 3/1/24, 6:02 AM
Most of the time I spend with C++ code revolves around figuring out compile/link errors. Heaven forbid you need to deal with non-portable `make` files that for some reason work on the old box, but not yours... Oh, and I hope you have a ton of spare space because of some reason building a 500k exe takes 4GB.
Keep in mind, this advice only applies to inherited C++ code bases. If you've written your own or are working on an actively maintained project these are non-issues. Sort-of.
by sjc02060 on 2/29/24, 4:25 PM
by hgs3 on 2/29/24, 11:23 PM
[1] https://www.joelonsoftware.com/2000/04/06/things-you-should-...
by sk11001 on 2/29/24, 4:49 PM
by jcarrano on 2/29/24, 9:58 PM
This is an extremely crucial step that you must do first: familiarize yourself with the system, its uses and the reasons it works like it does. Most things will be there for a reason, even if not written to the highest standard. Other parts might at first sight seem very problematic yet be only minor issues.
Be careful with number 4 and 5. Do not rush to fix or rewrite things just because they look like they can be improved. If it is not causing issues and it is not central to the system, better spend your resources somewhere else.
Get the team to adopt good practices, both in the actual code and in the process. Observe the team and how they work and address the worst issues first, but do not overwhelm them. They may not even be aware of their inefficiencies (e.g. they might consider complete rebuilds as something normal to do).
by happyweasel on 3/1/24, 12:34 PM
by bobnamob on 2/29/24, 10:04 PM
So much talk about change and large LoC deltas without capturing the expected behavior of the system first
by rurban on 3/1/24, 7:01 AM
Their shell build script was called makefile, kid me not. So first create a proper dependency management: GNUmakefile. A BSD makefile would have been prettier, but not many are used to this. dos2unix, chmod -x `find . -name *.c -o name *\.h`, clang-format -i All in seperate commits.
Turns out there was a .h file not a header, but some custom list of algorithm states, broken by fmt. Dontg do that. Either keep it a header file, or rename it to .lst or such.
Fix all the warnings, hundreds. Check with sanitizers. Check the tests, disable broken algorithms, and mark them as such.
Improve the codebase. There are lots of hints of thought about features. write them. Simplify the state handling. Improve the tests.
Add make check lint. Check all the linter warnings.
Add a CI. Starting with Linux, Windows mingw, macos and aarch64. Turns out the code is Linux x64 only, ha. Make it compat with sse checks, windows quirks.
Waiting for GH actions suck, write Dockerfiles and qemu drivers into your makefile. Maybe automake would have been a better idea after all. Or even proper autoconf.
Find the missing algorithms described elsewhere. Add them. Check their limitations.
Reproducible builds? Not for this one, sorry. This is luxury. Rather check clang-tidy, and add fuzzing.
by rayiner on 3/1/24, 3:18 PM
I observed from afar when the Gwydion Dylan folks (the Dylan successor to the CMU CL compiler) inherited Harlequin’s Dylan compiler and IDE and decided to switch to that going forward: https://opendylan.org. The work (done out in the open in public mailing lists and IRC) is a very nicely done case study in taking a large existing code base developed by someone else, studying it, and refactoring it to bring it incrementally into the present. They started with retooling the build system and documenting the internals. Then over time they addressed major pain points, like creating an LLVM backend to avoid the need to maintain custom code generators.
by t43562 on 3/1/24, 9:41 AM
We could have left customers with old operating systems on the older versions of the product. A lot of them never upgraded anyhow. We absolutely destroyed our productivity by not making this kind of decision. We also really hurt ourselves by supporting Windows - as soon as there are 2 or more completely different compilers things turn to **t. I'm not even sure we made much money from it.
Given the ability to use new tools (clang, gcc and others) that are only available on newer operating systems we could have done amazing things. All those address sanitizers etc would have been wonderful and I would like to have done some automated refactoring which I know clang has tools for.
Most of the problems were just with understanding the minds of the developers - they were doing something difficult and at a level of complexity that somewhat overmatched the problem most of the time but the complexity was there to handle the edge cases. I wanted to go around adding comments to the files and classes as I understood bits of it. I was working with one of the original developers who was of course not at all interested in anyone understanding it or making it clearer and this kind of effort tended to get shot down.
If you don't have good tests you're dead in the water. I have twice inherited python projects without tests at all and those were a complete nightmare until I added some. One was a long running build process in which unit tests were only partially helpful. Until I came up with a fake android source tree that could build in under a minute I was extremely handicapped. Once I had that everything started to get much better.
My favorite game ... is an open source C++ thing called warzone2100 - no tests. It's not easy to make changes with confidence. I imagine to myself that one day my contribution will be to add some. The problem is that I cannot imagine the current developers taking all that kindly to it. Some people get to competence in a codebase and leave it at that.
by myrmidon on 2/29/24, 5:04 PM
Something that's kinda implied that I would really stress: Establish a "single source of truth" for any release/binary that reaches production/customers, before even touching ANY code (Ideally CI. And ideally builds are reproducible).
If you build from different machines/environments/toolchains, its only a matter of time before that in itself breaks something, and those kinds of problems can be really "interesting" to find (an obscure race condition that only occurs when using a newer compiler, etc.)
by sega_sai on 2/29/24, 8:53 PM
by cljacoby on 3/1/24, 3:34 PM
I recently started on a new Rust project, and despite not having to worry about things like sanitizers as much, I followed a similar approach of getting it to compile locally, getting it compile in a docker container, setup automated CI/CD against all PRs.
Although I would order the steps as 1, 3, 4, 2. Don't get out the chainsaw until you have CI/CD tests evaluating your code changes as you go.
by joshmarinacci on 2/29/24, 10:26 PM
docs:
https://docs.trunk.io/check/configuration/configuring-existi...)
by jxramos on 3/1/24, 8:45 AM
by cesaref on 3/1/24, 2:36 PM
So buy the book, read it, apply the ideas.
by w10-1 on 3/1/24, 1:55 AM
Structure101 is the best way to grok the architecture er tangles of a large code base. They have a trial period that would give you the overview, but their refactoring support is fantastic (in Java at least).
by FpUser on 2/29/24, 11:00 PM
So no. I would not call C++ any special in this regards.
by AtlasBarfed on 3/2/24, 2:25 AM
2) Ask how long it would take to write from scratch? Only ask if the answer to #1 is "salaried well" btw.
Dumping an unsupported codebase on a new employee is a major dick move, and it requires hazard pay. They are doing that because they are desperate.
by Jean-Papoulos on 3/1/24, 7:23 AM
by rwmj on 2/29/24, 4:07 PM
by dureuill on 2/29/24, 10:28 PM
Look for another job
> You’d be amazed at how many C++ codebase in the wild that are a core part of a successful product earning millions and they basically do not compile.
Wow I really hope this is hyperbole. I feel like I was lucky to work on a codebase that had CI to test on multiple computers with WError
by JoeAltmaier on 3/1/24, 2:33 PM
Make notes about mysterious things. Check them off once you've figured them out.
Try to find the 'business case' database. You will fail, nobody has one. Make one then, while you're reading the code.
by victorronin on 3/1/24, 8:13 PM
by Kapura on 2/29/24, 4:57 PM
Great advice! People do not often think about the value of de-cluttering the codebase, especially _before_ a refactor.
by geoelectric on 2/29/24, 11:43 PM
by codelobe on 2/29/24, 9:43 PM
#0: Replace the custom/proprietary Hashmap implementation with the STL version.
Once upon a time, C++ academics brow beat the lot of us into accepting Red-Black-Tree as the only Map implementation, arguing (in good faith yet from ignorance) that the "Big O" (an orgasm joke, besides others) worst case scenario (Oops, pregnancy) categorized Hash Map as O(n) on insert, etc. due to naieve implementations frequently placing hash colliding keys in a bucket via linked list or elsewise iterating to other "adjacent" buckets. Point being: The One True Objective Standard of "benchmark or die" was not considered, i.e., the average case is obviously the best deciding factor -- or, as Spock simply logic'd it, "The needs of the many outweigh the needs of the few".Thus, it came to pass that STL was missing its Hashmap implementation; And since it is typically trivial (or a non issue) to avoid "worst case scenario" (of Waat? A Preggers Table Bucket?), e.g., use of iterative re-mapping of the hashmap. So it was that many "legacy" codebases built their own Hashmap implementations to get at that (academically forbidden) effective/average case insert/access/etc. sweet spot of constant time "O(1)" [emphasis on the scare quotes: benchmark it and see -- there is no real measure of the algo otherwise, riiight?]. Therefore, the affore-prophesied fracturing of the collections APIs via the STL's failure to fill the niche that a Hashmap would inevitably have to occupy came to pass -- Who could have forseen this?!
What is done is done. The upshot is: One can typically familiarize oneself with a legacy codebase whilst paying lip service to "future maintainability" by (albeit usually needless) replacing of custom Hashmap implementations with the one that the C++ standards body eventually accepted into the codebase despite the initial "academic" protesting too much via "Big O" notation (which is demonstrably a sex-humor-based system meant to be of little use in practical/average case world that we live in). Yes, once again the apprentice has been made the butt of the joke.
by asah on 3/1/24, 11:07 AM
...but grandpa, how did you know your code worked? Son, we didn't. <silence>
(wait until they hear that source control wasn't used...)
by davidw on 2/29/24, 4:53 PM
by Scubabear68 on 2/29/24, 5:53 PM
Make it compile, automate what you can, try not to poke the bear as much as you can, pray you can start strangling it by porting pieces to something else over time.
by girafffe_i on 3/1/24, 2:48 AM
by bfrog on 3/1/24, 3:57 PM
by m_a_g on 3/1/24, 12:24 PM
by xchip on 3/1/24, 10:18 AM
It better, change job to do something more interesting
by huqedato on 2/29/24, 4:56 PM
by Cloudef on 3/2/24, 2:28 AM
by jeffrallen on 2/29/24, 8:06 PM
Fun times. Don't work there anymore. Life is good. :)
by elzbardico on 2/29/24, 9:30 PM
by btbuildem on 3/1/24, 2:44 PM
Nope. Make a portable/virtualized env, and make it build there. That way it'll build on you machine, in the CI pipeline, on your co-worker's machine, etc etc.
> linters, fuzzing, auto-formatting
No, for at least two reasons:
1) Too risky, you change some "cosmetic" things and something will quietly shift in the depths, only to surface at the worst possible time
2) Stylistic refactors will bury the subtle footprints of the past contributors - these you will need as you walk through the tunnels on your forensic missions to understand Wat The Fuk and Why
Generally, touch as little as possible, focus on what adds value (actual value, like, sales dollars). The teetering jenga tower of legacy code is best not approached lightly.
Work with PMs to understand the roadmap, talk to sales and support to understand what features are most used and valuable. The code is just an artifact, a record of unhappy accidents and teeth-grinding compromises. Perhaps there's a way to walk away from it altogether.
by cratermoon on 2/29/24, 5:01 PM
by cloudhan on 3/1/24, 2:40 AM
by Merik on 2/29/24, 9:08 PM
by Dowwie on 3/1/24, 11:37 AM
by leecarraher on 2/29/24, 4:13 PM
by professorTuring on 2/29/24, 8:20 PM
by sealeck on 2/29/24, 4:38 PM
Problem solved
by throw_m239339 on 3/1/24, 4:10 AM
by nottorp on 3/1/24, 11:06 AM
... and goes on against using system packages.
Well if you're not using what your OS provides, why don't you statically link?
After all, it's the customer who pays for the extra storage and ram requirements, not you.
by bun_terminator on 2/29/24, 4:36 PM
like c++11 and later?
by dirkc on 3/1/24, 8:08 AM
1. Source control
2. Reproducible builds
3. Learn the functionality
4. ...
If you don't understand what the code does, you're probably going to regret any changes you make before step 3!
by EvgeniyZh on 3/1/24, 1:20 AM
by hesdeadjim on 2/29/24, 4:34 PM
by girafffe_i on 3/1/24, 3:05 AM
by BlueTemplar on 3/1/24, 6:23 AM
Ah yes, nothing like a veiled insult towards the people that might need that advice the most, blaming them for not using a developer paradigm that you even fail to name and for which you only give a hard to (directly) search two-letter acronym ! (/s)
( CI stands for Continuous Integration :
https://en.wikipedia.org/ wiki/Continuous_integration )
by beanjuiceII on 2/29/24, 8:05 PM
by legacybob on 3/1/24, 9:59 AM
I then sit on my legacy chair to browse the Internet and read about brand new programming things (through a legacy monitor).
by jujube3 on 2/29/24, 9:11 PM
But you should rewrite it in Rust.