by TripleH on 5/24/21, 2:17 PM with 12 comments
I gathered along the way knowledge of how to architecture and build systems depending on the expected volumetry (numbers of active users) and its distribution in time (periods of activity).
Now I'm faced with the challenge of turning Matlab (or equivalent) scripts of the research team designed to run on a single machine for a single data point, into a system able to perform those computations on a lot more data and eventually in a distributed manner.
I am well aware there is no silver bullet and will have to compose a solution based on the specifics of my use case. That's why I'm asking you, the truly diverse HN community, for the best resources you know on the subject.
by mustafa_pasi on 5/24/21, 8:53 PM
But moving past that, first of all there is no book that would help you here. Most introductory textbooks will focus on implementing numerical methods and the math behind those methods. You most definitely should not try to write any numerical code yourself. What you need is a general know how of how to build a simulation pipeline. There is no textbook or class that covers this as far as I know. Graduate students learn it on the job.
Your MATLAB/Python scripts are already using the most optimized libraries. For parallelization, you have two options. There is OpenMPI, which can be used with most languages and MATLAB has its own built in parallelism library. OpenMPI is much more technical and you need to actually understand what the code is doing to use it, so this is not an option for you. I suggest you get acquainted with MATLAB and do everything in MATLAB. If they have code written in other languages it is easier to rewrite it in MATLAB than to try to do anything else. Translating numpy to MATLAB is not very hard. Since they have a MATLAB license you might benefit from some MATLAB training that they might have access to. I know MATLAB offers online classes and seminars and maybe even technical support. Thinking about it, a book on the MATLAB environment might be very useful for you. I cannot give you any suggestions though.
And ignore the other comments. They aren't any good. You most definitely don't want to rewrite your simulation in FORTRAN. MATLAB is already using FORTRAN under the hood.
by truth_ on 5/24/21, 7:20 PM
by AKluge on 5/25/21, 1:10 AM
Do they have a hardware or software environment for large scale computing? I think a good place to start is understanding their environment and expectations. Do they expect a move to another language, like Fortran, or to exploit parallelism in Matlab? https://www.mathworks.com/help/parallel-computing/index.html
by hackermailman on 5/26/21, 7:29 PM
by khadgar25 on 5/26/21, 2:51 AM
MPI is the usual backbone in parallelizing scientific applications so unless you have experience with it, getting some familiarity will be helpful. A good resource is Parallel Programming with MPI by Pacheco. MPI itself is not very hard but thinking parallel can be challenging unless you have some experience.
Just a word of caution though, well written MATLAB code is very hard to beat performance. You will need to carefully understand latency and bandwidth aspects of the cluster in order to get the most benefits out of parallelization.
by OminousWeapons on 5/26/21, 5:32 PM
Additionally, its a pretty safe assumption that any given piece of scientific code has never been profiled for performance. There are probably numerous opportunities to improve performance on a single host through refactoring before you need to think about scaling.
by temp234 on 5/24/21, 5:28 PM
by Bostonian on 5/24/21, 2:28 PM
by imvetri on 5/26/21, 6:17 AM
You task is still engineering, There doesn't seem anything scientific here.
Less you know, easier is to bring in your skills. Leave the scientific part to your colleagues and focus on your skills alone. Do not mix and match.
Did you join Mathworks ?