from Hacker News

Ask HN: Resources to Learn Scientific Computing?

by TripleH on 5/24/21, 2:17 PM with 12 comments

After some years of software engineering (mobile apps and back end development) in classic business domains, I recently joined a company in a scientific domain (space).

I gathered along the way knowledge of how to architecture and build systems depending on the expected volumetry (numbers of active users) and its distribution in time (periods of activity).

Now I'm faced with the challenge of turning Matlab (or equivalent) scripts of the research team designed to run on a single machine for a single data point, into a system able to perform those computations on a lot more data and eventually in a distributed manner.

I am well aware there is no silver bullet and will have to compose a solution based on the specifics of my use case. That's why I'm asking you, the truly diverse HN community, for the best resources you know on the subject.

  • by mustafa_pasi on 5/24/21, 8:53 PM

    They made a shit hiring decision, because graduates in applied math and scientific computing are plentiful and this is their bread and butter.

    But moving past that, first of all there is no book that would help you here. Most introductory textbooks will focus on implementing numerical methods and the math behind those methods. You most definitely should not try to write any numerical code yourself. What you need is a general know how of how to build a simulation pipeline. There is no textbook or class that covers this as far as I know. Graduate students learn it on the job.

    Your MATLAB/Python scripts are already using the most optimized libraries. For parallelization, you have two options. There is OpenMPI, which can be used with most languages and MATLAB has its own built in parallelism library. OpenMPI is much more technical and you need to actually understand what the code is doing to use it, so this is not an option for you. I suggest you get acquainted with MATLAB and do everything in MATLAB. If they have code written in other languages it is easier to rewrite it in MATLAB than to try to do anything else. Translating numpy to MATLAB is not very hard. Since they have a MATLAB license you might benefit from some MATLAB training that they might have access to. I know MATLAB offers online classes and seminars and maybe even technical support. Thinking about it, a book on the MATLAB environment might be very useful for you. I cannot give you any suggestions though.

    And ignore the other comments. They aren't any good. You most definitely don't want to rewrite your simulation in FORTRAN. MATLAB is already using FORTRAN under the hood.

  • by truth_ on 5/24/21, 7:20 PM

  • by AKluge on 5/25/21, 1:10 AM

    Scalable Computing: Practice and Experience, is a great one. Especially the past issues where they talk about the basics of parallel and distributed programming. https://www.scpe.org/index.php/scpe/issue/archive

    Do they have a hardware or software environment for large scale computing? I think a good place to start is understanding their environment and expectations. Do they expect a move to another language, like Fortran, or to exploit parallelism in Matlab? https://www.mathworks.com/help/parallel-computing/index.html

  • by hackermailman on 5/26/21, 7:29 PM

    There's a course for this, Julia is similar to Matlab syntax to see how to run it distributed https://m.youtube.com/playlist?list=PLCAl7tjCwWyGjdzOOnlbGnV... lot's of practical advice in the lectures I saw so far
  • by khadgar25 on 5/26/21, 2:51 AM

    I worked in a Scientific Computing team for a few years. Having some engineering background yourself and the curiosity to actually understand the code rather than "just port" it to C++ or Fortran will be more helpful. As others have mentioned, there is no one book or silver bullet but developing a good relation with the research team is a good starting point. Researchers love to talk and explain their stuff but often don't consider the scalability very well.

    MPI is the usual backbone in parallelizing scientific applications so unless you have experience with it, getting some familiarity will be helpful. A good resource is Parallel Programming with MPI by Pacheco. MPI itself is not very hard but thinking parallel can be challenging unless you have some experience.

    Just a word of caution though, well written MATLAB code is very hard to beat performance. You will need to carefully understand latency and bandwidth aspects of the cluster in order to get the most benefits out of parallelization.

  • by OminousWeapons on 5/26/21, 5:32 PM

    If you're not familiar with it, I would recommend checking out Matlab's parallel computing toolbox for starters. IIRC Mathworks has several webinars surrounding that toolbox which cover the basics of what you are trying to do. Matlab also has a product "Parallel Server" which handles scheduling and monitoring of distributed jobs https://www.mathworks.com/products/matlab-parallel-server.ht...

    Additionally, its a pretty safe assumption that any given piece of scientific code has never been profiled for performance. There are probably numerous opportunities to improve performance on a single host through refactoring before you need to think about scaling.

  • by temp234 on 5/24/21, 5:28 PM

    Beyond scalability, I think the biggest things you could bring to the scientists here would be test driven development and logs. You won't become a Matlab expert on a deadline and they may not actually be Matlab experts either. Tests and visibility into actual behavior are everything here since they don't even seem to have a spec, just a prototype.
  • by Bostonian on 5/24/21, 2:28 PM

    I think Fortran is the closest compiled language to Matlab (both have multidimensional arrays with 1-based indexing), so modern Fortran could be considered for the conversion. Some sites are https://fortran-lang.org/ and for Matlab-Fortran interoperability and partial translation, http://fortranwiki.org/fortran/show/Matlab . A good forum is https://fortran-lang.discourse.group/ .
  • by imvetri on 5/26/21, 6:17 AM

    Wait, the task is to run the matlab script in single system to be able to scale for multiple system.

    You task is still engineering, There doesn't seem anything scientific here.

    Less you know, easier is to bring in your skills. Leave the scientific part to your colleagues and focus on your skills alone. Do not mix and match.

    Did you join Mathworks ?