from Hacker News

Ask HN: How can I build up experience in ML infrastructure?

by minzi on 1/20/24, 9:58 PM with 5 comments

At the moment I do backend development in Go. In my free time I’ve been refreshing my C++ and doing some Nvidia cuda courses.

Once I’ve built up the fundamentals, how can I get some real experience (at least, enough to land a first job in the field)?

I might be going back to school as well for a mscs if I can get in somewhere. I figure that might help.

by AznHisoka on 1/20/24, 10:38 PM
The only thing that has worked for me is solving a problem you are passionate about with ML. I’m not talking about some pretend problem or competition in Kaggle. I’m talking about a problem I’m interested in or curious about. Choose a model, try to get some data, tune the model and deploy it somewhere. See how it performs and keep iterating and learning as you go along
by shipwright on 1/21/24, 4:06 PM
Getting to know some cuda is a solid step forward, I'd recommend familiarizing with Triton as a next move since you seem into that, besides fundamentals like trying out a scalable deployment workflow with a framework>training env>inference env>orchestration tooling of choice. Plenty of room to mix and match there (and can be done at no cost in many cases due to generous free trials/tiers) :). Simply finding something you love goes a long way, there's always going to be opportunities to prove your worth especially when it comes to infra: here's a little example, there's a lot of popular streamlined training tools which save checkpoints where tensors are nested oddly in subkeys- because of it deserialization gets iffy when converting to safetensors. That wouldn't be a hassle if it were a one off thing, but said models are really big and popular without even counting the sheer number of finetunes that also get deployed and when aware of it can just act accordingly but it's what stopped enough projects in their tracks for a while until pointed out haha. I'm sure that whatever you pick to do will be a good choice, best of luck :)
by catlover76 on 1/20/24, 10:48 PM
I am learning about this stuff right now as well (and fortunate to be in a job where I basically get to start building this stuff with 0 credentials). I am not sure C++ and deep Nvidia Cuda stuff matter, but we are probably just thinking of different things when we say or hear "ML infrastructure".
I think of MLOps--deploying, training, managing, scaling ML systems on SageMaker, models on Bedrock, etc. Dealing with data ingestion/ETL for those systems. Managing costs. Doing SRE stuff for those systems. Stuff like that.