
Luma AI
We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.
Responsibilities
-
Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.
-
Profile and optimize our model training code-base to achieve best in class hardware efficiency.
-
Build systems to distribute work across massive GPU clusters efficiently.
-
Design and implement methods to robustly train models in the presence of hardware failures.
-
Build tooling to help us better understand problems in our largest training jobs.
Experience
-
5+ years of work experience.
-
Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.
-
Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.
-
Experience building stable and highly efficient distributed systems.
-
Strong generalist Python and Software skills including significant experience with Pytorch.
-
Good to have experience working with high performance C++ or CUDA.
Your application is reviewed by real people.
Apply now
To help us track our recruitment effort, please indicate in your cover/motivation letter where (itjobvacancies.com) you saw this job posting.