Cerebrium is a machine learning platform that makes it easy for companies to fine-tune, deploy and monitor ML models in production. We abstract away the complexity that comes with managing ML infrastructure and the need to stay up to date with the latest ML research and technologies.
As an Infrastructure Engineer, you'll work closely with our team to build and operate our infrastructure at scale. You'll help us optimize our model deployment pipeline, implement GPU sharing and time-slicing on top of K8’s, implement parallism and implement scale this across 1000’s of machines to ensure our models and fine-tuning jobs run smoothly.
The ideal candidate has experience building and scaling infrastructure at a large scale, expertise in Kubernetes and serverless architectures, and excellent communication skills. You should be able to communicate complex topics clearly and work closely with customers - typically they are all machine learing engineers and software engineers.
How we work:
- We focus on output. We don’t care what hours you work or from where you work. Just do what it takes to meet your weekly sprint. Finished early or just not having a good day - take the day.
- We have a flat structure and want to constantly to be challenged by you. In terms of product and company decisions.
- We ship multiple times a week - every time we can add value to the customer we ship it. Also you do weekly demos to team members on what you have been building.
Responsibilities:
- Build and operate our infrastructure at scale
- Optimize and deploy machine learning models and training jobs at scale
- Improve GPU-sharing and time-slicing capability
Qualifications:
- Experience building and scaling infrastructure at a large scale
- Expertise in Kubernetes and serverless architectures
- Experience with Python or Go
- Experience with Infrastructure as code such as Terraform
We offer a remote, international work environment and only require 4 hours overlap with the team daily (9-18pm EST). If you're passionate about machine learning and are interested in joining a team dedicated to making it faster and easier to deploy machine learning models, please apply with your resume and cover letter.
Benefits
- Competitive salary and meaningful equity
- Flexible work environment – work remotely from home or from a WeWork which we sponsor
- Health, dental, and vision benefits with 80% coverage for you
- Unlimited PTO
- Opportunities to speak and participate in events across the Cloud Native community
- 2-3 company off-sites a year. We have previously done Tulum and Greece.
- Learning budget, and much more