Overview We2;re looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters aroundour Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.Youll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. Youll also help us plan for future capacity and evaluate new technologies as we continue to scale.ResponsibilitiesManage and optimize HPC cluster operationsDeploy and maintain infrastructure-as-code solutionsSupport ML/research teams with cluster usage optimizationOperate, troubleshoot and optimize Ceph storage clustersDevelop automation and toolingMinimum Qualifications5+ years of experience in SRE or HPC operationsProficiency in Linux systems administration (Ubuntu/Debian)Experience with Kubernetes and container orchestrationExperience with Ceph 1PB deployments and maintenanceKnowledge of security best practices in multi-tenant environmentsUnderstanding of L2/L3 networking fundamentalsSkilled in Python and Bash scriptingPreferred QualificationsExperience with infrastructure-as-code tools (Ansible/Terraform)Experience with GitOps (Helm, ArgoCD)Strong grasp of RDMA, InfiniBand, and GPUDirect technologiesFamiliarity with deep learning frameworks such as PyTorch and TensorFlowFamiliarity in at least one cloud platform: AWS, Azure or GCPIf youre a natural problem-solver with a passion for continuous learning, wed love to hear from you.We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.#J-18808-Ljbffr
Site Reliability Engineer, Ai/Ml Infrastructure
BOSON AI
toronto, toronto
Published TodayNew
Report job