Become a Site Reliability Engineer to support cutting-edge AI technologies. Ensure system reliability and operational effectiveness utilizing your Linux and automation skills in a hybrid setup.In this role, you will focus on the intersection of reliability and customer engineering, validating that our AI systems are production-ready. Engaging with internal teams, you will tackle complex issues and enhance monitoring and automation processes, contributing significantly to system performance and reliability.Key Responsibilities: • Maintain operational integrity of AI infrastructures • Troubleshoot issues spanning compute, network, and software • Collaborate with teams for incident response • Enhance monitoring and observability frameworks • Create automation solutions to boost reliabilityRequirements: • Expertise in site reliability or systems engineering • Advanced Linux troubleshooting capabilities • Knowledge of observability tools like Prometheus • Proficiency in scripting with Python or Go • Solid grasp of networking principles at scaleElevate AI infrastructure through your role, ensuring robust and reliable systems with efficient operational practices. #J-18808-Ljbffr
Ai Systems Reliability Engineer Position
TENSTORRENT
toronto, toronto
Published 27 days ago
Report job