Become a Site Reliability Engineer to support cutting-edge AI technologies. Ensure system reliability and operational effectiveness utilizing your Linux and automation skills in a hybrid setup.In this role, you will focus on the intersection of reliability and customer engineering, validating that our AI systems are production-ready. Engaging with internal teams, you will tackle complex issues and enhance monitoring and automation processes, contributing significantly to system performance and reliability.Key Responsibilities:• Maintain operational integrity of AI infrastructures• Troubleshoot issues spanning compute, network, and software• Collaborate with teams for incident response• Enhance monitoring and observability frameworks• Create automation solutions to boost reliabilityRequirements:• Expertise in site reliability or systems engineering• Advanced Linux troubleshooting capabilities• Knowledge of observability tools like Prometheus• Proficiency in scripting with Python or Go• Solid grasp of networking principles at scaleElevate AI infrastructure through your role, ensuring robust and reliable systems with efficient operational practices.#J-18808-Ljbffr
Ai Systems Reliability Engineer Position
TENSTORRENT
toronto, toronto
Published 27 days ago
Report job