What Is the Opportunity? Join RBC's Site Reliability Engineering team as a founding member building the bank's first‑ever Agentic AI platform for software reliability and resiliency. You'll pioneer intelligent automation systems that autonomously prevent incidents, accelerate response times, and transform how we maintain resilience across enterprise systems.What Will You Do?Design and implement end‑to‑end Agentic AI solutions that autonomously detect anomalies, identify root causes, and resolve incidents with minimal human intervention.Develop intelligent automation frameworks using LangChain and LangGraph to create context‑aware agents that learn from incident patterns and continually improve response strategies.Build ML‑powered monitoring and alerting systems that distinguish signal from noise, dramatically reducing false positives and improving MTTD and MTTI.Architect scalable, production‑grade solutions on OpenShift and Kubernetes that process real‑time system metrics and telemetry data at enterprise scale.Implement infrastructure‑as‑code using Ansible and Docker for reproducibility, consistency, and rapid deployment across environments.Partner with incident management and operations teams to translate operational pain points into AI‑driven automation opportunities that reduce toil.Establish and track KPIs focused on reducing MTTR, MTTD, and MTTI while improving system reliability.Lead technical design discussions and contribute to architectural decisions that shape RBC's AI‑powered reliability strategy.QualificationsStrong ML engineering background with experience designing, training, and deploying machine learning models in production.Proven expertise in Agentic AI frameworks and tools (LangChain, LangGraph, AutoGen, CrewAI, or similar) and building autonomous, multi‑agent systems.Deep understanding of Model Context Protocol (MCP) for enabling AI agents to interact with external systems and data sources.Experience building AI agents with tool‑calling capabilities, memory management, and reasoning chains.Proficiency in Python and experience with ML libraries (scikit‑learn, TensorFlow, PyTorch, or similar).Working knowledge of containerization (Docker), orchestration (Kubernetes/OpenShift), and infrastructure‑as‑code principles (Ansible, Terraform).Demonstrated ability to translate complex technical concepts into business value and collaborate effectively with cross‑functional teams.Nice‑to‑havePrior experience in Site Reliability Engineering, DevOps, or infrastructure monitoring roles.Familiarity with observability tools (Prometheus, Grafana, ELK stack) and incident management platforms (PagerDuty, ServiceNow).Experience with LLMs, prompt engineering, and retrieval‑augmented generation (RAG) architectures.Background in financial services or other highly regulated industries with strict reliability requirements.BenefitsA comprehensive Total Rewards Program including bonuses, flexible benefits, competitive compensation, commissions, and stock where applicable.Leaders who support your development through coaching and management opportunities.Ability to make a difference and lasting impact.Work in a dynamic, collaborative, progressive, and high‑performing team.A world‑class training program in financial services.Flexible work/life balance options.Opportunities to do challenging work.Job DetailsAddress: RBC Waterpark Place, 88 Queens Quay W, Toronto, CanadaCity: TorontoCountry: CanadaWork hours/week: 37.5Employment Type: Full timePlatform: Technology and OperationsJob Type: RegularPay Type: SalariedPosted Date: 2026‑04‑27Application Deadline: 2026‑05‑29#J-18808-Ljbffr
Senior Ai/Ml Engineer - Site Reliability Engineering
RBC
toronto, toronto
Published 26 days ago
Report job