We are seeking a Senior Consultant in Site Reliability Engineering (Network SRE) to lead network-centric reliability practices across the Shared Platform ecosystem. This role focuses on ensuring resilience, scalability, and operational excellence for all Shared Platform-hosted applications and their interfacing systems, including network and messaging dependencies such as Solace and colocation integrations. The ideal candidate will bring an SRE mindset to early-stage design, embedding reliability, observability, governance, and operational readiness into network architecture and integrations. Key ResponsibilitiesLead the network SRE perspective during early design phases for application integrations and end-to-end network architecture. Define and enforce reliability standards across Shared Platform-hosted and interfacing applications. Own and govern theData Flow Diagram (DFD)lifecycle, ensuring accuracy, quality, and alignment with architecture and operations. Establish and drive network reliability controls as part of onboarding and governance processes for new and existing integrations. Collaborate with application, platform, and operations teams to define and implement monitoring, alerting, and capacity planning standards. Identify risks, failure modes, and reliability gaps in network components and proactively drive improvements. Ensure operational readiness through SRE best practices, including observability, incident prevention, and continuous improvement. Promote consistent adoption of SRE-aligned network practices across cross-functional teams. Required Skills & ExperienceStrong experience inSite Reliability Engineering (SRE)with a focus on network infrastructure and distributed systems. Deep understanding ofnetwork architecture, integration patterns, and messaging systems(experience with Solace is a plus). Proven experience in designing and implementingmonitoring, alerting, and observability frameworks . Hands-on expertise incapacity planning, performance tuning, and reliability engineering . Experience withData Flow Diagrams (DFDs) , architecture documentation, and governance practices. Strong knowledge offailure analysis, risk mitigation, and operational readiness frameworks . Ability to work across teams in amatrix organization , influencing stakeholders and driving alignment. Excellent communication skills with the ability to translate technical concepts into actionable strategies.#J-18808-Ljbffr
Site Reliability Engineer
KUMARAN SYSTEMS
toronto, toronto
Published 26 days ago
Report job