Job Description: Site Reliability Engineer (SRE) – Observability Toronto - Hybrid (1-2 days office) We are looking for a Observability Engineer to help implement, operate, and improve observability capabilities across our applications and platforms. This role focuses on hands‑on onboarding, instrumentation, dashboarding, and alerting, working under established standards and guidance from senior engineers. You will collaborate with application, SRE, and operations teams to ensure systems are observable, supportable, and production‑ready. Key Responsibilities Observability Implementation Implement and maintain metrics, logs, and traces for applications and infrastructure Assist with onboarding applications into observability platforms (e.g., Dynatrace, ELK, Datadog) Configure dashboards, alerts, and basic anomaly detection Application Support & Instrumentation Work with development teams to enable structured logging, basic distributed tracing, and core metrics Validate observability requirements during Production Readiness Reviews (PRR) Troubleshoot missing or low‑quality telemetry Monitoring & Alerting Configure alerts based on golden signals (latency, errors, traffic, saturation) Help reduce alert noise by tuning thresholds and alert logic Support incident response by gathering logs, metrics, and traces Operations & Reliability Support root cause analysis using observability tools Maintain dashboards and documentation used by on‑call and support teams Participate in on‑call rotations (as applicable) Automation & Continuous Improvement Assist in automating observability onboarding and validation tasks Create and maintain reusable dashboards and alert templates Follow established observability standards and best practices Required Qualifications 2–4 years of experience in Observability, or SRE Working knowledge of metrics, logs, and basic tracing concepts Hands‑on experience with at least one observability platform (Dynatrace, Elastic/ELK, Datadog, New Relic, etc.) Basic understanding of SLIs/SLOs and service health indicators Experience with cloud platforms or hybrid environments Ability to write scripts (Python, Bash, PowerShell) for automation and troubleshooting Preferred Qualifications Experience with OpenTelemetry or APM agents Familiarity with Kubernetes or containerized workloads Experience working with incident management tools (PagerDuty, ServiceNow) Exposure to Dynatrace/Kibana ELK or similar cloud‑native monitoring Experience in regulated or enterprise environments #J-18808-Ljbffr
Site Reliability Engineer (Sre) – Observability
ASTRA-NORTH INFOTECK INC. ~ CONQUERING TODAY’S CHALLENGES, ACHIEVING TOMORROW’S VISION!
toronto, toronto
Published 27 days ago
Report job