We are seeking a skilled Site Reliability Engineer (SRE) to enhance the reliability, scalability, and performance of our systems and applications. The ideal candidate will have strong experience in automation, cloud platforms, observability, incident management, and DevOps practices. This role involves working closely with cross-functional teams to ensure high availability, continuous improvement, and efficient service delivery. Key Responsibilities Design, build, and maintain automation for infrastructure provisioning and configuration management. Implement and manage monitoring, observability, and alerting systems to ensure service reliability. Collaborate with development and operations teams to enhance CI/CD pipelines and deployment automation. Lead incident response, root‑cause analysis, and continuous improvement initiatives. Manage cloud infrastructure, container orchestration platforms, and distributed systems at scale. Ensure security, compliance, and governance across systems and processes. Optimize application performance and conduct capacity planning and load testing. Maintain documentation, runbooks, SLOs/SLAs, and operational processes. Required Skills & Experience 1. Automation & Configuration Management Ansible: Writing playbooks, roles, and modules. Python: Scripting for automation, monitoring, API integration. PowerShell: Automation for Windows, AD, and cloud resources. 2. Monitoring & Observability Dynatrace: Synthetic & real user monitoring, alerting, performance analysis. Elasticsearch Stack: Log aggregation & querying; familiarity with Kibana/Logstash. ServiceNow: Ticket lifecycle, CMDB, workflow automation. 3. Database & Storage SQL Server: Query tuning, replication, HA/DR setups. Backup & disaster recovery planning. 4. Security & Compliance IAM, encryption, secrets management (e.g., HashiCorp Vault). Vulnerability scanning and compliance frameworks (e.g., SOC 2). 5. CI/CD & DevOps CI/CD tools: Jenkins, GitHub Actions, UrbanCode Deploy (UCD). Git workflows and branching strategies. 6. Performance Engineering Load testing using JMeter. Capacity planning & performance optimization. Defining and measuring SLIs, SLOs, SLAs. #J-18808-Ljbffr
Devops Sre
TECHDOQUEST
montreal (administrative region), montreal (administrative region)
Published 27 days ago
Report job