Description:
We are seeking an experienced Site Reliability Engineer (SRE) to join our team and play a pivotal role within our Product Engineering department, in maintaining and enhancing the performance, reliability, and scalability of our systems. The ideal candidate will lead and establish SLAs and SLOs, optimize resource usage, and proactively monitor system health. You will be instrumental in developing solutions that ensure seamless operations, cost-effective performance, and continuous service improvement.
What You will Do
Primary responsibilities:
- Establish service level objectives (SLOs) and service level agreements (SLAs)
- Optimise system performance and scalability through effective resource management, load distribution, and latency reduction
- Develop proactive monitoring solutions and dashboards to provide visibility into system health and performance while alerting on potential issues
- Ensure services operate within defined budget constraints, identifying opportunities for cost-saving and optimization
- Create and maintain comprehensive documentation for system architecture, configuration, and troubleshooting procedures
- Partner with development teams to ensure new features and enhancements meet reliability and performance standards
- Conduct root cause analysis post-incident and implement preventive measures to avoid future occurrences
- Automate repetitive tasks and processes to enhance efficiency and minimize manual intervention
- Stay informed about industry best practices, emerging trends, and new technologies in site reliability engineering
- Identify technical debt and collaborate with application teams to establish remediation plans
- Deliver continuous service improvement through Infrastructure as Code development
Secondary responsibilities:
- Perform daily health and compliance checks on systems as required
- Validate and promptly resolve monitoring alerts and batch job failures
- Ensure sufficient capacity is available to support growth
- Respond promptly to emails sent to team distribution lists or mailboxes
- Handle incidents and requests efficiently, prioritizing a "customer-first" mindset
- Maintain highly available, reliable, secure, and performant infrastructure
- Conduct general server, database, and virtualization administration maintenance activities
- Provide technical support to application support and development teams
- Offer consultation to application support and development teams
Key Requirements
Essential:
- Proficiency in Docker/Kubernetes deployment, scaling, and managing containerized applications
- Expertise in managing and optimizing monitoring stacks like Grafana, Prometheus or Azure Monitor
- Strong skills in creating dashboards using PromQL or KQL
- Experience with CI/CD/CT platforms such as Azure DevOps
- Familiarity with "Infrastructure as Code" and "Continuous Integration and Continuous Development" principles and practices
- Knowledge of Agile, Site Reliability Engineering (SRE), and DevOps principles and practices
- Proficiency in scripting and programming languages such as PowerShell, Python, Bash, and C#
- Knowledge of backup and recovery processes and procedures
- Advanced understanding of clustering, high availability, replication, and disaster recovery techniques
- Strong ability to tune network, storage, server, and virtualization layers for optimal performance and reliability
- In-depth performance tuning skills and system internals knowledge
- Experience implementing CIS security hardening recommendations