Site Reliability Engineer

Description:

We are seeking an experienced Site Reliability Engineer (SRE) to join our team and play a pivotal role within our Product Engineering department, in maintaining and enhancing the performance, reliability, and scalability of our systems. The ideal candidate will lead and establish SLAs and SLOs, optimize resource usage, and proactively monitor system health. You will be instrumental in developing solutions that ensure seamless operations, cost-effective performance, and continuous service improvement.

What You will Do

Primary responsibilities:

Establish service level objectives (SLOs) and service level agreements (SLAs)
Optimise system performance and scalability through effective resource management, load distribution, and latency reduction
Develop proactive monitoring solutions and dashboards to provide visibility into system health and performance while alerting on potential issues
Ensure services operate within defined budget constraints, identifying opportunities for cost-saving and optimization
Create and maintain comprehensive documentation for system architecture, configuration, and troubleshooting procedures
Partner with development teams to ensure new features and enhancements meet reliability and performance standards
Conduct root cause analysis post-incident and implement preventive measures to avoid future occurrences
Automate repetitive tasks and processes to enhance efficiency and minimize manual intervention
Stay informed about industry best practices, emerging trends, and new technologies in site reliability engineering
Identify technical debt and collaborate with application teams to establish remediation plans
Deliver continuous service improvement through Infrastructure as Code development

Secondary responsibilities:

Perform daily health and compliance checks on systems as required
Validate and promptly resolve monitoring alerts and batch job failures
Ensure sufficient capacity is available to support growth
Respond promptly to emails sent to team distribution lists or mailboxes
Handle incidents and requests efficiently, prioritizing a "customer-first" mindset
Maintain highly available, reliable, secure, and performant infrastructure
Conduct general server, database, and virtualization administration maintenance activities
Provide technical support to application support and development teams
Offer consultation to application support and development teams

Key Requirements

Essential:

Proficiency in Docker/Kubernetes deployment, scaling, and managing containerized applications
Expertise in managing and optimizing monitoring stacks like Grafana, Prometheus or Azure Monitor
Strong skills in creating dashboards using PromQL or KQL
Experience with CI/CD/CT platforms such as Azure DevOps
Familiarity with "Infrastructure as Code" and "Continuous Integration and Continuous Development" principles and practices
Knowledge of Agile, Site Reliability Engineering (SRE), and DevOps principles and practices
Proficiency in scripting and programming languages such as PowerShell, Python, Bash, and C#
Knowledge of backup and recovery processes and procedures
Advanced understanding of clustering, high availability, replication, and disaster recovery techniques
Strong ability to tune network, storage, server, and virtualization layers for optimal performance and reliability
In-depth performance tuning skills and system internals knowledge
Experience implementing CIS security hardening recommendations

Organization	Altitude Angel
Industry	Engineering Jobs
Occupational Category	Site Reliability Engineer
Job Location	London,UK
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2024-05-09 6:29 am
Expires on	2024-12-15