Posted 4 weeks ago
*Join Our Product Client*
Key Responsibilities:
- Design, implement, and maintain robust monitoring and alerting systems.
- Lead observability initiatives by improving metrics, logging, and tracing across services and infrastructure.
- Collaborate with development and infrastructure teams to instrument applications and ensure visibility into system health and performance.
- Write Python scripts and tools for automation, infrastructure management, and incident response.
- Participate in and improve the incident management and on-call process, driving down Mean Time to Resolution (MTTR).
- Conduct root cause analysis and postmortems following incidents, and champion efforts to prevent recurrence.
- Optimize systems for scalability, performance, and cost-efficiency in cloud and containerized environments.
- Advocate and implement SRE best practices, including SLOs/SLIs, capacity planning, and reliability reviews.
Required Skills & Qualifications:
- 3+ years of experience in a Site Reliability Engineer or similar role.
- Proficiency in Python for automation and tooling.
- Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, New Relic, OpenTelemetry, etc.
- Experience with log aggregation and analysis tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Fluentd.
- Good understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
- Familiarity with infrastructure-as-code (Terraform, Ansible, or similar).
- Strong debugging and incident response skills.
- Knowledge of CI/CD pipelines and release engineering practices.
*Apply Now:*
Submit your resume and cover letter to [cloudanglesrecruiters@kyanosai.com]
*Equal Opportunity Employer:* We welcome applications from qualified candidates of all backgrounds.
Thanks & regards
Pinki Arya