Posted 4 weeks ago

*Join Our Product Client*

Key Responsibilities:

  • Design, implement, and maintain robust monitoring and alerting systems.
  • Lead observability initiatives by improving metrics, logging, and tracing across services and infrastructure.
  • Collaborate with development and infrastructure teams to instrument applications and ensure visibility into system health and performance.
  • Write Python scripts and tools for automation, infrastructure management, and incident response.
  • Participate in and improve the incident management and on-call process, driving down Mean Time to Resolution (MTTR).
  • Conduct root cause analysis and postmortems following incidents, and champion efforts to prevent recurrence.
  • Optimize systems for scalability, performance, and cost-efficiency in cloud and containerized environments.
  • Advocate and implement SRE best practices, including SLOs/SLIs, capacity planning, and reliability reviews.

 

Required Skills & Qualifications:

  • 3+ years of experience in a Site Reliability Engineer or similar role.
  • Proficiency in Python for automation and tooling.
  • Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, New Relic, OpenTelemetry, etc.
  • Experience with log aggregation and analysis tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Fluentd.
  • Good understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
  • Familiarity with infrastructure-as-code (Terraform, Ansible, or similar).
  • Strong debugging and incident response skills.
  • Knowledge of CI/CD pipelines and release engineering practices.

*Apply Now:*

Submit your resume and cover letter to [cloudanglesrecruiters@kyanosai.com]

*Equal Opportunity Employer:* We welcome applications from qualified candidates of all backgrounds.

Thanks & regards
Pinki Arya

Apply For This Job

Name

A valid email address is required.
A valid phone number is required.