SRE – KyanosAi Career

Design, implement, and maintain robust monitoring and alerting systems.

Lead observability initiatives by improving metrics, logging, and tracing across services and infrastructure.

Collaborate with development and infrastructure teams to instrument applications and ensure visibility into system health and performance.

Write Python scripts and tools for automation, infrastructure management, and incident response.

Participate in and improve the incident management and on-call process, driving down Mean Time to Resolution (MTTR).

Conduct root cause analysis and postmortems following incidents, and champion efforts to prevent recurrence.

Optimize systems for scalability, performance, and cost-efficiency in cloud and containerized environments.

Advocate and implement SRE best practices, including SLOs/SLIs, capacity planning, and reliability reviews.

KyanosAi Career