Site Reliability Engineer
Perennial Systems
Applied
69
Application Deadline
6
days left
Impressions
8,547
Eligibility
Experienced Professionals
Recruitment Process
Details
Role Overview:
- We are seeking a Site Reliability Engineer (SRE) with strong expertise in observability and monitoring using Datadog. The ideal candidate will be responsible for ensuring system reliability, performance, scalability, and availability across cloud-native and distributed environments by leveraging Datadog for monitoring, alerting, and incident management.
Key Responsibilities:
- Design, implement, and maintain comprehensive monitoring and observability solutions using Datadog (APM, Infrastructure Monitoring, Logs, RUM, Synthetic Monitoring).
- Define and track SLIs, SLOs, and SLAs; build dashboards and alerts aligned with reliability and business goals.
- Proactively identify performance bottlenecks, system anomalies, and availability risks using Datadog metrics and traces.
- Lead incident response, root cause analysis (RCA), and postmortems; improve alert quality and reduce noise.
- Collaborate with engineering, DevOps, and security teams to improve system resilience and operational excellence.
- Automate monitoring, alerting, and remediation workflows using Infrastructure as Code (Terraform/CloudFormation).
- Support on-call rotations and continuously improve reliability practices.
- Integrate Datadog with CI/CD pipelines and cloud services for end-to-end visibility.
Required Skills & Experience:
- 3+ years of experience as an SRE, DevOps Engineer, or Production Support Engineer.
- Strong hands-on experience with Datadog (dashboards, monitors, APM, logs, synthetics).
- Solid understanding of SRE principles: error budgets, toil reduction, availability, latency, and reliability.
- Experience with cloud platforms (AWS, Azure, or GCP).
- Proficiency in Linux/Unix systems and networking fundamentals.
- Experience with containers and orchestration (Docker, Kubernetes).
- Scripting experience in Python, Bash, or Go.
- Familiarity with incident management and on-call best practices.
Good to Have:
- Experience implementing custom Datadog metrics and distributed tracing.
- Knowledge of CI/CD tools (Jenkins, GitHub Actions, GitLab CI).
- Experience with configuration management and IaC tools (Terraform, Ansible).
- Exposure to security monitoring and compliance observability.
- Prior experience scaling high-traffic, distributed systems.
If an employer asks you to pay any kind of fee, please notify us immediately. unstop does not charge
any fee from the applicants and we do not allow other companies also to do so.
Important dates & deadlines?
-
3 Feb'26, 12:00 AM IST Registration Deadline
Contact the organisers
Send queries to organizersAdditional Information
Job Location(s)
Pune
Salary
Salary: Not Disclosed
Work Detail
Working Days: 5 Days
Job Type/Timing
Job Type: In Office
Job Timing: Full Time
Featured Opportunities
Online
Free
PPIs at L'Oréal + Intrapreneurship & International recognition with Brandstorm 2026!
Online
Free
L&T OutThink 2026: PPIs and Cash Prize worth INR 2.25 Lakh!
Online
Free
L&T CreaTech 2026: PPIs and Cash Prize worth INR 2.25 Lakh!
Online
Free
MarQing Minds: Case Study Competition | Register now!
Online
Free
Hero Campus Challenge Season 10: PPIs & Prizes Worth ₹25 Lakhs
Unstop Freedom Festival 🇮🇳
*This opportunity has been listed by
Perennial Systems
.
Unstop is not liable for any content mentioned in this opportunity or the process followed by the organizers for this opportunity. However, please raise a complaint if you want unstop to look into the matter.
Raise a Complaint