Observability Engineer / Site Reliability Engineer (SRE)

Location: Bucharest, Romania
Compensation: Competitive salary + performance-based quarterly bonuses

About the Role:
Diomedes Technologies is seeking an Observability Engineer to join our team and play a critical role in maintaining the reliability and performance of our production systems. This position focuses on ensuring real-time visibility into our infrastructure and applications, enabling us to proactively monitor system health and resolve issues swiftly. If you’re passionate about observability, system optimization, and incident management in a fast-paced environment, this role is for you.

Key Responsibilities:

Monitoring and Observability:

Design, build, and maintain Grafana dashboards to visualize critical system metrics and application performance.
Develop custom Prometheus metrics to monitor systems and applications effectively.
Configure, test, and deploy alerting rules in Prometheus and Grafana to ensure system stability and quick issue detection.

Incident Management:

Continuously monitor production systems to identify and mitigate performance issues before they affect system availability.
Investigate, diagnose, and resolve system alerts in a timely manner to minimize downtime and service disruption.
Provide detailed root-cause analysis for production incidents and suggest preventative measures to avoid recurrence.

System Performance and Optimization:

Analyze trends in system and application performance data to proactively detect potential issues.
Collaborate with development teams to instrument applications for enhanced observability and improve monitoring coverage.

Tooling and Automation:

Evaluate, integrate, and implement new monitoring tools to enhance system visibility and reporting.
Automate common monitoring, alerting, and incident management tasks to reduce manual intervention and improve operational efficiency.

Documentation and Training:

Document observability processes, Grafana dashboards, Prometheus metrics, and alerting configurations.
Provide training and support to team members on best practices for using Grafana and Prometheus to ensure effective monitoring and observability.

What We’re Looking For:

Technical Skills:

Strong experience with Grafana and Prometheus, including dashboard development, metric collection, and alerting configurations.
Proficiency in setting up and managing monitoring systems, identifying key system metrics, and ensuring optimal visibility.
Hands-on experience with incident response, system troubleshooting, and performance tuning in production environments.
Familiarity with automating common tasks related to monitoring, alerting, and incident management.

Problem-Solving and Execution:

Strong analytical skills to identify trends, resolve performance issues, and optimize system reliability.
Proactive mindset with the ability to detect issues before they impact production.
Ability to work in a fast-paced environment, collaborating with cross-functional teams to ensure system health.

Qualifications:

Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience).

Why Join Diomedes Technologies?

Play a crucial role in ensuring the reliability and performance of our cutting-edge systems.
Work with a team of talented engineers and technologists in a fast-paced, high-impact environment.
Collaborate closely with development teams to improve observability and enhance system performance.
Be part of a culture that values innovation, teamwork, and continuous improvement.

If you’re an experienced Observability Engineer/SRE passionate about monitoring, incident management, and system performance, we encourage you to apply.

Send your resume to [email protected].