Observability Engineer / Site Reliability Engineer (SRE)

Location: Bucharest, Romania
Compensation: Competitive salary + performance-based quarterly bonuses

About the Role:
Diomedes Technologies is seeking an Observability Engineer to join our team and play a critical role in maintaining the reliability and performance of our production systems. This position focuses on ensuring real-time visibility into our infrastructure and applications, enabling us to proactively monitor system health and resolve issues swiftly. If you’re passionate about observability, system optimization, and incident management in a fast-paced environment, this role is for you.


Key Responsibilities:

Monitoring and Observability:

  • Design, build, and maintain Grafana dashboards to visualize critical system metrics and application performance.
  • Develop custom Prometheus metrics to monitor systems and applications effectively.
  • Configure, test, and deploy alerting rules in Prometheus and Grafana to ensure system stability and quick issue detection.

Incident Management:

  • Continuously monitor production systems to identify and mitigate performance issues before they affect system availability.
  • Investigate, diagnose, and resolve system alerts in a timely manner to minimize downtime and service disruption.
  • Provide detailed root-cause analysis for production incidents and suggest preventative measures to avoid recurrence.

System Performance and Optimization:

  • Analyze trends in system and application performance data to proactively detect potential issues.
  • Collaborate with development teams to instrument applications for enhanced observability and improve monitoring coverage.

Tooling and Automation:

  • Evaluate, integrate, and implement new monitoring tools to enhance system visibility and reporting.
  • Automate common monitoring, alerting, and incident management tasks to reduce manual intervention and improve operational efficiency.

Documentation and Training:

  • Document observability processes, Grafana dashboards, Prometheus metrics, and alerting configurations.
  • Provide training and support to team members on best practices for using Grafana and Prometheus to ensure effective monitoring and observability.

What We’re Looking For:

Technical Skills:

  • Strong experience with Grafana and Prometheus, including dashboard development, metric collection, and alerting configurations.
  • Proficiency in setting up and managing monitoring systems, identifying key system metrics, and ensuring optimal visibility.
  • Hands-on experience with incident response, system troubleshooting, and performance tuning in production environments.
  • Familiarity with automating common tasks related to monitoring, alerting, and incident management.

Problem-Solving and Execution:

  • Strong analytical skills to identify trends, resolve performance issues, and optimize system reliability.
  • Proactive mindset with the ability to detect issues before they impact production.
  • Ability to work in a fast-paced environment, collaborating with cross-functional teams to ensure system health.

Qualifications:

  • Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience).

Why Join Diomedes Technologies?

  • Play a crucial role in ensuring the reliability and performance of our cutting-edge systems.
  • Work with a team of talented engineers and technologists in a fast-paced, high-impact environment.
  • Collaborate closely with development teams to improve observability and enhance system performance.
  • Be part of a culture that values innovation, teamwork, and continuous improvement.

If you’re an experienced Observability Engineer/SRE passionate about monitoring, incident management, and system performance, we encourage you to apply.

Send your resume to [email protected].