Location: Bucharest, Romania
Compensation: Competitive salary + performance-based quarterly bonuses
About the Role:
Diomedes Technologies is seeking an Observability Engineer to join our team and play a critical role in maintaining the reliability and performance of our production systems. This position focuses on ensuring real-time visibility into our infrastructure and applications, enabling us to proactively monitor system health and resolve issues swiftly. If you’re passionate about observability, system optimization, and incident management in a fast-paced environment, this role is for you.
Key Responsibilities:
Monitoring and Observability:
- Design, build, and maintain Grafana dashboards to visualize critical system metrics and application performance.
- Develop custom Prometheus metrics to monitor systems and applications effectively.
- Configure, test, and deploy alerting rules in Prometheus and Grafana to ensure system stability and quick issue detection.
Incident Management:
- Continuously monitor production systems to identify and mitigate performance issues before they affect system availability.
- Investigate, diagnose, and resolve system alerts in a timely manner to minimize downtime and service disruption.
- Provide detailed root-cause analysis for production incidents and suggest preventative measures to avoid recurrence.
System Performance and Optimization:
- Analyze trends in system and application performance data to proactively detect potential issues.
- Collaborate with development teams to instrument applications for enhanced observability and improve monitoring coverage.
Tooling and Automation:
- Evaluate, integrate, and implement new monitoring tools to enhance system visibility and reporting.
- Automate common monitoring, alerting, and incident management tasks to reduce manual intervention and improve operational efficiency.
Documentation and Training:
- Document observability processes, Grafana dashboards, Prometheus metrics, and alerting configurations.
- Provide training and support to team members on best practices for using Grafana and Prometheus to ensure effective monitoring and observability.
What We’re Looking For:
Technical Skills:
- Strong experience with Grafana and Prometheus, including dashboard development, metric collection, and alerting configurations.
- Proficiency in setting up and managing monitoring systems, identifying key system metrics, and ensuring optimal visibility.
- Hands-on experience with incident response, system troubleshooting, and performance tuning in production environments.
- Familiarity with automating common tasks related to monitoring, alerting, and incident management.
Problem-Solving and Execution:
- Strong analytical skills to identify trends, resolve performance issues, and optimize system reliability.
- Proactive mindset with the ability to detect issues before they impact production.
- Ability to work in a fast-paced environment, collaborating with cross-functional teams to ensure system health.
Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience).
Why Join Diomedes Technologies?
- Play a crucial role in ensuring the reliability and performance of our cutting-edge systems.
- Work with a team of talented engineers and technologists in a fast-paced, high-impact environment.
- Collaborate closely with development teams to improve observability and enhance system performance.
- Be part of a culture that values innovation, teamwork, and continuous improvement.
If you’re an experienced Observability Engineer/SRE passionate about monitoring, incident management, and system performance, we encourage you to apply.
Send your resume to [email protected].