Your mission
Join us at Rhesis AI – Open-source test generation and management for Gen AI applications that deliver value, not surprises!
At Rhesis AI, we empower organizations to develop and deploy Gen AI applications that meet high standards for reliability, robustness, and compliance. As the creators of an open-source solution for test generation & management, we enable AI teams to build context-specific tests, and collaborate directly with domain experts.
We're currently part of the K.I.E.Z. Accelerator at the Merantix AI Campus in Berlin, where we’re building the testing infrastructure Gen AI needs to earn trust at scale.
If you’re passionate about advancing trustworthy AI through practical tools and collaborative infrastructure, we invite you to join our mission.
Your profile
This role is about understanding how to evaluate model behavior in realistic scenarios: multi-turn conversations, ambiguous queries, edge-case prompts, or compliance-critical interactions. You’ll design the metrics, evaluation logic, and infrastructure that helps customers ensure their Gen AI applications are robust, reliable, and compliant.
What You’ll Do
Develop evaluation strategies tailored to LLM-based systems, especially in production-like environments with nuanced conversational behavior or task-specific requirements.
Design and refine metrics that capture key quality dimensions—robustness (handling edge cases and adversarial prompts), reliability (predictable, consistent behavior), and compliance (alignment with user, business, or regulatory expectations).
Build automated pipelines that evaluate Gen AI application performance across real-world use cases, including prompt variation, context manipulation, and reasoning-based tasks.
Work across disciplines with engineers and product leads to formalize evaluation goals, define benchmarks, and iterate on evaluation logic as the system evolves.
Integrate emerging techniques and academic insights (e.g., adaptive testing, behavioral testing, hallucination metrics) into a practical, production-grade evaluation stack.
You’re a Great Fit If You Have:
Core Strengths:
Experience evaluating natural language systems, especially LLM-based or generative applications (e.g., chatbots, RAG systems, decision support tools).
Strong intuition for designing meaningful evaluation metrics beyond accuracy—e.g., for measuring factuality, coherence, sensitivity, or policy violations.
Proficiency in Python and familiarity with automation tools and evaluation libraries (e.g., LangChain, OpenAI Evals, PromptLayer, custom frameworks).
Deep curiosity for how language models behave in unexpected or high-stakes contexts, and a drive to improve their reliability.
Nice to Have:
Exposure to Gen AI alignment methods, LLMOps practices, or compliance-related validation (e.g., GDPR-sensitive outputs, safety constraints).
Familiarity with academic and industry benchmarks (e.g., TruthfulQA, MMLU), and an understanding of their limitations in real-world scenarios.
Experience working in startup environments, OSS ecosystems, or cross-functional teams tackling novel AI product challenges.
Why us?
We’re excited to offer a
fixed one-year contract (with very likely extension),
starting 1 July or 15 July, along with a range of benefits to support our team members, including:
- Work at the forefront of Gen AI: Collaborate with some of the most innovative companies building LLM applications. Contribute to the trustworthiness of AI by shaping open-source tools that define how Gen AI is tested and validated.
- Flexible work arrangements: We understand the importance of work-life balance and offer flexible working options to accommodate personal needs and preferences. We have offices in Berlin (AI Campus) and Potsdam (Griebnitzsee).
- Compensation: We offer salaries and benefits tailored to your experience and qualifications, along with the opportunity to gain ownership in the company.
- A supportive and collaborative work environment: We foster a culture of teamwork, collaboration, and mutual respect, where every team member is valued and supported in their professional and personal growth.
At Rhesis AI, we value diversity and inclusion, believing that diverse perspectives enrich our team and drive innovation. We encourage applications from individuals of all backgrounds, regardless of gender, nationality, religion, or other personal characteristics. Even if you don’t meet every requirement listed, we encourage you to apply—your unique skills and experiences could be exactly what we need to succeed.
Ready to join us?If you’re passionate about QA engineering, excited by the prospect of working with cutting-edge AI technology, and committed to making AI responsible, we’d love to hear from you!
Apply now and help us build solutions that shape the future of AI!
About us
At Rhesis AI, we’re driven by the goal of making AI evaluation and testing seamless, thorough, and accessible. We’re not just another tech company – we’re building solutions to ensure that Gen AI applications are reliable, resilient, and ready to meet the demands of real-world use. Our focus is on providing a comprehensive, automated testing platform that validates AI applications across diverse scenarios and industries, helping businesses confidently deploy Gen AI.