Build better GenAI for
images,
with human feedback
Evaluate your models on metrics that truly matter,
using our self-serve platform and ready-to-use subjective studies.
Trusted by researchers at
Design the perfect study for your data from our 13+ highly configurable experiments
Evaluate your models, effortlessly
Well-designed studies with human observers remain the gold standard for evaluating the quality of audio-visual content, but are difficult to get right. Mabyduck allows you to start collecting results within minutes of uploading your data, with well-tested experiments designed for online participation.
Our real-time analytics help you to quickly make sense of your results. You can compare results between different rater pools, such as crowd-sourced participants and expert raters, and choose the most cost-effective rater pool for your data.
Create your own
rubrics and leaderboards
Create private leaderboards to track the performance of your internal models over time and to encourage healthy competition within your team.
When you're ready, share public rubrics with the world to demonstrate the performance of your models on metrics that truly matter.
Make every data point count,
with active selection
Our active selection strategies automatically decide which methods to evaluate next. On an image compression dataset (CLIC 2024), for example, we found that this saves 34% in rater time compared to uniform sampling while achieving the same signal.
This is in addition to savings achieved by using well qualified raters. On the same dataset, we found that pre-screened raters performed the task 32% faster than raters who only passed basic filters as commonly available on crowd-sourcing platforms.
Localised into 8 languages
Our experiments and rater pools are available in 8 different languages, with more languages available on request.
Data quality that you can trust
Our raters have passed rigorous pre-screenings which test their ability as well as their hardware's capabilities. We use additional automated quality controls and manual spot checks to ensure that the quality of our rater pools remain high at all times.
We also provide technical support to raters throughout the duration of your experiment to ensure smooth data collection, allowing you to focus on building better machine learning models.