Build better GenAI for
images,
with human feedback

Evaluate your models on metrics that truly matter,
using our self-serve platform and ready-to-use subjective studies.

Sign up Book a demo

Trusted by researchers at

Images

Audio

Video

APIs

Interface of a pairwise image comparison study

Design the perfect study for your data from our 13+ highly configurable experiments

Evaluate your models, effortlessly

Well-designed studies with human observers remain the gold standard for evaluating the quality of audio-visual content, but are difficult to get right. Mabyduck allows you to start collecting results within minutes of uploading your data, with well-tested experiments designed for online participation.

Our real-time analytics help you to quickly make sense of your results. You can compare results between different rater pools, such as crowd-sourced participants and expert raters, and choose the most cost-effective rater pool for your data.

Create your own
rubrics and leaderboards

Create private leaderboards to track the performance of your internal models over time and to encourage healthy competition within your team.

When you're ready, share public rubrics with the world to demonstrate the performance of your models on metrics that truly matter.

Make every data point count,
with active selection

Our active selection strategies automatically decide which methods to evaluate next. On an image compression dataset (CLIC 2024), for example, we found that this saves 34% in rater time compared to uniform sampling while achieving the same signal.

This is in addition to savings achieved by using well qualified raters. On the same dataset, we found that pre-screened raters performed the task 32% faster than raters who only passed basic filters as commonly available on crowd-sourcing platforms.

Localised into 8 languages

Our experiments and rater pools are available in 8 different languages, with more languages available on request.

English British, American

French

Spanish

German

Vietnamese

Hindi

Mandarin Simplified Chinese

Polish

Data quality that you can trust

Our raters have passed rigorous pre-screenings which test their ability as well as their hardware's capabilities. We use additional automated quality controls and manual spot checks to ensure that the quality of our rater pools remain high at all times.

We also provide technical support to raters throughout the duration of your experiment to ensure smooth data collection, allowing you to focus on building better machine learning models.

Build better GenAI for images,with human feedback