RLHF for finer alignment with Gemma 3

34 Views

Thanks! Share it with your friends!

You disliked this video. Thanks for the feedback!

Published Apr 2, 2025

How to best align a model for human interaction? In RLHF we first learn a proxy for the human preferences: the reward model (RM), that is later used to align a language policy. Yet the RM is an imperfect approximator of human preferences, and its prolonged usage leads inevitably to reward hacking. In this presentation we’ll review techniques developed in Gemma to mitigate this hacking and allow for longer training.

Subscribe to Google for Developers → https://goo.gle/developers

Speakers: Louis Rouillard
Products Mentioned: Gemma

Category: Project
Tags: Google, developers, pr_pr: Gemma;

Be the first to comment

Sign in

Create your account

Add Video

RLHF for finer alignment with Gemma 3

Up Next