Hi everyone,
Preference Tuning (DPO) is now supported for Gemini 2.5 Flash and Flash-Lite on Vertex AI. And I want to share a new tutorial for those of you looking to align Gemini models to specific user preferences.
How it works:
- You provide a JSONL file containing prompts with paired responses: one chosen (preferred) and one rejected.
- The model adjusts its internal probability distributions to increase the likelihood of the preferred output format without needing a separate reward model.
- We recommend a two-step approach: Tune with SFT first for preferred responses, then continue tuning from that checkpoint with DPO to refine the behavior.
Here you can find the notebook and the documentation.
Happy building!
