Abhishek Gupta
@abhishekunique7
Assistant Professor at University of Washington. I like robots, and reinforcement learning. Previously: post-doc at MIT, PhD at Berkeley
ID: 495550336
https://homes.cs.washington.edu/~abhgupta 18-02-2012 02:50:08
358 Tweet
6,6K Takipçi
680 Takip Edilen
Sriyash Poddar Yanming Wan Given latent conditional reward, optimizing policies with this is hard, due to scale ambiguity in RLHF methods. We show that methods like self-play optimization (SPO from Gokul Swamy) can help, since rewards correspond to likelihoods instead of arbitrarily scaled utilities (3/7)