@abhishekunique7 : @sriyash__ @yanming_wan Given latent conditional reward, optimizing policies with this is hard, due to scale ambiguity in RLHF methods. We show that methods like self-play optimization (SPO from @g_k_swamy) can help, since rewards correspond to likelihoods instead of arbitrarily scaled utilities (3/7) • TwiDoom

Abhishek Gupta

@abhishekunique7

+ Follow

Assistant Professor at University of Washington. I like robots, and reinforcement learning. Previously: post-doc at MIT, PhD at Berkeley

ID: 495550336

linkhttps://homes.cs.washington.edu/~abhgupta calendar_today18-02-2012 02:50:08

358 Tweet

6,6K Followers

680 Following

Abhishek Gupta

@abhishekunique7

24 days ago

Sriyash Poddar Yanming Wan Given latent conditional reward, optimizing policies with this is hard, due to scale ambiguity in RLHF methods. We show that methods like self-play optimization (SPO from Gokul Swamy) can help, since rewards correspond to likelihoods instead of arbitrarily scaled utilities (3/7)

<a href="/sriyash__/">Sriyash Poddar</a> <a href="/yanming_wan/">Yanming Wan</a> Given latent conditional reward, optimizing policies with this is hard, due to scale ambiguity in RLHF methods. We show that methods like self-play optimization (SPO from <a href="/g_k_swamy/">Gokul Swamy</a>) can help, since rewards correspond to likelihoods instead of arbitrarily scaled utilities (3/7)

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare