@ghandeharioun : 🧵Responses to adversarial queries can still remain latent in a safety-tuned model. Why are they revealed sometimes, but not others? And what are the mechanics of this latent misalignment? Does it matter *who* the user is? (1/n) • TwiDoom

Asma Ghandeharioun

@ghandeharioun

+ Follow

Sr. Research Scientist @GoogleDeepMind working on ML interpretability & human-centered AI, PhD from @MIT

ID: 708149090740281345

linkhttps://alum.mit.edu/www/asma_gh calendar_today11-03-2016 04:34:34

125 Tweet

2,2K Followers

522 Following

Asma Ghandeharioun

@ghandeharioun

a month ago

🧵Responses to adversarial queries can still remain latent in a safety-tuned model. Why are they revealed sometimes, but not others? And what are the mechanics of this latent misalignment? Does it matter *who* the user is? (1/n)

thumb_up_off_alt62

chat_bubble_outline1

repeat10

shareShare