Asma Ghandeharioun (@ghandeharioun) 's Twitter Profile
Asma Ghandeharioun

@ghandeharioun

Sr. Research Scientist @GoogleDeepMind working on ML interpretability & human-centered AI, PhD from @MIT

ID: 708149090740281345

linkhttps://alum.mit.edu/www/asma_gh calendar_today11-03-2016 04:34:34

125 Tweet

2,2K Followers

522 Following

Asma Ghandeharioun (@ghandeharioun) 's Twitter Profile Photo

🧵Responses to adversarial queries can still remain latent in a safety-tuned model. Why are they revealed sometimes, but not others? And what are the mechanics of this latent misalignment? Does it matter *who* the user is? (1/n)

🧵Responses to adversarial queries can still remain latent in a safety-tuned model. Why are they revealed sometimes, but not others? And what are the mechanics of this latent misalignment? Does it matter *who* the user is? (1/n)