Neel Nanda(@NeelNanda5) 's Twitter Profileg
Neel Nanda

@NeelNanda5

Mechanistic Interpretability lead @DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

ID:1542528075128348674

linkhttp://neelnanda.io calendar_today30-06-2022 15:18:58

1,8K Tweets

13,4K Followers

89 Following

Jacob Dunefsky(@jacobdunefsky) 's Twitter Profile Photo

Excited to share what I've been working on alongside Philippe Chlenski under Neel Nanda's mentorship: finding interpretable feature circuits in LLMs using a tool called transcoders!
Blogpost: lesswrong.com/posts/YmkjnWtZ…
Code: github.com/jacobdunefsky/…
(1/8)

account_circle
Ryan Kidd(@ryan_kidd44) 's Twitter Profile Photo

ML Alignment & Theory Scholars has 1220 summer applicants and enough funding to accept 2.5% (ideally 7%). Accepting donations via existence.org and manifund.org/projects/mats-… at $24k/scholar. Help us support mentors like Neel Nanda, Owain Evans, Ethan Perez, Evan Hubinger, and more!

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Great work from my MATS scholars!

Refusal in LLMs is mediated by a single vector - injecting it means harmless statements are refused, ablating it everywhere lets harmful prompts through

We can jailbreak model *weights* by projecting out this direction, no fine tuning needed!

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

It's a great week for mech interp releases! I'm very excited to try out Anthropic's new recommendations for stable dictionary learning

account_circle
Senthooran Rajamanoharan(@sen_r) 's Twitter Profile Photo

New Google DeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders.

They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ Arthur Conmy

New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy
account_circle