Neel Nanda
@NeelNanda5
Mechanistic Interpretability lead @DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
ID:1542528075128348674
http://neelnanda.io 30-06-2022 15:18:58
1,8K Tweets
13,4K Followers
89 Following
Excited to share what I've been working on alongside Philippe Chlenski under Neel Nanda's mentorship: finding interpretable feature circuits in LLMs using a tool called transcoders!
Blogpost: lesswrong.com/posts/YmkjnWtZ…
Code: github.com/jacobdunefsky/…
(1/8)
ML Alignment & Theory Scholars has 1220 summer applicants and enough funding to accept 2.5% (ideally 7%). Accepting donations via existence.org and manifund.org/projects/mats-… at $24k/scholar. Help us support mentors like Neel Nanda, Owain Evans, Ethan Perez, Evan Hubinger, and more!
New Google DeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders.
They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ Arthur Conmy