Neel Nanda(@NeelNanda5) 's Twitter Profileg
Neel Nanda

@NeelNanda5

Mechanistic Interpretability lead @DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

ID:1542528075128348674

linkhttp://neelnanda.io calendar_today30-06-2022 15:18:58

1,7K Tweets

13,2K Followers

89 Following

Anil Ananthaswamy(@anilananth) 's Twitter Profile Photo

I first heard Irina Rish mention Grokking networks on the Paul Middlebrooks's Brain Inspired podcast! That was in early '22. Years later, here's a story on Grokking, about the follow-up detective work of Neel Nanda Ziming Liu and others for Quanta Magazine quantamagazine.org/how-do-machine…

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Extremely cool work from @Saprmarks! I think this is one of my favourite SAE papers since Towards Monosemanticity. I'm particularly excited about the use of error nodes, without which SAEs are a bit too janky to do reliable circuit analysis with

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Applications to my MATS stream close on Friday April 12th - in just over a week! If you're excited about the idea of getting into mechanistic interpretability research, I'd love to get your application :) The app is designed to take a while, so I'd get started soon!

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Great work from my MATS scholars Callum McDougall and Joseph Bloom, in honour of today's special occasion!

Turns out SAEs contain wild features, like a Neel Nanda feature, and this perseverance feature:
lesswrong.com/posts/BK8AMsNH…

Great work from my MATS scholars @calsmcdougall and @JBloomAus, in honour of today's special occasion! Turns out SAEs contain wild features, like a Neel Nanda feature, and this perseverance feature: lesswrong.com/posts/BK8AMsNH…
account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Great visualisation library for Sparse Autoencoder features from Callum McDougall! My team has already been finding it super useful, go check it out:
lesswrong.com/posts/nAhy6Zqu…

account_circle
johnny(@johnnylin) 's Twitter Profile Photo

1/ Introducing Neuronpedia: an open platform for interpretability research with hosting, visualizations, and tooling for Sparse Autoencoders (SAEs).

Let's try it out! ➡️

Neuronpedia lets us instantly test activations of SAE features with custom text. Here's a Star Wars feature:

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Cool challenge, though don't assume the question! Do they specialise into anything at all? That's not at all obvious to me.

I would weakly guess that any given feature may be somewhat localised to an expert, but that similar features may be distributed across experts. In…

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Cool post! I was very surprised by this result. My guess is that 'adding noise with the same norm as the SAE error' isn't a fair baseline, but I don't have a great story for why. I think better understanding WTF is going on here might help me understand SAEs better!

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Sparse autoencoders are currently a big deal in mech interp, but there's not a good, concise intro to what they are. I'm currently taking a stab at writing one! Here's the draft TLDR:

Sparse autoencoders are currently a big deal in mech interp, but there's not a good, concise intro to what they are. I'm currently taking a stab at writing one! Here's the draft TLDR:
account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

I'm really excited about Neuronpedia's pivot to helping with sparse autoencoder research! Johnny has made a gorgeous UI for poking around inside models and I'm excited to see what new mech interp research this can enable/accelerate!

account_circle
Chris Olah(@ch402) 's Twitter Profile Photo

Are you someone with management experience who would like to contribute to interpretability and AI safety? We're looking to hire another manager to help us continue growing our team, in partnership with our technical leads.

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Cool work! I really like the result about adding a 'high ELO' steering vector to make it better at chess!

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

I've heard great things from Ethan's mentees, and they've put out cool work. If you want to get into alignment research, applying to MATS to work with Ethan is a great opportunity!

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

There's a new monthly update from the Anthropic mechanistic interpretability team! I'm particularly excited to see work building on my and others (David Bau Mor Geva) work reverse-engineering factual recall, this time by using the lens of SAEs

account_circle