Neel Nanda (@NeelNanda5) Twitter Tweets • TwiCopy

5 days ago

Really cool to see my and other's work on grokking covered in Quanta magazine!

thumb_up_off_alt68

repeat0

account_circle

I first heard Irina Rish mention Grokking #neural networks on the Paul Middlebrooks's Brain Inspired podcast! That was in early '22. Years later, here's a story on Grokking, about the follow-up detective work of Neel Nanda Ziming Liu and others for Quanta Magazine quantamagazine.org/how-do-machine…

thumb_up_off_alt33

repeat4

account_circle

Neel Nanda

2 weeks ago

Extremely cool work from @Saprmarks! I think this is one of my favourite SAE papers since Towards Monosemanticity. I'm particularly excited about the use of error nodes, without which SAEs are a bit too janky to do reliable circuit analysis with

thumb_up_off_alt83

repeat4

account_circle

Neel Nanda

2 weeks ago

Applications to my MATS stream close on Friday April 12th - in just over a week! If you're excited about the idea of getting into mechanistic interpretability research, I'd love to get your application :) The app is designed to take a while, so I'd get started soon!

thumb_up_off_alt45

repeat3

account_circle

Neel Nanda

2 weeks ago

Great work from my MATS scholars Callum McDougall and Joseph Bloom, in honour of today's special occasion!

Turns out SAEs contain wild features, like a Neel Nanda feature, and this perseverance feature:
lesswrong.com/posts/BK8AMsNH…

Great work from my MATS scholars @calsmcdougall and @JBloomAus, in honour of today's special occasion! Turns out SAEs contain wild features, like a Neel Nanda feature, and this perseverance feature: lesswrong.com/posts/BK8AMsNH…

account_circle

Neel Nanda

2 weeks ago

Great visualisation library for Sparse Autoencoder features from Callum McDougall! My team has already been finding it super useful, go check it out:
lesswrong.com/posts/nAhy6Zqu…

account_circle

johnny

@johnnylin

2 weeks ago

1/ Introducing Neuronpedia: an open platform for interpretability research with hosting, visualizations, and tooling for Sparse Autoencoders (SAEs).

Let's try it out! ➡️

Neuronpedia lets us instantly test activations of SAE features with custom text. Here's a Star Wars feature:

account_circle

Neel Nanda

2 weeks ago

Cool challenge, though don't assume the question! Do they specialise into anything at all? That's not at all obvious to me.

I would weakly guess that any given feature may be somewhat localised to an expert, but that similar features may be distributed across experts. In…

thumb_up_off_alt170

repeat8

account_circle

Neel Nanda

2 weeks ago

Cool post! I was very surprised by this result. My guess is that 'adding noise with the same norm as the SAE error' isn't a fair baseline, but I don't have a great story for why. I think better understanding WTF is going on here might help me understand SAEs better!

thumb_up_off_alt53

repeat2

account_circle

Neel Nanda

2 weeks ago

Sparse autoencoders are currently a big deal in mech interp, but there's not a good, concise intro to what they are. I'm currently taking a stab at writing one! Here's the draft TLDR:

account_circle

Neel Nanda

2 weeks ago

I'm really excited about Neuronpedia's pivot to helping with sparse autoencoder research! Johnny has made a gorgeous UI for poking around inside models and I'm excited to see what new mech interp research this can enable/accelerate!

thumb_up_off_alt105

repeat9

account_circle

Chris Olah

@ch402

3 weeks ago

Are you someone with management experience who would like to contribute to interpretability and AI safety? We're looking to hire another manager to help us continue growing our team, in partnership with our technical leads.

account_circle

Neel Nanda

3 weeks ago

Cool work! I really like the result about adding a 'high ELO' steering vector to make it better at chess!

thumb_up_off_alt84

repeat3

account_circle

Neel Nanda

3 weeks ago

I've heard great things from Ethan's mentees, and they've put out cool work. If you want to get into alignment research, applying to MATS to work with Ethan is a great opportunity!

thumb_up_off_alt25

repeat1

account_circle

Neel Nanda

3 weeks ago

There's a new monthly update from the Anthropic mechanistic interpretability team! I'm particularly excited to see work building on my and others (David Bau Mor Geva) work reverse-engineering factual recall, this time by using the lens of SAEs

thumb_up_off_alt67

repeat6

account_circle

Neel Nanda

3 weeks ago

Great to see the Google DeepMind Dangerous Capability Evals team put out their first paper!

thumb_up_off_alt57

repeat4