Chris Olah(@ch402) 's Twitter Profileg
Chris Olah

@ch402

Reverse engineering neural networks at @AnthropicAI. DMs open! Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

ID:153196789

linkhttp://colah.github.io calendar_today07-06-2010 23:08:04

5,2K Tweets

90,9K Followers

173 Following

Follow People
Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Announcing a progress update from the Google DeepMind mech interp team! Inspired by Anthropic's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.
account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Great visualisation library for Sparse Autoencoder features from Callum McDougall! My team has already been finding it super useful, go check it out:
lesswrong.com/posts/nAhy6Zqu…

account_circle
Chris Olah(@ch402) 's Twitter Profile Photo

I'm incredibly excited to have Craig joining us on the Anthropic Interpretability team!

I've been a huge fan of Colaboratory for nearly a decade (I used it internally at Google!) and have really admired Craig's work on it.

account_circle
Craig Citro(@craigcitro) 's Twitter Profile Photo

big news for me: after 5000+ days and too many excellent colleagues to mention, I'm leaving Google.

it's been a fantastic ride, and the hardest part about leaving is saying goodbye to my teammates and colleagues.

account_circle
Joshua Batson(@thebasepoint) 's Twitter Profile Photo

Next our series of small monthly updates from the interpretability team, including a few fun things:

1. We use do feature attribution to find features related to specific completions (following the athlete-sport association example of Neel Nanda )

Next our series of small monthly updates from the interpretability team, including a few fun things: 1. We use do feature attribution to find features related to specific completions (following the athlete-sport association example of @NeelNanda5 )
account_circle
Jesse Mu(@jayelmnop) 's Twitter Profile Photo

We’re hiring for the adversarial robustness team Anthropic!

As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)
account_circle
Chris Olah(@ch402) 's Twitter Profile Photo

I continue to be impressed by the work of Neel's scholars -- very excited to see what the next group will do!

account_circle
Adam Jermyn(@AdamSJermyn) 's Twitter Profile Photo

Reflections on Qualitative Research:
transformer-circuits.pub/2024/qualitati…

[h/t to Chris Olah for originating & driving this!]

account_circle