Chris Olah (@ch402) Twitter Tweets • TwiCopy

Chris Olah

@ch402

+ Follow

Reverse engineering neural networks at @AnthropicAI. DMs open! Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

ID:153196789

linkhttp://colah.github.io calendar_today07-06-2010 23:08:04

5,2K Tweets

90,9K Followers

173 Following

Follow People

Anthropic

We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97uk4d.

+ Follow

Jack Clark

@AnthropicAI, ONEAI OECD, co-chair @indexingai, writer @ https://t.co/3vmtHYkaTu Past: @openai, @business @theregister. Neural nets, distributed systems, weird futures

+ Follow

Oriol Vinyals

VP of Research & Deep Learning Lead, Google DeepMind. Gemini co-lead. Past: AlphaStar, AlphaFold, AlphaCode, WaveNet, seq2seq, distillation, TF.

+ Follow

Amanda Askell

Ethicist doing AI alignment research @AnthropicAI. Personal account.

+ Follow

Ferenc Huszár

Secular Bayesian. Associate Professor in Machine Learning @Cambridge_CL. Talent aficionado at https://t.co/RbJkoLguey Alum of @Twitter, Magic Pony and @Balderton

+ Follow

Chris Olah

@ch402

4 days ago

Scaling laws for dictionary learning! transformer-circuits.pub/2024/april-upd…

account_circle

Adam Jermyn

@AdamSJermyn

4 days ago

Some small updates from the Anthropic Interpretability team:

transformer-circuits.pub/2024/april-upd…

account_circle

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

account_circle

Neel Nanda

@NeelNanda5

1 week ago

Announcing a progress update from the Google DeepMind mech interp team! Inspired by Anthropic's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

account_circle

Neel Nanda

@NeelNanda5

1 month ago

Great visualisation library for Sparse Autoencoder features from Callum McDougall! My team has already been finding it super useful, go check it out:
lesswrong.com/posts/nAhy6Zqu…

account_circle

Chris Olah

@ch402

1 month ago

I'm incredibly excited to have Craig joining us on the Anthropic Interpretability team!

I've been a huge fan of Colaboratory for nearly a decade (I used it internally at Google!) and have really admired Craig's work on it.

thumb_up_off_alt131

chat_bubble_outline0

repeat7

shareShare

account_circle

Craig Citro

@craigcitro

1 month ago

big news for me: after 5000+ days and too many excellent colleagues to mention, I'm leaving Google.

it's been a fantastic ride, and the hardest part about leaving is saying goodbye to my teammates and colleagues.

thumb_up_off_alt107

chat_bubble_outline0

repeat4

shareShare

account_circle

Joshua Batson

@thebasepoint

1 month ago

Next our series of small monthly updates from the interpretability team, including a few fun things:

1. We use do feature attribution to find features related to specific completions (following the athlete-sport association example of Neel Nanda )

account_circle

Chris Olah

@ch402

1 month ago

Another small update from us, including some fun results about circuit analysis with SAEs.

thumb_up_off_alt87

chat_bubble_outline0

repeat8

shareShare

account_circle

Jesse Mu

@jayelmnop

1 month ago

We’re hiring for the adversarial robustness team Anthropic!

As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)