Neel Nanda (@NeelNanda5) Twitter Tweets • TwiCopy

Neel Nanda

@NeelNanda5

+ Follow

Mechanistic Interpretability lead @DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

ID:1542528075128348674

linkhttp://neelnanda.io calendar_today30-06-2022 15:18:58

1,8K Tweets

13,4K Followers

89 Following

Jacob Dunefsky

@jacobdunefsky

8 hours ago

Excited to share what I've been working on alongside Philippe Chlenski under Neel Nanda's mentorship: finding interpretable feature circuits in LLMs using a tool called transcoders!
Blogpost: lesswrong.com/posts/YmkjnWtZ…
Code: github.com/jacobdunefsky/…
(1/8)

thumb_up_off_alt23

chat_bubble_outline0

repeat3

shareShare

account_circle

Ryan Kidd

@ryan_kidd44

3 days ago

ML Alignment & Theory Scholars has 1220 summer applicants and enough funding to accept 2.5% (ideally 7%). Accepting donations via existence.org and manifund.org/projects/mats-… at $24k/scholar. Help us support mentors like Neel Nanda, Owain Evans, Ethan Perez, Evan Hubinger, and more!

thumb_up_off_alt48

chat_bubble_outline0

repeat9

shareShare

account_circle

Neel Nanda

@NeelNanda5

3 days ago

Great work from my MATS scholars!

Refusal in LLMs is mediated by a single vector - injecting it means harmless statements are refused, ablating it everywhere lets harmful prompts through

We can jailbreak model *weights* by projecting out this direction, no fine tuning needed!

account_circle

Andy Arditi

@littlefish3625

3 days ago

New research post on refusals in LLMs lesswrong.com/posts/jGuXSZgv…

account_circle

Neel Nanda

@NeelNanda5

4 days ago

It's a great week for mech interp releases! I'm very excited to try out Anthropic's new recommendations for stable dictionary learning

thumb_up_off_alt52

chat_bubble_outline0

repeat2

shareShare

account_circle

Senthooran Rajamanoharan

@sen_r

5 days ago

New Google DeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders.

They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ Arthur Conmy

account_circle

Robert Wiblin

@robertwiblin

6 days ago

astralcodexten.com/p/ye-olde-bay-…

account_circle