Chris Olah (@ch402) Twitter Tweets • TwiCopy

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

thumb_up_off_alt965

chat_bubble_outline0

repeat165

shareShare

account_circle

Neel Nanda

@NeelNanda5

1 week ago

Announcing a progress update from the Google DeepMind mech interp team! Inspired by Anthropic's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

account_circle

Neel Nanda

@NeelNanda5

1 month ago

Great visualisation library for Sparse Autoencoder features from Callum McDougall! My team has already been finding it super useful, go check it out:
lesswrong.com/posts/nAhy6Zqu…

account_circle

Chris Olah

@ch402

1 month ago

I'm incredibly excited to have Craig joining us on the Anthropic Interpretability team!

I've been a huge fan of Colaboratory for nearly a decade (I used it internally at Google!) and have really admired Craig's work on it.

thumb_up_off_alt131

chat_bubble_outline0

repeat7

shareShare

account_circle

Craig Citro

@craigcitro

1 month ago

big news for me: after 5000+ days and too many excellent colleagues to mention, I'm leaving Google.

it's been a fantastic ride, and the hardest part about leaving is saying goodbye to my teammates and colleagues.

thumb_up_off_alt107

chat_bubble_outline0

repeat4

shareShare

account_circle

Joshua Batson

@thebasepoint

1 month ago

Next our series of small monthly updates from the interpretability team, including a few fun things:

1. We use do feature attribution to find features related to specific completions (following the athlete-sport association example of Neel Nanda )

account_circle

Chris Olah

@ch402

1 month ago

Another small update from us, including some fun results about circuit analysis with SAEs.

thumb_up_off_alt87

chat_bubble_outline0

repeat8

shareShare

account_circle

Jesse Mu

@jayelmnop

1 month ago

We’re hiring for the adversarial robustness team Anthropic!

As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)