Ethan Perez (@EthanJPerez) Twitter Tweets • TwiCopy

repeat4

account_circle

Ethan Perez

@EthanJPerez

4 weeks ago

Some of our first steps on developing mitigations for sleeper agents

thumb_up_off_alt34

repeat0

account_circle

Trenton Bricken

@TrentonBricken

4 weeks ago

How to catch a sleeper agent:

1. Collect neuron activations from the model when it replies “Yes” vs “No” to the question: “Are you a helpful AI?”

thumb_up_off_alt106

repeat6

account_circle

Anthropic

4 weeks ago

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

account_circle

Neel Nanda

@NeelNanda5

1 month ago

Announcing a progress update from the Google DeepMind mech interp team! Inspired by Anthropic's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

account_circle

Arjun Panickssery is in London

@panickssery

1 month ago

Are LLMs biased toward themselves?

Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs

This could interfere with safety techniques like reward modeling & constitutional AI

account_circle

Anthropic

1 month ago

New Anthropic research: Measuring Model Persuasiveness

We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.

Read our blog post here: anthropic.com/news/measuring…

account_circle

Anthropic

1 month ago

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans.

We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.

account_circle

Joschka Braun

@JoschkaBraun

1 month ago

I benchmarked Anthropic's new tool use beta API on the Berkeley function calling benchmark. Haiku beats GPT-4 Turbo in half of the scenarios. Results in 🧵

A huge thanks to Shishir Patil, Fanjia Yan, Tianjun Zhang, Joey Gonzalez & rest for providing this benchmark publicly.

I benchmarked @AnthropicAI's new tool use beta API on the Berkeley function calling benchmark. Haiku beats GPT-4 Turbo in half of the scenarios. Results in 🧵 A huge thanks to @shishirpatil_, @fanjia_yan, @tianjun_zhang, @profjoeyg & rest for providing this benchmark publicly.

account_circle

Leo Gao

@nabla_theta

1 month ago

Eliezer Yudkowsky ⏹️ while computers may excel at soft skills like creativity and emotional understanding, they will never match human ability at dispassionate, mechanical reasoning

account_circle

Tristan Hume

@trishume

1 month ago

Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can have fun testing those as speed optimizations via overly-costly low batch size. Come work with me at Anthropic on things like this, more info in thread 🧵

account_circle

andy jones

@andy_l_jones

1 month ago

tristan is top-three best engineers i've worked with and a lot of the people he's hired recently are not very far behind. _obscenely_ high talent concentration

what's worse, they're nice people and easy to get on with

thumb_up_off_alt103

repeat4

account_circle

Ethan Perez

@EthanJPerez

1 month ago

This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.

account_circle

Anthropic