Ethan Perez(@EthanJPerez) 's Twitter Profileg
Ethan Perez

@EthanJPerez

Large language model safety

ID:908728623988953089

linkhttps://scholar.google.com/citations?user=za0-taQAAAAJ calendar_today15-09-2017 16:26:02

985 Tweets

6,4K Followers

464 Following

Samuel Marks(@saprmarks) 's Twitter Profile Photo

Constellation -- an AI safety research center in Berkeley, CA -- is launching two new programs!

* Visiting Fellows: 3-6 months visiting (w/ travel, housing, & office space covered)
* Constellation Residency: 1yr salaried position

account_circle
Trenton Bricken(@TrentonBricken) 's Twitter Profile Photo

How to catch a sleeper agent:

1. Collect neuron activations from the model when it replies “Yes” vs “No” to the question: “Are you a helpful AI?”

account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Announcing a progress update from the Google DeepMind mech interp team! Inspired by Anthropic's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.
account_circle
Arjun Panickssery is in London(@panickssery) 's Twitter Profile Photo

Are LLMs biased toward themselves?

Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs

This could interfere with safety techniques like reward modeling & constitutional AI

Are LLMs biased toward themselves? Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs This could interfere with safety techniques like reward modeling & constitutional AI
account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research: Measuring Model Persuasiveness

We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.

Read our blog post here: anthropic.com/news/measuring…

New Anthropic research: Measuring Model Persuasiveness We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude. Read our blog post here: anthropic.com/news/measuring…
account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans.

We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans. We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.
account_circle
Joschka Braun(@JoschkaBraun) 's Twitter Profile Photo

I benchmarked Anthropic's new tool use beta API on the Berkeley function calling benchmark. Haiku beats GPT-4 Turbo in half of the scenarios. Results in 🧵

A huge thanks to Shishir Patil, Fanjia Yan, Tianjun Zhang, Joey Gonzalez & rest for providing this benchmark publicly.

I benchmarked @AnthropicAI's new tool use beta API on the Berkeley function calling benchmark. Haiku beats GPT-4 Turbo in half of the scenarios. Results in 🧵 A huge thanks to @shishirpatil_, @fanjia_yan, @tianjun_zhang, @profjoeyg & rest for providing this benchmark publicly.
account_circle
Leo Gao(@nabla_theta) 's Twitter Profile Photo

Eliezer Yudkowsky ⏹️ while computers may excel at soft skills like creativity and emotional understanding, they will never match human ability at dispassionate, mechanical reasoning

account_circle
Tristan Hume(@trishume) 's Twitter Profile Photo

Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can have fun testing those as speed optimizations via overly-costly low batch size. Come work with me at Anthropic on things like this, more info in thread 🧵

account_circle
andy jones(@andy_l_jones) 's Twitter Profile Photo

tristan is top-three best engineers i've worked with and a lot of the people he's hired recently are not very far behind. _obscenely_ high talent concentration

what's worse, they're nice people and easy to get on with

account_circle
Ethan Perez(@EthanJPerez) 's Twitter Profile Photo

This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.

account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…
account_circle
Ian Hogarth(@soundboy) 's Twitter Profile Photo

Very proud of the landmark agreement the US and UK have signed today around joint testing of frontier AI systems. Testament to an incredible team of civil servants at the AI Safety Institute: ft.com/content/4bafe0…

Very proud of the landmark agreement the US and UK have signed today around joint testing of frontier AI systems. Testament to an incredible team of civil servants at the AI Safety Institute: ft.com/content/4bafe0…
account_circle
lmsys.org(@lmsysorg) 's Twitter Profile Photo

[Arena Update]

70K+ new Arena votes🗳️ are in!

Claude-3 Haiku has impressed all, even reaching GPT-4 level by our user preference! Its speed, capabilities & context length are unmatched now in the market🔥

Congrats Anthropic on the incredible Claude-3 launch!

More exciting

[Arena Update] 70K+ new Arena votes🗳️ are in! Claude-3 Haiku has impressed all, even reaching GPT-4 level by our user preference! Its speed, capabilities & context length are unmatched now in the market🔥 Congrats @AnthropicAI on the incredible Claude-3 launch! More exciting
account_circle
Nick Dobos(@NickADobos) 's Twitter Profile Photo

The king is dead

RIP GPT-4
Claude opus #1 ELo

Haiku beats GPT-4 0613 & Mistral large
That’s insane for how cheap & fast it is

The king is dead RIP GPT-4 Claude opus #1 ELo Haiku beats GPT-4 0613 & Mistral large That’s insane for how cheap & fast it is
account_circle