Cem Anil (@cem__anil) Twitter Tweets • TwiCopy

Cem Anil

@cem__anil

+ Follow

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. student researcher @google (Blueshift Team) and @nvidia.

ID:1062518594356035584

linkhttps://www.cs.toronto.edu/~anilcem/ calendar_today14-11-2018 01:32:28

450 Tweets

1,5K Followers

1,3K Following

Anthropic

4 weeks ago

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…

thumb_up_off_alt553

chat_bubble_outline0

account_circle

Daniel Johnson

1 month ago

Excited to share Penzai, a JAX research toolkit from Google DeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere.

Check it out on GitHub: github.com/google-deepmin…

thumb_up_off_alt1,9K

chat_bubble_outline0

account_circle

Arjun Panickssery is in London

1 month ago

Are LLMs biased toward themselves?

Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs

This could interfere with safety techniques like reward modeling & constitutional AI

Are LLMs biased toward themselves? Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs This could interfere with safety techniques like reward modeling & constitutional AI

thumb_up_off_alt319

chat_bubble_outline0

account_circle

Anthropic

1 month ago

New Anthropic research: Measuring Model Persuasiveness

We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.

Read our blog post here: anthropic.com/news/measuring…

New Anthropic research: Measuring Model Persuasiveness We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude. Read our blog post here: anthropic.com/news/measuring…

thumb_up_off_alt714

chat_bubble_outline0

account_circle

Anthropic

1 month ago

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans.

We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans. We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.

thumb_up_off_alt171

chat_bubble_outline0

account_circle

Joshua Batson

1 month ago

This whole paper is fascinating...shows the power of in-context learning to dominate in-weights learning, for jailbreaks in particular.

Hidden in the appendix is a toy model of in-context learning that analytically reproduces the powerlaw behavior, which seems to be universal.

thumb_up_off_alt50

chat_bubble_outline0

account_circle

Haochen Zhang

1 month ago

Nice work from Cem Anil and the Anthropic team! Security of LLMs and agentic systems is increasingly crucial. Looking forward to seeing more research on jailbreaking test cases and remediation / monitoring techniques without hurting the model capability.

thumb_up_off_alt8

chat_bubble_outline0

account_circle

Tomek Korbak

1 month ago

On novel risks posed by long context windows, work led by amazing Cem Anil

thumb_up_off_alt3

chat_bubble_outline0

account_circle

Jesse Mu

1 month ago

Another thorny safety challenge for LLMs.

Like Sleeper Agents (twitter.com/jayelmnop/stat…), Cem Anil has found behavior that is stubbornly resistant to finetuning. Training on MSJ shifts the intercept, but not the slope, of the relationship b/t # of shots and attack efficacy.

Another thorny safety challenge for LLMs. Like Sleeper Agents (twitter.com/jayelmnop/stat…), @cem__anil has found behavior that is stubbornly resistant to finetuning. Training on MSJ shifts the intercept, but not the slope, of the relationship b/t # of shots and attack efficacy.

thumb_up_off_alt56

chat_bubble_outline0

account_circle

Cem Anil

1 month ago

One of our most crisp findings was that in-context learning usually follows simple power laws as a function of number of demonstrations.

We were surprised we didn’t find this stated explicitly in the literature.

Soliciting pointers: have we missed anything?

thumb_up_off_alt69

chat_bubble_outline0

account_circle

Trenton Bricken

@TrentonBricken

1 month ago

We have a long way to go on figuring out the implications of long contexts.

Congrats Cem Anil and team on publishing this important work.

thumb_up_off_alt73

chat_bubble_outline0

account_circle

Sam Bowman

@sleepinyourhat

1 month ago

Interesting and concerning new results from Cem Anil et al.: Many-shot prompting for harmful behavior gets predictably more effective at overcoming safety training with more examples, following a power law.

thumb_up_off_alt76

chat_bubble_outline0

account_circle

Rylan Schaeffer

@RylanSchaeffer

1 month ago

I thought this should be called 'Waterboard Jailbreaking' (I in no way mean to jest at the expense actual torture victims)

If you show long-context models enough (not-real) examples of them being broken, they'll eventually crack like an egg

Congrats to Cem Anil & authors ❤️‍🔥

thumb_up_off_alt14

chat_bubble_outline0

account_circle

Esin Durmus

1 month ago

Excited to share our new research on a long-context jailbreaking technique that works across a wide range of large language models (w/ Cem Anil) .

thumb_up_off_alt52

chat_bubble_outline0

account_circle

Anthropic

1 month ago

New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…

thumb_up_off_alt1,7K

chat_bubble_outline0

account_circle

Cade Gordon

1 month ago

I recently gave a talk to Machine Learning at Berkeley on Influence Functions through the lens of Anthropic's recent work generalizing them to LLMs.

Slides: docs.google.com/presentation/d…

We go through the math, engineering, and interesting results uniquely afforded by this technique!

thumb_up_off_alt152

chat_bubble_outline0

account_circle

Roger Grosse

1 month ago

New open source implementation of EK-FAC influence functions (including for language models) by Juhan Bae. github.com/pomonam/kronfl…

thumb_up_off_alt121

chat_bubble_outline0

account_circle

Owain Evans

1 month ago

My new blogpost: 'How do LLMs give truthful answers? LLM vs. human reasoning, ensembles, & parrots'.

Summary in 🧵:
Large language models (LLMs) like GPT-4 and Claude 3 become increasingly truthful as they scale up in size and are finetuned for factual accuracy and calibration.

thumb_up_off_alt42

chat_bubble_outline0

account_circle