Cem Anil(@cem__anil) 's Twitter Profileg
Cem Anil

@cem__anil

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. student researcher @google (Blueshift Team) and @nvidia.

ID:1062518594356035584

linkhttps://www.cs.toronto.edu/~anilcem/ calendar_today14-11-2018 01:32:28

450 Tweets

1,5K Followers

1,3K Following

Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training.

Check out our first alignment blog post here: anthropic.com/research/probe…

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored 'sleeper agent' models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
account_circle
Daniel Johnson(@_ddjohnson) 's Twitter Profile Photo

Excited to share Penzai, a JAX research toolkit from Google DeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere.

Check it out on GitHub: github.com/google-deepmin…

account_circle
Arjun Panickssery is in London(@panickssery) 's Twitter Profile Photo

Are LLMs biased toward themselves?

Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs

This could interfere with safety techniques like reward modeling & constitutional AI

Are LLMs biased toward themselves? Frontier LLMs give higher scores to their own outputs in self-eval. We find evidence that this bias is caused by LLM's ability to recognize their own outputs This could interfere with safety techniques like reward modeling & constitutional AI
account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research: Measuring Model Persuasiveness

We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.

Read our blog post here: anthropic.com/news/measuring…

New Anthropic research: Measuring Model Persuasiveness We developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude. Read our blog post here: anthropic.com/news/measuring…
account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans.

We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.

We find that Claude 3 Opus generates arguments that don't statistically differ in persuasiveness compared to arguments written by humans. We also find a scaling trend across model generations: newer models tended to be rated as more persuasive than previous ones.
account_circle
Joshua Batson(@thebasepoint) 's Twitter Profile Photo

This whole paper is fascinating...shows the power of in-context learning to dominate in-weights learning, for jailbreaks in particular.

Hidden in the appendix is a toy model of in-context learning that analytically reproduces the powerlaw behavior, which seems to be universal.

account_circle
Haochen Zhang(@jhaochenz) 's Twitter Profile Photo

Nice work from Cem Anil and the Anthropic team! Security of LLMs and agentic systems is increasingly crucial. Looking forward to seeing more research on jailbreaking test cases and remediation / monitoring techniques without hurting the model capability.

account_circle
Jesse Mu(@jayelmnop) 's Twitter Profile Photo

Another thorny safety challenge for LLMs.

Like Sleeper Agents (twitter.com/jayelmnop/stat…), Cem Anil has found behavior that is stubbornly resistant to finetuning. Training on MSJ shifts the intercept, but not the slope, of the relationship b/t # of shots and attack efficacy.

Another thorny safety challenge for LLMs. Like Sleeper Agents (twitter.com/jayelmnop/stat…), @cem__anil has found behavior that is stubbornly resistant to finetuning. Training on MSJ shifts the intercept, but not the slope, of the relationship b/t # of shots and attack efficacy.
account_circle
Cem Anil(@cem__anil) 's Twitter Profile Photo

One of our most crisp findings was that in-context learning usually follows simple power laws as a function of number of demonstrations.

We were surprised we didn’t find this stated explicitly in the literature.

Soliciting pointers: have we missed anything?

account_circle
Trenton Bricken(@TrentonBricken) 's Twitter Profile Photo

We have a long way to go on figuring out the implications of long contexts.

Congrats Cem Anil and team on publishing this important work.

account_circle
Sam Bowman(@sleepinyourhat) 's Twitter Profile Photo

Interesting and concerning new results from Cem Anil et al.: Many-shot prompting for harmful behavior gets predictably more effective at overcoming safety training with more examples, following a power law.

account_circle
Rylan Schaeffer(@RylanSchaeffer) 's Twitter Profile Photo

I thought this should be called 'Waterboard Jailbreaking' (I in no way mean to jest at the expense actual torture victims)

If you show long-context models enough (not-real) examples of them being broken, they'll eventually crack like an egg

Congrats to Cem Anil & authors ❤️‍🔥

account_circle
Esin Durmus(@esindurmusnlp) 's Twitter Profile Photo

Excited to share our new research on a long-context jailbreaking technique that works across a wide range of large language models (w/ Cem Anil) .

account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: anthropic.com/research/many-…

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…
account_circle
Cade Gordon(@CadeGordonML) 's Twitter Profile Photo

I recently gave a talk to Machine Learning at Berkeley on Influence Functions through the lens of Anthropic's recent work generalizing them to LLMs.

Slides: docs.google.com/presentation/d…

We go through the math, engineering, and interesting results uniquely afforded by this technique!

account_circle
Roger Grosse(@RogerGrosse) 's Twitter Profile Photo

New open source implementation of EK-FAC influence functions (including for language models) by Juhan Bae. github.com/pomonam/kronfl…

account_circle
Owain Evans(@OwainEvans_UK) 's Twitter Profile Photo

My new blogpost: 'How do LLMs give truthful answers? LLM vs. human reasoning, ensembles, & parrots'.

Summary in 🧵:
Large language models (LLMs) like GPT-4 and Claude 3 become increasingly truthful as they scale up in size and are finetuned for factual accuracy and calibration.

account_circle