Sam Bowman (@sleepinyourhat) Twitter Tweets • TwiCopy

I’m super excited to release our 100+ page collaborative agenda - led by Usman Anwar - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities!

Some highlights below...

I’m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on “Foundational Challenges In Assuring Alignment and Safety of LLMs” alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...

thumb_up_off_alt390

chat_bubble_outline0

repeat137

shareShare

account_circle

Sasha Rush

@srush_nlp

2 weeks ago

I like to think of myself as a researcher, but almost certainly the most valuable use of my time is writing US Visa letters.

account_circle

Cem Anil

@cem__anil

3 weeks ago

One of our most crisp findings was that in-context learning usually follows simple power laws as a function of number of demonstrations.

We were surprised we didn’t find this stated explicitly in the literature.

Soliciting pointers: have we missed anything?

thumb_up_off_alt69

chat_bubble_outline0

repeat6

shareShare

account_circle

Ethan Perez

@EthanJPerez

3 weeks ago

This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.

account_circle

Sam Bowman

@sleepinyourhat

3 weeks ago

Interesting and concerning new results from Cem Anil et al.: Many-shot prompting for harmful behavior gets predictably more effective at overcoming safety training with more examples, following a power law.

thumb_up_off_alt76

chat_bubble_outline0

repeat9

shareShare

account_circle

Chris Olah

@ch402

3 weeks ago

I'm incredibly excited to have Craig joining us on the Anthropic Interpretability team!

I've been a huge fan of Colaboratory for nearly a decade (I used it internally at Google!) and have really admired Craig's work on it.

thumb_up_off_alt130

chat_bubble_outline0

repeat7

shareShare

account_circle

Ethan Perez

@EthanJPerez

1 month ago

I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)!

Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy

thumb_up_off_alt66

chat_bubble_outline0

repeat6

shareShare

account_circle

Rohin Shah

@rohinmshah

1 month ago

Despite the constant arguments on p(doom), many agree that *if* AI systems become highly capable in risky domains, *then* we ought to mitigate those risks. So we built an eval suite to see whether AI systems are highly capable in risky domains.

twitter.com/tshevl/status/…

account_circle

Jesse Mu

@jayelmnop

1 month ago

We’re hiring for the adversarial robustness team Anthropic!

As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

account_circle

Anthropic

@AnthropicAI

1 month ago

Today we're releasing Claude 3 Haiku, the fastest and most affordable model in its intelligence class.

Haiku is now available in the API and on claude.ai for Claude Pro subscribers.

account_circle

Sam Bowman

@sleepinyourhat

1 month ago

🚨📄 Following up on 'LMs Don't Always Say What They Think', Miles Turpin et al. now have an intervention that dramatically reduces the problem! 📄🚨

It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

thumb_up_off_alt72

chat_bubble_outline0

repeat8

shareShare

account_circle

Neel Nanda

@NeelNanda5

1 month ago

Really great post on how to think about doing mech interp research, and how it requires a very different mindset to normal ML

thumb_up_off_alt76

chat_bubble_outline0

repeat5

shareShare

account_circle

Amanda Askell

@AmandaAskell

1 month ago

I suppose this is a good time to mention that I'm looking for a research prompt engineer, in case you want to be my promptégé.

(Look, you may wildly out-prompt me but I couldn't resist that portmanteau.) jobs.lever.co/Anthropic/a2c8…

account_circle

Jack Clark

@jackclarkSF

1 month ago

Want to work at the frontier of AI policy with the most technical policy team in the business? You do? Excellent. Please consider applying
- Special Projects Lead jobs.lever.co/Anthropic/5752…
- Policy Analyst, Product jobs.lever.co/Anthropic/6ecd…
- Outreach Lead jobs.lever.co/Anthropic/df58…

account_circle

Helen Toner

@hlntnr

1 month ago

5 years! It's been unbelievable to see how CSET's team and reputation has grown.

To celebrate, here are 5 papers/products, 1 from each year of CSET's existence, that I love (and that exemplify the work we do).

thumb_up_off_alt92

chat_bubble_outline0

repeat6

shareShare

account_circle