Sam Bowman(@sleepinyourhat) 's Twitter Profileg
Sam Bowman

@sleepinyourhat

AI alignment + LLMs at NYU & Anthropic. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.

ID:338526004

linkhttps://cims.nyu.edu/~sbowman/ calendar_today19-07-2011 18:19:52

2,2K Tweets

34,6K Followers

3,1K Following

Owain Evans(@OwainEvans_UK) 's Twitter Profile Photo

Full lecture slides and reading list for Roger Grosse's class on AI Alignment are up:
alignment-w2024.notion.site

Full lecture slides and reading list for Roger Grosse's class on AI Alignment are up: alignment-w2024.notion.site
account_circle
David Krueger(@DavidSKrueger) 's Twitter Profile Photo

Iā€™m super excited to release our 100+ page collaborative agenda - led by Usman Anwar - on ā€œFoundational Challenges In Assuring Alignment and Safety of LLMsā€ alongside 35+ co-authors from NLP, ML, and AI Safety communities!

Some highlights below...

Iā€™m super excited to release our 100+ page collaborative agenda - led by @usmananwar391 - on ā€œFoundational Challenges In Assuring Alignment and Safety of LLMsā€ alongside 35+ co-authors from NLP, ML, and AI Safety communities! Some highlights below...
account_circle
Sasha Rush(@srush_nlp) 's Twitter Profile Photo

I like to think of myself as a researcher, but almost certainly the most valuable use of my time is writing US Visa letters.

account_circle
Cem Anil(@cem__anil) 's Twitter Profile Photo

One of our most crisp findings was that in-context learning usually follows simple power laws as a function of number of demonstrations.

We were surprised we didnā€™t find this stated explicitly in the literature.

Soliciting pointers: have we missed anything?

account_circle
Ethan Perez(@EthanJPerez) 's Twitter Profile Photo

This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.

account_circle
Sam Bowman(@sleepinyourhat) 's Twitter Profile Photo

Interesting and concerning new results from Cem Anil et al.: Many-shot prompting for harmful behavior gets predictably more effective at overcoming safety training with more examples, following a power law.

account_circle
Chris Olah(@ch402) 's Twitter Profile Photo

I'm incredibly excited to have Craig joining us on the Anthropic Interpretability team!

I've been a huge fan of Colaboratory for nearly a decade (I used it internally at Google!) and have really admired Craig's work on it.

account_circle
Ethan Perez(@EthanJPerez) 's Twitter Profile Photo

I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)!

Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy

account_circle
Rohin Shah(@rohinmshah) 's Twitter Profile Photo

Despite the constant arguments on p(doom), many agree that *if* AI systems become highly capable in risky domains, *then* we ought to mitigate those risks. So we built an eval suite to see whether AI systems are highly capable in risky domains.

twitter.com/tshevl/status/ā€¦

account_circle
Jesse Mu(@jayelmnop) 's Twitter Profile Photo

Weā€™re hiring for the adversarial robustness team Anthropic!

As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If youā€™re interested in these areas, let us know! (emails in šŸ§µ)

Weā€™re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If youā€™re interested in these areas, let us know! (emails in šŸ§µ)
account_circle
Anthropic(@AnthropicAI) 's Twitter Profile Photo

Today we're releasing Claude 3 Haiku, the fastest and most affordable model in its intelligence class.

Haiku is now available in the API and on claude.ai for Claude Pro subscribers.

account_circle
Sam Bowman(@sleepinyourhat) 's Twitter Profile Photo

šŸšØšŸ“„ Following up on 'LMs Don't Always Say What They Think', Miles Turpin et al. now have an intervention that dramatically reduces the problem! šŸ“„šŸšØ

It's not a perfect solution, but it's a simple method with few assumptions and it generalizes *much* better than I'd expected.

account_circle
Neel Nanda(@NeelNanda5) 's Twitter Profile Photo

Really great post on how to think about doing mech interp research, and how it requires a very different mindset to normal ML

account_circle
Amanda Askell(@AmandaAskell) 's Twitter Profile Photo

I suppose this is a good time to mention that I'm looking for a research prompt engineer, in case you want to be my promptƩgƩ.

(Look, you may wildly out-prompt me but I couldn't resist that portmanteau.) jobs.lever.co/Anthropic/a2c8ā€¦

account_circle
Jack Clark(@jackclarkSF) 's Twitter Profile Photo

Want to work at the frontier of AI policy with the most technical policy team in the business? You do? Excellent. Please consider applying
- Special Projects Lead jobs.lever.co/Anthropic/5752ā€¦
- Policy Analyst, Product jobs.lever.co/Anthropic/6ecdā€¦
- Outreach Lead jobs.lever.co/Anthropic/df58ā€¦

account_circle
Helen Toner(@hlntnr) 's Twitter Profile Photo

5 years! It's been unbelievable to see how CSET's team and reputation has grown.

To celebrate, here are 5 papers/products, 1 from each year of CSET's existence, that I love (and that exemplify the work we do).

account_circle