Abhishek Gupta(@abhishekunique7) 's Twitter Profile Photo

Who doesn’t love good methods for reward inference. What if I told you that you could extract dense rewards from video, by ranking frames temporally using the BT model from RLHF (aka just doing temporal classification with cross-entropy). Let's see how, in rank2reward - a🧵(1/10)

account_circle
AK(@_akhaliq) 's Twitter Profile Photo

Self-Play Preference Optimization for Language Model Alignment

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences.

Self-Play Preference Optimization for Language Model Alignment

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences.
account_circle
Greg Durrett(@gregd_nlp) 's Twitter Profile Photo

In Vienna for two 🔦 spotlights at

🕵MuSR by Zayne Sprague @ ICLR 24 +al, Tues 10:45am

🧑‍💻✍️ Coeditor by Jiayi Wei with Isil Dillig , Thurs 10:45am (presented by me)

DM me if you're interested in chatting about these, reasoning + factuality in LLMs, RLHF, or other topics!

In Vienna for two 🔦 spotlights at #ICLR2024 

🕵MuSR by @ZayneSprague +al, Tues 10:45am 

🧑‍💻✍️ Coeditor by @MrVPlusOne with @IsilDillig , Thurs 10:45am (presented by me)

DM me if you're interested in chatting about these, reasoning + factuality in LLMs, RLHF, or other topics!
account_circle
Ge Gao(@ggaonlp) 's Twitter Profile Photo

RLHF research requires training and hiring annotators to explicitly choose between different model outputs.

What if we can get human preference based on user edits, which are naturally generated in applications like AI writing assistants? arxiv.org/abs/2404.15269

RLHF research requires training and hiring annotators to explicitly choose between different model outputs. 

What if we can get human preference based on user edits, which are naturally generated in applications like AI writing assistants? arxiv.org/abs/2404.15269
account_circle
Wei Xiong(@weixiong_1) 's Twitter Profile Photo

🥳Our paper 'Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint' is accepted by ICML2024 (also an ORAL presentation at ICLR ME-FoMo workshop)!

- Problem formulation: we first formally formulate the RLHF as a…

🥳Our paper 'Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint' is accepted by ICML2024 (also an ORAL presentation at ICLR ME-FoMo workshop)!

- Problem formulation: we first formally formulate the RLHF as a…
account_circle
Roberta Raileanu(@robertarail) 's Twitter Profile Photo

I’ll be at on Saturday for the LLM Agents Workshop Panel 🚀

Some of my collaborators will also be there throughout the week presenting our work on:
- how RLHF affects LLM generalisation and diversity with Robert Kirk @ ICLR 2024
- training NetHack agents using LLM feedback with…

I’ll be at #ICLR2024 on Saturday for the LLM Agents Workshop Panel 🚀

Some of  my collaborators will also be there throughout the week presenting our work on:
- how RLHF affects LLM generalisation and diversity with @_robertkirk
- training NetHack agents using LLM feedback with…
account_circle
Noam Razin ✈️ ICLR(@noamrazin) 's Twitter Profile Photo

Interested in language model finetuning? Stop by our poster Wednesday morning at to hear about a vanishing gradients problem of RLHF!

At the same session, Hattie Zhou ✈️ ICLR will present a fascinating work on understanding what algorithms Transformers can learn

Interested in language model finetuning? Stop by our poster Wednesday morning at #ICLR2024 to hear about a vanishing gradients problem of RLHF!

At the same session, @oh_that_hat will present a fascinating work on understanding what algorithms Transformers can learn
account_circle
Costa Huang(@vwxyzjn) 's Twitter Profile Photo

Experimenting with some PPO / chat recipes. I noticed there is always a drop off in RLHF reward initially (`(score.mean() - per_token_kl.sum(1).mean())`). Do people observe similar phenomena?

Experimenting with some PPO / chat recipes. I noticed there is always a drop off in RLHF reward initially (`(score.mean() - per_token_kl.sum(1).mean())`). Do people observe similar phenomena?
account_circle
Technical AI Safety Conference (TAIS)(@tais_2024) 's Twitter Profile Photo

Scott Emmons discussed at the issues of partial observability in reinforcement learning from human feedback (RLHF). He challenged the prevalent notion that human evaluators have complete awareness of the environment when providing feedback. Scott revealed that under…

Scott Emmons discussed at #TAIS2024 the issues of partial observability in reinforcement learning from human feedback (RLHF). He challenged the prevalent notion that human evaluators have complete awareness of the environment when providing feedback. Scott revealed that under…
account_circle
Alireza Makhzani(@AliMakhzani) 's Twitter Profile Photo

Introducing “Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo”

Many capability and safety techniques of LLMs—such as RLHF, automated red-teaming, prompt engineering, and infilling—can be viewed from a probabilistic inference perspective, specifically…

Introducing “Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo”

Many capability and safety techniques of LLMs—such as RLHF, automated red-teaming, prompt engineering, and infilling—can be viewed from a probabilistic inference perspective, specifically…
account_circle
Yannic Kilcher 🇸🇨(@ykilcher) 's Twitter Profile Photo

🌎New Video🌏
Explaining ORPO: Monolithic Preference Optimization without Reference Model - a more stable, less costly, better performing, and single-step alternative to SFT + RLHF / DPO
Watch here: youtu.be/52kMBrAI_IM

🌎New Video🌏
Explaining ORPO: Monolithic Preference Optimization without Reference Model - a more stable, less costly, better performing, and single-step alternative to SFT + RLHF / DPO
Watch here: youtu.be/52kMBrAI_IM
account_circle
Akifumi Wachi(@akifumi_wachi) 's Twitter Profile Photo

RLHF/DPO小話 最終回(その4)を執筆しました。
最後までお読みいただいた方ありがとうございました!
akifumi-wachi-4.github.io/website/column…

ところで、私がメンターのインターン募集です↓
lycorp.co.jp/ja/recruit/lan…

account_circle
fly51fly(@fly51fly) 's Twitter Profile Photo

[LG] DPO Meets PPO: Reinforced Token Optimization for RLHF
arxiv.org/abs/2404.18922
- This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit…

[LG]  DPO Meets PPO: Reinforced Token Optimization for RLHF  
arxiv.org/abs/2404.18922     
- This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit…
account_circle
Aran Komatsuzaki(@arankomatsuzaki) 's Twitter Profile Photo

Self-Play Preference Optimization for Language Model Alignment

SPPO serves as the RLHF counterpart of SPIN and outperforms iterative DPO, Snorkel AI, Self-Rewarding LM, GPT-4 0613 etc

arxiv.org/abs/2405.00675

Self-Play Preference Optimization for Language Model Alignment

SPPO serves as the RLHF counterpart of SPIN and outperforms iterative DPO, Snorkel AI, Self-Rewarding LM, GPT-4 0613 etc

arxiv.org/abs/2405.00675
account_circle
machine learning(@Mlearning_ai) 's Twitter Profile Photo

Why Human Feedback?
Learning from Human Preferences

Reinforcement learning from human feedback (RLHF)

The article provides a detailed overview of reinforcement learning from human feedback (RLHF), emphasizing its integration with human-computer interaction to refine AI…

Why Human Feedback? 
Learning from Human Preferences

Reinforcement learning from human feedback (RLHF) 

The article provides a detailed overview of reinforcement learning from human feedback (RLHF), emphasizing its integration with human-computer interaction to refine AI…
account_circle
Teortaxes▶️(@teortaxesTex) 's Twitter Profile Photo

GPT '3.5' was the biggest psyop in AI – starting with the very labeling. In retrospect it's clear that the LLM behind ChatGPT had zero relation to the davinci 175B@500B series and was more like overtrained late 2023 models. But for a year+, we had to debate the Magic Of RLHF.

GPT '3.5' was the biggest psyop in AI – starting with the very labeling. In retrospect it's clear that the LLM behind ChatGPT had zero relation to the davinci 175B@500B series and was more like overtrained late 2023 models. But for a year+, we had to debate the Magic Of RLHF.
account_circle
Overly Literate Skater(@0xflashmine) 's Twitter Profile Photo

'Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full…

'Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full…
account_circle