Twitter #RLHF hashtag • TwiCopy

Abhishek Gupta

@abhishekunique7

4 days ago

Who doesn’t love good methods for reward inference. What if I told you that you could extract dense rewards from video, by ranking frames temporally using the BT model from RLHF (aka just doing temporal classification with cross-entropy). Let's see how, in rank2reward - a🧵(1/10)

thumb_up_off_alt102

chat_bubble_outline0

account_circle

♦ ereliuer eteer ♦

@ereliuer_eteer

3 days ago

#Claude3 explains about RLHF and ethical principles

#Claude3 explains about RLHF and ethical principles

thumb_up_off_alt12

chat_bubble_outline0

account_circle

AK

4 days ago

Self-Play Preference Optimization for Language Model Alignment

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences.

Self-Play Preference Optimization for Language Model Alignment

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences.

thumb_up_off_alt215

chat_bubble_outline0

account_circle

Greg Durrett

38 minutes ago

In Vienna for two 🔦 spotlights at #ICLR2024

🕵MuSR by Zayne Sprague @ ICLR 24 +al, Tues 10:45am

🧑‍💻✍️ Coeditor by Jiayi Wei with Isil Dillig , Thurs 10:45am (presented by me)

DM me if you're interested in chatting about these, reasoning + factuality in LLMs, RLHF, or other topics!

In Vienna for two 🔦 spotlights at #ICLR2024

🕵MuSR by @ZayneSprague +al, Tues 10:45am

🧑‍💻✍️ Coeditor by @MrVPlusOne with @IsilDillig , Thurs 10:45am (presented by me)

DM me if you're interested in chatting about these, reasoning + factuality in LLMs, RLHF, or other topics!

thumb_up_off_alt10

chat_bubble_outline0

account_circle

Ge Gao

4 days ago

RLHF research requires training and hiring annotators to explicitly choose between different model outputs.

What if we can get human preference based on user edits, which are naturally generated in applications like AI writing assistants? arxiv.org/abs/2404.15269

RLHF research requires training and hiring annotators to explicitly choose between different model outputs.

What if we can get human preference based on user edits, which are naturally generated in applications like AI writing assistants? arxiv.org/abs/2404.15269

thumb_up_off_alt258

chat_bubble_outline0

account_circle

Wei Xiong

4 days ago

🥳Our paper 'Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint' is accepted by ICML2024 (also an ORAL presentation at ICLR ME-FoMo workshop)!

- Problem formulation: we first formally formulate the RLHF as a…

🥳Our paper 'Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint' is accepted by ICML2024 (also an ORAL presentation at ICLR ME-FoMo workshop)!

- Problem formulation: we first formally formulate the RLHF as a…

thumb_up_off_alt156

chat_bubble_outline0

account_circle

Roberta Raileanu

7 hours ago

I’ll be at #ICLR2024 on Saturday for the LLM Agents Workshop Panel 🚀

Some of my collaborators will also be there throughout the week presenting our work on:
- how RLHF affects LLM generalisation and diversity with Robert Kirk @ ICLR 2024
- training NetHack agents using LLM feedback with…

I’ll be at #ICLR2024 on Saturday for the LLM Agents Workshop Panel 🚀

Some of my collaborators will also be there throughout the week presenting our work on:
- how RLHF affects LLM generalisation and diversity with @_robertkirk
- training NetHack agents using LLM feedback with…

thumb_up_off_alt56

chat_bubble_outline0

account_circle

Noam Razin ✈️ ICLR

3 hours ago

Interested in language model finetuning? Stop by our poster Wednesday morning at #ICLR2024 to hear about a vanishing gradients problem of RLHF!

At the same session, Hattie Zhou ✈️ ICLR will present a fascinating work on understanding what algorithms Transformers can learn

Interested in language model finetuning? Stop by our poster Wednesday morning at #ICLR2024 to hear about a vanishing gradients problem of RLHF!

At the same session, @oh_that_hat will present a fascinating work on understanding what algorithms Transformers can learn

thumb_up_off_alt9

chat_bubble_outline0

account_circle

Costa Huang

3 days ago

Experimenting with some PPO / chat recipes. I noticed there is always a drop off in RLHF reward initially (`(score.mean() - per_token_kl.sum(1).mean())`). Do people observe similar phenomena?

Experimenting with some PPO / chat recipes. I noticed there is always a drop off in RLHF reward initially (`(score.mean() - per_token_kl.sum(1).mean())`). Do people observe similar phenomena?

thumb_up_off_alt39

chat_bubble_outline0

account_circle

Technical AI Safety Conference (TAIS)

9 hours ago

Scott Emmons discussed at #TAIS2024 the issues of partial observability in reinforcement learning from human feedback (RLHF). He challenged the prevalent notion that human evaluators have complete awareness of the environment when providing feedback. Scott revealed that under…

Scott Emmons discussed at #TAIS2024 the issues of partial observability in reinforcement learning from human feedback (RLHF). He challenged the prevalent notion that human evaluators have complete awareness of the environment when providing feedback. Scott revealed that under…

thumb_up_off_alt1

chat_bubble_outline0

account_circle

Alireza Makhzani

5 days ago

Introducing “Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo”

Many capability and safety techniques of LLMs—such as RLHF, automated red-teaming, prompt engineering, and infilling—can be viewed from a probabilistic inference perspective, specifically…

Introducing “Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo”

Many capability and safety techniques of LLMs—such as RLHF, automated red-teaming, prompt engineering, and infilling—can be viewed from a probabilistic inference perspective, specifically…

thumb_up_off_alt98

chat_bubble_outline0

account_circle

Yannic Kilcher 🇸🇨

5 days ago

🌎New Video🌏
Explaining ORPO: Monolithic Preference Optimization without Reference Model - a more stable, less costly, better performing, and single-step alternative to SFT + RLHF / DPO
Watch here: youtu.be/52kMBrAI_IM

🌎New Video🌏
Explaining ORPO: Monolithic Preference Optimization without Reference Model - a more stable, less costly, better performing, and single-step alternative to SFT + RLHF / DPO
Watch here: youtu.be/52kMBrAI_IM

thumb_up_off_alt160

chat_bubble_outline0

account_circle

Akifumi Wachi

6 days ago

RLHF/DPO小話最終回（その4）を執筆しました。
最後までお読みいただいた方ありがとうございました！
akifumi-wachi-4.github.io/website/column…

ところで、私がメンターのインターン募集です↓
lycorp.co.jp/ja/recruit/lan…

thumb_up_off_alt54

chat_bubble_outline0

account_circle

fly51fly

5 days ago

[LG] DPO Meets PPO: Reinforced Token Optimization for RLHF
arxiv.org/abs/2404.18922
- This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit…

[LG] DPO Meets PPO: Reinforced Token Optimization for RLHF
arxiv.org/abs/2404.18922
- This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit…

thumb_up_off_alt104

chat_bubble_outline0

account_circle

Aran Komatsuzaki

@arankomatsuzaki

4 days ago

Self-Play Preference Optimization for Language Model Alignment

SPPO serves as the RLHF counterpart of SPIN and outperforms iterative DPO, Snorkel AI, Self-Rewarding LM, GPT-4 0613 etc

arxiv.org/abs/2405.00675

Self-Play Preference Optimization for Language Model Alignment

SPPO serves as the RLHF counterpart of SPIN and outperforms iterative DPO, Snorkel AI, Self-Rewarding LM, GPT-4 0613 etc

arxiv.org/abs/2405.00675

thumb_up_off_alt321

chat_bubble_outline0

account_circle

machine learning

5 days ago

Why Human Feedback?
Learning from Human Preferences

Reinforcement learning from human feedback (RLHF)

The article provides a detailed overview of reinforcement learning from human feedback (RLHF), emphasizing its integration with human-computer interaction to refine AI…

Why Human Feedback?
Learning from Human Preferences

Reinforcement learning from human feedback (RLHF)

The article provides a detailed overview of reinforcement learning from human feedback (RLHF), emphasizing its integration with human-computer interaction to refine AI…

thumb_up_off_alt3

chat_bubble_outline0

account_circle

doomslide

1 week ago

αιamblichus j⧉nus RLHF runs deep

@aiamblichus @repligate RLHF runs deep

thumb_up_off_alt3

chat_bubble_outline0

account_circle

Teortaxes▶️

1 week ago

GPT '3.5' was the biggest psyop in AI – starting with the very labeling. In retrospect it's clear that the LLM behind ChatGPT had zero relation to the davinci 175B@500B series and was more like overtrained late 2023 models. But for a year+, we had to debate the Magic Of RLHF.

GPT '3.5' was the biggest psyop in AI – starting with the very labeling. In retrospect it's clear that the LLM behind ChatGPT had zero relation to the davinci 175B@500B series and was more like overtrained late 2023 models. But for a year+, we had to debate the Magic Of RLHF.

thumb_up_off_alt138

chat_bubble_outline0

account_circle

Arxiv Papers

6 days ago

DPO Meets PPO: Reinforced Token Optimization for RLHF
youtu.be/u8Cg07NxELM

DPO Meets PPO: Reinforced Token Optimization for RLHF
youtu.be/u8Cg07NxELM

thumb_up_off_alt1

chat_bubble_outline0

account_circle

Overly Literate Skater

6 days ago

'Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full…

'Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full…

thumb_up_off_alt1

chat_bubble_outline0

account_circle