anton(@abacaj) 's Twitter Profileg
anton

@abacaj

Software engineer. Hacking on large language models

ID:70514287

calendar_today31-08-2009 22:06:04

10,8K Tweets

36,1K Followers

518 Following

Rafael Rafailov(@rm_rafailov) 's Twitter Profile Photo

We train a family of LLMs on the tiny stories dataset and indeed verify significant model collapse in the iterative (replace) setting. However, surprisingly, in the data accumulation regime the model not only does not degrade, but improves with more iterations!

We train a family of LLMs on the tiny stories dataset and indeed verify significant model collapse in the iterative (replace) setting. However, surprisingly, in the data accumulation regime the model not only does not degrade, but improves with more iterations!
account_circle
elvis(@omarsar0) 's Twitter Profile Photo

When to Retrieve?

This new paper presents an approach to train LLMs to effectively utilize information retrieval.

It first proposes a training approach to teach an LLM to generate a special token, <RET>, when it's not confident or doesn't know the answer to a question.

The…

When to Retrieve? This new paper presents an approach to train LLMs to effectively utilize information retrieval. It first proposes a training approach to teach an LLM to generate a special token, <RET>, when it's not confident or doesn't know the answer to a question. The…
account_circle
Aran Komatsuzaki(@arankomatsuzaki) 's Twitter Profile Photo

Meta presents Better & Faster Large Language Models via Multi-token Prediction

- training language models to predict multiple future tokens at once results in higher sample efficiency
- up to 3x faster at inference

arxiv.org/abs/2404.19737

Meta presents Better & Faster Large Language Models via Multi-token Prediction - training language models to predict multiple future tokens at once results in higher sample efficiency - up to 3x faster at inference arxiv.org/abs/2404.19737
account_circle
Jason Weston(@jaseweston) 's Twitter Profile Photo

🚨 Iterative Reasoning Preference Optimization 🚨
- Iterative algorithm for reasoning tasks: generate pairs & apply DPO+NLL
- Improves accuracy over iterations on GSM8K, MATH, ARC & beats baselines
E.g. Llama2-70B GSM8K: 55.6%->81.6% (88.7% maj32)
arxiv.org/abs/2404.19733
🧵(1/5)

🚨 Iterative Reasoning Preference Optimization 🚨 - Iterative algorithm for reasoning tasks: generate pairs & apply DPO+NLL - Improves accuracy over iterations on GSM8K, MATH, ARC & beats baselines E.g. Llama2-70B GSM8K: 55.6%->81.6% (88.7% maj32) arxiv.org/abs/2404.19733 🧵(1/5)
account_circle
anton(@abacaj) 's Twitter Profile Photo

Turns out you can actually just run full 32k context on a single 3090 using vllm at higher precision (bf16). Just enable 'fp8' cache dtype. This is for llama-3 8B

Turns out you can actually just run full 32k context on a single 3090 using vllm at higher precision (bf16). Just enable 'fp8' cache dtype. This is for llama-3 8B
account_circle