Davis Blalock (@davisblalock) Twitter Tweets • TwiCopy

3 weeks ago

One fact I didn't appreciate when I was younger is that the '10,000 hour rule' is a joke. Like, 10k hours is less than 4 years of college + internships. It's new grad level.

Not until 20k, 30k, 40k hours are you starting to get good.

Like, I'm ~30k hours into machine learning…

thumb_up_off_alt58

repeat1

account_circle

Pratyush Maini @ICLR 2024 🎡

@pratyushmaini

3 weeks ago

1/ 🥁Scaling Laws for Data Filtering 🥁

TLDR: Data Curation *cannot* be compute agnostic!
In our #CVPR2024 paper, we develop the first scaling laws for heterogeneous & limited web data.

w/Sachin Goyal Zachary Lipton Aditi Raghunathan Zico Kolter
📝:arxiv.org/abs/2404.07177

account_circle

Mihir Patel

@mvpatel2000

3 weeks ago

🚨Open Source Drop🚨

Databricks is adopting MegaBlocks, and we're releasing the MegaBlocks integration into LLMFoundry. This is a critical component in our Dbrx training stack, and we're super excited to bring MoE training to the community (1/N)

account_circle

Davis Blalock

1 month ago

Oh my gosh, it was so hard to keep this secret once we saw the numbers (beating GPT-3.5 and Grok with 36B active params!). Feels good man.

thumb_up_off_alt138

repeat8

account_circle

Vitaliy Chiley

@vitaliychiley

1 month ago

Introducing DBRX: A New Standard for Open LLM 🔔

databricks.com/blog/introduci…

💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens
🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks.

Is this thread mostly written by DBRX? Yes!
🧵

account_circle

Atli Kosson

@AtliKosson

2 months ago

Why does AdamW outperform Adam with L2-regularization?

Its effectiveness seems to stem from how it affects the angular update size of weight vectors!

This may also be the case for Weight Standardization, lr warmup and weight decay in general!
🧵 for arxiv.org/abs/2305.17212 1/10

account_circle

MLflow

@MLflow

1 month ago

In this MLOps Community episode, MosaicML's Davis Blalock and bandish share war stories and lessons learned from pushing the limits of #LLM training and helping dozens of customers get LLMs into production. 🤝

👀 Watch the full episode: home.mlops.community/public/videos/…

#mlops #llms

thumb_up_off_alt13

repeat7

account_circle

Davis Blalock

1 month ago

What does it look like to knock a million dollars off the cost of training huge models?

For us, it looked like this:

thumb_up_off_alt52

repeat5

account_circle

Davis Blalock

1 month ago

Underappreciated: The entire public internet is maybe a few hundred terabytes of text.

This is not that big.

Many organizations have *petabytes* of domain-specific data. CERN can generate a petabyte per second (information-technology.web.cern.ch/sites/default/…).

thumb_up_off_alt88

repeat4

account_circle

Davis Blalock

1 month ago

I know this is an AMD commercial, but I am so happy to see Abhi Venigalla getting airtime. The man should be a top 5 name in LLMs, but just quietly does his job making . successful instead of seeking attention.

thumb_up_off_alt94

repeat6

account_circle

Kangwook Lee

@Kangwook_Lee

1 month ago

🧵Let me explain why the early ascent phenomenon occurs🔥

We must first understand that in-context learning exhibits two distinct modes.

When given samples from a novel task, the model actually learns the pattern from the examples.

We call this mode the 'task learning' mode.

account_circle

Davis Blalock

2 months ago

A fantastic post on large-scale infra pain. If you've wondered why MosaicML was a unicorn, it's this. tl;dr:

Every cluster and every PyTorch library is its own unique, broken, unstable snowflake. Everything is hard at scale. Nothing 'just works.'

We get paid to abstract this…

thumb_up_off_alt119

repeat8