Davis Blalock(@davisblalock) 's Twitter Profileg
Davis Blalock

@davisblalock

Research scientist + first hire @MosaicML. @MIT PhD. I write + retweet technical machine learning content. If you write a thread about your paper, tag me for RT

ID:805547773944889344

linkhttp://bit.ly/3OXJbDs calendar_today04-12-2016 23:02:10

1,2K Tweets

12,2K Followers

165 Following

Davis Blalock(@davisblalock) 's Twitter Profile Photo

One fact I didn't appreciate when I was younger is that the '10,000 hour rule' is a joke. Like, 10k hours is less than 4 years of college + internships. It's new grad level.

Not until 20k, 30k, 40k hours are you starting to get good.

Like, I'm ~30k hours into machine learning…

account_circle
Pratyush Maini @ICLR 2024 🎡(@pratyushmaini) 's Twitter Profile Photo

1/ 🥁Scaling Laws for Data Filtering 🥁

TLDR: Data Curation *cannot* be compute agnostic!
In our paper, we develop the first scaling laws for heterogeneous & limited web data.

w/Sachin Goyal Zachary Lipton Aditi Raghunathan Zico Kolter
📝:arxiv.org/abs/2404.07177

1/ 🥁Scaling Laws for Data Filtering 🥁 TLDR: Data Curation *cannot* be compute agnostic! In our #CVPR2024 paper, we develop the first scaling laws for heterogeneous & limited web data. w/@goyalsachin007 @zacharylipton @AdtRaghunathan @zicokolter 📝:arxiv.org/abs/2404.07177
account_circle
Mihir Patel(@mvpatel2000) 's Twitter Profile Photo

🚨Open Source Drop🚨

Databricks is adopting MegaBlocks, and we're releasing the MegaBlocks integration into LLMFoundry. This is a critical component in our Dbrx training stack, and we're super excited to bring MoE training to the community (1/N)

🚨Open Source Drop🚨 Databricks is adopting MegaBlocks, and we're releasing the MegaBlocks integration into LLMFoundry. This is a critical component in our Dbrx training stack, and we're super excited to bring MoE training to the community (1/N)
account_circle
Davis Blalock(@davisblalock) 's Twitter Profile Photo

Oh my gosh, it was so hard to keep this secret once we saw the numbers (beating GPT-3.5 and Grok with 36B active params!). Feels good man.

account_circle
Vitaliy Chiley(@vitaliychiley) 's Twitter Profile Photo

Introducing DBRX: A New Standard for Open LLM 🔔

databricks.com/blog/introduci…

💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens
🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks.

Is this thread mostly written by DBRX? Yes!
🧵

Introducing DBRX: A New Standard for Open LLM 🔔 databricks.com/blog/introduci… 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵
account_circle
Atli Kosson(@AtliKosson) 's Twitter Profile Photo

Why does AdamW outperform Adam with L2-regularization?

Its effectiveness seems to stem from how it affects the angular update size of weight vectors!

This may also be the case for Weight Standardization, lr warmup and weight decay in general!
🧵 for arxiv.org/abs/2305.17212 1/10

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for arxiv.org/abs/2305.17212 1/10
account_circle
MLflow(@MLflow) 's Twitter Profile Photo

In this MLOps Community episode, MosaicML's Davis Blalock and bandish share war stories and lessons learned from pushing the limits of training and helping dozens of customers get LLMs into production. 🤝

👀 Watch the full episode: home.mlops.community/public/videos/…

In this @mlopscommunity episode, MosaicML's @davisblalock and @bandish share war stories and lessons learned from pushing the limits of #LLM training and helping dozens of customers get LLMs into production. 🤝 👀 Watch the full episode: home.mlops.community/public/videos/… #mlops #llms
account_circle
Davis Blalock(@davisblalock) 's Twitter Profile Photo

What does it look like to knock a million dollars off the cost of training huge models?

For us, it looked like this:

account_circle
Davis Blalock(@davisblalock) 's Twitter Profile Photo

Underappreciated: The entire public internet is maybe a few hundred terabytes of text.

This is not that big.

Many organizations have *petabytes* of domain-specific data. CERN can generate a petabyte per second (information-technology.web.cern.ch/sites/default/…).

Underappreciated: The entire public internet is maybe a few hundred terabytes of text. This is not that big. Many organizations have *petabytes* of domain-specific data. CERN can generate a petabyte per second (information-technology.web.cern.ch/sites/default/…).
account_circle
Davis Blalock(@davisblalock) 's Twitter Profile Photo

I know this is an AMD commercial, but I am so happy to see Abhi Venigalla getting airtime. The man should be a top 5 name in LLMs, but just quietly does his job making . successful instead of seeking attention.

account_circle
Kangwook Lee(@Kangwook_Lee) 's Twitter Profile Photo

🧵Let me explain why the early ascent phenomenon occurs🔥

We must first understand that in-context learning exhibits two distinct modes.

When given samples from a novel task, the model actually learns the pattern from the examples.

We call this mode the 'task learning' mode.

🧵Let me explain why the early ascent phenomenon occurs🔥 We must first understand that in-context learning exhibits two distinct modes. When given samples from a novel task, the model actually learns the pattern from the examples. We call this mode the 'task learning' mode.
account_circle
Davis Blalock(@davisblalock) 's Twitter Profile Photo

A fantastic post on large-scale infra pain. If you've wondered why MosaicML was a unicorn, it's this. tl;dr:

Every cluster and every PyTorch library is its own unique, broken, unstable snowflake. Everything is hard at scale. Nothing 'just works.'

We get paid to abstract this…

account_circle