Andrew Carr (e/🤸) (@andrew_n_carr) 's Twitter Profile
Andrew Carr (e/🤸)

@andrew_n_carr

co-founder leading science @getcartwheel AI writer @tldrnewsletter advisor @arcade_ai Past - Codegen @OpenAI, Brain @GoogleAI, world ranked Tetris player

ID: 3378986176

linkhttps://getcartwheel.com calendar_today16-07-2015 15:36:33

6,6K Tweet

16,16K Followers

3,3K Following

Sander Dieleman (@sedielem) 's Twitter Profile Photo

Diffusion is the rising tide that eventually submerges all frequencies, high and low 🌊 Diffusion is the gradual decomposition into feature scales, fine and coarse 🗼 Diffusion is just spectral autoregression 🤷🌈

Andrew Carr (e/🤸) (@andrew_n_carr) 's Twitter Profile Photo

Here's a quick outline of the learning rate schedule for Llama 3.1 - this is likely the simplest and most powerful "shape" for a schedule when training models.

Here's a quick outline of the learning rate schedule for Llama 3.1 - this is likely the simplest and most powerful "shape" for a schedule when training models.
Andrew Carr (e/🤸) (@andrew_n_carr) 's Twitter Profile Photo

There are really three classes of language models: 1. frontier reasoning models for solving coding problems 2. reliable specific models for product deployment 3. good enough and fast models for batch processing

Andrew Carr (e/🤸) (@andrew_n_carr) 's Twitter Profile Photo

When training the SAEs for GemmaScope, the team had to store 17 Petabytes of activations. They found though, that if they used them within 100 days, it was cheaper than recomputing.

When training the SAEs for GemmaScope, the team had to store 17 Petabytes of activations. They found though, that if they used them within 100 days, it was cheaper than recomputing.
Andrew Carr (e/🤸) (@andrew_n_carr) 's Twitter Profile Photo

The irony is that the 80 hr / week hustle bros are winning right now cause Claude makes each marginal hour way more productive. Essentially mitigating tiredness.