Stas Bekman (@stasbekman) 's Twitter Profile
Stas Bekman

@stasbekman

Toolmaker. Software creator, optimizer and harmonizer.

Makes things work and fly at @ContextualAI

Training LLM/RAG/Generative AI/Machine Learning/Scalability

ID: 1068360975898660864

linkhttps://stasosphere.com/machine-learning/ calendar_today30-11-2018 04:28:00

1,1K Tweet

7,7K Followers

273 Following

Stas Bekman (@stasbekman) 's Twitter Profile Photo

Hear, hear, I'm excited to introduce a new performance metric: Maximum Achievable Matmul FLOPS (MAMF): github.com/stas00/ml-engi… Please read the notes at the url above to see what's what and I have the first measurements included (snapshot). As I get access to more accelerators

Hear, hear, I'm excited to introduce a new performance metric: Maximum Achievable Matmul FLOPS (MAMF):

github.com/stas00/ml-engi…

Please read the notes at the url above to see what's what and I have the first measurements included (snapshot). As I get access to more accelerators
Stas Bekman (@stasbekman) 's Twitter Profile Photo

If you need a working example of a cross-accelerator script, this script supports - NVIDIA: V100, A100, H100, ... - AMD: MI250, MI300X, ... - Intel Gaudi2+ github.com/stas00/ml-engi… It, of course, doesn't cover everything, but it's a good starting point. Thanks to Imtiaz

If you need a working example of a cross-accelerator script, this script supports 

- NVIDIA: V100, A100, H100, ...
- AMD: MI250, MI300X, ...
- Intel Gaudi2+

github.com/stas00/ml-engi…

It, of course, doesn't cover everything, but it's a good  starting point.

Thanks to Imtiaz
Stas Bekman (@stasbekman) 's Twitter Profile Photo

Has github.com/Dao-AILab/flas… got faster than torch's SDPA? I'm seeing ~15-20% faster throughput with FA2@main Back in Feb-24 I clocked the 2 to give about the same training speed, so I switched to SDPA since it was a built in. I have just retested with torch-2.3.1 vs FA2 built

Stas Bekman (@stasbekman) 's Twitter Profile Photo

If your CI depends on orjson it's likely broken now as the new release is missing wheels for linux X86_64 github.com/ijl/orjson/iss… Workaround: pin to `orjson==3.10.6` thanks to github.com/smallsam for suggesting a workaround

Byron Hsu (@hsu_byron) 's Twitter Profile Photo

We are thrilled to receive so much positive feedback on Liger Kernel! We're delighted that the open-source community values our contributions to LLM training. github.com/linkedin/Liger… Within just one day: 1. Our repository stars increased from 0 to 800 2. Andrej Karpathy,

We are thrilled to receive so much positive feedback on Liger Kernel! We're delighted that the open-source community values our contributions to LLM training. 

github.com/linkedin/Liger…

Within just one day:

1. Our repository stars increased from 0 to 800
2. <a href="/karpathy/">Andrej Karpathy</a>,
Stas Bekman (@stasbekman) 's Twitter Profile Photo

How to choose an inference framework A list of questions to ask oneself before committing to one. github.com/stas00/ml-engi… Please let me know if I've missed some critical questions/considerations. Thank you.

How to choose an inference framework

A list of questions to ask oneself before committing to one.

github.com/stas00/ml-engi…

Please let me know if I've missed some critical questions/considerations. Thank you.
Stas Bekman (@stasbekman) 's Twitter Profile Photo

Inference: 20 tokens per second per user is all you need. The interesting thing about online inference is that unlike normal webserving it doesn't have to be as fast as possible, since it doesn't have to return the full generated response at once. Moreover depending on the

Inference: 20 tokens per second per user is all you need.

The interesting thing about online inference is that unlike normal webserving it doesn't have to be as fast as possible, since it doesn't have to return the full generated response at once.

Moreover depending on the
Stas Bekman (@stasbekman) 's Twitter Profile Photo

If you use an openai completions client make sure you use their asyncio API, otherwise you might be losing 5-10% throughput to the blocking IO client github.com/vllm-project/v… github.com/openai/openai-… h/t github.com/njhill for the recommendation.

If you use an openai completions client make sure you use their asyncio API, otherwise you might be losing 5-10% throughput to the blocking IO client

github.com/vllm-project/v…

github.com/openai/openai-…

h/t github.com/njhill for the recommendation.
Stas Bekman (@stasbekman) 's Twitter Profile Photo

This is an important work from the IBM and Hugging Face teams that will benefit us all. Automating the uneven sample lengths packing and making it work correctly with FA for a much higher throughput and saving memory on those pad tokens.

Contextual AI (@contextualai) 's Twitter Profile Photo

Thank you NVIDIA for your ongoing partnership and covering us in your blog today. Contextual AI is proud to work with NVIDIA to develop and bring the next generation of LLMs, powered by RAG 2.0, to market. If you're building a production-ready enterprise use case using RAG,

Thank you <a href="/nvidia/">NVIDIA</a> for your ongoing partnership and covering us in your blog today. Contextual AI is proud to work with NVIDIA to develop and bring the next generation of LLMs, powered by RAG 2.0, to market.

If you're building a production-ready enterprise use case using RAG,
Stas Bekman (@stasbekman) 's Twitter Profile Photo

semianalysis.com/p/the-memory-w… This is an important study as if you're not aware the new accelerators' efficiency keeps on going down because the compute speed up is happening faster than HBM speed up. So for example H100 is less efficient than H200, even though the compute is the

Niklas Muennighoff (@muennighoff) 's Twitter Profile Photo

Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9

Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source
- 1B active, 7B total params for 5T tokens
- Best small LLM &amp; matches more costly ones like Gemma, Llama
- Open Model/Data/Code/Logs + lots of analysis &amp; experiments

📜arxiv.org/abs/2409.02060
🧵1/9