Stas Bekman (@stasbekman) Twitter Tweets • TwiDoom

Stas Bekman

@stasbekman

+ Follow

Toolmaker. Software creator, optimizer and harmonizer.

Makes things work and fly at @ContextualAI

Training LLM/RAG/Generative AI/Machine Learning/Scalability

ID: 1068360975898660864

linkhttps://stasosphere.com/machine-learning/ calendar_today30-11-2018 04:28:00

1,1K Tweet

7,7K Followers

273 Following

Stas Bekman

@stasbekman

a month ago

Hear, hear, I'm excited to introduce a new performance metric: Maximum Achievable Matmul FLOPS (MAMF): github.com/stas00/ml-engi… Please read the notes at the url above to see what's what and I have the first measurements included (snapshot). As I get access to more accelerators

thumb_up_off_alt312

chat_bubble_outline16

repeat47

shareShare

Stas Bekman

@stasbekman

a month ago

If you need a working example of a cross-accelerator script, this script supports - NVIDIA: V100, A100, H100, ... - AMD: MI250, MI300X, ... - Intel Gaudi2+ github.com/stas00/ml-engi… It, of course, doesn't cover everything, but it's a good starting point. Thanks to Imtiaz

thumb_up_off_alt136

chat_bubble_outline3

repeat22

shareShare

Stas Bekman

@stasbekman

a month ago

Has github.com/Dao-AILab/flas… got faster than torch's SDPA? I'm seeing ~15-20% faster throughput with FA2@main Back in Feb-24 I clocked the 2 to give about the same training speed, so I switched to SDPA since it was a built in. I have just retested with torch-2.3.1 vs FA2 built

thumb_up_off_alt59

chat_bubble_outline4

repeat9

shareShare

Stas Bekman

@stasbekman

a month ago

If your CI depends on orjson it's likely broken now as the new release is missing wheels for linux X86_64 github.com/ijl/orjson/iss… Workaround: pin to `orjson==3.10.6` thanks to github.com/smallsam for suggesting a workaround

thumb_up_off_alt6

chat_bubble_outline1

repeat0

shareShare

Stas Bekman

@stasbekman

a month ago

An amazing resource from amazing people!

thumb_up_off_alt24

chat_bubble_outline0

repeat1

shareShare

Byron Hsu

@hsu_byron

23 days ago

We are thrilled to receive so much positive feedback on Liger Kernel! We're delighted that the open-source community values our contributions to LLM training. github.com/linkedin/Liger… Within just one day: 1. Our repository stars increased from 0 to 800 2. Andrej Karpathy,

thumb_up_off_alt226

chat_bubble_outline6

repeat21

shareShare

Stas Bekman

@stasbekman

23 days ago

How to choose an inference framework A list of questions to ask oneself before committing to one. github.com/stas00/ml-engi… Please let me know if I've missed some critical questions/considerations. Thank you.

thumb_up_off_alt142

chat_bubble_outline2

repeat22

shareShare

Stas Bekman

@stasbekman

19 days ago

Inference: 20 tokens per second per user is all you need. The interesting thing about online inference is that unlike normal webserving it doesn't have to be as fast as possible, since it doesn't have to return the full generated response at once. Moreover depending on the

thumb_up_off_alt41

chat_bubble_outline2

repeat3

shareShare

Stas Bekman

@stasbekman

19 days ago

If you use an openai completions client make sure you use their asyncio API, otherwise you might be losing 5-10% throughput to the blocking IO client github.com/vllm-project/v… github.com/openai/openai-… h/t github.com/njhill for the recommendation.

thumb_up_off_alt86

chat_bubble_outline1

repeat12

shareShare

Stas Bekman

@stasbekman

19 days ago

This is an important work from the IBM and Hugging Face teams that will benefit us all. Automating the uneven sample lengths packing and making it work correctly with FA for a much higher throughput and saving memory on those pad tokens.

thumb_up_off_alt15

chat_bubble_outline0

repeat1

shareShare

Contextual AI

@contextualai

18 days ago

Thank you NVIDIA for your ongoing partnership and covering us in your blog today. Contextual AI is proud to work with NVIDIA to develop and bring the next generation of LLMs, powered by RAG 2.0, to market. If you're building a production-ready enterprise use case using RAG,

Thank you <a href="/nvidia/">NVIDIA</a> for your ongoing partnership and covering us in your blog today. Contextual AI is proud to work with NVIDIA to develop and bring the next generation of LLMs, powered by RAG 2.0, to market.

If you're building a production-ready enterprise use case using RAG,

thumb_up_off_alt19

chat_bubble_outline0

repeat4

shareShare

Stas Bekman

@stasbekman

13 days ago

semianalysis.com/p/the-memory-w… This is an important study as if you're not aware the new accelerators' efficiency keeps on going down because the compute speed up is happening faster than HBM speed up. So for example H100 is less efficient than H200, even though the compute is the

thumb_up_off_alt161

chat_bubble_outline4

repeat33

shareShare

Stas Bekman

@stasbekman

13 days ago

What he said! Resist the temptation - write your own code - flex that muscle or risk becoming replaceable.

thumb_up_off_alt15

chat_bubble_outline0

repeat1

shareShare

Niklas Muennighoff

@muennighoff

13 days ago

Releasing OLMoE - the first good Mixture-of-Experts LLM that's 100% open-source - 1B active, 7B total params for 5T tokens - Best small LLM & matches more costly ones like Gemma, Llama - Open Model/Data/Code/Logs + lots of analysis & experiments 📜arxiv.org/abs/2409.02060 🧵1/9

thumb_up_off_alt888

chat_bubble_outline22

repeat213

shareShare