Horace He (@chhillee) 's Twitter Profile
Horace He

@chhillee

@PyTorch "My learning style is Horace twitter threads" - @typedfemale

ID: 117233133

linkhttps://www.thonking.ai/p/strangely-matrix-multiplications calendar_today24-02-2010 23:48:25

2,2K Tweet

26,26K Followers

482 Following

Horace He (@chhillee) 's Twitter Profile Photo

Some folks had some confusion about whether FlexAttention worked on H100s. To clarify, FlexAttention runs on H100 with fairly good perf (up to ~40% faster than FA2). However, it is still about 75-85% of the FLOPs of FA3. Check out Driss's thread for more details.

Horace He (@chhillee) 's Twitter Profile Photo

Fun fact, this was actually the first novel mask implemented in FlexAttention from someone not working on FlexAttention!

Horace He (@chhillee) 's Twitter Profile Photo

We’re planning on adding learned bias support for FlexAttention. Unfortunately, this is somewhat nontrivial to do efficiently and generically. So, for flashattention, do you usually care more about 1. good performance or 2. avoiding quadratic memory usage?

Mike Shou (@mikeshou1) 's Twitter Profile Photo

Show-o update🔥: 1. We have released training codes on GitHub, including both pre-training and instruction tuning! 🔥 2. Add FlexAttention’s impl for great speed up. Thanks Horace He 🚀 github.com/showlab/Show-o… 3. gradio demo up 🤗 huggingface.co/spaces/showlab… Have fun!

LMSys Open Source (@lmsys_oss) 's Twitter Profile Photo

We're excited to announce the release of SGLang v0.3, featuring enhanced performance and extended support for novel architectures! Highlights include: - Up to 7x higher throughput for DeepSeek Multi-Head Latent Attention (MLA) - Up to 1.5x lower latency with torch.compile on

We're excited to announce the release of SGLang v0.3, featuring enhanced performance and extended support for novel architectures! Highlights include:

- Up to 7x higher throughput for DeepSeek Multi-Head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on