Pang Wei Koh (@pangweikoh) 's Twitter Profile
Pang Wei Koh

@pangweikoh

Assistant professor at @uwcse. Formerly @StanfordAILab @GoogleAI @Coursera. 🇸🇬

ID: 1273467805283659777

linkhttps://koh.pw calendar_today18-06-2020 04:09:26

279 Tweet

3,3K Followers

880 Following

Yi Tay (@yitayml) 's Twitter Profile Photo

Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐 In this blog post, I discuss: 1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's

Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐

In this blog post, I discuss:
1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's
Chelsea Finn (@chelseabfinn) 's Twitter Profile Photo

I’m really excited to be starting a new adventure with multiple amazing friends & colleagues. Our company is called Physical Intelligence (Pi or π, like the policy). A short thread 🧵

Emma Pierson (@2plus2make5) 's Twitter Profile Photo

.Raj Movva, Pang Wei Koh, and I write for Nature Medicine on using unlabeled data to improve generalization + fairness of medical AI models: nature.com/articles/s4159… We highlight two nice recent papers illustrating this - nature.com/articles/s4159…, nature.com/articles/s4159….

Scott Geng (@scottgeng00) 's Twitter Profile Photo

Will training on AI-generated synthetic data lead to the next frontier of vision models?🤔 Our new paper suggests NO—for now. Synthetic data doesn't magically enable generalization beyond the generator's original training set. 📜: arxiv.org/abs/2406.05184 Details below🧵(1/n)

Tong Chen (@tomchen0) 's Twitter Profile Photo

📢Check out CopyBench, the first benchmark to evaluate non-literal copying in language model generation! ❗️Non-literal copying can occur even in models as small as 7B and is overlooked by current copyright risk mitigation methods. 🔗chentong0.github.io/copy-bench/ [1/N]

Niloofar Mireshghallah (@niloofar_mire) 's Twitter Profile Photo

📢Anyone who talked to me in the past year heard my rant of *LLM memorization is beyond form* & output overlap! ©️Reproducing similar series of events, or character traits also has copyright issues. 👩‍⚖️In new work we look at non-literal copying in LLMs! arxiv.org/abs/2407.07087

📢Anyone who talked to me in the past year heard my rant of *LLM memorization is beyond form* & output overlap!

©️Reproducing similar series of events, or character traits also has copyright issues. 

👩‍⚖️In new work we look at non-literal copying in LLMs!

arxiv.org/abs/2407.07087
Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

Copyright discussions around LMs have focused on literal copying of training data. With CopyBench, we show that non-literal copying is even more pervasive but overlooked by current mitigation methods. Check out Tong's tweet for the details!

Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

Instead of scaling pretraining data, can we scale the amount of data available at inference instead? Scaling RAG datastores to 1.4T tokens (on an academic budget) gives us better training-compute-optimal curves for LM & downstream performance. Check out Rulin Shao's work below!

AK (@_akhaliq) 's Twitter Profile Photo

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations.

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations.
Tim Dettmers (@tim_dettmers) 's Twitter Profile Photo

After 7 months on the job market, I am happy to announce: - I joined Ai2 - Professor at Carnegie Mellon University from Fall 2025 - New bitsandbytes maintainer Titus von Koeller My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops 🧵

Sewon Min (@sewon__min) 's Twitter Profile Photo

📣 After graduating from @UWCSE, I am joining UC Berkeley as an Assistant Professor (affiliated w Berkeley AI Research BerkeleyNLP) and Ai2 as a Research Scientist. I'm looking forward to tackling exciting challenges in NLP & generative AI together with new colleagues! 🐻✨

Jesse Dodge (@jessedodge) 's Twitter Profile Photo

Congrats to our team for winning two paper awards at #ACL2024! OLMo won the Best Theme Paper award, and Dolma won a Best Resource Paper award! All the credit goes to the whole team for the massive group effort 🎉🎉

Congrats to our team for winning two paper awards at #ACL2024!

OLMo won the Best Theme Paper award, and Dolma won a Best Resource Paper award!

All the credit goes to the whole team for the massive group effort 🎉🎉
Emma Pierson (@2plus2make5) 's Twitter Profile Photo

New paper on race adjustments in clinical algorithms in PNAS! Joint work with the wonderful Anna Zink and Ziad Obermeyer - see Anna's detailed thread below. Anna will be on the job market this year - check out her other great work at zinka88.github.io!

New paper on race adjustments in clinical algorithms in PNAS! 

Joint work with the wonderful <a href="/annalzink/">Anna Zink</a> and <a href="/oziadias/">Ziad Obermeyer</a> - see Anna's detailed thread below. 

Anna will be on the job market this year - check out her other great work at zinka88.github.io!
Eunsol Choi (@eunsolc) 's Twitter Profile Photo

My lab will move to NYU Data Science and NYU Courant this Fall! I’m excited to connect with amazing researchers at CILVR and larger ML/NLP community in NYC. I will be recruiting students this cycle at NYU. Happy to be back to the city 🗽on the east coast as well. I had a

Xiaochuang Han (@xiaochuanghan) 's Twitter Profile Photo

👽Have you ever accidentally opened a .jpeg file with a text editor (or a hex editor)? Your language model can learn from these seemingly gibberish bytes and generate images with them! Introducing *JPEG-LM* - an image generator that uses exactly the same architecture as LLMs

👽Have you ever accidentally opened a .jpeg file with a text editor (or a hex editor)?

Your language model can learn from these seemingly gibberish bytes and generate images with them!

Introducing *JPEG-LM* - an image generator that uses exactly the same architecture as LLMs
Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

Check out JPEG-LM, a fun idea led by Xiaochuang Han -- we generate images simply by training an LM on raw JPEG bytes and show that it outperforms much more complicated VQ models, especially on rare inputs.

Pang Wei Koh (@pangweikoh) 's Twitter Profile Photo

OLMoE, our fully open mixture-of-experts LLM led by Niklas Muennighoff, is out! Check out the paper for details on our design decisions: expert granularity, routing, upcycling, load balancing, etc.

Tristan Thrush (@tristanthrush) 's Twitter Profile Photo

Do you want to select great LLM pretraining data but don’t have 1000 H100s for a ton of mixture experiments? What about a method that requires none of your own training, matches the best known existing method, and has some nice theory? New preprint: Perplexity Correlations

Do you want to select great LLM pretraining data but don’t have 1000 H100s for a ton of mixture experiments?

What about a method that requires none of your own training, matches the best known existing method, and has some nice theory?

New preprint: Perplexity Correlations