Pang Wei Koh (@pangweikoh) Twitter Tweets • TwiDoom

Yi Tay

7 months ago

Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐 In this blog post, I discuss: 1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's

thumb_up_off_alt1,1K

chat_bubble_outline44

repeat254

shareShare

Chelsea Finn

@chelseabfinn

6 months ago

I’m really excited to be starting a new adventure with multiple amazing friends & colleagues. Our company is called Physical Intelligence (Pi or π, like the policy). A short thread 🧵

thumb_up_off_alt1,1K

chat_bubble_outline54

repeat110

shareShare

Emma Pierson

@2plus2make5

5 months ago

.Raj Movva, Pang Wei Koh, and I write for Nature Medicine on using unlabeled data to improve generalization + fairness of medical AI models: nature.com/articles/s4159… We highlight two nice recent papers illustrating this - nature.com/articles/s4159…, nature.com/articles/s4159….

thumb_up_off_alt104

chat_bubble_outline1

repeat21

shareShare

Scott Geng

@scottgeng00

2 months ago

Will training on AI-generated synthetic data lead to the next frontier of vision models?🤔 Our new paper suggests NO—for now. Synthetic data doesn't magically enable generalization beyond the generator's original training set. 📜: arxiv.org/abs/2406.05184 Details below🧵(1/n)

thumb_up_off_alt474

chat_bubble_outline17

repeat95

shareShare

Tong Chen

@tomchen0

2 months ago

📢Check out CopyBench, the first benchmark to evaluate non-literal copying in language model generation! ❗️Non-literal copying can occur even in models as small as 7B and is overlooked by current copyright risk mitigation methods. 🔗chentong0.github.io/copy-bench/ [1/N]

thumb_up_off_alt67

chat_bubble_outline2

repeat20

shareShare

Niloofar Mireshghallah

@niloofar_mire

2 months ago

📢Anyone who talked to me in the past year heard my rant of *LLM memorization is beyond form* & output overlap! ©️Reproducing similar series of events, or character traits also has copyright issues. 👩‍⚖️In new work we look at non-literal copying in LLMs! arxiv.org/abs/2407.07087

thumb_up_off_alt141

chat_bubble_outline3

repeat19

shareShare

Pang Wei Koh

@pangweikoh

2 months ago

Copyright discussions around LMs have focused on literal copying of training data. With CopyBench, we show that non-literal copying is even more pervasive but overlooked by current mitigation methods. Check out Tong's tweet for the details!

thumb_up_off_alt27

chat_bubble_outline0

repeat7

shareShare

Pang Wei Koh

@pangweikoh

2 months ago

Instead of scaling pretraining data, can we scale the amount of data available at inference instead? Scaling RAG datastores to 1.4T tokens (on an academic budget) gives us better training-compute-optimal curves for LM & downstream performance. Check out Rulin Shao's work below!

thumb_up_off_alt34

chat_bubble_outline0

repeat1

shareShare

AK

@_akhaliq

2 months ago

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations.

thumb_up_off_alt218

chat_bubble_outline4

repeat50

shareShare

Tim Dettmers

@tim_dettmers

2 months ago

After 7 months on the job market, I am happy to announce: - I joined Ai2 - Professor at Carnegie Mellon University from Fall 2025 - New bitsandbytes maintainer Titus von Koeller My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops 🧵

thumb_up_off_alt2,2K

chat_bubble_outline152

repeat85

shareShare

Sewon Min

@sewon__min

2 months ago

📣 After graduating from @UWCSE, I am joining UC Berkeley as an Assistant Professor (affiliated w Berkeley AI Research BerkeleyNLP) and Ai2 as a Research Scientist. I'm looking forward to tackling exciting challenges in NLP & generative AI together with new colleagues! 🐻✨

thumb_up_off_alt1,1K

chat_bubble_outline159

repeat68

shareShare

Alice Oh

@aliceoh

a month ago

Pre #acl2024 talks, posters, and food 🥘🍝🍲🍛🥙 We Sunipa Dev Kai-Wei Chang Tristan Naumann Isabelle Augenstein Scott Hale Jose Camacho-Collados Pang Wei Koh Minjoon Seo JinYeong Bak and Yoon Kim, (with Mohit Bansal and Violet Peng joining online) had a blast thanks to the amazing students

thumb_up_off_alt60

chat_bubble_outline3

repeat12

shareShare

Luca Soldaini 🎀

@soldni

a month ago

don’t miss it!!! olmo poster happening now at booth 43 Kyle Lo @ ACL 2024 #ACL2024

don’t miss it!!!

olmo poster happening now at booth 43

<a href="/kylelostat/">Kyle Lo @ ACL 2024</a> #ACL2024

thumb_up_off_alt167

chat_bubble_outline1

repeat11

shareShare

Jesse Dodge

@jessedodge

a month ago

Congrats to our team for winning two paper awards at #ACL2024! OLMo won the Best Theme Paper award, and Dolma won a Best Resource Paper award! All the credit goes to the whole team for the massive group effort 🎉🎉

thumb_up_off_alt246

chat_bubble_outline11

repeat44

shareShare

Emma Pierson

@2plus2make5

a month ago

New paper on race adjustments in clinical algorithms in PNAS! Joint work with the wonderful Anna Zink and Ziad Obermeyer - see Anna's detailed thread below. Anna will be on the job market this year - check out her other great work at zinka88.github.io!

New paper on race adjustments in clinical algorithms in PNAS!

Joint work with the wonderful <a href="/annalzink/">Anna Zink</a> and <a href="/oziadias/">Ziad Obermeyer</a> - see Anna's detailed thread below.

Anna will be on the job market this year - check out her other great work at zinka88.github.io!

thumb_up_off_alt60

chat_bubble_outline1

repeat8

shareShare

Eunsol Choi

@eunsolc

a month ago

My lab will move to NYU Data Science and NYU Courant this Fall! I’m excited to connect with amazing researchers at CILVR and larger ML/NLP community in NYC. I will be recruiting students this cycle at NYU. Happy to be back to the city 🗽on the east coast as well. I had a

thumb_up_off_alt509

chat_bubble_outline53

repeat46

shareShare

Xiaochuang Han

@xiaochuanghan

a month ago

👽Have you ever accidentally opened a .jpeg file with a text editor (or a hex editor)? Your language model can learn from these seemingly gibberish bytes and generate images with them! Introducing *JPEG-LM* - an image generator that uses exactly the same architecture as LLMs

thumb_up_off_alt150

chat_bubble_outline4

repeat36

shareShare

Pang Wei Koh

@pangweikoh

a month ago

Check out JPEG-LM, a fun idea led by Xiaochuang Han -- we generate images simply by training an LM on raw JPEG bytes and show that it outperforms much more complicated VQ models, especially on rare inputs.

thumb_up_off_alt40

chat_bubble_outline2

repeat4

shareShare

Pang Wei Koh

@pangweikoh

18 days ago

OLMoE, our fully open mixture-of-experts LLM led by Niklas Muennighoff, is out! Check out the paper for details on our design decisions: expert granularity, routing, upcycling, load balancing, etc.

thumb_up_off_alt30

chat_bubble_outline3

repeat1

shareShare

Tristan Thrush

@tristanthrush

12 days ago

Do you want to select great LLM pretraining data but don’t have 1000 H100s for a ton of mixture experiments? What about a method that requires none of your own training, matches the best known existing method, and has some nice theory? New preprint: Perplexity Correlations

thumb_up_off_alt411

chat_bubble_outline10

repeat71

shareShare