Chris Painter (@chrispainteryup) Twitter Tweets • TwiDoom

Sam Rodriques

2 months ago

Today, we're releasing LAB-Bench, a set of >2000 evaluations for language models and agents on scientific research tasks in biology. Public models underperform PhD/postdoc-level humans on nearly all tasks. Claude 3.5 Sonnet is the clear frontrunner atm, but long way to go. 1/

thumb_up_off_alt313

chat_bubble_outline7

repeat67

shareShare

Chris Painter

@chrispainteryup

2 months ago

I’m in Mexico City until Thursday. Let me know if you’re here too!

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

METR

@metr_evals

2 months ago

How well can LLM agents complete diverse tasks compared to skilled humans? Our preliminary results indicate that our baseline agents based on several public models (Claude 3.5 Sonnet and GPT-4o) complete a proportion of tasks similar to what humans can do in ~30 minutes. 🧵

thumb_up_off_alt435

chat_bubble_outline10

repeat105

shareShare

METR

@metr_evals

2 months ago

On average, when agents can do a task, they do so at ~1/30th of the cost of the median hourly wage of a US bachelor’s degree holder. One example: our Claude 3.5 Sonnet agent fixed bugs in an ORM library at a cost of <$2, while the human baseline took >2 hours.

thumb_up_off_alt404

chat_bubble_outline13

repeat96

shareShare

Michael Chen

@miclchen

2 months ago

AI agents still aren't reliable yet, even for short tasks, but simple agents can occasionally complete tasks that take humans hours. Managing AI risks will become much more challenging when agents can autonomously execute on long-term projects.

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Chris Painter

@chrispainteryup

a month ago

Interesting

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Dwarkesh Patel

@dwarkesh_sp

a month ago

thumb_up_off_alt3,3K

chat_bubble_outline161

repeat27

shareShare

Nat Friedman

@natfriedman

a month ago

Just do things.

thumb_up_off_alt1,1K

chat_bubble_outline58

repeat139

shareShare

Eli Dourado

@elidourado

a month ago

Can’t wait for the amount of resources Casey Handmer, PhD will put into antimatter production and storage after Terraform exits caseyhandmer.wordpress.com/2024/08/18/ant…

thumb_up_off_alt67

chat_bubble_outline5

repeat6

shareShare

METR

@metr_evals

a month ago

We’re open sourcing Vivaria, our platform for running LLM agent evaluations and conducting research in eliciting AI capabilities. 🪴

thumb_up_off_alt139

chat_bubble_outline1

repeat22

shareShare

Dwarkesh Patel

@dwarkesh_sp

a month ago

I wrote about the past and future of my podcast

thumb_up_off_alt878

chat_bubble_outline34

repeat31

shareShare

Epoch AI

@epochairesearch

a month ago

1/ Can AI scaling continue through 2030? We examine whether constraints on power, chip manufacturing, training data, or data center latencies might hinder AI growth. Our analysis suggests that AI scaling can likely continue its current trend through 2030.

thumb_up_off_alt667

chat_bubble_outline37

repeat161

shareShare

Chris Painter

@chrispainteryup

a month ago

Dream job

thumb_up_off_alt18

chat_bubble_outline0

repeat2

shareShare

Ethan Mollick

@emollick

a month ago

Wow, diffusion models (used in AI image generation) are also game engines - a type of world simulation. By predicting the next frame of the classic shooter DOOM, you get a playable game at 20 fps without any underlying real game engine. This video is from the diffusion model.

thumb_up_off_alt9,9K

chat_bubble_outline421

repeat1,1K

shareShare

Alex Tabarrok 🛡️

@atabarrok

23 days ago

Dwarkesh is going to be the Mr. Beast for intellectuals. Absolutely relentless on getting better.

thumb_up_off_alt563

chat_bubble_outline21

repeat18

shareShare

Chris Painter

@chrispainteryup

18 days ago

We thought it would be helpful to have all of the similar themes/components from each of Deepmind's Frontier Safety Framework, OpenAI's Preparedness Framework, and Anthropic's Responsible Scaling Policy, in one place.

thumb_up_off_alt66

chat_bubble_outline0

repeat7

shareShare

Chris Painter

@chrispainteryup

17 days ago

A few months ago I hosted a Q&A with Geoffrey Hinton in Berkeley: youtube.com/watch?v=PTF5Up…

thumb_up_off_alt11

chat_bubble_outline1

repeat0

shareShare

Chris Painter

@chrispainteryup

16 days ago

He literally rewrote the form to make it easier for him to find you. You need to apply.

thumb_up_off_alt8

chat_bubble_outline0

repeat1

shareShare

Dwarkesh Patel

@dwarkesh_sp

11 days ago

Help me build a media empire. New blog post: Why you should become my COO.

thumb_up_off_alt699

chat_bubble_outline33

repeat36

shareShare

METR

@metr_evals

10 days ago

We ran o1-preview on our suite of ML R&D/SWE/general agency tasks, from Sep 3–9. 4 days of scaffolding iteration took it from well below GPT-4o to on par with the highest-scoring public model (3.5 Sonnet). We expect substantial performance gains from more elicitation/finetuning.

thumb_up_off_alt143

chat_bubble_outline4

repeat18

shareShare