Chris Painter (@chrispainteryup) 's Twitter Profile
Chris Painter

@chrispainteryup

Evals accelerationist, working hard on responsible scaling policies, policy director @METR_Evals

ID: 930159122535866369

calendar_today13-11-2017 19:43:11

878 Tweet

1,1K Takipçi

1,1K Takip Edilen

Sam Rodriques (@sgrodriques) 's Twitter Profile Photo

Today, we're releasing LAB-Bench, a set of >2000 evaluations for language models and agents on scientific research tasks in biology. Public models underperform PhD/postdoc-level humans on nearly all tasks. Claude 3.5 Sonnet is the clear frontrunner atm, but long way to go. 1/

Today, we're releasing LAB-Bench, a set of >2000 evaluations for language models and agents on scientific research tasks in biology. Public models underperform PhD/postdoc-level humans on nearly all tasks. Claude 3.5 Sonnet is the clear frontrunner atm, but long way to go. 1/
METR (@metr_evals) 's Twitter Profile Photo

How well can LLM agents complete diverse tasks compared to skilled humans? Our preliminary results indicate that our baseline agents based on several public models (Claude 3.5 Sonnet and GPT-4o) complete a proportion of tasks similar to what humans can do in ~30 minutes. 🧵

How well can LLM agents complete diverse tasks compared to skilled humans? Our preliminary results indicate that our baseline agents based on several public models (Claude 3.5 Sonnet and GPT-4o) complete a proportion of tasks similar to what humans can do in ~30 minutes. 🧵
METR (@metr_evals) 's Twitter Profile Photo

On average, when agents can do a task, they do so at ~1/30th of the cost of the median hourly wage of a US bachelor’s degree holder. One example: our Claude 3.5 Sonnet agent fixed bugs in an ORM library at a cost of <$2, while the human baseline took >2 hours.

On average, when agents can do a task, they do so at ~1/30th of the cost of the median hourly wage of a US bachelor’s degree holder. One example: our Claude 3.5 Sonnet agent fixed bugs in an ORM library at a cost of &lt;$2, while the human baseline took &gt;2 hours.
Michael Chen (@miclchen) 's Twitter Profile Photo

AI agents still aren't reliable yet, even for short tasks, but simple agents can occasionally complete tasks that take humans hours. Managing AI risks will become much more challenging when agents can autonomously execute on long-term projects.

Eli Dourado (@elidourado) 's Twitter Profile Photo

Can’t wait for the amount of resources Casey Handmer, PhD will put into antimatter production and storage after Terraform exits caseyhandmer.wordpress.com/2024/08/18/ant…

METR (@metr_evals) 's Twitter Profile Photo

We’re open sourcing Vivaria, our platform for running LLM agent evaluations and conducting research in eliciting AI capabilities. 🪴

We’re open sourcing Vivaria, our platform for running LLM agent evaluations and conducting research in eliciting AI capabilities. 🪴
Epoch AI (@epochairesearch) 's Twitter Profile Photo

1/ Can AI scaling continue through 2030? We examine whether constraints on power, chip manufacturing, training data, or data center latencies might hinder AI growth. Our analysis suggests that AI scaling can likely continue its current trend through 2030.

Ethan Mollick (@emollick) 's Twitter Profile Photo

Wow, diffusion models (used in AI image generation) are also game engines - a type of world simulation. By predicting the next frame of the classic shooter DOOM, you get a playable game at 20 fps without any underlying real game engine. This video is from the diffusion model.

Chris Painter (@chrispainteryup) 's Twitter Profile Photo

We thought it would be helpful to have all of the similar themes/components from each of Deepmind's Frontier Safety Framework, OpenAI's Preparedness Framework, and Anthropic's Responsible Scaling Policy, in one place.

METR (@metr_evals) 's Twitter Profile Photo

We ran o1-preview on our suite of ML R&D/SWE/general agency tasks, from Sep 3–9. 4 days of scaffolding iteration took it from well below GPT-4o to on par with the highest-scoring public model (3.5 Sonnet). We expect substantial performance gains from more elicitation/finetuning.

We ran o1-preview on our suite of ML R&amp;D/SWE/general agency tasks, from Sep 3–9. 4 days of scaffolding iteration took it from well below GPT-4o to on par with the highest-scoring public model (3.5 Sonnet). We expect substantial performance gains from more elicitation/finetuning.