@ofirpress : The progress on SWE-bench is nuts. I think my prediction of 2 systems surpassing 35% pass@1 on the full test set by Aug 1 will come true. When we launched in October, nobody wanted to work on the dataset because it was considered "too hard" or "impossible". Acc was 1.96% then. • TwiDoom

Ofir Press

@ofirpress

+ Follow

I build tough benchmarks for LMs and then I get the LMs to solve them. Postdoc @Princeton. PhD from @nlpnoah @UW. Ex-visiting researcher @MetaAI & @MosaicML.

ID: 746788615951355904

linkhttps://ofir.io/about calendar_today25-06-2016 19:34:15

1,1K Tweet

10,10K Followers

3,3K Following

Ofir Press

@ofirpress

3 months ago

The progress on SWE-bench is nuts. I think my prediction of 2 systems surpassing 35% pass@1 on the full test set by Aug 1 will come true. When we launched in October, nobody wanted to work on the dataset because it was considered "too hard" or "impossible". Acc was 1.96% then.

thumb_up_off_alt134

chat_bubble_outline10

repeat13

shareShare