Karthik Narasimhan (@karthik_r_n) Twitter Tweets • TwiDoom

Karthik Narasimhan

@karthik_r_n

+ Follow

Associate Professor @PrincetonCS, Head of Research @SierraPlatform. Previously @OpenAI, PhD @MIT_CSAIL, BTech @iitmadras

ID: 3272351166

linkhttp://www.karthiknarasimhan.com/ calendar_today09-07-2015 01:28:42

259 Tweet

3,3K Followers

456 Following

carlos

@_carlosejimenez

6 months ago

SWE-bench Lite is a smaller & slightly easier *subset* of SWE-bench, with 23 dev / 300 test examples (full SWE-bench is 225 dev / 2,294 test). We hopes this makes SWE-bench evals easier. Special thanks to Jiayi Geng for making this happen. Download here: swebench.com/lite

thumb_up_off_alt53

chat_bubble_outline1

repeat12

shareShare

John Yang

@jyangballin

6 months ago

SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source! We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code github.com/princeton-nlp/…

thumb_up_off_alt2,2K

chat_bubble_outline65

repeat427

shareShare

carlos

@_carlosejimenez

6 months ago

SWE-Agent is an open-source software engineering agent with a 12.3% resolve rate on SWE-Bench! Check out SWE-agent in action at swe-agent.com Repo: github.com/princeton-nlp/…

thumb_up_off_alt561

chat_bubble_outline30

repeat113

shareShare

Ofir Press

@ofirpress

6 months ago

SWE-agent is blazing fast, and when it works it feels like magic! In this short demo I show how it solved a real bug in the neural network training code in scikit-learn. I also explain the process behind our agent-computer interface design choices.

thumb_up_off_alt251

chat_bubble_outline6

repeat44

shareShare

John Yang

@jyangballin

5 months ago

The SWE-agent preprint has finally landed! Check it out at swe-agent.com/paper.pdf

thumb_up_off_alt323

chat_bubble_outline6

repeat67

shareShare

Ben Shi

@benshi34

4 months ago

Our visualizer for our preprint, “Can Language Models Solve Olympiad Programming” is live. See the per-problem performance of models on USACO + more! Link here: princeton-nlp.github.io/USACOBench/ Ty again to my collaborators: Michael Tang Shunyu Yao Karthik Narasimhan

thumb_up_off_alt13

chat_bubble_outline0

repeat4

shareShare

Tejas Kulkarni

@tejasdkulkarni

3 months ago

x.com/i/article/1801…

thumb_up_off_alt149

chat_bubble_outline10

repeat36

shareShare

Bret Taylor

@btaylor

3 months ago

Sierra's research team just published 𝜏-bench, a novel new benchmark to evaluate AI agents' performance and reliability in real-world settings. The results show that that agents built with simple LLM constructs (like function calling or ReAct) perform poorly on even relatively

thumb_up_off_alt146

chat_bubble_outline10

repeat17

shareShare

Shunyu Yao

@shunyuyao12

3 months ago

Excited to share what I did Sierra with Noah Shinn pedram and Karthik Narasimhan ! 𝜏-bench evaluates critical agent capabilities omitted by current benchmarks: robustness, complex rule following, and human interaction skills. Try it out!

thumb_up_off_alt57

chat_bubble_outline2

repeat11

shareShare