Pengfei Liu (@stefan_fee) Twitter Tweets • TwiCopy

repeat1

account_circle

Pengfei Liu

1 week ago

How much data *leakage* do popular LLMs have on public benchmarks?

Our recent work will tell you:

Benchmarking Benchmark Leakage in Large Language Models
arxiv.org/pdf/2404.18824

account_circle

Pengfei Liu

2 weeks ago

Crazy finding!!!!! -> ” Without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B’s mathematical reasoning ability on GSM8K can be improved from 46.77% to 56.63% in accuracy'

account_circle

Pengfei Liu

3 weeks ago

repeat8

account_circle

Pengfei Liu

3 weeks ago

Do you worry about hallucination risks from using LLMs in high-stake applications like medical?
Please check out our work DocLens, an all-encompassing, automated tool designed to detect hallucinations tailored to user needs.

thumb_up_off_alt12

repeat2

account_circle

Yiqing Xie

@YiqingXieNLP

1 month ago

(2/4) DocLens🔍 evaluates medical texts' completeness, conciseness, and attribution with fine-grained metrics. It can be computed with various evaluators and applied to diverse tasks like clinical note generation, radiology report summarization & medical question summarization.

thumb_up_off_alt8

repeat5

account_circle

SambaNova Systems

@SambaNovaAI

2 months ago

EXPERT RELEASE: SambaNova Systems is proud to feature Auto-J, a 13B parameter language model intended for use as an automated judge of other language models.

The model starts from Llama 2 13B Chat from AI at Meta, and is trained on a curated mix of open source LLM chat responses.…

thumb_up_off_alt7

repeat3

account_circle

Pengfei Liu

3 months ago

Scalability is an eternal topic in LLMOps. Existing works focus on training superhuman systems by constructing scalable oversight. We focus on a dual problem: scalable evaluation and argue: by introducing Agent Debate into the meta-evaluation, we can achieve this goal pretty well

thumb_up_off_alt25

repeat6

account_circle

Andrew Michael

@andrewmichaelio

3 months ago

Sam Altman says AI will make it possible for one person to build a billion dollar company very soon.

account_circle

Pengfei Liu

3 months ago

Nice work!

thumb_up_off_alt7

repeat0

account_circle

Pengfei Liu

3 months ago

Instruction-following is one of the most crucial foundational abilities for LLMs, but how can we know how well your large model adheres to instructions? InfoBench not only offers a benchmark but also proposes a very elegant evaluation metric.

thumb_up_off_alt24

repeat0

account_circle

Pengfei Liu

3 months ago

Critiques will become the “Natural Language Gradient” for model optimization in the Generative AI era, but how reliable are they? Check out our new work:
The Critique of Critique!

thumb_up_off_alt15

repeat2

account_circle

Shichao Sun

@BruceSun1995

3 months ago

Could I trust LLM critique?

🔥 We are the pioneers in prioritizing critique evaluation and introducing the critique of critique, termed MetaCritique.

Repo: github.com/GAIR-NLP/MetaC…
Paper: arxiv.org/abs/2401.04518

(1/7)

account_circle

Sinclair Wang

@SinclairWang1

4 months ago

Thanks for your kind intro😍 We are continually iterating on our MathPile based on community feedback and plan to release a commercially available version. Stay tuned🥰

thumb_up_off_alt14

repeat3

account_circle

Pengfei Liu

4 months ago

Aligning for honesty is a MUCH more challenging task than 'harmlessness' and 'helpfulness.' Fortunately, we have the wisdom of the ancients to guide us. Our outstanding author Yuqing Yang made the dream collaboration between Confucius and LLMs happen.

thumb_up_off_alt25

repeat9

account_circle

KaiqiangSong

@SongKaiqiang

4 months ago

Struggling with global attention's slow computing times? Worried about local attention's performance? Try our new Zebra model: Global capability with faster training and inference speeds. Quick, efficient results on new or existing models. paper: arxiv.org/pdf/2312.08618…
#AI #LLMs

thumb_up_off_alt4

repeat2