Pengfei Liu(@stefan_fee) 's Twitter Profileg
Pengfei Liu

@stefan_fee

Associate Prof. at SJTU, leading GAIR Lab (https://t.co/Nfd8KmZx3B) Co-founder of Inspired Cognition, Postdoc at @LTIatCMU, Previously FNLP, @MILAMontreal,

ID:2818867628

linkhttp://pfliu.com/ calendar_today19-09-2014 02:34:24

388 Tweets

2,4K Followers

633 Following

Junlong Li🚀🇦🇹ICLR24(@lockonlvange) 's Twitter Profile Photo

Will be presenting Auto-J on site, and cannot be more excited (as a first conference timer) to meet old and new friends🥰

Will be presenting Auto-J on site, and cannot be more excited (as a first conference timer) to meet old and new friends🥰
account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

4. The LLM disciplinary committee, who oversee whether LLM developers *secretly* use benchmark data without disclosure, demanding that developers maintain a sense of respect for *transparency*.

See how much data *leakage* popular LLMs have on public benchmarks:

4. The LLM disciplinary committee, who oversee whether LLM developers *secretly* use benchmark data without disclosure, demanding that developers maintain a sense of respect for *transparency*. See how much data *leakage* popular LLMs have on public benchmarks:
account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

How much data *leakage* do popular LLMs have on public benchmarks?

Our recent work will tell you:

Benchmarking Benchmark Leakage in Large Language Models
arxiv.org/pdf/2404.18824

How much data *leakage* do popular LLMs have on public benchmarks? Our recent work will tell you: Benchmarking Benchmark Leakage in Large Language Models arxiv.org/pdf/2404.18824
account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

Crazy finding!!!!! -> ” Without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B’s mathematical reasoning ability on GSM8K can be improved from 46.77% to 56.63% in accuracy'

Crazy finding!!!!! -> ” Without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B’s mathematical reasoning ability on GSM8K can be improved from 46.77% to 56.63% in accuracy'
account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

Do you worry about hallucination risks from using LLMs in high-stake applications like medical?
Please check out our work DocLens, an all-encompassing, automated tool designed to detect hallucinations tailored to user needs.

Do you worry about hallucination risks from using LLMs in high-stake applications like medical? Please check out our work DocLens, an all-encompassing, automated tool designed to detect hallucinations tailored to user needs.
account_circle
Yiqing Xie(@YiqingXieNLP) 's Twitter Profile Photo

(2/4) DocLens🔍 evaluates medical texts' completeness, conciseness, and attribution with fine-grained metrics. It can be computed with various evaluators and applied to diverse tasks like clinical note generation, radiology report summarization & medical question summarization.

(2/4) DocLens🔍 evaluates medical texts' completeness, conciseness, and attribution with fine-grained metrics. It can be computed with various evaluators and applied to diverse tasks like clinical note generation, radiology report summarization & medical question summarization.
account_circle
SambaNova Systems(@SambaNovaAI) 's Twitter Profile Photo

EXPERT RELEASE: SambaNova Systems is proud to feature Auto-J, a 13B parameter language model intended for use as an automated judge of other language models.

The model starts from Llama 2 13B Chat from AI at Meta, and is trained on a curated mix of open source LLM chat responses.…

account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

Scalability is an eternal topic in LLMOps. Existing works focus on training superhuman systems by constructing scalable oversight. We focus on a dual problem: scalable evaluation and argue: by introducing Agent Debate into the meta-evaluation, we can achieve this goal pretty well

Scalability is an eternal topic in LLMOps. Existing works focus on training superhuman systems by constructing scalable oversight. We focus on a dual problem: scalable evaluation and argue: by introducing Agent Debate into the meta-evaluation, we can achieve this goal pretty well
account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

Instruction-following is one of the most crucial foundational abilities for LLMs, but how can we know how well your large model adheres to instructions? InfoBench not only offers a benchmark but also proposes a very elegant evaluation metric.

account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

Critiques will become the “Natural Language Gradient” for model optimization in the Generative AI era, but how reliable are they? Check out our new work:
The Critique of Critique!

account_circle
Shichao Sun(@BruceSun1995) 's Twitter Profile Photo

Could I trust LLM critique?

🔥 We are the pioneers in prioritizing critique evaluation and introducing the critique of critique, termed MetaCritique.

Repo: github.com/GAIR-NLP/MetaC…
Paper: arxiv.org/abs/2401.04518

(1/7)

Could I trust LLM critique? 🔥 We are the pioneers in prioritizing critique evaluation and introducing the critique of critique, termed MetaCritique. Repo: github.com/GAIR-NLP/MetaC… Paper: arxiv.org/abs/2401.04518 (1/7)
account_circle
Sinclair Wang(@SinclairWang1) 's Twitter Profile Photo

Thanks for your kind intro😍 We are continually iterating on our MathPile based on community feedback and plan to release a commercially available version. Stay tuned🥰

account_circle
Pengfei Liu(@stefan_fee) 's Twitter Profile Photo

Aligning for honesty is a MUCH more challenging task than 'harmlessness' and 'helpfulness.' Fortunately, we have the wisdom of the ancients to guide us. Our outstanding author Yuqing Yang made the dream collaboration between Confucius and LLMs happen.

Aligning for honesty is a MUCH more challenging task than 'harmlessness' and 'helpfulness.' Fortunately, we have the wisdom of the ancients to guide us. Our outstanding author Yuqing Yang made the dream collaboration between Confucius and LLMs happen.
account_circle
KaiqiangSong(@SongKaiqiang) 's Twitter Profile Photo

Struggling with global attention's slow computing times? Worried about local attention's performance? Try our new Zebra model: Global capability with faster training and inference speeds. Quick, efficient results on new or existing models. paper: arxiv.org/pdf/2312.08618…

Struggling with global attention's slow computing times? Worried about local attention's performance? Try our new Zebra model: Global capability with faster training and inference speeds. Quick, efficient results on new or existing models. paper: arxiv.org/pdf/2312.08618… #AI #LLMs
account_circle