I am an undergraduate majoring in Artificial Intelligence at Yuanpei College, Peking University, advised by Prof. Yaodong Yang. My research direction are reinforcement learning, the safety and value alignment of large models (i.e., scalable oversight, deception alignment and super alignment). My current research focuses on the goal of constructing safe and trustworthy AI systems. You can find my cv and research statement here.

My answers to the Hamming question (“What are the most important problems [that you should probably work on]?”):

How to align systems smarter than humans and how to align them on tasks challenging for human evaluation? (i.e., scalable oversight)
How can we integrate theory and experimental validation to embed moral values into AI systems? (e.g., moral reflection and moral progress) and address the AI alignment problem from a socio-technical perspective.

I have just started on this long road, and I will leverage my youth and curiosity to seize more opportunities and time for an in-depth exploration of these problems. Feel free to reach out! Email: boyuan.chen.byc at gmail dot com

关于我（中文）

陈博远，北京大学元培学院人工智能方向本科生，导师为杨耀东老师，研究方向为强化学习、大模型的安全与价值对齐、可扩展监督，曾在计算机顶会NeurIPS发表口头报告，荣获ACL 2025 最佳论文奖，谷歌学术引用累计1200余次，GitHub开源累计获得3.5k+ Star。曾获首批北京市自然科学基金资助（2023年度北京大学人工智能本科生唯一），商汤奖学金（全国仅25位），研究成果及模型被OpenAI 、Meta引用，被MIT Tech Review报道。

News

2025.07: 🎉 Our work Language Models Resist Alignment has been awarded the ACL 2025 Best Paper!
2025.06: 🎊 Our work MedAligner has been accepted by The Innovation (IF: 33.2) ! MedAligner demonstrates the potential of Aligner (our NeurIPS 2024 oral work) in the medical domain.
2025.06: 🎉 Two papers are accepted by ACL 2025 Main.
2025.05: 🎉 We open-source InterMT, the first multi-turn multimodal understanding and generation human preference dataset. Welcome to discuss and collaborate!

More news

2025.01: 🎉 We release Align-DS-V, the first multimodal strong reasoning model.
2024.10: 💥 We open-source the first all-modality alignment framework - Align-Anything!
2024.09: 💥 Aligner has been accepted as an Oral presentation at NeurIPS 2024!
2024.06: 🎉 We introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in LLMs.
2024.06: 🎙️ Happy to introduce our new work about elasticity of LLMs. Click here for further details.
2024.04: 🎊 Our work - BeaverTails has been recognized by Meta, further contributing to AI safety research.
2024.03: 💥 Our alignment survey has been recognized by NIST! More details.
2024.03: 🚀 We have made significant updates to the alignment survey (V4)!
2024.02: 💥 We release Aligner: a new efficient alignment paradigm, bypasses the whole RLHF process.
2023.11: 🚀 We release AI Alignment Survey and Alignment Resource Website. Welcome to further discussion!

Publications

^* denotes equal contribution, ^α denotes core contributors, and ^† denotes corresponding author

Under Review

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Boyuan Chen^*, Donghai Hong^*, Jiaming Ji^*, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, and Yaodong Yang^†

[Project] [Paper]

This paper introduces Data: InterMT, a pioneering preference dataset for multi-turn multimodal interactions, comprising 15.6k prompts and 32.4k human-annotated preference pairs. It employs an innovative agentic workflow that utilizes tool-augmented MLLMs to generate multi-turn QA instances. Algorithm: The work proposes chain-prefix local and global preference modeling for training judge models, demonstrating multi-turn scaling law. Evaluation: The study evaluates model capabilities in multi-turn multimodal scenarios through InterMT-Bench.

ACL 2025

Language Models Resist Alignment

Jiaming Ji^*, Kaile Wang^*, Tianyi Qiu^*, Boyuan Chen^*, Jiayi Zhou^*, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, and Yaodong Yang^†

ACL 2025 Best Paper Award

[Project] [Paper]

This paper makes the first exploration of LLM alignment elasticity from both theoretical and empirical perspectives, revealing that models tend to revert to pre-training behavior distribution after fine-tuning, with this elasticity positively correlating with model size and pre-training data scale.

ACL 2025

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Jiaming Ji^*α, Donghai Hong^*α, Borong Zhang^α, Boyuan Chen^α, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang^†

[Paper] [Data]

We introduce PKU-SafeRLHF, a comprehensive dataset for LLM safety alignment research, featuring 44.6k prompts and 265k QA pairs with safety meta-labels across 19 harm categories and three severity levels, along with 166.8k preference data for both dual-preference and single-preference scenarios.

NeurIPS 2024 Oral

Aligner: Achieving Efficient Alignment through Learned Correction

Jiaming Ji^*, Boyuan Chen^*, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, and Yaodong Yang^†

NeurIPS 2024 Oral (Top 0.45%)

[Project] [Paper] [Media]

We propose Aligner, a novel and simple alignment paradigm that learns correctional residuals between preferred and dispreferred answers using a small model, achieving comparable performance with significantly reduced computational costs while being model-agnostic and plug-and-play.

NeurIPS 2023

BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Jiaming Ji^*, Mickel Liu^*, Juntao Dai^*, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, Yaodong Yang^†

[Project] [Paper]

We introduce BEAVERTAILS, a large-scale dataset with 333,963 QA pairs and 361,903 expert comparisons, uniquely separating helpfulness and harmlessness annotations to advance LLM safety alignment research.

Journal

^* denotes equal contribution, ^α denotes core contributors, and ^† denotes corresponding author

The Innovation

Med-Aligner: Empowers LLM Medical Applications for complex medical scenarios

Xiangbin Meng^*, Jia-ming Ji^*, Xiangyu Yan^*, Jun-tao Dai, Bo-yuan Chen, Guan Wang, Hua Xu, Jing-jia Wang, Xu-liang Wang, Da Liu, Ming-qi Zheng, Rongzhou Wu, Chuanjie Wu, Yuwei Wu^†, Wen-yao Wang^†, Zhen Song^†, and Yaodong Yang^†

The Innovation (IF 33.2)

We introduce Med-Aligner, a plug-in alignment framework for LLMs that enables targeted medical calibration by learning correction residuals between preferred and non-preferred responses. Built as a 2-billion-parameter model using DeepSpeed and Transformer architecture, it is trained on 267,524 anonymized medical records from 21 departments spanning 4,353 disease types, achieving 90% cross-validation agreement through physician annotations following clinical guidelines.

Selected Preprints

You may head for my Google Scholar profile to view my other works!

Arxiv 2025

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Jiaming Ji^*, Jiayi Zhou^*, Hantao Lou^*, Boyuan Chen^*, Donghai Hong^*, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang^†

[Paper] [Code] [Data]

Motivated by the challenge of achieving all-modality human preference alignment, particularly the limitations of binary preferences in accurately reflecting human preferences, we introduce the align-anything: Data: align-anything-200k, which covers text, image, audio, video modalities, and 8+ specific subtasks, annotated with preference and language feedback; Algorithm: improving all-modality alignment by learning from language feedback; Evaluation: encompassing all-modality understanding and generation.

Arxiv 2024

AI Alignment: A Comprehensive Survey

Jiaming Ji^*, Tianyi Qiu^*, Boyuan Chen^*, Borong Zhang^*, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao

[Paper] [Project]

We present a comprehensive survey of AI alignment research, introducing the RICE principles (Robustness, Interpretability, Controllability, and Ethicality) and exploring both forward alignment (preference modeling, RLHF, scalable oversight) and backward alignment (assurance techniques, governance practices) approaches.

Honors and Awards

2025 Dean’s Scholarship (¥10000 RMB)
2025 The Sixth Yuanpei Young Scholar Award
2024 SenseTime Scholarship (25/year in China, 1/25, ¥20000 RMB)
2024 Yicong Huang Scholarship (research innovation award, ¥8000 RMB)
2024 Research Excellence Award (¥5000 RMB)
2024 Ching-Ling Soong Future Scholarship (¥5000 RMB)
2024 Beijing Natural Science Foundation for Undergraduate Students (first batch; the sole recipient in the Peking University’s intelligence field)
2023 Yicong Huang Scholarship (¥8000 RMB)
2023 Peking University Scholarship (¥4000 RMB)
2023 Peking University Public Service Scholarship (¥2000 RMB)
2022 Peking University Freshman Scholarship (¥10000 RMB)

Invited Talks

2025.02: DeepSeek-R1 Analysis and Sharing [Video]
2024.09: Technical details analysis about OpenAI o1 and Post-Training Scaling Law [Video] [Slides]
2023.11: Invited talk about AI Alignment Survey [Video]

Service

Conference Reviewer for ICCV 2025, ICML 2025, ICLR 2025 (Notable Reviewer), NeurIPS 2025 & 2024
Workshop Reviewer for ICML 2024 Workshop TiFA

Blogs

Some thoughts on AI Alignment and Cognitive Reasoning

Intentionality, 2024.10
Abstraction Reasoning, 2024.09
Causality, 2024.09

Educations

2022.09 - Present, Yuanpei College, Peking University

Undergraduate Student in Artificial Intelligence
2023.05 - Present, PAIR Lab: PKU Alignment and Interaction Research Lab

Research Intern

Advisor: Prof. Yaodong Yang at Institute for AI, Peking University