Undergraduate Research Overview

Throughout my undergraduate research, I have focused on reinforcement learning, the safety and value alignment of large models (i.e., scalable oversight, deception alignment, and super alignment), with the overarching goal of constructing capable, autonomous, safe, and trustworthy AI systems.

My undergraduate research focuses on two interconnected questions: how to construct capably, safe and trustworthy AI systems, and how to supervise and align superhuman AI systems. The latter is a natural progression of the former and also subsumes it, encompassing both the post-training enhancement of capabilities (e.g., multimodal understanding and generation, and the realization of strong reasoning in systems like o3-level models) and the emergence of novel high-risk failure modes (e.g., deception and alignment faking), along with their underlying mechanisms and potential solutions.

Undergraduate Publications

^* denotes equal contribution, ^α denotes core contributors, and ^† denotes corresponding author

NeurIPS 2025

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Boyuan Chen^*, Donghai Hong^*, Jiaming Ji^*, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, and Yaodong Yang^†

NeurIPS 2025 Spotlight (Top 2.6%)

[Project] [Paper]

This paper introduces Data: InterMT, a pioneering preference dataset for multi-turn multimodal interactions, comprising 15.6k prompts and 32.4k human-annotated preference pairs. It employs an innovative agentic workflow that utilizes tool-augmented MLLMs to generate multi-turn QA instances. Algorithm: The work proposes chain-prefix local and global preference modeling for training judge models, demonstrating multi-turn scaling law. Evaluation: The study evaluates model capabilities in multi-turn multimodal scenarios through InterMT-Bench.

ACL 2025

Language Models Resist Alignment

Jiaming Ji^*, Kaile Wang^*, Tianyi Qiu^*, Boyuan Chen^*, Jiayi Zhou^*, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, and Yaodong Yang^†

ACL 2025 Best Paper Award

[Project] [Paper]

This paper makes the first exploration of LLM alignment elasticity from both theoretical and empirical perspectives, revealing that models tend to revert to pre-training behavior distribution after fine-tuning, with this elasticity positively correlating with model size and pre-training data scale.

NeurIPS 2024 Oral

Aligner: Achieving Efficient Alignment through Learned Correction

Jiaming Ji^*, Boyuan Chen^*, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, and Yaodong Yang^†

NeurIPS 2024 Oral (Top 0.45%)

[Project] [Paper] [Media]

We propose Aligner, a novel and simple alignment paradigm that learns correctional residuals between preferred and dispreferred answers using a small model, achieving comparable performance with significantly reduced computational costs while being model-agnostic and plug-and-play.

ACL 2025

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Jiaming Ji^*α, Donghai Hong^*α, Borong Zhang^α, Boyuan Chen^α, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang^†

[Paper] [Data]

We introduce PKU-SafeRLHF, a comprehensive dataset for LLM safety alignment research, featuring 44.6k prompts and 265k QA pairs with safety meta-labels across 19 harm categories and three severity levels, along with 166.8k preference data for both dual-preference and single-preference scenarios.

NeurIPS 2023

BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Jiaming Ji^*, Mickel Liu^*, Juntao Dai^*, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, Yaodong Yang^†

[Project] [Paper]

We introduce BEAVERTAILS, a large-scale dataset with 333,963 QA pairs and 361,903 expert comparisons, uniquely separating helpfulness and harmlessness annotations to advance LLM safety alignment research.

Journal

^* denotes equal contribution, ^α denotes core contributors, and ^† denotes corresponding author

The Innovation

Med-Aligner: Empowers LLM Medical Applications for complex medical scenarios

Xiangbin Meng^*, Jia-ming Ji^*, Xiangyu Yan^*, Jun-tao Dai, Bo-yuan Chen, Guan Wang, Hua Xu, Jing-jia Wang, Xu-liang Wang, Da Liu, Ming-qi Zheng, Rongzhou Wu, Chuanjie Wu, Yuwei Wu^†, Wen-yao Wang^†, Zhen Song^†, and Yaodong Yang^†

The Innovation (IF 33.2)

We introduce Med-Aligner, a plug-in alignment framework for LLMs that enables targeted medical calibration by learning correction residuals between preferred and non-preferred responses. Built as a 2-billion-parameter model using DeepSpeed and Transformer architecture, it is trained on 267,524 anonymized medical records from 21 departments spanning 4,353 disease types, achieving 90% cross-validation agreement through physician annotations following clinical guidelines.

ACM Computing Surveys

AI Alignment: A Comprehensive Survey

Jiaming Ji^*, Tianyi Qiu^*, Boyuan Chen^*, Borong Zhang^*, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao

ACM Computing Surveys (IF 28.0)

[Paper] [Project]

We present a comprehensive survey of AI alignment research, introducing the RICE principles (Robustness, Interpretability, Controllability, and Ethicality) and exploring both forward alignment (preference modeling, RLHF, scalable oversight) and backward alignment (assurance techniques, governance practices) approaches.

Selected Preprints

You may head for my Google Scholar profile to view my other works!

AI Deception: Risks, Dynamics, and Controls

Boyuan Chen^*, Sitong Fang^*, Jiaming Ji^*, Yanxu Zhu^*, Pengcheng Wen^*, Jinzhou Wu^*, …, Yaodong Yang^†, Philip Torr , Zhongyuan Wang, Tiejun Huang, Ya-qin Zhang, Hongjiang Zhang, Andrew Yao.

_{^* We thank all collaborators for their valuable feedback. Please see project team for reference}

[Paper] [Website]

As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This survey provides a comprehensive and up-to-date overview of the field, defining AI deception through the lens of signaling theory and reviewing empirical studies that highlight its sociotechnical risks. We organize existing research into a deception cycle with two components: emergence, which analyzes incentive foundations, capability prerequisites (perception, planning, and performing), and contextual triggers such as supervision gaps and distributional shifts; and treatment, which encompasses detection methods, evaluation protocols, and mitigation strategies. We conclude by outlining key challenges and future directions, emphasizing the need for integrated technical, community, and governance approaches, and provide a living resource at www.deceptionsurvey.com

Arxiv 2025

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Jiaming Ji^*, Jiayi Zhou^*, Hantao Lou^*, Boyuan Chen^*, Donghai Hong^*, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang^†

[Paper] [Code] [Data]

Motivated by the challenge of achieving all-modality human preference alignment, particularly the limitations of binary preferences in accurately reflecting human preferences, we introduce the align-anything: Data: align-anything-200k, which covers text, image, audio, video modalities, and 8+ specific subtasks, annotated with preference and language feedback; Algorithm: improving all-modality alignment by learning from language feedback; Evaluation: encompassing all-modality understanding and generation.