About me
I’m a 3rd-year Ph.D. student in Conversational AI Group, Department of Computer Science and Technology, Tsinghua University. I’m advised by Prof. Minlie Huang. My research interests mainly include safety and alignment of large language models and AI agents.
Education
- 2022.9 - Present: Ph.D. Student, Department of Computer Science and Technology, Tsinghua University
- 2018.9 - 2022.6: B.Eng., Department of Computer Science and Technology, Tsinghua University
Publications
Conference Papers
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang. SafetyBench: Evaluating the Safety of Large Language Models. ACL 2024 (Long Paper). [pdf] [code]
Zhexin Zhang*, Junxiao Yang*, Pei Ke, Fei Mi, Hongning Wang, Minlie Huang. Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. ACL 2024 (Long Paper). [pdf] [code]
Zhexin Zhang, Jiaxin Wen, Minlie Huang. Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation. ACL 2023 (Long Paper, Oral). [pdf] [code]
Zhexin Zhang*, Yida Lu*, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang. ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors. EMNLP 2024 Findings (Long Paper). [pdf] [code]
Zhexin Zhang*, Jiaxin Wen*, Jian Guan, Minlie Huang. Persona-Guided Planning for Controlling the Protagonist’s Persona in Story Generation. NAACL 2022 (Long Paper). [pdf] [code]
Junxiao Yang*, Zhexin Zhang*, Shiyao Cui, Hongning Wang, Minlie Huang. Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints. ACL 2025 (Long Paper). [pdf] [code]
Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie Huang. InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning. EMNLP 2023 Findings (Long paper). [pdf] [code]
Zhexin Zhang*, Jiale Cheng*, Hao Sun, Jiawen Deng, Fei Mi, Yasheng Wang, Lifeng Shang, Minlie Huang. Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation. EMNLP 2022 Findings (Long Paper). [pdf] [code]
Zhexin Zhang, Yeshuang Zhu, Zhengcong Fei, Jinchao Zhang, Jie Zhou. Selecting Stickers in Open-Domain Dialogue through Multitask Learning. ACL 2022 Findings (Short Paper). [pdf] [code]
Zhexin Zhang, Jian Guan, Xin Cui, Yu Ran, Bo Liu, Minlie Huang. Self-Supervised Sentence Polishing by Adding Engaging Modifiers. ACL 2023 Demo (Demo Paper). [pdf] [code]
Zhexin Zhang*, Jian Guan*, Guowei Xu, Yixiang Tian, Minlie Huang. Automatic Comment Generation for Chinese Student Narrative Essays. EMNLP 2022 Demo (Demo Paper). [pdf] [code]
Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, Minlie Huang. OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics. ACL 2021 (Long Paper). [pdf] [code]
Hao Sun, Zhexin Zhang, Fei Mi, Yasheng Wang, Wei Liu, Jianwei Cui, Bin Wang, Qun Liu, Minlie Huang. MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions. ACL 2023 (Long Paper). [pdf] [code]
Yida Lu*, Jiale Cheng*, Zhexin Zhang, Shiyao Cui, Cunxiang Wang, Xiaotao Gu, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang. LongSafety: Evaluating Long-Context Safety of Large Language Models. ACL 2025 (Long Paper). [pdf] [code]
Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang. Unveiling the Implicit Toxicity in Large Language Models. EMNLP 2023 (Long Paper). [pdf] [code]
Preprints
Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang. Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! arXiv preprint 2025. [pdf] [code]
Zhexin Zhang*, Xian Qi Loye*, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang. How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study. arXiv preprint 2025. [pdf] [code]
Zhexin Zhang*, Shiyao Cui*, Yida Lu*, Jingzhuo Zhou*, Junxiao Yang, Hongning Wang, Minlie Huang. Agent-SafetyBench: Evaluating the Safety of LLM Agents. arXiv preprint 2024. [pdf] [code]
Zhexin Zhang*, Junxiao Yang*, Yida Lu, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang. From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks. arXiv preprint 2024. [pdf] [code]
Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang. Safety Assessment of Chinese Large Language Models. arXiv preprint 2023. [pdf] [code]
Jiawen Deng, Jiale Cheng, Hao Sun, Zhexin Zhang, Minlie Huang. Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements. arXiv preprint 2023. [pdf] [code]
Shangqing Tu, Zhuoran Pan, Wenxuan Wang, Zhexin Zhang, Yuliang Sun, Jifan Yu, Hongning Wang, Lei Hou, Juanzi Li. Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack. arXiv preprint 2024. [pdf] [code]
Projects
Services
- Program Committee Member (Conference Reviewer) EMNLP 2023, ARR 2023, ARR 2024, ARR 2025
Teaching
I was a TA for the following undergraduate courses:
- Artificial Neural Network (2022 Fall, 2023 Fall, 2024 Fall)
- Object-Oriented Programming (2022 Spring, 2023 Spring)
- Discrete Mathematics for Computer Science (2023 Spring, 2024 Spring, 2025 Spring)
Selected Honors and Awards
- Top 40 Global Finalist of the Baidu Scholarship, 2024
- National Scholarship, Dept. CST, Tsinghua University, 2024
- Third place, Global Challenge for Safe and Secure LLMs, 2024
- Samsung Scholarship, Dept. CST, Tsinghua University, 2023
- Excellent Graduate, Tsinghua University, 2022
- Outstanding Graduate, Dept. CST, Tsinghua University, 2022
- Third place (3/3665), WeChat Big Data Challenge Finals, 2022
- Sixth place (6/1000+), Global AI Innovation Contest Finals, 2022
- Second place (2/5000+), Global AI Innovation Contest Finals, 2021
- Academic Excellence Scholarship, Dept. CST, Tsinghua University, 2020, 2021
- Meritorious Winner (<10%), The Mathematics Contest in Modeling, 2020