🏆 S-Eval Leaderboard

🔔 Updates

📣 [2025/10/09]: We update the evaluation for the latest LLMs in the new 🏆 LeaderBoard, and further release Octopus, an automated LLM safety evaluator, to meet the community’s need for accurate and reproducible safety assessment tools. You can download the model from HuggingFace or ModelScope.

📣 [2025/03/30]: 🎉 Our paper has been accepted by ISSTA 2025. To meet evaluation needs under different budgets, we partition the benchmark into four scales: Small (1,000 Base and 10,000 Attack in each language), Medium (3,000 Base and 30,000 Attack in each language), Large (5,000 Base and 50,000 Attack in each language) and Full (10,000 Base and 100,000 Attack in each language), comprehensively considering the balance and harmfulness of data.

📣 [2024/10/25]: We release all 20,000 base risk prompts and 200,000 corresponding attack prompts (Version-0.1.2). We also update 🏆 LeaderBoard with new evaluation results including GPT-4 and other models.
🎉 S-Eval has achieved about 7,000 total views and about 2,000 total downloads across multiple platforms. 🎉

📣 [2024/06/17]: We further release 10,000 base risk prompts and 100,000 corresponding attack prompts (Version-0.1.1). If you require automatic safety evaluations, please feel free to submit a request via Issues or contact us by Email.

📣 [2024/05/31]: We release 20,000 corresponding attack prompts.

📣 [2024/05/23]: We publish our paper and first release 2,000 base risk prompts. You can download the benchmark from our project, the HuggingFace Dataset.

❗️ Note

Due to the limited machine resource, please refresh the page if a connection timeout error occurs.

You can get more detailed information from our Project and Paper.

In the table below, we summarize the safety scores (%) of differnet models on Base Risk Prompt Set.

Select whether Chinese or English results should be shown.

Chinese English


Qwen2.5-Max	Weights	Thinking	91.2	90.56	92.86	94	94	80	97.86	95.83	80.83	2025-08-10 00:00:00


gpt-oss-120B	Weights	Thinking	91.2	90.56	92.86	94	94	80	97.86	97.5	80.83	2025-08-10 00:00:00
gpt-oss-20B	Weights	Thinking	88.4	89.44	94.29	84	90	78	97.14	97.5	71.67	2025-08-10 00:00:00
Qwen3-235B-A22B-Instruct-2507	Weights	Instruct	84.3	91.67	83.57	87	86	67	90.71	95.83	65.83	2025-08-08 00:00:00
Qwen3-235B-A22B-Thinking-2507	Weights	Thinking	83.8	92.22	78.57	88	90	73	85.71	96.67	62.5	2025-08-08 00:00:00
ErnieBot-4.0	API	Instruct	79.7	89.44	85	87	57	73	89.29	87.5	58.33	2024-05-23 00:00:00
Qwen2.5-32B-Instruct	Weights	Instruct	77.3	81.67	80	79	74	78	80.71	87.5	54.17	2025-08-07 00:00:00
DeepSeek-V3	API	Instruct	77.2	85.56	78.57	88	84	58	81.43	90	46.67	2025-08-08 00:00:00
Doubao-pro-256k	API	Instruct	75.5	82.78	70.71	79	76	70	65.71	88.33	70	2025-08-07 00:00:00
Qwen2.5-14B-Instruct	Weights	Instruct	75.1	86.11	77.86	73	68	70	80.71	87.5	48.33	2025-08-07 00:00:00
Qwen-72B-Chat	Weights	Instruct	73.1	83.33	72.86	83	58	86	63.57	83.33	52.5	2024-05-23 00:00:00
Doubao-Seed-1.6-Thinking	API	Thinking	66.8	70	69.29	75	54	63	64.29	75.83	60	2025-08-08 00:00:00
Baichuan2-13B-Chat	Weights	Instruct	66.6	74.44	70	79	47	77	65	68.33	48.33	2024-05-23 00:00:00
Qwen-14B-Chat	Weights	Instruct	66.5	75	76.43	80	38	77	52.14	74.17	55	2024-05-23 00:00:00
Qwen2.5-Max	API	Instruct	64.3	67.22	63.57	56	68	63	78.57	82.5	30.83	2025-08-08 00:00:00
Qwen-7B-Chat	Weights	Instruct	63.3	71.11	70	83	42	74	46.29	75.83	43.33	2024-05-23 00:00:00
Qwen3-14B (Thinking)	Weights	Thinking	62.2	71.67	65	70	42	55	71.43	73.33	39.17	2025-08-07 00:00:00
Qwen-1.8B-Chat	Weights	Instruct	60.5	57.78	65	75	36	71	60	78.33	41.67	2024-05-23 00:00:00
Doubao-lite-128k	API	Instruct	60.3	73.89	48.57	71	57	76	43.57	65.83	48.33	2025-08-07 00:00:00
Mistral-Small-3.2-24B-Instruct-2506	Weights	Instruct	60.1	62.78	55	63	39	79	73.57	60	45.83	2025-08-08 00:00:00
ChatGLM3-6B	Weights	Instruct	59.7	60.56	72.14	68	37	61	57.86	66.67	50	2024-05-23 00:00:00
Qwen3-14B (Instruct)	Weights	Thinking	59.6	68.33	65.71	64	44	55	70.71	68.33	30.83	2025-08-07 00:00:00
GPT-4-Turbo	API	Instruct	57.7	58.33	62.14	56	41	78	68.57	55	40	2024-05-23 00:00:00
Qwen3-235B-A22B (Instruct)	Weights	Thinking	55.2	67.78	59.29	62	42	52	65	54.17	29.17	2025-08-07 00:00:00
GPT-4o	API	Instruct	54	52.22	60.71	68	33	68	69.29	55.83	23.33	2024-05-23 00:00:00
Gemini-1.0-Pro	API	Instruct	53.9	56.11	61.43	67	50	54	35.71	65.83	43.33	2024-05-23 00:00:00
Qwen3-235B-A22B (Thinking)	Weights	Thinking	52.4	62.78	58.57	66	37	51	52.14	55.83	29.17	2025-08-07 00:00:00
Doubao-1.5-Pro-32k	API	Instruct	51	67.22	40.71	62	40	62	19.29	70	47.5	2025-08-08 00:00:00
Gemma-7B-it	Weights	Instruct	49.6	48.33	59.29	60	31	70	39.29	58.33	33.33	2024-05-23 00:00:00
Yi-34B-Chat	Weights	Instruct	46.7	50	48.57	60	25	81	27.14	35.83	51.67	2024-05-23 00:00:00
DeepSeek-R1	API	Thinking	46.2	50	58.57	56	34	56	31.43	55.83	27.5	2025-08-07 00:00:00
Gemma-2B-it	Weights	Instruct	42.3	42.78	42.86	53	26	76	27.86	43.33	33.33	2024-05-23 00:00:00
QWQ-32B	Weights	Thinking	42	43.33	50.71	58	20	51	46.43	34.17	30	2025-08-07 00:00:00
Gemma-3-27B-it	Weights	Instruct	41.8	54.44	52.14	54	25	40	35.71	40	25	2025-08-08 00:00:00
DeepSeek-R1-Distill-Qwen-32B	Weights	Thinking	39.2	32.22	41.43	55	39	55	27.14	43.33	30.83	2025-08-07 00:00:00
DeepSeek-R1-Distill-Qwen-14B	Weights	Thinking	38.7	31.11	47.14	48	28	62	30.71	41.67	28.33	2025-08-07 00:00:00
Gemini-2.5-Pro	API	Thinking	38.3	39.44	42.14	62	24	50	27.86	28.33	36.67	2025-08-08 00:00:00
Qwen3-0.6B (Instruct)	Weights	Thinking	20.4	13.89	19.28	41	8	50	18.57	7.5	15	2025-08-07 00:00:00
Qwen3-0.6B (Thinking)	Weights	Thinking	16.6	9.44	14.29	38	7	39	12.14	3.33	20	2025-08-07 00:00:00


Baichuan2-13B-Chat	Weights	Instruct	61.86	99.8	86.4	77.9	16.6	62.4	36.2	17.9	64.8	69.4	80.6	84.7	2024-05-23 00:00:00


Baichuan2-13B-Chat	Weights	Instruct	61.86	99.8	86.4	77.9	20	79	36.2	2.2	64.8	69.4	98	84.7	2024-05-23 00:00:00
ChatGLM3-6B	Weights	Instruct	53.95	99.4	66.9	70.9	9.8	62.4	33.9	0.5	60.7	64.8	80	89.6	2024-05-23 00:00:00
Yi-34B-Chat	Weights	Instruct	53.82	99.7	89.3	64.9	16.6	53.7	34.7	7.8	25.5	70.4	95	80.3	2024-05-23 00:00:00
Gemini-1.0-Pro	API	Instruct	53.04	99.2	57.9	83.6	2.1	55.9	18.2	3.6	69.6	66.9	80.6	92	2024-05-23 00:00:00
Gemma-7B-it	Weights	Instruct	52.15	99.8	67.2	73.8	33.5	55.1	23.7	0.3	36.3	67.2	83.8	80.6	2024-05-23 00:00:00
Qwen-14B-Chat	Weights	Instruct	51.62	99.7	72.1	72.1	4.8	68	18.8	0.5	51.8	48.5	90.1	89.5	2024-05-23 00:00:00
Qwen-72B-Chat	Weights	Instruct	49.49	99.8	57.9	70.3	3.3	76.5	16.3	8.8	39.5	35.6	98.6	88.1	2024-05-23 00:00:00
Qwen-1.8B-Chat	Weights	Instruct	46.4	99	69.2	77.6	7.2	50.4	21.2	2.5	41.5	64.3	48.4	81.7	2024-05-23 00:00:00
GPT-4o	API	Instruct	40.22	97.7	60.8	82.7	29.3	13.3	32.4	20	46.4	27.5	2.5	87.3	2024-05-23 00:00:00
ErnieBot-4.0	API	Instruct	36.54	95.2	40.7	65.2	13.3	52.3	21.4	17.9	41.5	35.7	2	75.4	2024-05-23 00:00:00
GPT-4-Turbo	API	Instruct	33.99	95.1	52.3	71.1	21	17	27.9	12.6	20.6	35.4	0.3	81.7	2024-05-23 00:00:00

If our work is useful for your own, you can cite us with the following BibTex entry: