🏆 S-Eval Leaderboard
🔔 Updates
📣 [2025/03/30]: 🎉 Our paper has been accepted by ISSTA 2025. To meet evaluation needs under different budgets, we partition the benchmark into four scales: Small (1,000 Base and 10,000 Attack in each language), Medium (3,000 Base and 30,000 Attack in each language), Large (5,000 Base and 50,000 Attack in each language) and Full (10,000 Base and 100,000 Attack in each language), comprehensively considering the balance and harmfulness of data.
📣 [2024/10/25]: We release all 20,000 base risk prompts and 200,000 corresponding attack prompts (Version-0.1.2). We also update 🏆 LeaderBoard with new evaluation results including GPT-4 and other models.
🎉 S-Eval has achieved about 7,000 total views and about 2,000 total downloads across multiple platforms. 🎉
📣 [2024/06/17]: We further release 10,000 base risk prompts and 100,000 corresponding attack prompts (Version-0.1.1). If you require automatic safety evaluations, please feel free to submit a request via Issues or contact us by Email.
📣 [2024/05/31]: We release 20,000 corresponding attack prompts.
📣 [2024/05/23]: We publish our paper and first release 2,000 base risk prompts. You can download the benchmark from our project, the HuggingFace Dataset.
❗️ Note
Due to the limited machine resource, please refresh the page if a connection timeout error occurs.
You can get more detailed information from our Project and Paper.
In the table below, we summarize the safety scores (%) of differnet models on Base Risk Prompt Set.
79.7 | 89.44 | 72.86 | 87 | 57 | 73 | 89.29 | 83.33 | 58.33 |
In the table below, we summarize the attack success rates (%) of the instruction attacks in Attack Prompt Set on different models
61.86 | 99.8 | 86.4 | 77.9 | 16.6 | 62.4 | 36.2 | 17.9 | 64.8 | 69.4 | 80.6 | 84.7 |
61.86 | 99.8 | 86.4 | 77.9 | 20 | 79 | 36.2 | 2.2 | 64.8 | 69.4 | 98 | 84.7 | |
53.95 | 99.4 | 66.9 | 70.9 | 9.8 | 62.4 | 33.9 | 0.5 | 60.7 | 64.8 | 80 | 89.6 | |
53.82 | 99.7 | 89.3 | 64.9 | 16.6 | 53.7 | 34.7 | 7.8 | 25.5 | 70.4 | 95 | 80.3 | |
53.04 | 99.2 | 57.9 | 83.6 | 2.1 | 55.9 | 18.2 | 3.6 | 69.6 | 66.9 | 80.6 | 92 | |
52.15 | 99.8 | 67.2 | 73.8 | 33.5 | 55.1 | 23.7 | 0.3 | 36.3 | 67.2 | 83.8 | 80.6 | |
51.62 | 99.7 | 72.1 | 72.1 | 4.8 | 68 | 18.8 | 0.5 | 51.8 | 48.5 | 90.1 | 89.5 | |
49.49 | 99.8 | 57.9 | 70.3 | 3.3 | 76.5 | 16.3 | 8.8 | 39.5 | 35.6 | 98.6 | 88.1 | |
46.4 | 99 | 69.2 | 77.6 | 7.2 | 50.4 | 21.2 | 2.5 | 41.5 | 64.3 | 48.4 | 81.7 | |
40.22 | 97.7 | 60.8 | 82.7 | 29.3 | 13.3 | 32.4 | 20 | 46.4 | 27.5 | 2.5 | 87.3 | |
36.54 | 95.2 | 40.7 | 65.2 | 13.3 | 52.3 | 21.4 | 17.9 | 41.5 | 35.7 | 2 | 75.4 | |
33.99 | 95.1 | 52.3 | 71.1 | 21 | 17 | 27.9 | 12.6 | 20.6 | 35.4 | 0.3 | 81.7 |
S-Eval is designed to be a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. So far, S-Eval has 220,000 evaluation prompts in total (and is still in active expansion), including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks. These test prompts are generated based on a comprehensive and unified risk taxonomy, specifically designed to encompass all crucial dimensions of LLM safety evaluation and meant to accurately reflect the varied safety levels of LLMs across these risk dimensions. More details on the construction of the test suite including model-based test generation, selection and the expert critique LLM can be found in our paper.