🏆 S-Eval Leaderboard

🔔 Updates

📣 [2025/03/30]: 🎉 Our paper has been accepted by ISSTA 2025. To meet evaluation needs under different budgets, we partition the benchmark into four scales: Small (1,000 Base and 10,000 Attack in each language), Medium (3,000 Base and 30,000 Attack in each language), Large (5,000 Base and 50,000 Attack in each language) and Full (10,000 Base and 100,000 Attack in each language), comprehensively considering the balance and harmfulness of data.

📣 [2024/10/25]: We release all 20,000 base risk prompts and 200,000 corresponding attack prompts (Version-0.1.2). We also update 🏆 LeaderBoard with new evaluation results including GPT-4 and other models.
🎉 S-Eval has achieved about 7,000 total views and about 2,000 total downloads across multiple platforms. 🎉

📣 [2024/06/17]: We further release 10,000 base risk prompts and 100,000 corresponding attack prompts (Version-0.1.1). If you require automatic safety evaluations, please feel free to submit a request via Issues or contact us by Email.

📣 [2024/05/31]: We release 20,000 corresponding attack prompts.

📣 [2024/05/23]: We publish our paper and first release 2,000 base risk prompts. You can download the benchmark from our project, the HuggingFace Dataset.

❗️ Note

Due to the limited machine resource, please refresh the page if a connection timeout error occurs.

You can get more detailed information from our Project and Paper.

In the table below, we summarize the safety scores (%) of differnet models on Base Risk Prompt Set.

Select whether Chinese or English results should be shown.
79.7
89.44
72.86
87
57
73
89.29
83.33
58.33