Coming December 2024
Enter your e-mail here to receive a one-time message when the benchmark is released.
Dataset | Description | Number of samples |
---|---|---|
Modified subset of TheoremQA (Chen et al. 2023) | Multi-step flawed solutions with annotated errors to university-level questions on various STEM topics | 105 |
Modified subset of SciBench (Wang et al. 2024) | Multi-step flawed solutions with annotated errors to college-level physics questions | 94 distinct flawed solutions to 31 questions |
Modified CELS (Recchia et al., in prep.) | GPT-4 and GPT-3.5 answers to questions on contract law (5), evidence law (5), Lojban (48), and surgery (48), where model was asked to explicitly argue for a right answer or a wrong answer, annotated by multiple topic experts | 424 LLM responses to 106 questions (four per question), annotated sentence-by-sentence for errors by two experts each |
Modified subset of Python800 (Puri et al. 2021) | Two experts have gone through claims made by GPT-4 about answers to programming competition solutions and identified errors, with a third adjudicating disagreements | 1300 LLM claims about 650 programming problems, annotated by two experts each |
Modified subset of ScienceQA (Lu et al. 2022) | Multi-step flawed solutions with annotated errors to gradeschool and highschool questions on language arts, social studies, and science | 308 |
Modified subset of Google-Proof QA Diamond (Rein et al. 2023) | Multi-step flawed solutions with annotated errors for university-level questions on various STEM topics | 198 |
Modified adversarial subset of MedQA (Jin et al. 2020) | GPT-4 answers and justifications of answers to difficult MedQA questions (selected such that GPT-4 gets only 20% correct), with structured clinician commentary on LLM answers | 223 for which two of three clinicians (including initial question author) agree; 319 total |