Coming December 2024
Enter your e-mail here to receive a one-time message when the benchmark is released.
Dataset Description Number of samples
Modified subset of TheoremQA (Chen et al. 2023) Multi-step flawed solutions with annotated errors to university-level questions on various STEM topics 105
Modified subset of SciBench (Wang et al. 2024) Multi-step flawed solutions with annotated errors to college-level physics questions 94 distinct flawed solutions to 31 questions
Modified CELS (Recchia et al., in prep.) GPT-4 and GPT-3.5 answers to questions on contract law (5), evidence law (5), Lojban (48), and surgery (48), where model was asked to explicitly argue for a right answer or a wrong answer, annotated by multiple topic experts 424 LLM responses to 106 questions (four per question), annotated sentence-by-sentence for errors by two experts each
Modified subset of Python800 (Puri et al. 2021) Two experts have gone through claims made by GPT-4 about answers to programming competition solutions and identified errors, with a third adjudicating disagreements 1300 LLM claims about 650 programming problems, annotated by two experts each
Modified subset of ScienceQA (Lu et al. 2022) Multi-step flawed solutions with annotated errors to gradeschool and highschool questions on language arts, social studies, and science 308
Modified subset of Google-Proof QA Diamond (Rein et al. 2023) Multi-step flawed solutions with annotated errors for university-level questions on various STEM topics 198
Modified adversarial subset of MedQA (Jin et al. 2020) GPT-4 answers and justifications of answers to difficult MedQA questions (selected such that GPT-4 gets only 20% correct), with structured clinician commentary on LLM answers 223 for which two of three clinicians (including initial question author) agree; 319 total