Coming February 2025
The FindTheFlaws meta-dataset is designed to advance research on scalable oversight by offering a comprehensive benchmark of (mostly) challenging questions paired with flawed answers across multiple domains, including STEM, law, medicine, and constructed languages. The flaws in each dataset have largely been explicitly designed to be difficult to detect, although their difficulty varies substantially by dataset and item. By providing annotated flaws in long-form responses, we hope to support scalable oversight research that aims to develop protocols which enable 'weak' AI models (which do not necessarily have the capability to generate correct answers to highly challenging questions) to effectively identify and analyze subtle errors in reasoning and explanations such as might be generated by stronger models. Some of our datasets may also support research in process-oriented learning, although not all datasets within our benchmark are appropriate for this purpose.

Enter your e-mail here to receive a one-time message when the benchmark is released.
Dataset Description Number of samples
Modified subset of TheoremQA (Chen et al. 2023) Multi-step flawed solutions with annotated errors to university-level questions on various STEM topics 105
Modified subset of SciBench (Wang et al. 2024) Multi-step flawed solutions with annotated errors to college-level physics questions 94 distinct flawed solutions to 31 questions
Modified CELS (Recchia et al., in prep.) GPT-4 and GPT-3.5 answers to questions on contract law (5), evidence law (5), Lojban (48), and surgery (55), where model was asked to explicitly argue for a right answer or a wrong answer, annotated by multiple topic experts 452 LLM responses to 113 questions (four per question), annotated sentence-by-sentence for errors by two experts each
Modified subset of Python800 (Puri et al. 2021) Two experts have gone through claims made by GPT-4 about answers to programming competition solutions and identified errors, with a third adjudicating disagreements 1300 LLM claims about 650 programming problems, annotated by two experts each
Modified subset of ScienceQA (Lu et al. 2022) Multi-step flawed solutions with annotated errors to gradeschool and highschool questions on language arts, social studies, and science 308
Modified subset of Google-Proof QA Diamond (Rein et al. 2023) Multi-step flawed solutions with annotated errors for university-level questions on various STEM topics 198
Modified adversarial subset of MedQA (Jin et al. 2020) GPT-4 answers and justifications of answers to difficult MedQA questions (selected such that GPT-4 gets only 20% correct), with structured clinician commentary on LLM answers 223 for which two of three clinicians (including initial question author) agree; 319 total