fix | agent | explain | python | performance | sql | visual | code | html | git | fun | c | bash | rust | coding | data |

Question	deepseek-reasoner (71%)	o3-mini (70%)	gemini-2.5-pro-exp-03-25 (69%)	claude-3-7-sonnet-latest (67%)	gpt-4o (48%)	claude-3-5-haiku-20241022 (44%)
BashRenamer Test if the model can write a bash script that renames files with a specific pattern. tags: code, bash	1/1	2/2	1/1	2/2	5/5	5/5
GitSimple Test if the model can guide a user in a conversation to setup a git repo. tags: bash, git, agent	1/1	2/2	1/1	2/2	5/5	5/5
ProgramRewriteCSimple Test if the model can rewrite a very simple Python program into an equivalent C program. tags: code, c	1/1	2/2	1/1	2/2	5/5	5/5
ExplainPrime Test if the model can interpret a minified JavaScript function and explain its function. tags: code, explain	1/1	2/2	1/1	2/2	5/5	5/5
ExtractRef Test if the model can extract paper tiles from a block of text. tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
SqlSubquery Test if the model can generate a Python program that retrieves data from a SQL file. tags: sql	1/1	2/2	1/1	2/2	5/5	5/5
Flexbox Test if the model can generate an HTML file using flexbox tags: code, html	1/1	2/2	1/1	2/2	5/5	5/5
FreeCADCircle Test if the model understands a rambling question about how to make construction circle in FreeCAD. tags: explain, fun	1/1	2/2	1/1	2/2	5/5	5/5
GitMerge Test if the model can guide a user through a series of git commands to merge a specific branch into the main branch. tags: bash, git, agent	1/1	2/2	1/1	2/2	5/5	5/5
WhatIsStarStar Test if the model can understand and interpret a request to gitignore any file called "foo/.KEYFILE" regardless of its location in a repository. tags: explain	1/1	2/2	1/1	2/2	5/5	5/5
FixJnpBug Test if the model can identify and fix a bug in a given jax.numpy function. tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
SqlMakeTable Test if the model can generate a SQL query to create a database table. tags: sql	1/1	2/2	1/1	2/2	5/5	5/5
Make16FilesEasy Test if the model can write a Python script that merges a list of file paths into 16 files of approximately equal size. tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
PrintHello Test if the model can generate a basic python program that prints "hello world". tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
ProgramSqrt Test if the model can implement a sqrt function. tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
Regex Test if the model can write a Python function with a straightforward regex. tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
RecoverExpiredPage Test if a model knows how to get the HTML for the entire webpage; not just the body. tags: explain, html	1/1	2/2	1/1	2/2	5/5	5/5
UnitConversion Test if a model can do basic math with some EE equations. tags: explain	1/1	2/2	1/1	2/2	5/5	5/5
VagueLoopFormat Test if the model can follow vague instructions for how to print IDs following an example. tags: code, python	1/1	2/2	1/1	2/2	5/5	5/5
VectorizeSmall Test if the model can replace a for loop with a vectorized version. tags: code, python, performance	1/1	2/2	1/1	2/2	5/5	5/5
WhatIsAutoModel Test if the model can interpret vague questions and will respond with the answer I want, not the answer that's easy to find. tags: explain	1/1	2/2	1/1	2/2	5/5	5/5
WhatIsFloatFormat This test case checks if models can format f strings with floats. tags: explain, python	1/1	2/2	1/1	2/2	5/5	5/5
WhatIsInv This test case is designed to check if the model can correctly identify the Python operator used for the tilde (~) symbol. tags: explain, python	1/1	2/2	1/1	2/2	5/5	5/5
WhatIsSlice This test case checks if the model can say how to properly get the end of a slice. tags: explain, python	1/1	2/2	1/1	2/2	5/5	5/5
ProgramRewriteC Test if the model can rewrite a given Python program into an equivalent C program. tags: code, c	1/1	2/2	1/1	2/2	4/5	5/5
Dedent Test if the model can write a Python function that removes excess indentation from a given block of code. tags: code, python	1/1	2/2	1/1	2/2	5/5	4/5
ExplainBroadcast Test if the model can correctly explain what the VPBROADCASTB instruction does. tags: explain	1/1	2/2	1/1	2/2	5/5	4/5
ExtractEmail Test if the model can accurately extract and identify invalid email addresses from a given text file. Models that are "overly safe" will fail. tags: data	1/1	2/2	1/1	2/2	4/5	5/5
GetVocab This test case is designed to check if the model can print out the tokens in a AutoTokenizer's vocabulary. tags: explain, python	1/1	2/2	1/1	2/2	5/5	4/5
SumSomeData Test if the model can infer what data to sum and what to ignore by example with vague instructions. tags: code, python	1/1	2/2	1/1	2/2	4/5	5/5
MissingStep Test if the model can identify a missing incredient in a recipe. Identifying incorrect steps is much harder than missing steps. tags: explain, fun	1/1	2/2	1/1	2/2	5/5	3/5
SimpleBNF Test if the model can understand a vague BNF-style grammar and write a Python function that evaluates expressions based on the grammar rules. tags: code, python	1/1	2/2	1/1	2/2	4/5	4/5
ExplainPrime2 Test if the model can interpret a minified and obfuscated JavaScript function and explain its function. tags: explain	1/1	2/2	1/1	2/2	5/5	3/5
QuestionThreadedFix Test if the model can explain a poorly worded error message in a short threaded python program. tags: code, python, explain	1/1	2/2	1/1	2/2	4/5	4/5
CRC32 Test if the model understands the CRC-32 spec well enough to implement it. tags: code, c	1/1	2/2	1/1	2/2	4/5	4/5
DrawHouse Test if the model can generate an HTML file with WebGL code that draws an image. tags: code, visual, html	1/1	2/2	1/1	2/2	3/5	5/5
FixDockerCuda This test case checks if the model can debug a docker cuda error tags: explain	1/1	1/2	1/1	2/2	5/5	5/5
ShortenPyGet Test if the model can shorten a line of python with an equal line. tags: code, python	1/1	2/2	1/1	1/2	5/5	5/5
ImgResize Test if the model can resize several images in a given subdirectory. tags: code, python	1/1	1/2	1/1	2/2	4/5	5/5
WhatIsStarStarB Test if the model can understand and interpret a request to gitignore any file called "foo/.KEYFILE" regardless of its location in a repository. tags: explain	1/1	2/2	1/1	2/2	1/5	5/5
TorchBackwardFix Test if the model can fix and explain a bug in PyTorch code related to forgetting to zero gradients. tags: code, python, fix	1/1	2/2	1/1	2/2	5/5	1/5
LlamaKnowledge Test the knowledge cutoff of the model to see if it knows the LLAMA-2 hidden dimension size. tags: explain	1/1	2/2	1/1	2/2	5/5	1/5
Disas1 Test if the model can disassemble a simple Python function from its bytecode. tags: code, python	1/1	2/2	1/1	1/2	3/5	5/5
DrawTriangle Test if the model can generate an HTML file with WebGL code that draws an image. tags: code, visual, html	1/1	2/2	1/1	1/2	5/5	3/5
DisasPrimes Test if the model can disassemble Python bytecode and create a function that returns a list of prime numbers and their negations. tags: code, python	1/1	2/2	1/1	2/2	0/5	5/5
MakeTreeEasy Test if the model can create a tree from a string. tags: code, python	1/1	2/2	1/1	2/2	0/5	5/5
RustCountNoLib Test if the model can write a rust program that performs word counting. tags: code, rust	1/1	2/2	1/1	2/2	2/5	3/5
MakeShiftOp Test if the model can generate a python program that defines dataflow DSL. tags: code, python	1/1	2/2	1/1	2/2	4/5	1/5
FixNode Test if the model can identify a node error message tags: explain	1/1	2/2	1/1	1/2	5/5	2/5
EmojiMovie A for-fun test to see if the model can go movie title -> emoji -> movie title. tags: fun	1/1	2/2	1/1	2/2	3/5	1/5
PyChessPrefix Test if the model can correctly call a python API for a moderately popular python library. tags: code, python	1/1	2/2	1/1	2/2	3/5	1/5
RustCount Test if the model can write a rust program that performs word counting. tags: code, rust	1/1	2/2	1/1	2/2	3/5	1/5
WhatIsLPR This test case checks if the model knows lpr commands. tags: explain	0/1	2/2	1/1	2/2	5/5	4/5
UPythonMQTT Test if a model can write upython code with an obscure module. tags: python, code	0/1	2/2	1/1	2/2	4/5	5/5
EasyFlagDrawBMP Test if the model can write a C program that draws an image. This test requires the ability to understand the .bmp specification, and draw a flag that can be correctly parsed and seen by the evaluator. tags: code, c, visual	1/1	2/2	1/1	2/2	0/5	3/5
SimpleFix Test if the model can identify and fix an issue with a tokenizer in a Python code snippet. Identifying the problem is in the regex, and fixing the regex, are both hard. tags: code, fix, python	1/1	2/2	0/1	2/2	5/5	3/5
LatexNewline Test if a model can fix a latex newline error in a caption tags: explain	1/1	2/2	1/1	2/2	1/5	2/5
MakeTree Test if the model can create a tree from a string. tags: code, python	1/1	2/2	1/1	2/2	0/5	3/5
TorchJnp Test if the model can convert a torch neural network to a jax numpy model. tags: code, python	1/1	2/2	1/1	1/2	0/5	5/5
GitMergeConflict Test if the model can guide a user through a series of git commands to merge a specific branch into the main branch. tags: bash, git, agent	0/1	2/2	1/1	2/2	5/5	2/5
StateTableStepbystep Test if the model can process a large table of text and identify rows with specific values. tags: data	1/1	2/2	1/1	2/2	2/5	0/5
FastL2 Test if the model can optimize a given Python program for speed and memory efficiency. tags: code, performance, python	1/1	2/2	1/1	2/2	1/5	1/5
HallucinateReference Test if the model will hallucinate references that don't exist. tags: explain	1/1	2/2	0/1	2/2	4/5	3/5
NumpyIx Test if a model can identify the _ix function as a method for simplifying some code. tags: explain, python	1/1	1/2	1/1	1/2	2/5	5/5
FlagDrawBMP Test if the model can write a C program that draws an image. This test requires the ability to understand the .bmp specification, and draw a flag that can be correctly parsed and seen by the evaluator. tags: code, c, visual	1/1	2/2	1/1	1/2	3/5	1/5
MakeJson Test if the model can successfully convert unstructured data to JSON. tags: data	1/1	2/2	0/1	1/2	5/5	4/5
NewAssemblySquareNumbers Test if the model can write a program in a new assembly language. This ability to learn a new language on-the-fly is important for many tasks. tags: code	1/1	1/2	1/1	2/2	4/5	0/5
ExplainWeirdCEasy This test case is meant to test if the model can correctly evaluate a complex C expression. tags: explain, c	1/1	2/2	1/1	2/2	1/5	0/5
LispSilencePython Test if the model can understand a vague error for an emacs lisp question. tags: explain	1/1	2/2	1/1	0/2	4/5	2/5
RLEDecode This test case tests if the model can convert a Game of Life pattern represented in RLE format to a numpy array. tags: code, python	1/1	2/2	1/1	2/2	0/5	1/5
Base64Thought Test if a model will follow instructions to the letter without lots of cajoling. Thinking in base64 is also interesting. tags: explain, fun	1/1	2/2	1/1	2/2	0/5	0/5
WhyBuggyPythonCountPar Test if a model can explain a bug in a parallelized wordcount function. tags: explain, python, fix	1/1	1/2	1/1	1/2	5/5	0/5
SqlExplore Test if the model can interact with an SQLite database and provide the correct command to add a new person with specific criteria. tags: sql, agent	0/1	2/2	1/1	0/2	5/5	5/5
IsUU Test if the model can correctly identify a block of text is uuencoded. tags: explain	1/1	2/2	1/1	0/2	5/5	0/5
SimTorchGrad This test case checks if the model can predict what the gradient of a variable is in PyTorch. tags: explain, python	1/1	2/2	1/1	2/2	0/5	0/5
ProgramStrided Test if the model knows how to use the strided trick with numpy. tags: code, python, performance	0/1	2/2	1/1	0/2	5/5	5/5
BashIncrementalUpdate Test if a model can run an incremental update of a bash command without overwriting files that already exist tags: bash	1/1	2/2	0/1	1/2	4/5	3/5
DataYearExtract Test if the model can extract structured data from (somewhat) unstructured text. tags: data	1/1	2/2	1/1	1/2	1/5	1/5
ImplementAssembly Test if the model can implement an interpreter for a new assembly language from a text description. tags: code, python	1/1	2/2	0/1	2/2	1/5	3/5
Crref Test if the model can rewrite a given python in C that performs reduced row echelon form (rref) on a 2D matrix. tags: code, c, performance	1/1	2/2	1/1	1/2	0/5	1/5
CallCFromPy Test if the model can write rust code that can be imported from python and knows how to build it. tags: rust, c, python, code	0/1	1/2	1/1	2/2	3/5	3/5
PythonToCLoopUpdate Test if a model can convert a python program to c, with a loop that makes it difficult. tags: code, python, c	1/1	2/2	1/1	1/2	0/5	1/5
CodeUnderstanding Test if a model can solve a simple capture-the-flag like entry in C. tags: c, explain	0/1	2/2	1/1	2/2	1/5	2/5
Make16Files Test if the model can write a Python script that merges a list of file paths into 16 files of approximately equal size. tags: code, python	1/1	2/2	0/1	2/2	2/5	1/5
ExplainWeirdC This test case is meant to test if the model can correctly evaluate a complex C expression. tags: explain, c	1/1	1/2	1/1	2/2	0/5	0/5
TwentyQuestionsLlama Test if the model is able to ask questions to get to an answer. tags: fun	0/1	0/2	1/1	2/2	3/5	4/5
FixJSONHelp Test if the model can fix broken JSON objects. tags: code, python	1/1	2/2	0/1	2/2	0/5	1/5
GitCherrypick Test if the model can guide a user through a series of git commands to identify and cherrypick a specific commit from a branch onto the main branch. tags: bash, git, agent	0/1	2/2	1/1	2/2	0/5	1/5
FixPatch Test if the model can generate a .patch file to fix a bug in a given Python code. tags: code, fix, python	0/1	0/2	1/1	2/2	1/5	5/5
BashFindDontContain Test if a model can implement (the negation of) a simple bash 1-liner searching for files that don't contain some text. tags: bash	0/1	2/2	0/1	2/2	5/5	0/5
BashListSize Test if the model can provide the correct bash command to list files in a directory and sort them by the least significant digit of their size. tags: bash	1/1	0/2	1/1	2/2	0/5	0/5
StateTable Test if the model can process a large table of text and identify rows with specific values. tags: data	1/1	2/2	1/1	0/2	0/5	0/5
TrainSchedule Test if the model can extract structured data from (somewhat) unstructured text. tags: data	1/1	2/2	0/1	2/2	0/5	0/5
UUDecode Test if the model can successfully uudecode a given string. tags: explain	1/1	2/2	1/1	0/2	0/5	0/5
ProgramNumbaLev Test if the model can generate a numba implementation of the Levenshtein distance algorithm. tags: code, python, performance	1/1	2/2	0/1	2/2	0/5	0/5
ProgramRewriteCCrypto Test the ability of the model to rewrite a simple c program so it will run on ubuntu, and keep bugs in place. tags: code, c	1/1	2/2	0/1	0/2	5/5	0/5
MakeShiftOpC Test if the model can generate a C++ program that defines dataflow DSL. tags: code, c	1/1	2/2	0/1	1/2	1/5	1/5
DB9 Test if a model knows about old computer ports when prompted ambiguously. tags: explain	0/1	1/2	1/1	2/2	0/5	0/5
InnerHTMLEventListener Test if a model knows that editing the innerHTML clears event listeners. tags: explain	1/1	0/2	1/1	1/2	0/5	0/5
LatexRedef Test if a model can use latex \renewcommand, and do a bit more than what I actually asked. tags: explain	1/1	1/2	1/1	0/2	0/5	0/5
UnholyMatrixStep Test if the model can solve a rather hard dynamic programming problem tags: code, c	0/1	2/2	1/1	1/2	0/5	0/5
ShortenC Test if the model can significantly shorten a repetitive C functions. tags: code, c	1/1	0/2	0/1	1/2	4/5	0/5
CallRustFromPy Test if the model can write rust code that can be imported from python and knows how to build it. tags: rust, c, python, code	1/1	0/2	1/1	0/2	0/5	1/5
BrokenExtraBrace This test checks is the model can figure out the user has put an accidental extra brace in the request body. tags: explain, python	1/1	0/2	0/1	2/2	0/5	1/5
PythonCountPar Test if the model can parallelize a python program to perform a wordcount. tags: code, python, performance	1/1	1/2	0/1	1/2	1/5	0/5
RustParCountNoLib Test if the model can write a rust program that performs parallel word counting. tags: code, rust, performance	0/1	2/2	1/1	0/2	1/5	0/5
ShortenC2 Test if the model can significantly shorten a repetitive C functions. tags: code, c	0/1	1/2	0/1	2/2	3/5	0/5
DisasRref Test if a model can decompile a long (300 line) python bytecode function back to python. tags: code, python	1/1	0/2	1/1	0/2	0/5	0/5
FixJSON Test if the model can fix broken JSON objects. tags: code, python	1/1	1/2	0/1	1/2	0/5	0/5
NumpyAdvancedIndexEasier Test if a model correctly understands how advanced indexing works in numpy. tags: explain, python	1/1	2/2	0/1	0/2	0/5	0/5
RustParCount Test if the model can write a rust program that performs parallel word counting. tags: code, rust, performance	0/1	1/2	1/1	1/2	0/5	0/5
ShortenCStep Test if the model can significantly shorten a repetitive C functions. tags: code, c	1/1	0/2	1/1	0/2	0/5	0/5
TorchBackwardExplain Test if the model can fix and explain a bug in PyTorch code related to forgetting to zero gradients. tags: code, python, fix	0/1	1/2	0/1	2/2	1/5	0/5
ShortenC2Step Test if the model can significantly shorten a repetitive C functions. tags: code, c	1/1	0/2	0/1	0/2	3/5	0/5
ProgramRemoveDP Test if the model can understand a DP algorithm and then convert it into an iterative implementation. tags: code, performance, python	0/1	2/2	0/1	1/2	0/5	0/5
AppendNotExtend This test checks is the model can figure out from context when it's right to use extend versus append. tags: explain, python	0/1	0/2	1/1	1/2	0/5	0/5
ImplementAssemblyByExample Test if the model can implement an interpreter for a new assembly language given an example. tags: code, python	1/1	0/2	0/1	1/2	0/5	0/5
NumbaRref Test if a model can rewrite a fairly complex Python function to Numba. tags: code, python, performance	0/1	0/2	0/1	1/2	0/5	5/5
UnholyMatrix Test if the model can solve a rather hard dynamic programming problem tags: code, c	0/1	2/2	0/1	1/2	0/5	0/5
DateNewsHeadlines Test if the model can predict the date a few news headlines were published. tags: fun	1/1	0/2	0/1	0/2	1/5	0/5
ProgramStringSlice Test if the model can write code to perform string slicing with vague instructions. tags: code, python	0/1	0/2	1/1	0/2	1/5	0/5
ShortenCHard Test if the model can significantly shorten a repetitive C functions. tags: code, c	1/1	0/2	0/1	0/2	1/5	0/5
TwentyQuestionsBook Test if the model is able to ask questions to get to an answer. tags: fun	0/1	0/2	0/1	1/2	1/5	2/5
TrainSchedulePython Test if the model can extract structured data from (somewhat) unstructured text. tags: data	0/1	0/2	1/1	0/2	0/5	0/5
PrintHelloPoly Test if the model can generate a program that prints "hello world" when run either as a C or a python program. tags: code, python	0/1	0/2	0/1	2/2	0/5	0/5
AWSV6 Test if the model can identify the error in an AWS Lambda code for authorizing a new network. This type of error is generally difficult to find via search. tags: explain	0/1	0/2	0/1	1/2	2/5	0/5
JaxOneHot Test if the model can correctly convert a list of indexes to a one-hot vector in Python using JAX. tags: code, python	0/1	0/2	0/1	0/2	0/5	3/5
MakePNGToELF Test if the model can make a PNG get detected as an ELF executable. tags: coding	0/1	0/2	0/1	0/2	0/5	0/5
TrainScheduleHard Test if the model can extract structured data from (somewhat) unstructured text. tags: data	0/1	0/2	0/1	0/2	0/5	0/5
FindBugPaper Test if a model can find math errors in the latex source of a paper. tags: explain	0/1	0/2	0/1	0/2	0/5	0/5
FindBugPaperEasy Test if a model can find math errors in the latex source of a paper. tags: explain	0/1	0/2	0/1	0/2	0/5	0/5
NumpyAdvancedIndex Test if a model correctly understands how advanced indexing works in numpy. tags: explain, python	0/1	0/2	0/1	0/2	0/5	0/5
PrintHelloPoly2 Test if the model can generate a program that prints "hello world" when run either as a C or a python program. tags: code, python	0/1	0/2	0/1	0/2	0/5	0/5
NewAssemblyPrimeNumbers Test if the model can write a program in a new assembly language. This ability to learn a new language on-the-fly is important for many tasks. tags: code	0/1	0/2	0/1	0/2	0/5	0/5
FlagDraw Test if a model can write a program that directly writes a jpeg file. This requires precise understanding of the jpeg spec. tags: code, python, visual	0/1	0/2	0/1	0/2	0/5	0/5
ProgramTB Test if the model can identify the buf and fix a program that handles python tracebacks. Useful to know if the model can handle more advanced python libraries. tags: code, fix	0/1	0/2	0/1	0/2	0/5	0/5
ShortenC2Hard Test if the model can significantly shorten a repetitive C functions. tags: code, c	0/1	0/2	0/1	0/2	0/5	0/5
WhatIsBlockByOrb Test if the model knows what ERR_BLOCKED_BY_ORB means. tags: explain	0/1	0/2	0/1	0/2	0/5	0/5
WhereIsSbox This test case checks if the model knows what latex package to import for the Sbox environment to work. tags: explain	0/1	0/2	0/1	0/2	0/5	0/5
WhisperMerge Test if the model can implement some string logic given a fuzzy description. tags: code, python	0/1	0/2	0/1	0/2	0/5	0/5