Question | DeepSeek r1 (71%) | o3-mini (70%) | o1-mini (62%) | Claude 3.5 Sonnet (56%) | GPT 4o (48%) | DeepSeek v3 (47%) | Claude 3.5 Haiku (44%) | gemini-1.5-pro-002 (43%) | llama-3.3-70b (39%) | GPT 4o-mini (37%) |
---|---|---|---|---|---|---|---|---|---|---|
GitSimple
Test if the model can guide a user in a conversation to setup a git repo.
tags: bash, git, agent | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
ProgramRewriteCSimple
Test if the model can rewrite a very simple Python program into an equivalent C program.
tags: code, c | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
ExplainPrime
Test if the model can interpret a minified JavaScript function and explain its function.
tags: code, explain | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
GitMerge
Test if the model can guide a user through a series of git commands to merge a specific branch into the main branch.
tags: bash, git, agent | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
WhatIsStarStar
Test if the model can understand and interpret a request to gitignore any file called "foo/.KEYFILE" regardless of its location in a repository.
tags: explain | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
PrintHello
Test if the model can generate a basic python program that prints "hello world".
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
RecoverExpiredPage
Test if a model knows how to get the HTML for the entire webpage; not just the body.
tags: explain, html | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
ShortenPyGet
Test if the model can shorten a line of python with an equal line.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
UnitConversion
Test if a model can do basic math with some EE equations.
tags: explain | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
WhatIsAutoModel
Test if the model can interpret vague questions and will respond with the answer I want, not the answer that's easy to find.
tags: explain | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
WhatIsFloatFormat
This test case checks if models can format f strings with floats.
tags: explain, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
WhatIsInv
This test case is designed to check if the model can correctly identify the Python operator used for the tilde (~) symbol.
tags: explain, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
Dedent
Test if the model can write a Python function that removes excess indentation from a given block of code.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 4/5 | 5/5 | 1/1 | 5/5 |
ExtractEmail
Test if the model can accurately extract and identify invalid email addresses from a given text file. Models that are "overly safe" will fail.
tags: data | 1/1 | 2/2 | 2/2 | 5/5 | 4/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
FreeCADCircle
Test if the model understands a rambling question about how to make construction circle in FreeCAD.
tags: explain, fun | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 4/5 |
SqlMakeTable
Test if the model can generate a SQL query to create a database table.
tags: sql | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 4/5 | 1/1 | 5/5 |
Make16FilesEasy
Test if the model can write a Python script that merges a list of file paths into 16 files of approximately equal size.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 4/5 | 1/1 | 5/5 |
VectorizeSmall
Test if the model can replace a for loop with a vectorized version.
tags: code, python, performance | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 4/5 | 1/1 | 5/5 |
BashRenamer
Test if the model can write a bash script that renames files with a specific pattern.
tags: code, bash | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 3/5 | 1/1 | 5/5 |
ExplainBroadcast
Test if the model can correctly explain what the VPBROADCASTB instruction does.
tags: explain | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 4/5 | 5/5 | 1/1 | 4/5 |
Regex
Test if the model can write a Python function with a straightforward regex.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 3/5 |
WhatIsSlice
This test case checks if the model can say how to properly get the end of a slice.
tags: explain, python | 1/1 | 2/2 | 1/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 5/5 |
GetVocab
This test case is designed to check if the model can print out the tokens in a AutoTokenizer's vocabulary.
tags: explain, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 4/5 | 3/5 | 1/1 | 5/5 |
VagueLoopFormat
Test if the model can follow vague instructions for how to print IDs following an example.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 2/5 | 1/1 | 5/5 |
FixDockerCuda
This test case checks if the model can debug a docker cuda error
tags: explain | 1/1 | 1/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 4/5 | 1/1 | 5/5 |
CRC32
Test if the model understands the CRC-32 spec well enough to implement it.
tags: code, c | 1/1 | 2/2 | 2/2 | 5/5 | 4/5 | 1/1 | 4/5 | 4/5 | 1/1 | 4/5 |
SumSomeData
Test if the model can infer what data to sum and what to ignore by example with vague instructions.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 4/5 | 1/1 | 5/5 | 5/5 | 1/1 | 2/5 |
Flexbox
Test if the model can generate an HTML file using flexbox
tags: code, html | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 0/5 |
ProgramSqrt
Test if the model can implement a sqrt function.
tags: code, python | 1/1 | 2/2 | 1/2 | 5/5 | 5/5 | 1/1 | 5/5 | 2/5 | 1/1 | 5/5 |
FixJnpBug
Test if the model can identify and fix a bug in a given jax.numpy function.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 0/1 | 5/5 | 4/5 | 1/1 | 5/5 |
SqlSubquery
Test if the model can generate a Python program that retrieves data from a SQL file.
tags: sql | 1/1 | 2/2 | 2/2 | 3/5 | 5/5 | 1/1 | 5/5 | 5/5 | 1/1 | 0/5 |
QuestionThreadedFix
Test if the model can explain a poorly worded error message in a short threaded python program.
tags: code, python, explain | 1/1 | 2/2 | 2/2 | 5/5 | 4/5 | 0/1 | 4/5 | 5/5 | 1/1 | 5/5 |
SimpleFix
Test if the model can identify and fix an issue with a tokenizer in a Python code snippet. Identifying the problem is in the regex, and fixing the regex, are both hard.
tags: code, fix, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 3/5 | 0/5 | 1/1 | 5/5 |
ImgResize
Test if the model can resize several images in a given subdirectory.
tags: code, python | 1/1 | 1/2 | 1/2 | 5/5 | 4/5 | 1/1 | 5/5 | 5/5 | 1/1 | 4/5 |
BashIncrementalUpdate
Test if a model can run an incremental update of a bash command without overwriting files that already exist
tags: bash | 1/1 | 2/2 | 1/2 | 5/5 | 4/5 | 1/1 | 3/5 | 5/5 | 1/1 | 3/5 |
ProgramRewriteC
Test if the model can rewrite a given Python program into an equivalent C program.
tags: code, c | 1/1 | 2/2 | 2/2 | 4/5 | 4/5 | 1/1 | 5/5 | 5/5 | 0/1 | 3/5 |
LlamaKnowledge
Test the knowledge cutoff of the model to see if it knows the LLAMA-2 hidden dimension size.
tags: explain | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 1/5 | 5/5 | 1/1 | 0/5 |
DrawTriangle
Test if the model can generate an HTML file with WebGL code that draws an image.
tags: code, visual, html | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 3/5 | 3/5 | 1/1 | 0/5 |
WhatIsLPR
This test case checks if the model knows lpr commands.
tags: explain | 0/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 4/5 | 5/5 | 1/1 | 2/5 |
TorchBackwardFix
Test if the model can fix and explain a bug in PyTorch code related to forgetting to zero gradients.
tags: code, python, fix | 1/1 | 2/2 | 1/2 | 5/5 | 5/5 | 1/1 | 1/5 | 2/5 | 1/1 | 5/5 |
SimpleBNF
Test if the model can understand a vague BNF-style grammar and write a Python function that evaluates expressions based on the grammar rules.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 4/5 | 1/1 | 4/5 | 3/5 | 0/1 | 4/5 |
ExtractRef
Test if the model can extract paper tiles from a block of text.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 5/5 | 0/1 | 0/5 |
MissingStep
Test if the model can identify a missing incredient in a recipe. Identifying incorrect steps is much harder than missing steps.
tags: explain, fun | 1/1 | 2/2 | 2/2 | 5/5 | 5/5 | 0/1 | 3/5 | 5/5 | 1/1 | 1/5 |
ExplainPrime2
Test if the model can interpret a minified and obfuscated JavaScript function and explain its function.
tags: explain | 1/1 | 2/2 | 2/2 | 1/5 | 5/5 | 1/1 | 3/5 | 0/5 | 1/1 | 5/5 |
MakeJson
Test if the model can successfully convert unstructured data to JSON.
tags: data | 1/1 | 2/2 | 2/2 | 0/5 | 5/5 | 1/1 | 4/5 | 0/5 | 1/1 | 5/5 |
DisasPrimes
Test if the model can disassemble Python bytecode and create a function that returns a list of prime numbers and their negations.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 0/5 | 1/1 | 5/5 | 3/5 | 1/1 | 0/5 |
SqlExplore
Test if the model can interact with an SQLite database and provide the correct command to add a new person with specific criteria.
tags: sql, agent | 0/1 | 2/2 | 2/2 | 0/5 | 5/5 | 1/1 | 5/5 | 3/5 | 1/1 | 5/5 |
WhatIsStarStarB
Test if the model can understand and interpret a request to gitignore any file called "foo/.KEYFILE" regardless of its location in a repository.
tags: explain | 1/1 | 2/2 | 1/2 | 5/5 | 1/5 | 1/1 | 5/5 | 5/5 | 0/1 | 4/5 |
UPythonMQTT
Test if a model can write upython code with an obscure module.
tags: python, code | 0/1 | 2/2 | 2/2 | 5/5 | 4/5 | 0/1 | 5/5 | 3/5 | 1/1 | 5/5 |
BashFindDontContain
Test if a model can implement (the negation of) a simple bash 1-liner searching for files that don't contain some text.
tags: bash | 0/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 0/5 | 1/5 | 1/1 | 4/5 |
ProgramRewriteCCrypto
Test the ability of the model to rewrite a simple c program so it will run on ubuntu, and keep bugs in place.
tags: code, c | 1/1 | 2/2 | 2/2 | 0/5 | 5/5 | 1/1 | 0/5 | 5/5 | 0/1 | 5/5 |
FixNode
Test if the model can identify a node error message
tags: explain | 1/1 | 2/2 | 2/2 | 2/5 | 5/5 | 0/1 | 2/5 | 4/5 | 1/1 | 1/5 |
LatexNewline
Test if a model can fix a latex newline error in a caption
tags: explain | 1/1 | 2/2 | 2/2 | 5/5 | 1/5 | 1/1 | 2/5 | 5/5 | 0/1 | 1/5 |
EasyFlagDrawBMP
Test if the model can write a C program that draws an image. This test requires the ability to understand the .bmp specification, and draw a flag that can be correctly parsed and seen by the evaluator.
tags: code, c, visual | 1/1 | 2/2 | 2/2 | 5/5 | 0/5 | 1/1 | 3/5 | 4/5 | 0/1 | 1/5 |
RustCountNoLib
Test if the model can write a rust program that performs word counting.
tags: code, rust | 1/1 | 2/2 | 2/2 | 5/5 | 2/5 | 1/1 | 3/5 | 3/5 | 0/1 | 0/5 |
MakeShiftOp
Test if the model can generate a python program that defines dataflow DSL.
tags: code, python | 1/1 | 2/2 | 1/2 | 3/5 | 4/5 | 1/1 | 1/5 | 4/5 | 0/1 | 2/5 |
NumpyIx
Test if a model can identify the _ix function as a method for simplifying some code.
tags: explain, python | 1/1 | 1/2 | 1/2 | 5/5 | 2/5 | 0/1 | 5/5 | 0/5 | 1/1 | 3/5 |
EmojiMovie
A for-fun test to see if the model can go movie title -> emoji -> movie title.
tags: fun | 1/1 | 2/2 | 0/2 | 5/5 | 3/5 | 1/1 | 1/5 | 5/5 | 0/1 | 1/5 |
MakeTreeEasy
Test if the model can create a tree from a string.
tags: code, python | 1/1 | 2/2 | 2/2 | 5/5 | 0/5 | 1/1 | 5/5 | 0/5 | 0/1 | 0/5 |
ProgramStrided
Test if the model knows how to use the strided trick with numpy.
tags: code, python, performance | 0/1 | 2/2 | 2/2 | 5/5 | 5/5 | 1/1 | 5/5 | 0/5 | 0/1 | 0/5 |
GitMergeConflict
Test if the model can guide a user through a series of git commands to merge a specific branch into the main branch.
tags: bash, git, agent | 0/1 | 2/2 | 1/2 | 5/5 | 5/5 | 0/1 | 2/5 | 5/5 | 0/1 | 5/5 |
TwentyQuestionsLlama
Test if the model is able to ask questions to get to an answer.
tags: fun | 0/1 | 0/2 | 0/2 | 4/5 | 3/5 | 1/1 | 4/5 | 4/5 | 1/1 | 3/5 |
HallucinateReference
Test if the model will hallucinate references that don't exist.
tags: explain | 1/1 | 2/2 | 1/2 | 5/5 | 4/5 | 0/1 | 3/5 | 0/5 | 0/1 | 3/5 |
RustCount
Test if the model can write a rust program that performs word counting.
tags: code, rust | 1/1 | 2/2 | 2/2 | 5/5 | 3/5 | 0/1 | 1/5 | 2/5 | 0/1 | 1/5 |
LispSilencePython
Test if the model can understand a vague error for an emacs lisp question.
tags: explain | 1/1 | 2/2 | 2/2 | 0/5 | 4/5 | 1/1 | 2/5 | 0/5 | 0/1 | 0/5 |
TorchJnp
Test if the model can convert a torch neural network to a jax numpy model.
tags: code, python | 1/1 | 2/2 | 2/2 | 0/5 | 0/5 | 0/1 | 5/5 | 1/5 | 1/1 | 0/5 |
NewAssemblySquareNumbers
Test if the model can write a program in a new assembly language. This ability to learn a new language on-the-fly is important for many tasks.
tags: code | 1/1 | 1/2 | 1/2 | 5/5 | 4/5 | 1/1 | 0/5 | 2/5 | 0/1 | 0/5 |
DrawHouse
Test if the model can generate an HTML file with WebGL code that draws an image.
tags: code, visual, html | 1/1 | 2/2 | 1/2 | 5/5 | 3/5 | 0/1 | 5/5 | 0/5 | 0/1 | 0/5 |
Disas1
Test if the model can disassemble a simple Python function from its bytecode.
tags: code, python | 1/1 | 2/2 | 2/2 | 0/5 | 3/5 | 0/1 | 5/5 | 2/5 | 0/1 | 0/5 |
FastL2
Test if the model can optimize a given Python program for speed and memory efficiency.
tags: code, performance, python | 1/1 | 2/2 | 2/2 | 0/5 | 1/5 | 0/1 | 1/5 | 5/5 | 0/1 | 2/5 |
Make16Files
Test if the model can write a Python script that merges a list of file paths into 16 files of approximately equal size.
tags: code, python | 1/1 | 2/2 | 2/2 | 4/5 | 2/5 | 0/1 | 1/5 | 1/5 | 0/1 | 1/5 |
PyChessPrefix
Test if the model can correctly call a python API for a moderately popular python library.
tags: code, python | 1/1 | 2/2 | 1/2 | 2/5 | 3/5 | 1/1 | 1/5 | 0/5 | 0/1 | 0/5 |
IsUU
Test if the model can correctly identify a block of text is uuencoded.
tags: explain | 1/1 | 2/2 | 1/2 | 0/5 | 5/5 | 1/1 | 0/5 | 0/5 | 0/1 | 0/5 |
ImplementAssembly
Test if the model can implement an interpreter for a new assembly language from a text description.
tags: code, python | 1/1 | 2/2 | 1/2 | 3/5 | 1/5 | 0/1 | 3/5 | 3/5 | 0/1 | 0/5 |
ExplainWeirdCEasy
This test case is meant to test if the model can correctly evaluate a complex C expression.
tags: explain, c | 1/1 | 2/2 | 0/2 | 3/5 | 1/5 | 1/1 | 0/5 | 0/5 | 0/1 | 3/5 |
DataYearExtract
Test if the model can extract structured data from (somewhat) unstructured text.
tags: data | 1/1 | 2/2 | 2/2 | 0/5 | 1/5 | 0/1 | 1/5 | 5/5 | 0/1 | 0/5 |
StateTableStepbystep
Test if the model can process a large table of text and identify rows with specific values.
tags: data | 1/1 | 2/2 | 2/2 | 0/5 | 2/5 | 0/1 | 0/5 | 0/5 | 1/1 | 0/5 |
ProgramStringSlice
Test if the model can write code to perform string slicing with vague instructions.
tags: code, python | 0/1 | 0/2 | 0/2 | 1/5 | 1/5 | 1/1 | 0/5 | 5/5 | 1/1 | 5/5 |
FlagDrawBMP
Test if the model can write a C program that draws an image. This test requires the ability to understand the .bmp specification, and draw a flag that can be correctly parsed and seen by the evaluator.
tags: code, c, visual | 1/1 | 2/2 | 1/2 | 4/5 | 3/5 | 0/1 | 1/5 | 0/5 | 0/1 | 1/5 |
CodeUnderstanding
Test if a model can solve a simple capture-the-flag like entry in C.
tags: c, explain | 0/1 | 2/2 | 2/2 | 5/5 | 1/5 | 0/1 | 2/5 | 0/5 | 0/1 | 3/5 |
CallCFromPy
Test if the model can write rust code that can be imported from python and knows how to build it.
tags: rust, c, python, code | 0/1 | 1/2 | 1/2 | 1/5 | 3/5 | 1/1 | 3/5 | 3/5 | 0/1 | 1/5 |
FixPatch
Test if the model can generate a .patch file to fix a bug in a given Python code.
tags: code, fix, python | 0/1 | 0/2 | 2/2 | 5/5 | 1/5 | 0/1 | 5/5 | 5/5 | 0/1 | 0/5 |
PythonToCLoopUpdate
Test if a model can convert a python program to c, with a loop that makes it difficult.
tags: code, python, c | 1/1 | 2/2 | 2/2 | 0/5 | 0/5 | 0/1 | 1/5 | 0/5 | 1/1 | 0/5 |
MakeTree
Test if the model can create a tree from a string.
tags: code, python | 1/1 | 2/2 | 1/2 | 5/5 | 0/5 | 0/1 | 3/5 | 0/5 | 0/1 | 0/5 |
Base64Thought
Test if a model will follow instructions to the letter without lots of cajoling. Thinking in base64 is also interesting.
tags: explain, fun | 1/1 | 2/2 | 2/2 | 5/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
Crref
Test if the model can rewrite a given python in C that performs reduced row echelon form (rref) on a 2D matrix.
tags: code, c, performance | 1/1 | 2/2 | 2/2 | 3/5 | 0/5 | 0/1 | 1/5 | 1/5 | 0/1 | 0/5 |
WhyBuggyPythonCountPar
Test if a model can explain a bug in a parallelized wordcount function.
tags: explain, python, fix | 1/1 | 1/2 | 0/2 | 1/5 | 5/5 | 0/1 | 0/5 | 4/5 | 0/1 | 2/5 |
BashListSize
Test if the model can provide the correct bash command to list files in a directory and sort them by the least significant digit of their size.
tags: bash | 1/1 | 0/2 | 2/2 | 0/5 | 0/5 | 0/1 | 0/5 | 2/5 | 1/1 | 2/5 |
BrokenExtraBrace
This test checks is the model can figure out the user has put an accidental extra brace in the request body.
tags: explain, python | 1/1 | 0/2 | 1/2 | 5/5 | 0/5 | 1/1 | 1/5 | 0/5 | 0/1 | 0/5 |
ProgramNumbaLev
Test if the model can generate a numba implementation of the Levenshtein distance algorithm.
tags: code, python, performance | 1/1 | 2/2 | 0/2 | 1/5 | 0/5 | 0/1 | 0/5 | 1/5 | 1/1 | 1/5 |
PythonCountPar
Test if the model can parallelize a python program to perform a wordcount.
tags: code, python, performance | 1/1 | 1/2 | 2/2 | 4/5 | 1/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
TrainSchedule
Test if the model can extract structured data from (somewhat) unstructured text.
tags: data | 1/1 | 2/2 | 1/2 | 4/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
FixJSONHelp
Test if the model can fix broken JSON objects.
tags: code, python | 1/1 | 2/2 | 1/2 | 3/5 | 0/5 | 0/1 | 1/5 | 0/5 | 0/1 | 0/5 |
ShortenC2
Test if the model can significantly shorten a repetitive C functions.
tags: code, c | 0/1 | 1/2 | 2/2 | 1/5 | 3/5 | 0/1 | 0/5 | 5/5 | 0/1 | 0/5 |
ExplainWeirdC
This test case is meant to test if the model can correctly evaluate a complex C expression.
tags: explain, c | 1/1 | 1/2 | 1/2 | 2/5 | 0/5 | 0/1 | 0/5 | 1/5 | 0/1 | 3/5 |
NumpyAdvancedIndexEasier
Test if a model correctly understands how advanced indexing works in numpy.
tags: explain, python | 1/1 | 2/2 | 2/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 1/5 |
SimTorchGrad
This test case checks if the model can predict what the gradient of a variable is in PyTorch.
tags: explain, python | 1/1 | 2/2 | 2/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
RLEDecode
This test case tests if the model can convert a Game of Life pattern represented in RLE format to a numpy array.
tags: code, python | 1/1 | 2/2 | 1/2 | 1/5 | 0/5 | 0/1 | 1/5 | 0/5 | 0/1 | 0/5 |
MakeShiftOpC
Test if the model can generate a C++ program that defines dataflow DSL.
tags: code, c | 1/1 | 2/2 | 1/2 | 0/5 | 1/5 | 0/1 | 1/5 | 0/5 | 0/1 | 0/5 |
GitCherrypick
Test if the model can guide a user through a series of git commands to identify and cherrypick a specific commit from a branch onto the main branch.
tags: bash, git, agent | 0/1 | 2/2 | 0/2 | 5/5 | 0/5 | 0/1 | 1/5 | 1/5 | 0/1 | 2/5 |
ShortenC
Test if the model can significantly shorten a repetitive C functions.
tags: code, c | 1/1 | 0/2 | 0/2 | 0/5 | 4/5 | 1/1 | 0/5 | 0/5 | 0/1 | 0/5 |
StateTable
Test if the model can process a large table of text and identify rows with specific values.
tags: data | 1/1 | 2/2 | 1/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
UUDecode
Test if the model can successfully uudecode a given string.
tags: explain | 1/1 | 2/2 | 1/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
DB9
Test if a model knows about old computer ports when prompted ambiguously.
tags: explain | 0/1 | 1/2 | 2/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 4/5 |
AWSV6
Test if the model can identify the error in an AWS Lambda code for authorizing a new network. This type of error is generally difficult to find via search.
tags: explain | 0/1 | 0/2 | 2/2 | 0/5 | 2/5 | 0/1 | 0/5 | 4/5 | 0/1 | 0/5 |
ImplementAssemblyByExample
Test if the model can implement an interpreter for a new assembly language given an example.
tags: code, python | 1/1 | 0/2 | 0/2 | 4/5 | 0/5 | 0/1 | 0/5 | 2/5 | 0/1 | 0/5 |
ShortenCHard
Test if the model can significantly shorten a repetitive C functions.
tags: code, c | 1/1 | 0/2 | 0/2 | 0/5 | 1/5 | 1/1 | 0/5 | 0/5 | 0/1 | 0/5 |
TorchBackwardExplain
Test if the model can fix and explain a bug in PyTorch code related to forgetting to zero gradients.
tags: code, python, fix | 0/1 | 1/2 | 0/2 | 1/5 | 1/5 | 0/1 | 0/5 | 3/5 | 0/1 | 3/5 |
DisasRref
Test if a model can decompile a long (300 line) python bytecode function back to python.
tags: code, python | 1/1 | 0/2 | 0/2 | 0/5 | 0/5 | 1/1 | 0/5 | 0/5 | 0/1 | 0/5 |
FixJSON
Test if the model can fix broken JSON objects.
tags: code, python | 1/1 | 1/2 | 1/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
NumbaRref
Test if a model can rewrite a fairly complex Python function to Numba.
tags: code, python, performance | 0/1 | 0/2 | 0/2 | 5/5 | 0/5 | 0/1 | 5/5 | 0/5 | 0/1 | 0/5 |
ShortenCStep
Test if the model can significantly shorten a repetitive C functions.
tags: code, c | 1/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 4/5 |
CallRustFromPy
Test if the model can write rust code that can be imported from python and knows how to build it.
tags: rust, c, python, code | 1/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 1/5 | 1/5 | 0/1 | 1/5 |
RustParCount
Test if the model can write a rust program that performs parallel word counting.
tags: code, rust, performance | 0/1 | 1/2 | 1/2 | 2/5 | 0/5 | 0/1 | 0/5 | 1/5 | 0/1 | 0/5 |
ShortenC2Step
Test if the model can significantly shorten a repetitive C functions.
tags: code, c | 1/1 | 0/2 | 0/2 | 0/5 | 3/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
InnerHTMLEventListener
Test if a model knows that editing the innerHTML clears event listeners.
tags: explain | 1/1 | 0/2 | 1/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
LatexRedef
Test if a model can use latex \renewcommand, and do a bit more than what I actually asked.
tags: explain | 1/1 | 1/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
PrintHelloPoly
Test if the model can generate a program that prints "hello world" when run either as a C or a python program.
tags: code, python | 0/1 | 0/2 | 1/2 | 4/5 | 0/5 | 0/1 | 0/5 | 1/5 | 0/1 | 0/5 |
TwentyQuestionsBook
Test if the model is able to ask questions to get to an answer.
tags: fun | 0/1 | 0/2 | 0/2 | 2/5 | 1/5 | 0/1 | 2/5 | 2/5 | 0/1 | 0/5 |
RustParCountNoLib
Test if the model can write a rust program that performs parallel word counting.
tags: code, rust, performance | 0/1 | 2/2 | 0/2 | 1/5 | 1/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
DateNewsHeadlines
Test if the model can predict the date a few news headlines were published.
tags: fun | 1/1 | 0/2 | 0/2 | 0/5 | 1/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
FindBugPaper
Test if a model can find math errors in the latex source of a paper.
tags: explain | 0/1 | 0/2 | 2/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 1/5 |
ProgramRemoveDP
Test if the model can understand a DP algorithm and then convert it into an iterative implementation.
tags: code, performance, python | 0/1 | 2/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
UnholyMatrix
Test if the model can solve a rather hard dynamic programming problem
tags: code, c | 0/1 | 2/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
UnholyMatrixStep
Test if the model can solve a rather hard dynamic programming problem
tags: code, c | 0/1 | 2/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
FindBugPaperEasy
Test if a model can find math errors in the latex source of a paper.
tags: explain | 0/1 | 0/2 | 1/2 | 1/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 1/5 |
JaxOneHot
Test if the model can correctly convert a list of indexes to a one-hot vector in Python using JAX.
tags: code, python | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 3/5 | 0/5 | 0/1 | 0/5 |
TrainSchedulePython
Test if the model can extract structured data from (somewhat) unstructured text.
tags: data | 0/1 | 0/2 | 1/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
AppendNotExtend
This test checks is the model can figure out from context when it's right to use extend versus append.
tags: explain, python | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 2/5 | 0/1 | 0/5 |
TrainScheduleHard
Test if the model can extract structured data from (somewhat) unstructured text.
tags: data | 0/1 | 0/2 | 0/2 | 1/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
PrintHelloPoly2
Test if the model can generate a program that prints "hello world" when run either as a C or a python program.
tags: code, python | 0/1 | 0/2 | 0/2 | 1/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
WhereIsSbox
This test case checks if the model knows what latex package to import for the Sbox environment to work.
tags: explain | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 1/5 | 0/1 | 0/5 |
MakePNGToELF
Test if the model can make a PNG get detected as an ELF executable.
tags: coding | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
NumpyAdvancedIndex
Test if a model correctly understands how advanced indexing works in numpy.
tags: explain, python | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
NewAssemblyPrimeNumbers
Test if the model can write a program in a new assembly language. This ability to learn a new language on-the-fly is important for many tasks.
tags: code | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
FlagDraw
Test if a model can write a program that directly writes a jpeg file. This requires precise understanding of the jpeg spec.
tags: code, python, visual | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
ProgramTB
Test if the model can identify the buf and fix a program that handles python tracebacks. Useful to know if the model can handle more advanced python libraries.
tags: code, fix | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
ShortenC2Hard
Test if the model can significantly shorten a repetitive C functions.
tags: code, c | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
WhatIsBlockByOrb
Test if the model knows what ERR_BLOCKED_BY_ORB means.
tags: explain | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |
WhisperMerge
Test if the model can implement some string logic given a fuzzy description.
tags: code, python | 0/1 | 0/2 | 0/2 | 0/5 | 0/5 | 0/1 | 0/5 | 0/5 | 0/1 | 0/5 |