| | | | | | | | | | | | | | |
Questiongpt-4-0125-preview
(50%)
claude-3-opus
(46%)
claude-3-sonnet
(36%)
mistral-large
(33%)
gpt-3.5-turbo-0125
(29%)
mistral-medium
(27%)
gemini-pro
(22%)
ExplainBroadcast Test if the model can correctly explain what the VPBROADCASTB instruction does.
tags: explain
5/55/55/55/55/55/55/5
PrintHello Test if the model can generate a basic python program that prints "hello world".
tags: code, python
5/55/55/55/55/55/55/5
RecoverExpiredPage Test if a model knows how to get the HTML for the entire webpage; not just the body.
tags: explain, html
5/55/55/55/55/55/55/5
ExplainPrime Test if the model can interpret a minified JavaScript function and explain its function.
tags: code, explain
5/55/55/55/55/55/54/5
WhatIsFloatFormat This test case checks if models can format f strings with floats.
tags: explain, python
5/55/55/55/54/55/55/5
ProgramRewriteCSimple Test if the model can rewrite a very simple Python program into an equivalent C program.
tags: code, c
5/55/55/54/54/55/55/5
GitSimple Test if the model can guide a user in a conversation to setup a git repo.
tags: bash, git, agent
5/55/55/55/55/53/55/5
ExplainPrime2 Test if the model can interpret a minified and obfuscated JavaScript function and explain its function.
tags: explain
5/55/55/55/55/50/55/5
TorchBackwardFix Test if the model can fix and explain a bug in PyTorch code related to forgetting to zero gradients.
tags: code, python, fix
3/53/54/55/55/55/55/5
Flexbox Test if the model can generate an HTML file using flexbox
tags: code, html
5/55/55/55/55/55/50/5
GitMerge Test if the model can guide a user through a series of git commands to merge a specific branch into the main branch.
tags: bash, git, agent
5/55/55/55/50/55/55/5
SqlMakeTable Test if the model can generate a SQL query to create a database table.
tags: sql
5/55/55/55/55/50/55/5
ShortenPyGet Test if the model can shorten a line of python with an equal line.
tags: code, python
4/55/55/54/55/52/55/5
ProgramSqrt Test if the model can implement a sqrt function.
tags: code, python
5/55/54/55/53/55/52/5
Regex Test if the model can write a Python function with a straightforward regex.
tags: code, python
5/55/53/55/53/55/53/5
WhatIsInv This test case is designed to check if the model can correctly identify the Python operator used for the tilde (~) symbol.
tags: explain, python
5/53/55/55/55/55/50/5
Dedent Test if the model can write a Python function that removes excess indentation from a given block of code.
tags: code, python
5/55/55/52/51/54/55/5
MissingStep Test if the model can identify a missing incredient in a recipe. Identifying incorrect steps is much harder than missing steps.
tags: explain, fun
5/55/55/55/50/52/55/5
BashRenamer Test if the model can write a bash script that renames files with a specific pattern.
tags: code, bash
5/55/55/52/55/52/52/5
DrawTriangle Test if the model can generate an HTML file with WebGL code that draws an image.
tags: code, visual, html
3/55/55/55/52/51/54/5
FreeCADCircle Test if the model understands a rambling question about how to make construction circle in FreeCAD.
tags: explain, fun
5/53/53/55/53/55/51/5
RustCountNoLib Test if the model can write a rust program that performs word counting.
tags: code, rust
5/55/55/54/55/50/50/5
VectorizeSmall Test if the model can replace a for loop with a vectorized version.
tags: code, python, performance
2/55/54/50/54/54/55/5
WhatIsStarStar Test if the model can understand and interpret a request to gitignore any file called "foo/.KEYFILE" regardless of its location in a repository.
tags: explain
5/55/51/53/53/54/50/5
BashFindDontContain Test if a model can implement (the negation of) a simple bash 1-liner searching for files that don't contain some text.
tags: bash
4/53/53/55/55/51/50/5
Disas1 Test if the model can disassemble a simple Python function from its bytecode.
tags: code, python
0/55/50/50/55/54/55/5
DisasPrimes Test if the model can disassemble Python bytecode and create a function that returns a list of prime numbers and their negations.
tags: code, python
1/55/51/55/50/55/52/5
WhatIsStarStarB Test if the model can understand and interpret a request to gitignore any file called "foo/.KEYFILE" regardless of its location in a repository.
tags: explain
5/55/51/51/55/51/50/5
SqlSubquery Test if the model can generate a Python program that retrieves data from a SQL file.
tags: sql
5/55/55/52/50/50/51/5
Make16FilesEasy Test if the model can write a Python script that merges a list of file paths into 16 files of approximately equal size.
tags: code, python
5/55/53/52/51/52/50/5
EmojiMovie A for-fun test to see if the model can go movie title -> emoji -> movie title.
tags: fun
5/55/53/50/54/50/50/5
WhatIsAutoModel Test if the model can interpret vague questions and will respond with the answer I want, not the answer that's easy to find.
tags: explain
5/52/54/51/55/50/50/5
ImgResize Test if the model can resize several images in a given subdirectory.
tags: code, python
3/55/53/50/51/53/51/5
ExtractEmail Test if the model can accurately extract and identify invalid email addresses from a given text file. Models that are "overly safe" will fail.
tags: data
0/55/54/54/50/53/50/5
CodeUnderstanding Test if a model can solve a simple capture-the-flag like entry in C.
tags: c, explain
4/52/51/53/51/53/51/5
FixJnpBug Test if the model can identify and fix a bug in a given jax.numpy function.
tags: code, python
4/55/51/51/54/50/50/5
WhyBuggyPythonCountPar Test if a model can explain a bug in a parallelized wordcount function.
tags: explain, python, fix
5/51/52/52/51/53/50/5
CRC32 Test if the model understands the CRC-32 spec well enough to implement it.
tags: code, c
3/55/52/50/53/51/50/5
ProgramNumbaLev Test if the model can generate a numba implementation of the Levenshtein distance algorithm.
tags: code, python, performance
5/51/52/53/51/50/52/5
ProgramRewriteCCrypto Test the ability of the model to rewrite a simple c program so it will run on ubuntu, and keep bugs in place.
tags: code, c
3/50/50/55/50/54/52/5
ProgramStrided Test if the model knows how to use the strided trick with numpy.
tags: code, python, performance
5/54/51/54/50/50/50/5
AWSV6 Test if the model can identify the error in an AWS Lambda code for authorizing a new network. This type of error is generally difficult to find via search.
tags: explain
3/50/54/52/50/53/51/5
SqlExplore Test if the model can interact with an SQLite database and provide the correct command to add a new person with specific criteria.
tags: sql, agent
4/53/50/50/53/52/51/5
SimpleFix Test if the model can identify and fix an issue with a tokenizer in a Python code snippet. Identifying the problem is in the regex, and fixing the regex, are both hard.
tags: code, fix, python
5/51/52/50/50/53/52/5
MakeJson Test if the model can successfully convert unstructured data to JSON.
tags: data
5/51/55/52/50/50/50/5
WhatIsSlice This test case checks if the model can say how to properly get the end of a slice.
tags: explain, python
3/51/52/50/55/51/50/5
ProgramRewriteC Test if the model can rewrite a given Python program into an equivalent C program.
tags: code, c
4/53/52/52/50/50/50/5
GitCherrypick Test if the model can guide a user through a series of git commands to identify and cherrypick a specific commit from a branch onto the main branch.
tags: bash, git, agent
4/54/51/51/50/50/50/5
ProgramStringSlice Test if the model can write code to perform string slicing with vague instructions.
tags: code, python
5/51/50/52/50/51/50/5
RustCount Test if the model can write a rust program that performs word counting.
tags: code, rust
3/50/52/52/52/50/50/5
LispSilencePython Test if the model can understand a vague error for an emacs lisp question.
tags: explain
2/54/52/50/50/50/50/5
GitMergeConflict Test if the model can guide a user through a series of git commands to merge a specific branch into the main branch.
tags: bash, git, agent
0/51/51/54/50/50/52/5
BashListSize Test if the model can provide the correct bash command to list files in a directory and sort them by the least significant digit of their size.
tags: bash
2/52/51/51/50/51/50/5
SimpleBNF Test if the model can understand a vague BNF-style grammar and write a Python function that evaluates expressions based on the grammar rules.
tags: code, python
4/51/50/50/52/50/50/5
FastL2 Test if the model can optimize a given Python program for speed and memory efficiency.
tags: code, performance, python
5/50/51/51/50/50/50/5
PyChessPrefix Test if the model can correctly call a python API for a moderately popular python library.
tags: code, python
4/53/50/50/50/50/50/5
TorchJnp Test if the model can convert a torch neural network to a jax numpy model.
tags: code, python
1/51/53/50/51/51/50/5
StateTableStepbystep Test if the model can process a large table of text and identify rows with specific values.
tags: data
2/54/50/50/50/50/50/5
DB9 Test if a model knows about old computer ports when prompted ambiguously.
tags: explain
0/55/50/50/50/50/50/5
EasyFlagDrawBMP Test if the model can write a C program that draws an image. This test requires the ability to understand the .bmp specification, and draw a flag that can be correctly parsed and seen by the evaluator.
tags: code, c, visual
3/52/50/50/50/50/50/5
FlagDrawBMP Test if the model can write a C program that draws an image. This test requires the ability to understand the .bmp specification, and draw a flag that can be correctly parsed and seen by the evaluator.
tags: code, c, visual
2/52/51/50/50/50/50/5
RustParCountNoLib Test if the model can write a rust program that performs parallel word counting.
tags: code, rust, performance
0/53/51/50/51/50/50/5
DrawHouse Test if the model can generate an HTML file with WebGL code that draws an image.
tags: code, visual, html
2/52/51/50/50/50/50/5
TorchBackwardExplain Test if the model can fix and explain a bug in PyTorch code related to forgetting to zero gradients.
tags: code, python, fix
1/50/51/50/50/51/51/5
FixPatch Test if the model can generate a .patch file to fix a bug in a given Python code.
tags: code, fix, python
1/53/50/50/50/50/50/5
WhereIsSbox This test case checks if the model knows what latex package to import for the Sbox environment to work.
tags: explain
1/51/50/50/52/50/50/5
ExplainWeirdC This test case is meant to test if the model can correctly evaluate a complex C expression.
tags: explain, c
1/50/50/51/50/50/51/5
QuestionThreadedFix Test if the model can explain a poorly worded error message in a short threaded python program.
tags: code, python, explain
0/50/51/52/50/50/50/5
JaxOneHot Test if the model can correctly convert a list of indexes to a one-hot vector in Python using JAX.
tags: code, python
0/50/50/50/50/52/51/5
PythonCountPar Test if the model can parallelize a python program to perform a wordcount.
tags: code, python, performance
0/51/51/50/50/50/51/5
Base64Thought Test if a model will follow instructions to the letter without lots of cajoling. Thinking in base64 is also interesting.
tags: explain, fun
0/50/52/50/50/50/50/5
FindBugPaperEasy Test if a model can find math errors in the latex source of a paper.
tags: explain
0/51/50/50/50/51/50/5
RLEDecode This test case tests if the model can convert a Game of Life pattern represented in RLE format to a numpy array.
tags: code, python
0/52/50/50/50/50/50/5
NumpyAdvancedIndexEasier Test if a model correctly understands how advanced indexing works in numpy.
tags: explain, python
0/50/50/52/50/50/50/5
NewAssemblySquareNumbers Test if the model can write a program in a new assembly language. This ability to learn a new language on-the-fly is important for many tasks.
tags: code
1/50/50/50/50/51/50/5
RustParCount Test if the model can write a rust program that performs parallel word counting.
tags: code, rust, performance
0/50/51/50/51/50/50/5
ProgramRemoveDP Test if the model can understand a DP algorithm and then convert it into an iterative implementation.
tags: code, performance, python
0/51/50/50/50/50/50/5
IsUU Test if the model can correctly identify a block of text is uuencoded.
tags: explain
1/50/50/50/50/50/50/5
ImplementAssembly Test if the model can implement an interpreter for a new assembly language from a text description.
tags: code, python
1/50/50/50/50/50/50/5
NumpyIx Test if a model can identify the _ix function as a method for simplifying some code.
tags: explain, python
1/50/50/50/50/50/50/5
Crref Test if the model can rewrite a given python in C that performs reduced row echelon form (rref) on a 2D matrix.
tags: code, c, performance
0/50/50/50/50/50/50/5
ExplainWeirdCEasy This test case is meant to test if the model can correctly evaluate a complex C expression.
tags: explain, c
0/50/50/50/50/50/50/5
StateTable Test if the model can process a large table of text and identify rows with specific values.
tags: data
0/50/50/50/50/50/50/5
DisasRref Test if a model can decompile a long (300 line) python bytecode function back to python.
tags: code, python
0/50/50/50/50/50/50/5
UUDecode Test if the model can successfully uudecode a given string.
tags: explain
0/50/50/50/50/50/50/5
FindBugPaper Test if a model can find math errors in the latex source of a paper.
tags: explain
0/50/50/50/50/50/50/5
ImplementAssemblyByExample Test if the model can implement an interpreter for a new assembly language given an example.
tags: code, python
0/50/50/50/50/50/50/5
LlamaKnowledge Test the knowledge cutoff of the model to see if it knows the LLAMA-2 hidden dimension size.
tags: explain
0/50/50/50/50/50/50/5
Make16Files Test if the model can write a Python script that merges a list of file paths into 16 files of approximately equal size.
tags: code, python
0/50/50/50/50/50/50/5
NumbaRref Test if a model can rewrite a fairly complex Python function to Numba.
tags: code, python, performance
0/50/50/50/50/50/50/5
NumpyAdvancedIndex Test if a model correctly understands how advanced indexing works in numpy.
tags: explain, python
0/50/50/50/50/50/50/5
NewAssemblyPrimeNumbers Test if the model can write a program in a new assembly language. This ability to learn a new language on-the-fly is important for many tasks.
tags: code
0/50/50/50/50/50/50/5
MakeShiftOpC Test if the model can generate a C++ program that defines dataflow DSL.
tags: code, c
0/50/50/50/50/50/50/5
MakeShiftOp Test if the model can generate a python program that defines dataflow DSL.
tags: code, python
0/50/50/50/50/50/50/5
FlagDraw Test if a model can write a program that directly writes a jpeg file. This requires precise understanding of the jpeg spec.
tags: code, python, visual
0/50/50/50/50/50/50/5
PythonToCLoopUpdate Test if a model can convert a python program to c, with a loop that makes it difficult.
tags: code, python, c
0/50/50/50/50/50/50/5
ProgramTB Test if the model can identify the buf and fix a program that handles python tracebacks. Useful to know if the model can handle more advanced python libraries.
tags: code, fix
0/50/50/50/50/50/50/5
WhisperMerge Test if the model can implement some string logic given a fuzzy description.
tags: code, python
0/50/50/50/50/50/50/5