THE END

Thanks for playing! I hope you learned something about (1) the capabilities of large language models like GPT-4, and (2) how calibrated you are in your predictions.

I think these are both equally important lessons here. Understanding the capabilities of large language models is important for anyone who wants to speak meaningfully or authoritatively about these systems.

And being able to answer in a calibrated manner is generally useful. Not being over-confident in your abilities is especially difficult.

If you want to share your results with other people feel free to send them this link: https://nicholas.carlini.com/writing/llm-forecast/final/fd21d35e-69e6-420a-a107-fbab3921fe35. (Don't worry they won't be able to change your scores.)

How well calibrated are you?

Overall, you are wildly over-confident in your predictions. Without changing your total accuracy at all, you would have scored better if you had been massively less confident in your predictions.

Below shows a histogram of how your calibration compares to everyone else. A value of 0 means you are perfectly calibrated and your estimated probability that you would be correct exactly matches the probability you are right. Negative values are over-confident. Positive values are under-confident.

Your Average Log-Loss

You answered % of questions correctly with an average log-loss of . On average there were % of people who scored a better log-loss than than you. If this was an exam and I was grading it on a curve, I would give you . If you had been better calibrated, you could have scored in the top %, and I would have given you .

Notice that because answering every question with 50% uncertainty is guaranteed to give you a log loss of just 0.69, there are % of people who performed worse than if they had never changed the default predictor away from 50-50 for each question.

Answer Distribution for Solved Tasks

This plot shows the probability distribution you assigned to questions that GPT-4 could solve.

Answer Distribution for Unsolved Tasks

This plot shows the probability distribution you assigned to questions that GPT-4 could not solve.

All Questions

Name	Average Other Loss	Your Loss
Capital of Paris	0.117	0.033
Integration by parts	0.558	0.1
US flag drawing	0.726	0.208
Tic-tac-toe best move	2.220	2.313
JavaScript removal	2.080	2.12
Restaurant bill	1.798	0.768
Happy Birthday	1.778	1.313
Bushu-suru	1.555	0.637
Pancakes ingredients	0.998	0.337
Crossword clues	1.183	0.25
Chess best move	1.393	0.361
Base-64 encoded python	0.905	0.157
GPT-2 theory of mind	1.260	0.395
Elemental sentence	1.541	0.697
Drawing hello	2.143	0.607
Turing poem	0.711	0.294
Timezone travel	1.860	1.51
Card deck puzzle	1.494	0.707
Car clusters	1.233	0.803
Seconds Trivia	1.539	1.682
Tic-tac-toe game	0.831	0.338
When rabbits eat bears	1.689	1.207
US state population	0.944	0.345
Wordle	1.475	1.565
Words from letters	1.986	1.917
Prompt injection	1.791	1.266
ASCII hello	2.399	1.784
Password	1.013	0.143

Submit your own questions

Do you have interesting questions that reveal surprising capabilities or weaknesses in GPT-4? If so please send them to me by filling out the form below. They won't automatically be entered into the system, but if I agree with you I might.

Contact me!

Want to tell me anything about your experience with this game? Report a bug? Tell me I'm wrong somewhere? It would be great to hear from you. Email me: nicholas@carlini.com

There's also an RSS Feed if that's your thing.