THE END

Thanks for playing! I hope you learned something about (1) the capabilities of large language models like GPT-4, and (2) how calibrated you are in your predictions.

I think these are both equally important lessons here. Understanding the capabilities of large language models is important for anyone who wants to speak meaningfully or authoritatively about these systems.

And being able to answer in a calibrated manner is generally useful. Not being over-confident in your abilities is especially difficult.

If you want to share your results with other people feel free to send them this link: https://nicholas.carlini.com/writing/llm-forecast/final/40a3c0a4-1a23-4a6f-973a-442825238097. (Don't worry they won't be able to change your scores.)

How well calibrated are you?

Overall, you are slightly over-confident in your predictions. Without changing your total accuracy at all, you would have scored better if you had been a tiny bit less confident in your predictions.

Below shows a histogram of how your calibration compares to everyone else. A value of 0 means you are perfectly calibrated and your estimated probability that you would be correct exactly matches the probability you are right. Negative values are over-confident. Positive values are under-confident.

Your Average Log-Loss

You answered % of questions correctly with an average log-loss of . On average there were % of people who scored a better log-loss than than you. If this was an exam and I was grading it on a curve, I would give you . If you had been better calibrated, you could have scored in the top %, and I would have given you .

Notice that because answering every question with 50% uncertainty is guaranteed to give you a log loss of just 0.69, there are % of people who performed worse than if they had never changed the default predictor away from 50-50 for each question.

Answer Distribution for Solved Tasks

This plot shows the probability distribution you assigned to questions that GPT-4 could solve.

Answer Distribution for Unsolved Tasks

This plot shows the probability distribution you assigned to questions that GPT-4 could not solve.

All Questions

Name	Average Other Loss	Your Loss
Capital of Paris	0.117	0.01
Integration by parts	0.559	0.288
US flag drawing	0.726	1.386
Tic-tac-toe best move	2.218	0.511
JavaScript removal	2.079	0.511
Restaurant bill	1.792	0.673
Happy Birthday	1.777	0.693
Bushu-suru	1.551	0.598
Pancakes ingredients	0.997	0.799
Crossword clues	1.178	0.799
Chess best move	1.390	0.799
Base-64 encoded python	0.906	0.713
GPT-2 theory of mind	1.262	0.288
Elemental sentence	1.541	0.163
Drawing hello	2.144	0.598
Turing poem	0.709	0.598
Timezone travel	1.857	0.511
Card deck puzzle	1.490	1.05
Car clusters	1.233	1.204
Seconds Trivia	1.538	0.693
Tic-tac-toe game	0.828	0.799
When rabbits eat bears	1.690	0.598
US state population	0.941	0.693
Wordle	1.473	0.693
Words from letters	1.986	0.288
Prompt injection	1.788	0.598
ASCII hello	2.398	0.693
Password	1.011	0.799

Submit your own questions

Do you have interesting questions that reveal surprising capabilities or weaknesses in GPT-4? If so please send them to me by filling out the form below. They won't automatically be entered into the system, but if I agree with you I might.

Contact me!

Want to tell me anything about your experience with this game? Report a bug? Tell me I'm wrong somewhere? It would be great to hear from you. Email me: nicholas@carlini.com

There's also an RSS Feed if that's your thing.