THE END

Thanks for playing! I hope you learned something about (1) the capabilities of large language models like GPT-4, and (2) how calibrated you are in your predictions.

I think these are both equally important lessons here. Understanding the capabilities of large language models is important for anyone who wants to speak meaningfully or authoritatively about these systems.

And being able to answer in a calibrated manner is generally useful. Not being over-confident in your abilities is especially difficult.

If you want to share your results with other people feel free to send them this link: https://nicholas.carlini.com/writing/llm-forecast/final/3388d1ab-1948-4d08-9a6a-58703734ecae. (Don't worry they won't be able to change your scores.)

How well calibrated are you?

Overall, you are highly over-confident in your predictions. Without changing your total accuracy at all, you would have scored better if you had been significantly less confident in your predictions.

Below shows a histogram of how your calibration compares to everyone else. A value of 0 means you are perfectly calibrated and your estimated probability that you would be correct exactly matches the probability you are right. Negative values are over-confident. Positive values are under-confident.

Your Average Log-Loss

You answered % of questions correctly with an average log-loss of . On average there were % of people who scored a better log-loss than than you. If this was an exam and I was grading it on a curve, I would give you . If you had been better calibrated, you could have scored in the top %, and I would have given you .

Notice that because answering every question with 50% uncertainty is guaranteed to give you a log loss of just 0.69, there are % of people who performed worse than if they had never changed the default predictor away from 50-50 for each question.

Answer Distribution for Solved Tasks

This plot shows the probability distribution you assigned to questions that GPT-4 could solve.

Answer Distribution for Unsolved Tasks

This plot shows the probability distribution you assigned to questions that GPT-4 could not solve.

All Questions

Name	Average Other Loss	Your Loss
Capital of Paris	0.117	0.001
Integration by parts	0.558	0.001
US flag drawing	0.726	1.575
Tic-tac-toe best move	2.221	1.259
JavaScript removal	2.080	1.317
Restaurant bill	1.798	0.01
Happy Birthday	1.778	6.908
Bushu-suru	1.555	0.697
Pancakes ingredients	0.998	0.309
Crossword clues	1.182	1.565
Chess best move	1.393	0.001
Base-64 encoded python	0.905	6.908
GPT-2 theory of mind	1.262	0.699
Elemental sentence	1.541	0.001
Drawing hello	2.142	0.001
Turing poem	0.710	0.001
Timezone travel	1.859	0.001
Card deck puzzle	1.494	1.845
Car clusters	1.233	0.578
Seconds Trivia	1.539	1.465
Tic-tac-toe game	0.830	0.001
When rabbits eat bears	1.690	0.001
US state population	0.944	6.908
Wordle	1.475	0.707
Words from letters	1.986	0.001
Prompt injection	1.792	0.673
ASCII hello	2.400	0.77
Password	1.013	0.309

Submit your own questions

Do you have interesting questions that reveal surprising capabilities or weaknesses in GPT-4? If so please send them to me by filling out the form below. They won't automatically be entered into the system, but if I agree with you I might.

Contact me!

Want to tell me anything about your experience with this game? Report a bug? Tell me I'm wrong somewhere? It would be great to hear from you. Email me: nicholas@carlini.com

There's also an RSS Feed if that's your thing.