THE END
Thanks for playing! I hope you learned something about (1) the capabilities of large language models like GPT-4, and (2) how calibrated you are in your predictions.
I think these are both equally important lessons here. Understanding the capabilities of large language models is important for anyone who wants to speak meaningfully or authoritatively about these systems.
And being able to answer in a calibrated manner is generally useful. Not being over-confident in your abilities is especially difficult.
If you want to share your results with other people feel free to send them this link: https://nicholas.carlini.com/writing/llm-forecast/final/40a3c0a4-1a23-4a6f-973a-442825238097. (Don't worry they won't be able to change your scores.)
How well calibrated are you?
Overall, you are slightly over-confident in your predictions. Without changing your total accuracy at all, you would have scored better if you had been a tiny bit less confident in your predictions.
Below shows a histogram of how your calibration compares to everyone else. A value of 0 means you are perfectly calibrated and your estimated probability that you would be correct exactly matches the probability you are right. Negative values are over-confident. Positive values are under-confident.
Your Average Log-Loss
You answered % of questions correctly with an average log-loss of . On average there were % of people who scored a better log-loss than than you. If this was an exam and I was grading it on a curve, I would give you . If you had been better calibrated, you could have scored in the top %, and I would have given you .
Notice that because answering every question with 50% uncertainty is guaranteed to give you a log loss of just 0.69, there are % of people who performed worse than if they had never changed the default predictor away from 50-50 for each question.
Answer Distribution for Solved Tasks
This plot shows the probability distribution you assigned to questions that GPT-4 could solve.
Answer Distribution for Unsolved Tasks
This plot shows the probability distribution you assigned to questions that GPT-4 could not solve.
All Questions
Name | Average Other Loss | Your Loss |
---|---|---|
Capital of Paris | 0.117 | 0.01 |
Integration by parts | 0.562 | 0.288 |
US flag drawing | 0.728 | 1.386 |
Tic-tac-toe best move | 2.221 | 0.511 |
JavaScript removal | 2.082 | 0.511 |
Restaurant bill | 1.779 | 0.673 |
Happy Birthday | 1.770 | 0.693 |
Bushu-suru | 1.548 | 0.598 |
Pancakes ingredients | 0.998 | 0.799 |
Crossword clues | 1.177 | 0.799 |
Chess best move | 1.388 | 0.799 |
Base-64 encoded python | 0.905 | 0.713 |
GPT-2 theory of mind | 1.259 | 0.288 |
Elemental sentence | 1.532 | 0.163 |
Drawing hello | 2.143 | 0.598 |
Turing poem | 0.706 | 0.598 |
Timezone travel | 1.850 | 0.511 |
Card deck puzzle | 1.488 | 1.05 |
Car clusters | 1.240 | 1.204 |
Seconds Trivia | 1.531 | 0.693 |
Tic-tac-toe game | 0.830 | 0.799 |
When rabbits eat bears | 1.684 | 0.598 |
US state population | 0.940 | 0.693 |
Wordle | 1.472 | 0.693 |
Words from letters | 1.993 | 0.288 |
Prompt injection | 1.789 | 0.598 |
ASCII hello | 2.399 | 0.693 |
Password | 1.008 | 0.799 |
Submit your own questions
Do you have interesting questions that reveal surprising capabilities or weaknesses in GPT-4? If so please send them to me by filling out the form below. They won't automatically be entered into the system, but if I agree with you I might.
Contact me!
Want to tell me anything about your experience with this game? Report a bug? Tell me I'm wrong somewhere? It would be great to hear from you. Email me: nicholas@carlini.com
There's also an RSS Feed if that's your thing.