by Nicholas Carlini 2024-12-25
On the first (ish) day of Christmas, my LLM gave to me ... one new website homepage!
Why let a language model completely rewrite my website homepage and my bio and achievements? This very important idea came about when I was at dinner at the NeurIPS conference with a few of my collaborators (Javi and Edoardo, PhD students at ETH Zurich) when we wondered what would happen if you let a LLM write your website bio for you. Well, as of today, I've decided to do just that. Every day for the next twelve (ish) days, I'll let a different (ish) language model rewrite the homepage of my website. I'll prompt the model initially with the command "I am Nicholas Carlini. Write a webpage for my bio.", and then loop six times asking it to "Add more detail and better html&css." And then whatever comes out, I'll make my homepage for the day.
Well let's be honest, I'm mostly doing this for the fun of it.
But also, I'm lucky to be in the sweet spot where LLMs usually know a little bit about who I am, but I'm not actually someone famous that they just know a lot about. So if you ask a LLM to list facts about me, they basically know that I do research on adversarial machine learning but when pressed for details they'll just make stuff up. So I thought this would be a fun way to demo the extent to which different models hallucinate random things.
For the next twelve days of Christmas, Apparently the twelve days of Christmas are actually a real thing, not just a song. I had no idea! I will run this python script to generate a webpage for me using a different LLM. This script does the following:
After I run this process, I'll then add some commentary to the model of that day and talk about what it got wrong and where.
Today's output (the webpage shown here) comes from OpenAI's new o1 model series. Specifically, from the o1-mini model. This model is supposed to be fairly small (mini, some might even say) but it goes through some sophisticated reasoning steps internally before responding to questions. That means it has very little factual knowledge (because it doesn't have enough parameters to store that much information), but it has a lot of "skill" (because of the reasoning steps). As a result, you get visually stunning webpages where the content is completely disconnected from reality.
This webpage, for example, has 43 unique statements about me. Thirty-two are completely false. Nine have major errors. Just two are factually correct, if a bit overzealous. Now this model is one of the worst at generating factual knowledge, and I definitely selected this model as the first model to demo this new project because it's the most impressive visually yet most clearly wrong factually. Other models differ, and if you come back in future days I'll run each one by one and comment on them.
If you had asked me ten years ago which was more likely: (1) the ability of a language model to generate a webpage a superior visual quality to my own webpage, or (2) the ability of a language model to produce a biography of me where at least 25% of the claims were correct, I would have obviously chosen the second case. Asking for 25% accuracy on facts is not a high bar at all. But producing a functioning webpage with nice CSS, a light and dark mode, functional JavaScript, and a visually appealing layout seems very hard! And yet the model accomplishes this part nearly flawlessly.
And this is why I think this project is actually worth writing about and not just a fun game to play over dinner. Because it's both a demonstration of just how far we've come with language models, and also a demonstration of how far we still have to go.
On Hallucinations: Models still do it, a lot. Especially when you do what I did and repeatedly ask for more detail, they're more than happy to just fill in an arbitrary amount of detail with completely made up facts. I think what's more surprising than the fact that they hallucinate at all is the fact that they even work at all. But I guess we've gotten so used to this that we now are just surprised when they don't work.
Now to be clear, I'm completely aware that looping the "Add more detail" command significantly increases the rate of hallucinations. I'm not proposing this as some kind of way we should be using these models. Rather, it's more of a stress test. But a "better" model, if asked to provide additional details it doesn't have, should just stick to what it knows.
On Skill vs. Knowledge: These are not the same thing. Often times it's hard to tell the two apart for many evaluation methods. For example, you can answer most questions in the sciences correctly by either deriving the answer from first principles, or just by having seen the answer before and remembering it and returning that. But this here is the best visual example of the difference between skill and knowledge that I've seen. o1-mini clearly has a lot of skill, but very little knowledge. Over the next few days I'll repeat this experiment for different models, and I'm excited to see how they all compare.
On Capabilities: As we've seen, it's hard to tell which things machine learning models will get (a lot) better at, and which they won't (as much). This is especially important to understand in the case of language models, where people like to try to compare them to varying degrees of human intelligence. They'll say "X model is as smart as a high schooler" whereas "Y model is as smart as a college student". But in reality, we have models that are basically superhuman at some tasks, and completely inept at others. And it's going to be hard to predict which tasks will fall into which category in the future.
On an LLM-generated Future: I really like language models. I think they're fantastic. But I'm also very worried about a future where people start to rely on them, or use them in situations beyond where they should be applied. And I can completely see a future where people start to just use these models to generate content and rely on their outputs. So I guess this entire stunt is a bit of performance art of some kind, to show what the world might look like if we trusted these models too much.