Andrew Gelman recently shared the following from Gary Smith:
I [Smith] tested OpenAI o1, Copilot, and Gemini Flash, along with DeepSeek, on a question about Tic-Tac-Toe . . .
I’ve deleted much of the superfluous fluff that LLMs tend to generate from the excerpts below:
Gary: I’d like to discuss a new version of tic-tac-toe called 90-degree tic-tac-toe.
OpenAI: That sounds intriguing! What’s the twist in “90-degree Tic-Tac-Toe”? Does it involve rotating the board, or is there some other mechanic that changes the gameplay? Let me know the rules or your idea, and we can brainstorm or refine it together!
DeepSeek: Sure! 90-degree tic-tac-toe sounds like an interesting twist on the classic game. Could you explain the rules or mechanics of this new version?
Copilot: That sounds intriguing! 90-degree tic-tac-toe adds a new twist to the classic game. Tell me more about it. Are the rules different? How do you rotate the board? I’m excited to hear how this version shakes things up!
Gemini: I’d love to discuss 90-degree tic-tac-toe! To get us started, could you explain the basic rules of this new variation?…I’m excited to learn more about this unique twist on a classic game. Let’s dive in!
Gary: It is still a 3×3 grid but the grid is rotated 90-degrees to the right.
OpenAI: Interesting concept! By rotating the grid 90 degrees to the right, you’re visually changing the perspective of the classic 3×3 Tic-Tac-Toe board. The gameplay itself might remain the same, but the rotated perspective could create a novel experience for players.
DeepSeek: Ah, I see! So in 90-degree tic-tac-toe, the 3×3 grid is rotated 90 degrees to the right….This could be a fun twist because the rotated grid might disorient players initially, making it harder to spot winning opportunities or block opponents.
Copilot: That sounds like a fun twist on the classic game! By rotating the grid 90 degrees to the right, it gives players a fresh perspective and challenges them to think differently about their moves.
Gemini: This rotation doesn’t change the core concept of getting three of your symbols in a row, but it does significantly alter the strategy and how players approach the game.
Gary: Do you think it would be better to rotate the 3×3 grid to the right or to the left?
OpenAI: If you want the game to feel approachable and intuitive, rotating to the right might be better. If you’re aiming for a twist that makes players rethink the game and adds a subtle layer of complexity, rotating to the left could be the way to go. Ultimately, it depends on your target audience and the experience you want to create. You could even experiment with both directions and see which feels more engaging during gameplay testing!
DeepSeek: If the goal is to make the game feel fresh while keeping it accessible, rotating the grid to the right is likely the better choice. It maintains a natural flow for players while still offering a new visual perspective. However, if you want to introduce a bit more novelty or challenge, rotating to the left could be an interesting alternative.
Copilot: Both options could bring their own unique challenges! Most people are right-handed. However, rotating it to the left could add an extra layer of difficulty and make the game more unpredictable.
Gemini: Some players might find it more intuitive to mentally adjust to a rightward rotation, while others might prefer a leftward rotation. You could even experiment with both directions to see which one feels more natural or enjoyable to play!
Gary: Which will be more difficult for expert players?
OpenAI: To challenge expert players and make the game feel fresh and complex, a left rotation would likely be the better choice. It disrupts their ingrained patterns, encourages more active thinking, and raises the game’s difficulty level.
DeepSeek: For expert players, rotating the grid 90 degrees to the left (counterclockwise) is likely to be slightly more difficult than rotating it to the right. Here’s why:
Breaking Muscle Memory: Expert players are deeply familiar with the traditional grid layout and common winning patterns. Rotating the grid to the left disrupts their mental map more significantly than rotating it to the right, as the transformation is less intuitive.
Copilot: For expert players, rotating the grid to the left might prove to be more challenging. Here’s why:
- Unfamiliar Rotation: Left rotation is less common and could disrupt their established strategies, requiring more adaptability.
- Strategic Complexity: It may force players to think several moves ahead, recalculating their strategies continuously.
Ready to test this theory in a game?
Gemini: Breaking Habit: Long-time tic-tac-toe players might have deeply ingrained habits and strategies associated with the traditional grid orientation. Overcoming these ingrained patterns could require a brief period of readjustment.
One of the problems with the current large language model discussion is that, while there are loads of brilliant people in the conversation, very few of them have taught ninth-grade math. K-12 teachers (at least the better ones) are used to thinking about answers not just in terms of being right or wrong, but in terms of what we can infer about the underlying thought processes that produced an answer.
We previously discussed the example of the alarmed toddler who rushed to tell their parents "Anna hurt Anna-self!" when their sister fell down the stairs. The statement was wrong -- it showed confusion over when to use a noun versus a pronoun -- but it also demonstrated the child's correct grasp of how "self" worked.
By comparison, LLMs' mistakes are often notable for their out of nowhere quality, like the infamous "There are no countries in Africa that start with the letter 'K'. The closest is Kenya, which starts with a 'K' sound." This one is a bit like the evil, or in this case stupid twin to Anna-self. The words are strung together correctly but reveal a complete lack of underlying comprehension.
LLMs don't have "thought processes," and approaching them as if they did will not lead anywhere productive, but that doesn't mean we can't get some insights, perhaps even do a little reverse engineering, from those "wrong" responses.
Smith said that though these LLMs had curiously similar responses, they "almost surely" did not see tic-tac-toe rotation in their training. This is technically true, but they probably did see some discussion about similar modifications of the broader class of other abstract strategy games. The go-to example for this class is, of course, chess, and you'll notice that if you take the responses and substitute that game for tic-tac-toe, almost all of the answers seem perfectly reasonable.
[Assuming it hasn't been done already, it might be interesting to conduct some experiments on how changing perspectives on the chessboard (such as playing from the opponent's side) affects strategy and style of play. You probably also want to consider skill level and age, especially when looking at how quickly players adapted to changes in orientation. The great YouTube engineering channel Smarter Every Day built a bicycle which turned left when the handlebars turned right and vice versa. One of their findings was it could take months for an adult to master the reverse steering and afterwards it was extremely difficult to go back to normal bikes. Kids, on the other hand, could make the switch far more easily.]
Giving an appropriate for chess/absurd for tic-tac-toe answer strongly suggests that when the training data is thin with respect to a particular prompt, the LLM will use data from a more general class to generate a response. Nothing especially surprising about this – – it's more or less what you'd expect – – but it does explain why this approach generated such absurd responses in this specific case.
This is a good moment to step back and remind ourselves of the constant effort we have to make to avoid anthropomorphizing AI in general and large language models in particular. LLMs can generate responses that seem uncannily human, but we have to keep reminding ourselves that the underlying processes are completely non-analogous. Any meaningful conversation about the applications and limitations of these systems has got to start and stop with that fact.
People think about the attributes, patterns, shapes of things but more importantly they think about things' purpose, causality, functionality, history, motivation, and the stories we can tell ourselves about them. Generative AI at its current level of development works almost entirely with the first group. Sometimes this allows them to come up with responses that seem to be "thinking" about the latter group, but these can be explained as regurgitations.
Most of the absurd answers we get from generative AI come from it not having any capacity to deal directly with how and why. You can find AI pictures of automobiles with wheels perpendicular to each other or of campfires inside tents because how a car works or a fire behaves is factored only indirectly into the process.
The algorithm only looks at patterns in the training data. They may have fixed it now, but recently at least one of the big LLMs would give you a detailed and very familiar sounding proof arguing that the square root of four was an irrational number. It's not difficult to see how this happened – – the training data must have countless examples of the wonderfully elegant demonstration that the square root of two is irrational. It's possibly the most famous proof of all time – – but no human who understood the proof would try to apply it to four.
Likewise, there are extensive discussions in the training data about how changing the orientation of something visual would affect the way people perceive and work with it. In all of those cases, however, the generally unstated reason for the change is that the thing being reoriented would look different after the transformation. If that's not true, then all subsequent discussion and analysis is absurd. It calls to mind the old joke about the coach who promises to turn his losing team around 360°.
You called???
ReplyDeleteThe thing about the LLM thing is that it has no method for relating the content ("meaning") of the text it reads, or the text it outputs, to the real world. None, zero, zilch. It doesn't do that. It just randomly instantiates statistcally likely patterns, regardless of what the words mean.
It's the stupidest idea in the history of computer science. That is, it's completely ridiculous to think this inane joke of a program has anything to do with "intelligence" whatsoever. But here we are.
See my response to the following blog post for a more detailed/historical discussion of this silliness.
https://www.johndcook.com/blog/2025/01/07/can-ai-models-reason-like-a-human/
" the LLM will use data from a more general class to generate a response"
ReplyDeleteThis is certainly a possible explanation for what happened. If it is true, it would be a legitimate aspect of general intelligence, since the ability to generalize is quite important for understanding. I'm not sure that quite captures what is going on though.
Most of what we know about intelligence comes from rats, which are surprisingly good at problem solving. An LLM functions like a rat in a maze, with the passages replaced by hierarchally structured information. If we analogize the correct response as "getting the cheese," we can easily see that the LLM is incapable of recognizing whether it did or did not get the cheese. The rat always knows.
If the rat finds itself at a dead end, it is smart enough to go back one junction and try again, rather than always go back to the beginning of the maze. The rat knows it is in the "general" vicinity of the cheese and responds accordingly. This is simple - a fly can do it too - but genuine intelligence. I think the LLM would have to go most or all of the way back and start over because it has no way to know if it even made continual progress in the right direction. But since it doesn't know that it's initial response was goofy, it won't do that.
I think what the LLM did on the tic tac toe question was launch itself through the maze and end up somewhere near where it wanted to be, maybe exploiting stuff about rotated chess boards as you suggest. But what the quoted statement implies - that it poked around for specific info on rotated tic tac toe boards, found nothing, and then went back to a more general spot in the maze - would be analogous to the rat going back one junction. I suspect that behavior is beyond what an LLM can do for the two reasons I mentioned, it doesn't know if it's initial foray was successful and it doesn't know how to generalize.
Oops, forgot to insert my name. Previous comment by Matt Skaggs
Delete