Friday, April 14, 2023

Watching ChatGPT explain a reasoning problem is like watching someone who learned the language phonetically take an improv class

When trying to keep your head above the hype and bullshit surrounding large language models, here's a handy trick. Ask yourself how many people have answered exactly the same question in print over the years. If the answer is a large number, it would not be surprising if the LLM approach of looking at the likelihood of certain words and phrases appearing with other words and phrases actually produced something that seemed to show comprehension and awareness. 

For example, if you asked ChatGPT to prove some familiar theorem, there's a good chance that the algorithm would stitch together an acceptable answer out of the thousands of explanations out there. 

 


It may look impressive, but it no more demonstrates an understanding of mathematics than reciting a passage from Goethe learned phonetically demonstrates a mastery of German grammar and syntax.

Here's what it looks like when an LLM encounters a new problem.







It is not just that the answer is wrong or that the explanation is wrong; it's that they seem unaware not just of the problem, but of each other. The answer at the top is different than the answer at the bottom and neither matches the steps given. 

We need to start thinking of these systems not as generative AI but as regurgative. In some cases that's good enough, niche applications like generating unimportant boilerplate text or writing code snippets, but even in those limited area, it can only function where humans have not only explored a question, but have done so in such exhaustive detail that the algorithm can autocomplete its way to a useful response.

3 comments:

  1. The performance of GPT-3 is amazing in terms of the amount of "knowledge" encoded and the way it can put it together coherently. It's superhuman on multiple tasks (like literally speaking every written language), but it's subhuman in others (particularly long-term coherent dialogue). I think the naysayers are forgetting GPT-3 is our jumping off point. If you follow the rest of your linked Twitter thread, you can see that GPT-4 gives the answer I think the OP expected (the question is vauge in that I can imagine someone answering all 'A's given that the previous examples only used 'A' as a fill character). GPT-4 can easily translate machine learning code from R to Python (just as I couldn't imagine coding without Stack Overflow last year, I can't imagine coding without GPT this year). Microsoft released a nice overview of what they could get GPT-4 to do, though they were using it before the final alignment phase.

    ReplyDelete
    Replies
    1. David in Tokyo here:

      As you should expect, I'm still in the ChatGPT is a parlor trick camp. This video is a fine example of how bad it remains. (Vlogger's conclusion: My job is safe.)

      https://www.youtube.com/watch?v=PlVX3hzp2qM&ab_channel=DavidBennettPiano%27s2ndChannel

      Interestingly, MickeySoft's Bing did way better on my first try at a similar question: I asked "Write me a chord progression in the style of Sonny Stitt." It grovelled around the internet, figured out that Stitt was a bebop/hard bop saxiphonist, and coughed up a (rather anodyne) variation on Rhythm Changes, and muttered some anodyne things about soloing over dominant seventh chords (well, this was sort of wrong, because it only coughed up chords sort of like the A section; whereas it's the B section that consists solely of dominant seventh chords that you have to worry about soloing over: the dominant seventh chords in the A section appear in ii-v cadences, so you wouldn't think about soloing over those the way you need to for the ones in the B section, but that's a quibble.)

      Basically, Bing is a "safe at any cost" Wikipedia interface, so it's boring. But safe. The ChatGPT folks are more willing to let the thing mess up. So it does.

      Delete
    2. Bob,

      As you might have guessed, outside of the niche applications of programming and translation, I remain very much a skeptic of the potential impact of large language models. I see diminishing returns and hard upper bounds. A method that doesn't even seem to have the possibility of pushing that far past its current capabilities.

      Put another way, most people look at LLMs and think of the Wright Brothers. I look at these things and think Count von Zepplin.

      Delete