West Coast Stat Views (on Observational Epidemiology and more): Watching ChatGPT explain a reasoning problem is like watching someone who learned the language phonetically take an improv class

Friday, April 14, 2023

Watching ChatGPT explain a reasoning problem is like watching someone who learned the language phonetically take an improv class

When trying to keep your head above the hype and bullshit surrounding large language models, here's a handy trick. Ask yourself how many people have answered exactly the same question in print over the years. If the answer is a large number, it would not be surprising if the LLM approach of looking at the likelihood of certain words and phrases appearing with other words and phrases actually produced something that seemed to show comprehension and awareness.

For example, if you asked ChatGPT to prove some familiar theorem, there's a good chance that the algorithm would stitch together an acceptable answer out of the thousands of explanations out there.

It may look impressive, but it no more demonstrates an understanding of mathematics than reciting a passage from Goethe learned phonetically demonstrates a mastery of German grammar and syntax.

As such, the *skill* displayed by a system on any particular task (even a very large collection of tasks) has absolutely no connection to the *intelligence* of the system.

The only way to measure the intelligence of a system is to present it with tasks it was not prepared for.
— François Chollet (@fchollet) March 21, 2023

Here's what it looks like when an LLM encounters a new problem.

ChatGPT vs. Bard. Bard wins for honesty :) pic.twitter.com/rTyz8uOmAO
— François Chollet (@fchollet) March 22, 2023

It is not just that the answer is wrong or that the explanation is wrong; it's that they seem unaware not just of the problem, but of each other. The answer at the top is different than the answer at the bottom and neither matches the steps given.

We need to start thinking of these systems not as generative AI but as regurgative. In some cases that's good enough, niche applications like generating unimportant boilerplate text or writing code snippets, but even in those limited area, it can only function where humans have not only explored a question, but have done so in such exhaustive detail that the algorithm can autocomplete its way to a useful response.

3 comments:

Bob CarpenterApril 14, 2023 at 6:58 PM
The performance of GPT-3 is amazing in terms of the amount of "knowledge" encoded and the way it can put it together coherently. It's superhuman on multiple tasks (like literally speaking every written language), but it's subhuman in others (particularly long-term coherent dialogue). I think the naysayers are forgetting GPT-3 is our jumping off point. If you follow the rest of your linked Twitter thread, you can see that GPT-4 gives the answer I think the OP expected (the question is vauge in that I can imagine someone answering all 'A's given that the previous examples only used 'A' as a fill character). GPT-4 can easily translate machine learning code from R to Python (just as I couldn't imagine coding without Stack Overflow last year, I can't imagine coding without GPT this year). Microsoft released a nice overview of what they could get GPT-4 to do, though they were using it before the final alignment phase.
ReplyDelete
Replies