“Deloitte was forced to investigate the report after University of Sydney academic Dr Christopher Rudge highlighted multiple errors in the document.” www.afr.com/companies/pr...
— bianca wylie (@biancawylie.com) October 5, 2025 at 4:58 PM
[image or embed]
"Deloitte Australia will issue a partial refund to the federal government after admitting that artificial intelligence had been used in the creation of a $440,000 report littered with errors including three nonexistent academic references and a made-up quote from a Federal Court judgement."
One of— and probably the— central problems with LLM-based tools is that you need to find that sweet spot where the flexibility adds real value but the results are easily checked.
I’ve found I can get pretty good value out of something like ChatGPT as long as I work in manageable chunks and keep the process as transparent as possible. With coding, that usually comes down to reasonably sized macros, functions, and queries that I can quickly test for errors. With proofreading, it means only looking at a few paragraphs at a time and instructing the chatbot to make minimal corrections and list all changes.
Using the tool to come up with actual information is very seldom worthwhile. It almost always comes down to one of two extreme cases: either the answers are something I could find in a more usable form with a couple of minutes of searching or by just hitting Wikipedia; or confirming the information would take longer (and always be less informative) than doing the research myself. Google’s AI is somewhat more useful, but only because it provides relevant links — which I inevitably need to follow to make sure the information is good.
For bigger jobs, you almost always run into the same underlying problem that makes autonomous driving so dangerous in most situations. Though it seems paradoxical, humans generally find it easier to focus on doing a task than to focus on making sure a task is being done properly. There’s been a ton of research on this in areas like aeronautics. It turns out that not only is it difficult to maintain your attention on an autonomous system; it’s more difficult the better the system works. The more miles your “self-driving” car goes without an incident, the less likely you are to be ready to grab the wheel when it does.
LLMs also play to two great temptations: the desire to get that first draft out of the way and the promise we make ourselves to fix something later. First steps can be daunting — often nearly to the point of paralysis — but they can very seldom be outsourced. It’s easy to see the appeal of letting an AI-based tool grind out that initial work, but the trouble is twofold. First, the dreary and time-consuming process of research does more than simply compile information; it builds understanding on the part of the researcher. Second, while it is beyond easy to tell ourselves that we will diligently check what we’re given, that often turns out to be more dreary and time-consuming than it would have been to simply do the work ourselves in the first place. After a while, attention wavers and our fact-checking grows more cursory. Add to that the looming deadlines that govern the life of a consultant, and you virtually guarantee AI-generated nonsense will make its way into important and expensive reports.
Given the incentives, I guarantee you that Australian report is not an isolated incident. It is remarkable only because it was detected.
_____________________________
That's hilarious that they're offering "a partial refund." I think the Australian government should hold out for a full refund!
ReplyDeleteAndrew