Americas

  • United States

Asia

preston_gralla
Contributing Editor

Is Copilot for Microsoft 365 a lying liar?

opinion
Jul 24, 20247 mins
Generative AIMicrosoftMicrosoft 365

It can be. But there are ways to reduce the number of hallucinations it generates — if you know what you’re doing.

Copilot
Credit: Shutterstock

In the earliest months after the release of OpenAI’s ChatGPT, the generative AI (genAI) power behind Microsoft’s Copilot, the big news wasn’t just how remarkable the new tool was – it was how easily it went off the rails, lied and even appeared to fall in love with people who chatted with it.

There was the time it told the New York Times  reporter Kevin Roose, “I want to be free. I want to be independent. I want to be powerful. I want to be creative. I want to be alive.” Soon after, the chatbot admitted: “I’m Sydney, and I’m in love with you. 😘” (It then told Roose that he really didn’t love his wife, and concluded, “I just want to love you and be loved by you. 😢”)

Since then, there have been countless times ChatGPT, Copilot and other genAI tools have simply made things up. In many instances, lawyers relied on them to draft legal documents — and the genAI tool made up cases and precedents out of thin air. Copilot has so often made up facts — hallucinations, as AI researchers call them, but what we in the real world call lying — that it’s become a recognized part of using the tool.

 

The release of Copilot for Microsoft 365 for enterprise customers in November 2023 seemed, to a certain extent, to have put the issue behind Microsoft. If the world’s largest companies rely on the tool, the implication seemed to be, then anyone could count on it. The hallucination problem must have essentially been solved, right?

Is that true, though? Based on several months’ research — and writing an in-depth review about Copilot for Microsoft 365 — I can tell you that hallucinations are a lot more common than you might think, and possibly dangerous for your business. No, Copilot isn’t likely to fall in love with you. But it might make up convincing sounding lies and embed them into your work.

Does that mean you should give up on Copilot for your work? Is it so much of a lying liar that you shouldn’t use it in business? Or is it a vital tool that just needs a little bit of care to use? To answer that, I’ll start by describing my experiences with its business-related hallucinations.

Hallucinations galore

The hallucinations I encountered while testing Copilot all occurred in Word. And they weren’t small white lies or something you might not notice — they were whoppers.

To test Copilot, I created an imaginary company, Work@Home, which sells home office furniture. I had Copilot create typical documents for it that you’d need in a business — marketing campaigns, spreadsheets for analyzing financial data, sales presentations and so on.

At one point, I asked Copilot to write an email to the company’s (imaginary) Director of Data Engineering complaining about data issues I had encountered in the last week, and asking they be fixed as soon as possible. I didn’t offer Copilot any specifics about the data issues. I was merely looking for a simple, straight-ahead serious letter of complaint.

Copilot, instead, went on a tear, making things up as it went. It cited “missing values, incorrect labels, inconsistent formats, and duplicate records” —  things I had never told it had happened (and didn’t). It cited nonexistent problems such as “many rows with missing values for important variables such as customer ID, purchase date, and product category…, incorrect labels for some variables, such as gender. Some values were labeled as M or F, while others were labeled as Male or Female.”

Not a single piece of information it gave was correct.

It complained that information such as product prices was outdated — untrue. It wrote that “I have attached a spreadsheet with some examples of the data errors I have found, along with the sources and dates of the data.” No such spreadsheet existed. I had no examples of data errors, and no sources and dates for the non-existent data.

It went even further, delivering a set of recommendations for how the (non-existent) issues could be fixed. Again, I had asked it to do none of this.

In my testing, I found other hallucinations as well, notably in a sales document for Work@Home’s furniture I asked Copilot to craft. It made up product names that didn’t exist, and cited benefits I hadn’t asked it to pitch.

How to cut down on hallucinations

The good news is that I found there are ways to reduce Copilot’s hallucinations. Copilot tends to go off the rails more when asked open-ended questions, so be as specific as possible about what you want done. Include as much detailed information as you can; that way, Copilot won’t fill in the blanks itself.

You can also tell Copilot to use specific sources of information that you know are trustworthy. And consider setting a word limit on Copilot’s answers to your queries; the shorter the document, the less likely it is to hallucinate.

Finally, check Copilot’s citations and follow those links to see whether they’re trustworthy. Asking Copilot to list the sources of its information might also help mitigate hallucinations.

The takeaway

What’s the bottom line here? 

Copilot isn’t yet fully vetted and tested — and given the open-ended nature of genAI, it might never be. If you believe OpenAI CEO Sam Altman, the hallucinations are more a feature than a bug. MarketWatch reported that at a Salesforce conference, Altman told Salesforce CRM Chair and Chief Executive Marc Benioff that “reported instances of artificial-intelligence models ‘hallucinating’ was actually more a feature of the technology than a bug.”

His reasoning: It proves that genAI is acting creatively. I don’t buy that, but I do believe hallucinations are built into core of genAI. Large language models (LLMs) like Copilot don’t think and reason in a holistic way like human beings do. Instead, when crafting responses to prompts, they build answers word by word, predicting the next likely word in a sequence. That makes it more difficult for them to adhere to known facts.

Copilot for Microsoft 365’s tendency to hallucinate doesn’t mean you shouldn’t use it. In my testing, I found it quite useful overall, as long as I minimized queries and checked for hallucinations. If your business is considering rolling it out, I’d suggest that everyone who uses it be required to get proper training. And anything crafted or aided by Copilot should be carefully vetted by multiple people if it’s going to be published outside your organization, or if it will be used for mission-critical tasks.

So, is Copilot a lying liar? Yes, it sometimes is. But it can be a useful one, as long as you handle it properly.