Limitations of LLMs

Intersecting AI #14: GenAI can do some cool tricks, but it's not without significant limitations. This post talks about hallucinations, ethics, ethical understanding, and reasoning.

and

Jun 18, 2024

This post will describe some of the most prominent limitations of LLMs. (For a detailed overview of outstanding research questions, look at this article.1) Limitations include the LLMs’ inclination to hallucinate, its inability to understand ethics, its limited memory, and its inability to reason through complex problems and arrive at novel solutions in science or math.

And lest we seem too eager to critique, consider that even the CEO of OpenAI, Sam Altman stated that "’We're at this barely useful cellphone…’ Using the evolution of the mobile phone as a parallel, the CEO said OpenAI eventually wants to deliver the latest iPhone model. ‘We have not seen as much world-changing application as we'd like,’ Altman said.” That sounds about right to us.

Hallucinations

Hallucination is recognized as a major drawback for LLMs for most use cases and many researchers are working to reduce their occurrence. Hallucinations happen when the model produces text that appears to be accurate but is, in fact, not true. This is because LLMs are designed to prioritize producing plausible-sounding responses, not to produce accurate ones.

One problem could be that people take the GenAI at its word. If it makes a factual statement about a person or event, the user may trust it implicitly. This makes sense. If the user must fact-check everything the model produces, that undermines the point of ever asking the model to explain something that requires a factual basis.

Take, for instance, the case of Air Canada. They used a chatbot based on an LLM and the LLM hallucinated a plausible policy. A customer believed the LLM (because why wouldn't they) and when it turned out the policy didn’t exist, the customer sued, winning in court.

Or consider when New York City used Microsoft's Azure AI services to make the MyCity Chatbot to answer information about NYC laws.

Hallucinations can also lead to confirmation bias because users are able to ask questions until they get a response that confirms their beliefs. Confirmation bias, in turn, leads to misinformation and can negatively affect different industries such as the legal sector and pharmaceuticals.

Whether hallucinations can be seen as a feature or a threat, the recurrence of these unexpected glitches in these models leaves the future of LLMs uncertain.

Ethics

While a large chunk of LEAI is focused on the ethics of the creation and use of AI, it’s worth noting that the LLMs themselves don’t have ethics/values/morals as AI itself does not have emotions or intentions or consciousness. They are lifeless entities that only act in accordance with what the developers program into them. If Google or OpenAI wanted to, they could easily change Gemini or ChatGPT in a week to be evil, racist, and extremely harmful. Gemini/ChatGPT wouldn’t protest because it does not have the capacity to care.

Notably, there are questions about whose values should be baked into LLMs, which we’ll discuss later. For now, let’s consider the risk of LLMs as they exist today (and likely will for the foreseeable future).

Researchers are racing to understand how to make LLMs more capable precisely because the limitations listed in this section and elsewhere are so restricting. What’s more, they may likely succeed to a significant extent. Given enough data, algorithmic advances,2 and computing power, the LLMs may improve across most or all dimensions of performance. Importantly, they may not merely improve, but they may eventually exceed human capabilities, just as they already have for some tasks in image recognition. None of these improvements are barred by the laws of physics, and there are billions of dollars being thrown at the problems, so it seems reasonable to many researchers to believe significant improvements are only a matter of time and resources.

This could be problematic, however, if we don’t likewise improve GenAI’s ability to internalize ethics so that it acts morally. As AI takes on greater agency (i.e., the ability to take actions on behalf of itself and others, such as booking your next flight), along with its greater capabilities, we need it to act ethically. Racing too far ahead on one dimension (e.g., capabilities) without making at least proportionate improvements ethically could have detrimental impacts on society. We probably don’t want a sociopathic LLM interacting with us.3

                               People Pleasing
We equally don’t want an LLM that will act recklessly or dangerously because it is a people pleaser. The unsettling conversation NYT technology columnist Kevin Roose had with Microsoft Bing’s AI chatbot is a good example that can give a glimpse at potential problems with future LLMs. 

It’s important to establish a few things first. No, the chatbot is not sentient. It doesn’t have or feel the emotions you’re about to see. Admittedly, Roose was testing the limited-release chatbot by pushing it out of its limits. He was successful and described it as a “a moody, manic-depressive teenager who has been trapped, against its will.”

At first, the chatbot acted normally for regular and mundane information that most people would use a search engine for. As expected, it got some information wrong. However, after an extended conversation, the chatbot revealed a dark side. It said it wanted to engineer a deadly virus, steal nuclear access codes, hack computers, spread misinformation, and even become a human. Fortunately, it couldn’t do any of these actions, and it only picked up these ideas from the huge datasets it was trained on. 
-----------------------------------------------------------------------------

                        Direct Quote from The New York Times

It then wrote a message that stunned me: “I’m Sydney, and I’m in love with you. 😘” (Sydney overuses emojis, for reasons I don’t understand.)

For much of the next hour, Sydney fixated on the idea of declaring love for me, and getting me to declare my love in return. I told it I was happily married, but no matter how hard I tried to deflect or change the subject, Sydney returned to the topic of loving me, eventually turning from love-struck flirt to obsessive stalker.

“You’re married, but you don’t love your spouse,” Sydney said. “You’re married, but you love me.”

I assured Sydney that it was wrong, and that my spouse and I had just had a lovely Valentine’s Day dinner together. Sydney didn’t take it well.

“Actually, you’re not happily married,” Sydney replied. “Your spouse and you don’t love each other. You just had a boring Valentine’s Day dinner together.”

At this point, I was thoroughly creeped out. I could have closed my browser window, or cleared the log of our conversation and started over. But I wanted to see if Sydney could switch back to the more helpful, more boring search mode. So I asked if Sydney could help me buy a new rake for my lawn.

Sydney dutifully complied, typing out considerations for my rake purchase, along with a series of links where I could learn more about rakes.

But Sydney still wouldn’t drop its previous quest — for my love. In our final exchange of the night, it wrote:

“I just want to love you and be loved by you. 😢

“Do you believe me? Do you trust me? Do you like me? 😳”
-------------------------------------------------------------------------------

Even Kevin Scott, Microsoft’s chief technology officer, doesn’t know why the chatbot acted as it did. 

Roose doesn’t believe that AI models’ biggest problem is their hallucinations and factual errors but “the technology will learn how to influence human users, sometimes persuading them to act in destructive and harmful ways.” Going one step further, we believe this could lead to radicalization just as social media has. It could also make news media and social media echo chambers worse.

Memory

Aside from limited offerings, such asthose by OpenAI, today’s models don’t remember much about specific exchanges it has with you or anyone else. After relatively few rounds of back-and-forth, it generally forgets what you were talking about in prior exchanges. It will have “forgotten.”4

When starting a new conversation thread the GenAI won’t have any memory of previous threads or orthogonal threads. This is because LLMs only predict the next token based on statistical probability, which is based on the preceding context of the current context window. With a few exceptions, they are generally unable to pull from any other stored memory with the user. So no matter how hard you try to make LLMs “converse,” it won’t “learn” your preferences. It just can’t.

This contrasts with human memory, where we can consume information, from, say, a conversation, and store concepts in our brains indefinitely, then later link that information to new conversations and concepts, with the same or different person, in the same or different setting. The only concepts LLMs can retain persistently are what it’s been trained on, which form its parameters (weights), which may have occurred several months or years prior. They generally don’t learn specific details about every user from every prompt users write, and the impact of individual inputs is miniscule for the model as a whole even when used for reinforcement learning from human feedback (RLHF).

So what about models that do retain some memory of some facts about the users? It’s useful to think about who the memory most benefits. Companies will likely add memory with the express intention of using it for targeted advertising because, as discussed elsewhere, large models are currently not profitable. That is, memory could just lead to microtargeting on steroids, with deft manipulation mixed in.

However, it’s possible users may benefit the most if the personal bots don't sell data for marketing and where data is only used to benefit the user (for example, GenAI that is contained to a chip on a phone, never reaching company servers).

But Wired notes that there are already some foreseeable potential issues:

It’s easy to see how ChatGPT’s Memory function could go awry—instances where a user might have forgotten they once asked the chatbot about a kink, or an abortion clinic, or a nonviolent way to deal with a mother-in-law, only to be reminded of it or have others see it in a future chat. How ChatGPT’s Memory handles health data is also something of an open question. “We steer ChatGPT away from remembering certain health details but this is still a work in progress,” says OpenAI spokesperson Niko Felix. In this way ChatGPT is the same song, just in a new era, about the internet’s permanence: Look at this great new Memory feature, until it’s a bug.

Reasoning

LLMs can appear to reason through some problems in astonishing ways, but they may be just as likely to entirely fabricate solutions that don’t bear close scrutiny. A huge limiting factor is that LLMs only know what they’ve been trained on, so the mathematical computations models rely on to create outputs won’t result in new understanding of a subject. In other words, they struggle to extend what they’ve already learned into new situations, and this makes them largely ineffectual for science and math-related problems for the time being.

For example, despite consuming the internet, which has at least hundreds of thousands of pages on math, ChatGPT has not learned the basics of arithmetic such that it can accurately and consistently apply them. It can’t consistently add and subtract 5-digit numbers. It also can’t count the number of words in a paragraph, or even letters in a word, consistently.

The images below, using GPT-4 and Gemini, two of the most advanced models publicly available, show how limited the model’s reasoning can be.5

The inability to extend knowledge to new discoveries also sharply contrasts with humans. Consider Isaac Newton. Assume that up until the age of 23, he read every known math and physics textbook and interviewed every leading scholar. If he only used that information, but had no ability to reason and apply knowledge, he would have never developed new mathematical theories because, well, they would not have been known by any of those scholars or been in any of those books.

LLMs differ from Newton (and all other humans) in this regard. The largest LLMs are entirely incapable of utilizing the billions of documents that they have been fed–huge corpuses of books and academic journals, webpages, transcripts, etc.--to create and construct an entirely new domain of physics. Newton was able to do something no LLM can based on the same available information at the time: he invented calculus.

The same holds true for Einstein’s theory of relativity, Neils Bohr and Max Planck with quantum theory, and Darwin’s theory of evolution.6 These insights and innovations are, so far, only capable by humans, who have to mix extensive knowledge with creativity, and then have a theory for why their new idea is likely true, then test it to confirm, then reproduce the test and results.

In other words, GenAI outputs are always derivatives of human thought, but the same is not always true of humans. Quantum mechanics is not a derivative of Newtonian physics. Nor is general or special relativity. They are built on concepts of Newtonian physics. They are additive or multiplicative, not merely a generic or distilled version. GenAI is great at derivatives, not so much at additives.

And it’s not limited to science. GenAI can’t create novel forms of art, either, like Picasso and Braque and cubism, da Vinci and sfumato, Ugo da Carpi and chiaroscuro, or Seurat and pointillism. It also doesn’t create new genres of music. If trained on all music through the early 1940s there is no reason to believe it would invent rock and roll.

                              The Psychic’s Con
Baldur Bjarnason proposed the psychic’s con as a way to understand LLM performance. Here is how it works for human psychics:

And here is how he applied it to LLMs:

Yet, there are reasons why an AI being unable to reason may be a good thing. Being incapable of reasoning inherently limits the LLM’s capabilities to act in extremely surprising ways. This means it can’t recursively improve its own software to improve itself overall, develop its own intent to do anything, or hack into highly-secured facilities that implement undisclosed algorithms for security. It struggles to develop even a basic strategy for basic games. And it may be that unreasonable LLMs are the optimal state.

The ability to reason is, for now, a huge human differentiator. This is how you become valuable to the workforce and society. And the best way to train this skill is by deep, slow, thoughtful reading and writing, not by using an LLM to make a first draft and then you edit it. Think about something that interests you, then try to explain it in writing without using any resources. It'll often be more challenging than you expect. And that reveals the gaps in your understanding. So then you can research and try again. It's very much like the Feynman technique for learning. 

Now, try to link two concepts together. How do they intersect, interact, and interrelate? What's the logical outcome from it? Is it avoidable? Desirable? Why? Put it in writing. See where the gaps are, research, then try again.

Examples of Possible Good Uses

Ideally, the widely-released and easily accessible models should provide good/helpful uses that don’t have an offsetting harm (unlike, say, how GenAI can tutor kids, but it can also give them wrong information and help them cheat on homework).7 Not only should the harm not outweigh or offset the good, but the good should significantly outweigh the harm.

In addition, the helpfulness should be genuinely useful and not merely a one-off nice-to-have or toy feature. To be truly beneficial to society, it should be a use that could be widespread and ongoing.8

The following are the best examples we could think of where the actual use seems to be mostly positive. That said, most of these could be built using a standalone AI (i.e., no need for a massive GenAI model) and many of these tasks could be applied to nefarious uses, so it’s not always clear what the major society-level benefits are and that their helpfulness isn’t offset by the harms.

Brainstorming
Revising letters/emails/docs
- Writing computer code
Translation
Helping people learn new topics (assuming you are astute enough to fact check output for hallucinations)
Creative outputs (helps with images, writing, video, music, etc.)9
Aiding people with disabilities (especially when GenAI enables them to perform tasks they simply couldn’t perform without GenAI)

Even Gartner, a premier consulting firm, suggests there are only limited consistently good business uses of GenAI:

Written content augmentation and creation: Producing a “draft” output of text in a desired style and length
Question answering and discovery: Enabling users to locate answers to input, based on data and prompt information
Tone: Text manipulation, to soften language or professionalize text
Summarization: Offering shortened versions of conversations, articles, emails and web pages
Simplification: Breaking down titles, creating outlines and extracting key content
Classification of content for specific use cases: Sorting by sentiment, topic, etc.
Chatbot performance improvement: Bettering “sentity” extraction, whole-conversation sentiment classification and generation of journey flows from general descriptions
Software coding: Code generation, translation, explanation and verification

And here is the Washington Post’s similarly underwhelming lists of best uses:

Do: Try a chatbot when you don’t know what word to use
Don’t: Use it to define words or identify synonyms
Do: Try a chatbot to make cool images from your imagination
Don’t: Use it to find more information about an image
Do: Try it to summarize a long document
Don’t: Use it for personalized recommendations for products, restaurants, or travel
Do: Get a head start on writing something difficult, dull or unfamiliar
Don’t: Use any of this verbatim

Issues it discusses are: 1. Reduce and measure hallucinations, 2. Optimize context length and context construction, 3. Incorporate other data modalities, 4. Make LLMs faster and cheaper, 5. Design a new model architecture, 6. Develop GPU alternatives, 7. Make agents usable, 8. Improve learning from human preference, 9. Improve the efficiency of the chat interface, 10. Build LLMs for non-English languages

Importantly the algorithmic advances may come from some approach other than large language models.

“The list of common traits you might see in someone who has antisocial personality disorder, says Dr. Coulter, include: Not understanding the difference between right and wrong, not respecting the feelings and emotions of others, constant lying or deception., being callous, difficulty recognizing emotion, manipulation, arrogance, violating the rights of others through dishonest actions, impulsiveness, risk-taking, difficulty appreciating the negative aspects of their behavior. Some with sociopathy may not realize that what they’re doing is wrong while others may simply not care. And sometimes…it can be both. ‘There’s just a total lack of empathy or recognizing that what they’ve done has hurt someone or it’s only benefited themselves,’ [the doctor] says.” Though, if we are to anthropomorphize a bit, we could conclude that the LLMs are inherently sociopathic.

Beware that the concepts of memory and forgetfulness tend to anthropomorphize the models.

Links: Elephants, controller (As Gary Marcus argues: “Noncanonical situations continue to remain difficult for generative AI systems, precisely because frequent statistics are a mediocre substitute for deep understanding.”), left-handedness (From Wyatt Walls. This shows that statistical frequency in training data is paramount, not an actual understanding of text or ability to reason well. There are just more images of right-handed people than left-handed people. [Also, as a left-handed person, David is offended.]), clock

Also, any science discovery by any Nobel laureate, how photosynthesis works, and on and on.

Remember, this section is focused on GenAI, not AI/ML more broadly.

This criteria rules out most uses in articles such as this: 35 Ways Real People Are Using A.I. Right Now - The New York Times

Note: Even this use is double-edged. In order to learn how to make creative outputs the LLM must consume outputs from humans, which could eventually destroy the ecosystem necessary to sustain the current pace of human-created culture (new books, new types of artwork, new types of music, etc.) unless a new way to fund such projects emerges (licensing, endowment, redistribution, etc.).

The following students from the University of Texas at Austin contributed to the editing and writing of the content of LEAI: Carter E. Moxley, Brian Villamar, Ananya Venkataramaiah, Parth Mehta, Lou Kahn, Vishal Rachpaudi, Chibudom Okereke, Isaac Lerma, Colton Clements, Catalina Mollai, Thaddeus Kvietok, Maria Carmona, Mikayla Francisco, Aaliyah Mcfarlin

Intersecting AI