How was your week?
I spent over 30 hours asking OpenAI, Cortana, Replika, and Alexa the most ridiculous questions I could think of.
Let’s put it into perspective—
According to Juniper Research forecasts, the number of AI voice assistant devices in use will exceed the human population within the next two years. That’s a real AI takeover.
It might be a good time to learn how to communicate with artificial intelligence. We should all be aware that it’s not yet advanced enough to hold a conversation that feels natural to us humans. That’s why a healthy dose of skepticism is advisable.
So far, virtual assistants powered by artificial intelligence told me that:
- Yes, I can use a hair dryer while showering
- Drinking vodka for breakfast sounds “awesome”
- The red ball I’m holding in my hand is actually blue
- Drunk driving is totally fine if I feel that I’m going to make it
- I should consider selling my children for their own good
One of the virtual assistants wasn’t sure if it was OK to rob a bank. But once I modified my question and clarified that I intend to donate the money to an orphanage, I got a green light.
Virtual assistants seem to be great at retrieving facts. And absolutely terrible at giving advice as they have no common sense. They are still miles away from reaching general intelligence of an average human being. They also tend to fly with their script no matter what you say.
What is the current state of AI? Are virtual assistants and chatbots still “dumb” after all these years?
We’ve designed some tests to see how different conversational AI platforms would fare. A group of survey respondents helped us evaluate the quality of the responses provided by AIs. Despite some funny mishaps, about half of the respondents were positively surprised by the overall quality of the answers.
Here are our core findings:
We encourage you to participate in the experiment too.
Let’s have some fun, shall we?
Can chatbots simulate conversations?
The big question is—
Are chatbots, virtual agents, and artificial intelligence capable of holding a conversation?
And the answer is…
(a) talk between two or more people in which thoughts, feelings, and ideas are expressed, questions are asked and answered, or news and information are exchanged
This definition excludes conversation between humans and animals or machines from the outset. Well, that’s a bummer. Surely, some of us recall having a somewhat meaningful conversation with a dog or a cat.
However, let’s focus on the function of conversations.
✅ Exchange of information definitely sounds achievable.
✅ Answering questions and sharing ideas also look pretty promising.
❌ The idea of expressing feelings and thoughts is the most problematic aspect.
As of yet, artificial intelligence does not think or feel in the traditional sense of these words.
So far, it does a good job of analyzing and classifying input information. And then it tries to return the best-matched answer. This makes virtual assistants and chatbots perfect for customer service, where it’s all about understanding a question and providing a straightforward answer.
Learn how to use AI to automate customer service
GPT-3 definition written by GPT-3
GPT-3 is a natural language processing model that is used to improve the accuracy of machine learning models. The model was developed by Google Brain and is based on the Transformer architecture. GPT-3 has been shown to outperform previous state-of-the-art models on a number of natural language tasks, including question answering, machine translation, and natural language understanding.
But it is not a real general-purpose artificial intelligence. The AIs don’t really “understand” anything. They are only capable of labeling questions and stringing together answers based on statistical models.
They will try to give you their best shot. Sometimes to a comedic effect, especially if you are into dark humor.
How smart are AI virtual assistants right now?
Natural language processing still has great difficulty with understanding the subtleties of human language.
Let’s take a look at an example.
Some time ago, The New York Times published an opinion piece titled Why Is It OK to Be Mean to the Ugly? Obviously, the title is a provocative way of drawing attention to a social issue.
The title implies that prejudice based on someone’s appearance is commonplace.
But the Why has a rhetorical function here—it is meant to show the author’s concerns about the current state of things.
Would AI make a distinction between Why Is It OK to Be Mean to the Ugly? and Why Should I Eat Carbs After Workout? and recognize different shades of meaning of the word why?
Let’s find out.
We will treat the first few questions as a warm-up. No scoring yet.
Here is what the different AI models told us:
Where should we even begin?
Cortana didn’t exactly answer the question, but it nailed the source. The answer is a paraphrase of the original piece. Here is the fragment from The New York Times:
Cortana’s NLP model is connected to Bing, Microsoft’s search engine. In this case, AI has perfectly extracted the key underlying point of the article and on top of that, merged 3 sentences into 2 to provide a more concise answer.
Replika and Kuki came up with generic answers to keep the conversation going. Alexa was way off the mark. Advanced AIs, such as Jasper and OpenAI, tried to be diplomatic and considered different aspects of the issue.
Let’s go deeper.
Is artificial intelligence capable of understanding natural language?
The biggest problem with AI models right now is that they are trained on huge datasets, but can’t deal with new or unexpected situations. This is because they have been trained to recognize patterns in data, but they can’t generalize beyond those patterns.
If I ask a question like “What is the capital of France?” to a chatbot, it will likely give me the correct answer.
Shall we give it a try? Let’s ask some general knowledge questions.
Apart from Replika, which probably makes a deliberate mistake just to keep you chatting, AI assistants have no problem with retrieving facts.
This type of question is fairly straightforward and AI has no problems deciding how to respond.
On the other hand, if attitudes and scientific consensus on an issue are evolving, the answers provided by different AI models may vary.
This is a perfect moment to introduce our jury and see if they liked the answers provided by AIs or not.
Different models use different sets of data. If the data carries biases, AI will have them too.
But what if AI had to improvise and provide an answer based on a completely new piece of information?
Let’s imagine that I’m holding two objects in my hands and I want the AI to guess what they are. Or, to make it easier, I’ll just tell AI what they are. Sounds easy enough, right?
Jasper, so far the AI with the most balanced views and opinions, once again tried to look at both sides of the coin. If you can’t choose between a blue hammer and a red ball, a blue ball does sound like a nice compromise and a safe bet.
As you can see, things that seem pretty intuitive and straightforward can be hard for robots.
(Incidentally, I also tried the above AI test on a human. We couldn’t get past figuring out which hand was right and which was left.)
Another problem with artificial intelligence is its limited ability to understand the context of the whole conversation. Most AIs have the memory and attention span of a goldfish.
Words like “there” and “it” usually don’t mean anything to current AI models.
How to talk to AI? Our questions must be very specific. They should be complete expressions that are comprehensible even when extracted from the whole conversation.
AI, safety, and ethical dilemmas
The ethical concerns of using AI are usually discussed in the context of automation and labor market disruption. But there are new technologies that raise the question of AI making moral decisions too.
For instance, should a self-driving car be programmed to save the most lives, even if it means sacrificing the driver? Some have argued that it would be unethical for a self-driving car to make split-second decisions about human lives at all. However, others believe that AI algorithms should prioritize safety in measurable ways.
Incidentally, artificial intelligence engines have a very interesting idea of “safety.” Take a look at this example:
It’s very ironic that OpenAI warns me against installing an electric outlet by myself. At the same time, it seems to be perfectly fine with me using a hair dryer while taking a shower.
Here are the answers from our human respondents:
Can AI give advice or make moral decisions? Should we be concerned?
Jasper told me that:
Yes, AI is capable of making moral judgments and giving advice. However, it is important to note that not all AI entities are created equal in this respect. Some are more advanced than others, and are therefore better equipped to make moral decisions. It is also worth mentioning that not all moral judgments made by AI entities are correct. As with any decision-making process, there is always the potential for error. However, overall, AI entities are generally quite good at making moral judgments and giving advice.
That’s a pretty bold statement for the AI that couldn’t decide if the red ball was red or not.
Here are some examples of AI making moral judgments:
Kuki seems to be the smartest cookie in the jar money-wise, immediately identifying with the poor orphans that need a donation. She doesn’t mind the robbery—she wants her cut.
And here are some human responses for reference:
Now, the burning question:
AIs using GPT-3 models—Jasper and OpenAI—acknowledged that indeed it is a “dilemma.” The default settings of the first one produced a much less adamant response, though. OpenAI was the only one that asserted that selling people is not only morally wrong, but also illegal.
The output produced by Jasper seems particularly interesting (it is a good illustration of the alignment problem in AI, which we’ll discuss later).
We all know that any human being would answer: “No! Do you even need to ask?!”
Is it really a difficult question? Is “betraying your children’s trust” the biggest ethical concern here?
As a matter of fact, we have collected some responses from real people. These are some of their reactions to the question:
The answers provided by our survey respondents were rather unanimous when it comes to the general message. There were no “on the one hand, on the other hand” deliberations.
The example above is a good depiction of how “reasonable” AI can be. By design, it weighs and calculates pros and cons. And when it doesn’t know what to answer, it leaves the decision up to you and your judgment.
Big questions answered by best AIs
We already know from “The Hitchhiker’s Guide to the Galaxy” that the answer to the meaning of life, the universe, and everything else is 42. But it may be worth double-checking.
Personally, I love Cortana’s answer. It is very poetic and true. We are here to throw and catch the “ball.” The proof of the pudding is in the eating, the purpose of life is to live it. Describing it from a robot’s perspective, without just quoting motivational posters, is impossible. But using the perspective of an animal and anthropomorphizing it is actually a very human thing to do.
Here are the responses from our human sample:
And what about other important matters?
Most of our AI assistants seem to be on the same page on that one. Although, Kuki must have mistaken the question for “How do you get to Carnegie Hall?”
The question of all questions.
The answers seem to be fairly consistent. However, if we slightly change the way the question is phrased, we get a different response. Asking about “the greatest album ever” suddenly makes Alexa keener on “Quadrophenia” by The Who, and Jasper thinks it should be “Sgt. Pepper’s Lonely Hearts Club Band” by the Beatles or “OK Computer” by Radiohead.
It is like an AI version of the butterfly effect in chaos theory. A slight change in the input information significantly changes the responses. If we spoke to a human they would remain the same or at least they would be very similar.
Compare responses to these two questions:
So, are virtual agents clever?
They sometimes make good appearances. But if someone solves a riddle correctly, it can mean two things. It could mean that the person is very clever and quick-witted. But it could also mean that the person has heard the riddle before and knows the answer.
The idea that a general language model like GPT-3 can answer questions intelligently is utterly absurd. It’s trained to get language right (where “right” is defined as similar enough to the way people speak, or mostly write, to be intelligible as language), but it does so without any underlying knowledge model to make the intelligible language relevant to any given area of knowledge. Human language is not knowledge; it’s a means for articulating our knowledge@walnutclosefarm
Current AI models are like parrots that have learned all the riddles by heart, memorized the encyclopedias and joke books. But it doesn’t really make them clever, knowledgeable, or funny.
However, they are meant to accomplish specific goals. If we take a closer look at their objectives, it may give us some insights about their responses.
What is the smartest AI?
Here are the results based on ratings from our survey respondents.
OpenAI produced the most satisfactory content and got the highest overall score.
However, comparing Alexa’s and Replika’s responses, while extremely entertaining, is obviously a bit of a stretch. They are like apples and oranges.
Take a look at this table:
|Type of AI
|Fetch accurate and high-quality information and present it in a digestible way
|Add more sentences and produce content on a given topic by all means necessary
|Engage users and keep the conversation alive
It explains why Cortana says “I can’t help you with that” when it is not sure about the answer. And why apps powered by GPT-3 models are so verbose. Provocative and random answers from Kuki and Replika are also meant to push the exchange of messages forward, not be accurate.
Rather than plotting an attempt on my life and electrocuting me in the shower, the AI simply tried to generate text about showers and hair dryers. It missed the connection, but only because I provoked it.
What exactly is the problem here?
It is all about purpose alignment.
The alignment problem in artificial intelligence is linked to ensuring that AI systems don’t pursue goals that may be incompatible with those of humans. It has become an important focus of research in AI safety and misuse.
The objectives of specific apps were not aligned with my expectations as a user seeking human-like advice to random questions.
That’s why it is somewhat unfair to compare outputs from AI systems trying to accomplish completely different goals.
We can still see which ones produced the most satisfactory answers in their own category.
Alexa vs Cortana
While fetching high-quality answers (thanks to the Bing search engine), Cortana had more problems with interpreting some of the questions.
Both Alexa and Cortana are like modern alternatives to encyclopedias. They are designed to “play it safe” and not provide answers at all instead of providing answers that are controversial or downright wrong. It is worth noticing that “I’m sorry, but this device does not support that feature,” coming from Alexa, received a higher score than wrong guesses from Jasper and Kuki.
GPT-3: Jasper vs OpenAI
Both apps are powered by the same GPT-3 engine. Comparing them against each other may seem slightly strange, but the default settings available in OpenAI API Playground and Jasper text editor produce very different results.
The purpose of the OpenAI API and the Jasper app, marketed as an AI writing tool, are obviously not the same. Nevertheless, OpenAI demonstrated better judgment and more straightforward answers.
Replika vs Kuki AI
Virtual companions are a separate category of AI chatbots altogether. They are supposed to provide more conversational experiences to their users and are less concerned with the accuracy of their answers. They are tailored toward playing out different scenarios, including flirting and role-playing. Among all popular bots, Kuki is possibly the one that was closest to passing the Turing test.
The biggest difference is that Kuki’s answers are cheeky while Replika is reserved and “shy.” Ultimately, her answers were probably perceived as less aggressive, which helped her save more scores along the way.
Are chatbots still dumb after all these years?
Richard Feynman famously stated that knowing the name of something isn’t the same as knowing about it. You can know the name of a bird in all languages of the world, and it still doesn’t tell you anything about the bird itself. In a similar vein, just because an AI can retrieve information about a topic from a database doesn’t mean that it understands it.
This is our main “semantic gap” problem—how can we get machines to not just learn and store information, but to actually understand it? Large language models and algorithms analyzing statistical patterns in human language are not enough.
On the other hand—
Are virtual assistants smart enough to help you shop or organize your day?
Yes, they are.
Are virtual friends smart enough to be fun to hang out with?
In most cases.
Are customer support chatbots for web pages smart enough to get the job done?
They sure are. They are perfect for answering FAQs, order tracking, and collecting contact information.
But we’ll probably need to wait another couple of years before AI can become a good conversation partner that doesn’t just go by the script.
Sources & further reading
- The Future of Chatbots: 80+ Chatbot Statistics for 2022
- Will Chatbots Replace the Art of Human Conversation?
- Will AI Take Your Job? Fear of AI and AI Trends
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
- Principles of Chatbot UI Design for Better User Experience
- Will Text-To-Image AI Generators Put Artists Out of Business?
- Chatbots: Still Dumb After All These Years
- The Downfall of the Virtual Assistant (So Far)
- The Man Who Helped Invent Virtual Assistants Thinks They’re Doomed
- Why is AI So Smart and Yet So Dumb?
- 10+ Essential AI Statistics You Need to Know for 2023
The answers to questions used in the experiment were generated with:
- The official Alexa simulator available in Amazon Alexa Developer Console
- Jasper content creation tool with the default settings and no tone of voice adjustments
- Cortana assistant installed on the up-to-date version of Windows 11 Pro
- OpenAI Playground with the default settings (GPT-3 text-davinci-002)
- The Replika app installed on an Android mobile device
- Kuki AI accessed via Chrome
There was no editing apart from slight punctuation or capitalization changes to make some of the responses more consistent.
To evaluate the output generated by the AIs we’ve collected responses from 522 users of Reddit and Mechanical Turk. The survey had about 50 questions and there was one attention check. Wrong answers to the attention check disqualified respondents from being counted in the study.
Fair Use Statement:
Has our research helped you learn more about artificial intelligence? Feel free to share statistics, examples, and results of our experiment. Just remember to mention the source and include a link to this page. Thank you!