RECOGNITION THROUGH BENCHMARKS AND EVALUATIONS
PART 2
In from three to eight years, we will have a machine with the general intelligence of an average human being. I mean a machine that will be able to read Shakespeare, grease a car, play office politics, tell a joke, have a fight. At that point the machine will begin to educate itself with fantastic speed. In a few months it will be at genius level, and a few months after that, its powers will be incalculable. -- Marvin Minsky in 1970
A key takeaway from the previous post on human intelligence is that there are many ways to measure it, whether it is logical problem-solving ability, vocabulary size, juggling ability, trivia knowledge, or being able to recognize when someone is feeling sad. We will need to apply the same approach with AI to understand where it equals or even surpasses human abilities with different benchmarks and evaluation methods to measure and track progress. When AI has reached human-level performance, then Artificial General Intelligence (AGI) may be achieved and going beyond humans would be Artificial Superior Intelligence (ASI). Although, as with human intelligence, there is still much debate about what exactly AGI is and how we will know if and when it happens.
A good way to start defining AGI is to contrast it with “narrow” or “weak” AI which is designed to perform specific tasks within a limited domain – think self-driving cars. They are very good at visual recognition, spatial mapping and driving-decision making. They are not good at hypothesizing potential protein structures, writing python code, playing chess, or mimicking a person as a chatbot. These are all examples of narrow AI and are very different from us – our nervous system can do all these things and much more, not just one task. AGI should be more like us, able to perform any intellectual task as well as or better than a human can. That’s a significant challenge!
When we get there, AGI will be able to think, learn and understand on a similar level to humans and do what we can do (and soon thereafter better). Some likely capabilities of AGI include:
Human-like intelligence: the ability to reason, learn and adapt similar to human cognition.
Flexibility: able to apply its intelligence to a wide range of tasks and domains, like humans can.
Learning from experience: capable of learning from its experiences and improve its performance over time.
Generalization: able to take knowledge and skills learned in one context and apply them to new, unfamiliar situations.
Creativity and problem-solving: able to think creatively and come up with novel solutions to problems, much like humans do.
Self-improvement: the ability to design better neural networks, better AI hardware, and better training datasets, all to improve itself.
When we get there, it might change everything (coming next in part 3 of the series).
Just as we’ve taken a multi-faceted approach to measuring the various dimensions of human intelligence, we need to take the same approach with AGI. Benchmarks and evaluations will help measure the capabilities of AI systems and determine their progress towards AGI. Without these objective measures, it becomes hard to compare different approaches, identify areas for improvement, and quantitatively track the advancement of the field. They become common measures for researchers and developers to evaluate their models, share their findings, and collaborate on next steps towards AGI.
For decades, the benchmark for AGI was the Turing Test, developed by Alan Turing to assess a machine’s ability to exhibit intelligent behavior indistinguishable from a human. However, simple mimicry can be pretty convincing and today’s Large Language Models (LLMs) like ChatGPT are preferred by humans. The Winograd Schema Challenge, introduced in 2012, evaluates a system's ability to resolve ambiguities in natural language based on common-sense reasoning. While this benchmark addresses some of the limitations of the Turing Test, it still focuses on a narrow aspect of intelligence. The General Language Understanding Evaluation (GLUE) benchmark, developed in 2018, assesses a model's performance on a range of natural language understanding tasks. While GLUE provides a more comprehensive evaluation of language understanding, it does not cover other essential aspects of intelligence, such as reasoning, problem-solving, and creativity. Newer benchmarks like GAIA test for practical, real-world knowledge and reasoning across domains like science and general knowledge and AGIEval looks to measure general abilities of AI models relevant to human cognition and problem solving.
Google’s Deepmind has proposed “Levels of AGI” which organizes AI’s performance, generality and autonomy, similar to levels used in autonomous driving technologies. True AGI will be flexible and general, with both strong performance across domains, not just specialized in a particular domain. According to the chart below, we have made huge strides in Narrow AI, surpassing human levels at Level 5. However, for General AI, we are only at Level 1, “Emerging AI” with systems like ChatGPT and Claude. Soon, we may hit Level 2, “Competent AI” that is as good as the average human (cue relevant George Carlin quote).
According to Stanford’s AI Index Report, current AI models are as good as or better than humans on most measures. There are fewer and fewer domains that we are better than AI, even for domains like creativity, which we thought were pretty uniquely human.
Looking across standardized tests from high-school level to graduate level, GPT-4 exceeds 80th percentile human scores on several tests, including most AP courses, but more impressively the Bar Exam, LSAT and GRE exams.
Some experts believe that we’re starting to see the beginnings of AGI in today’s LLMs or even full-blown AGI. However, there is no consensus about what AGI is, let alone when we might get there. In 2022, half of the over 300 experts expected AGI before 2061 (in 40 years) and 99% thought it would happen in the next 100 years.
Recent advancements have pushed some people’s timeframe up. Mustafa Suleyman, author of The Coming Wave, suggests AI will achieve this human-level performance soon, saying “within the next few years, AI will become as ubiquitous as the Internet”. He adds, “The coming wave of technologies threatens to fail faster and on a wider scale than anything witnessed before”.
Improvements in neural network architectures, better training data, and faster hardware may quickly take us to AGI, which further underscores the importance of appropriate benchmarks and measurement. Once we start seeing real glimpses of AGI, full-blown AGI won’t be far behind, so we need to heed the warnings.
Skeptics like Meta’s Turing Award Winner Yann LeCun argue that we don’t have the right tools now for AGI, saying:
It’s astonishing how [LLMs] work, if you train them at scale, but it’s very limited. We see today that those systems hallucinate, they don't really understand the real world. They require enormous amounts of data to reach a level of intelligence that is not that great in the end. And they can't really reason. They can't plan anything other than things they’ve been trained on. So they're not a road towards what people call “AGI.” I hate the term. They're useful, there's no question. But they are not a path towards human-level intelligence. So the mission of FAIR [Meta’s Fundamental AI Research team] is human-level intelligence. This ship has sailed, it’s a battle I’ve lost, but I don't like to call it AGI because human intelligence is not general at all. There are characteristics that intelligent beings have that no AI systems have today, like understanding the physical world; planning a sequence of actions to reach a goal; reasoning in ways that can take you a long time. Humans, animals, have a special piece of our brain that we use as working memory. LLMs don't have that.
As Meta pursues embodied intelligence, which is closer to how children learn about the world and develop their intelligence - a much deeper understanding of our reality than pure text could ever capture. Google is similarly pursuing the combination of AI and robotics, which brings autonomy and action in the world. By connecting a LLM to a robot, the system can plan and execute commands to achieve its goals in the real world like cleaning up a spill on the counter by grabbing a towel and wiping up (the video still trips me out…)
As AI systems develop, benchmarks will need to develop too. They should be an interdisciplinary effort involving AI researchers, cognitive scientists, neuroscientists, and philosophers. With diverse perspectives and concrete benchmarks, we can trace the development of AI’s pursuit of human intelligence and pinpoint the moment when it surpasses us.
We previously discussed the goalposts of AGI, human intelligence in its many facets. However, as AIs approach and surpass human-level capabilities on many tasks, the goalposts may shift. That’s why we need concrete and reliable benchmarks to trace AI development and provide some warning that AGI may be approaching. The field is advancing so rapidly that benchmarks can’t play catch up and need to stay ahead of the curve to accurately measure future advancements and provide clear comparisons to human abilities.
What does achieving AGI mean for us? What capabilities will it enable? What does it mean for our society? What does it mean for our species? Continue to the final part of this series, Potential Benefits of AGI to Society.