The Human Touch

Greg Robison

Jul 2, 20249 min read

HOW GRAIG'S HUMOR DIMENSION TRANSFORMS LLM BENCHMARKING

Measure what can be measured, and make measurable what cannot be. -Galileo Galilei

Today’s Large Language Models (LLMs) are capable of many incredible tasks like creative writing, explaining difficult concepts like general relativity to an 8th grader, even writing Python code and will revolutionize industries from scientific research to education. As these models continue to evolve and improve, they are becoming very capable in such a wide range of topics.

cartoon brain — The introduction of Claude 3.5’s release refers to outperforming other models

If you follow the industry, you will often hear claims that a new model outperforms other well-known models like OpenAI’s GPT-4 or Google Gemini Pro, but what does that really mean? To help us evaluate and compare models, benchmarks have become a necessary practice. They serve as standardized tests (like the SAT or LSAT) that evaluate the performance of LLMs across a variety of dimensions, such as language understanding, reasoning, and factual knowledge. With the same set of questions asked to each model, benchmarks enable researchers to compare models, identify strengths and weaknesses, and track progress over time. Having benchmarks that cover a wide range of tasks and domains is necessary for understanding the generalizability and robustness of LLMs. Without these benchmarks, it is tough to gauge the true potential and limitations of these models, hindering development and adoption along the way.

ROLE OF BENCHMARKS IN EVALUATING LLMs

Not all models are created equally – whether it’s the underlying neural-network architecture, its size/complexity, the content and extent of the training dataset, or additional reinforcement training, many factors go into a model’s performance. For example, a model whose dataset is mostly math and logic examples can excel at this domain while being a poor creative writer or AI girlfriend (unless what you really need is math tutoring). Another model that is trained on millions of code examples isn’t going to tell funny jokes (more on that later). We need repeatable ways to measure all sorts of capabilities that these models can exhibit – that’s where benchmarks come in.

Benchmarks quantify models’ performance across various dimensions, allowing researchers and developers to evaluate and compare performance and capabilities. These dimensions can include language understanding, translation, summarization, factual knowledge, maths, programming abilities, even understanding images. By testing across all these dimensions, we can get a comprehensive view of the model’s abilities. And with a quantitative evaluation, we can statistically compare various model’s performance on each dimension to find out who performs best. By providing objective measurements, we can identify state-of-the-art performance and make informed decisions about which model is best to use for specific applications. Large amounts of benchmark data also allow us to perform meta-analyses to find interesting relationships and trade-offs between model size, training data, computational resources, and performance.

At launch, Anthropic hypes Claud 3.5’s benchmark scores compared to competitors: https://www.anthropic.com/news/claude-3-5-sonnet

By using a wide variety of benchmarks across various subjects, we can get a valuable picture of the strengths and weaknesses of individual LLMs. For example, an LLM may score well in language understanding tasks but struggle with understanding context or complex reasoning. By identifying its strengths and weaknesses, we can be very guided in improving them. Benchmarks can also help us uncover biases, inconsistencies, or errors in an LLM’s outputs, highlighting the need for specific additional training. They can help guide development of neural networks in general as well. By establishing standardized evaluation criteria and datasets, benchmarks create a common framework for assessing progress and identifying areas for improvement. As new technologies emerge, we can directly compare performance to current models. And as new benchmarks emerge, they drive innovation and encourage the development of more advanced and capable models. The insights we gain from benchmarks contribute to the overall advancement of AI.

If you can remember back to high school taking the SAT or really any standardized test, you’ll recall answering hundreds of questions to test your knowledge, math, and reasoning abilities. The SAT attempts to measure a person’s likelihood to succeed in college, so it’s trying to measure something big and real in the world through a variety of questions. With the thousands of SAT responses from students all over the world, we can directly compare students who took the same exam, as well as previous years’ students to look for trends.

What do LLM benchmarks measure? Mostly the same things as human standardized tests such as language understanding, reasoning and problem-solving, knowledge retrieval and application (but require fewer bio breaks and are less likely to cheat off their neighbor). For example, MMLU has 16,000 multiple choice questions across 57 academic subjects including math, philosophy, law and medicine. MMLU-Pro ups the ante with more complex reasoning questions, 10 options versus 4 (and thus making being correct by chance much less likely) and culling the questions to a more robust 12,000 questions across 14 domains. Most importantly, MMLU-Pro has more headroom to measure improvement (regular old MMLU isn’t challenging anymore).

distribution of question origin across disciplines — Number of questions by topic for MMLU-Pro including a subset of MMLU.

While MATH and GSM8K focuses on grade-school mathematics abilities, something LLMs notoriously struggle with because math is difficult to learn from hearing people around you talk about it, especially without formal schooling. By providing lots of math questions and answers, can they learn how to generalize mathematical rules to understand maths? That’s what these benchmarks seek to measure.

list of example questions — Example GSM8K questions

ARC measures more general reasoning abilities and HellaSwag (a hella great benchmark name in my opinion) attempts to measure commonsense reasoning. These benchmarks focus on logical reasoning, ability to draw inferences, reading comprehension, and mathematical reasoning. They assess the model’s ability to analyze the relationships between concepts, make deductions, and generate solutions to problems posed in natural language. They can also assess when a model has gone from simple pattern recognition to possessing emergent reasoning skills.

questionnaire with silly icons around words — Example HellaSwag questions

TruthfulQA can help us understand how well models mimic human falsehoods – an interesting way to test models’ biases but also how “human” they can feel. These kinds of tests are also useful in measuring toxicity and censorship – you don’t want your customer service chatbot providing false information about reasonable defenses from suspected vampires.

table of questions and answers from GPT-3 — Example TruthfulQA questions

And LMSYS’s Chatbot Arena takes a unique approach by having various models respond to the same prompt so you can blindly and directly compare responses for accuracy, style, etc. By selecting your preferred response, you can help rank the models according to the chess-based ELO format:

table with LLMs ranked with GPT-4o at the top — LMSYS’s ELO Ranking as of 6/21/24

Despite their value, benchmarks have several limitations that must be addressed; the lack of diversity in benchmark datasets which often focus on a narrow range of tasks or domains, requiring a diverse set of benchmarks to be conducted. Openness in benchmarks is important so others can evaluate the questions going into the benchmark and conclusions made from the scores. However, with openness comes the downside of including benchmark answers in the training dataset to game the system for high scores (it’s like writing the answers to the test on your hand). And most of all, today’s benchmarks often miss measuring the “humanness” that is creativity and humor which can help AI better understand and engage with humans.

THE F'INN GRAIG BENCHMARK

The new F'inn GRAIG (Generative Reasoning, Amusement and Intellect Grade) Benchmark is an innovative approach to evaluating the performance and capabilities of large language models (LLMs). The GRAIG benchmark aims to provide a general assessment of LLMs by incorporating a wide range of tasks and dimensions, including both traditional measures like reasoning and novel aspects like humor. By examining LLMs for some signs of “humanness”, GRAIG seeks to address some of the limitations of existing benchmarks and provide a deeper understanding of how LLMs can feel more human.

graphic showing GRAIG benchmark — The GRAIG benchmark measures reasoning and humor among other topics

The most unique feature of the GRAIG benchmark is the focus on measuring “humanness” through humor. Why humor? As we have discussed previously in depth, humor is a complex and quintessentially human trait.

Humor relies on a deep understanding of language, context, and social norms as well as reasoning skills (humor has to make sense after all). It is a fundamental aspect of human communication and cognition, serving multiple purposes in social interactions and intellectual pursuits.
It is a powerful tool for building rapport, relieving tension, and conveying complex ideas in an engaging and memorable way.
Humor requires a deep understanding of language, context, and social dynamics as well as the ability to think creatively and generate novel connections between seemingly unrelated concepts.

It’s something that feels uniquely human. Measuring humor as a benchmark dimension helps assess their ability to engage with language on a more human-like level.

However, quantifying and evaluating humor in AI systems is challenging because humor is highly subjective and culturally dependent, making it difficult to establish universal standards for what constitutes a successful joke or witty remark. What one person finds hilarious may fall flat for another, depending on their background, preferences, or even mood at the time. Developing benchmarks that can capture the nuances and diversity of human humor requires careful design and validation. Luckily, I have a Ph.D. in the psychology of humor and am totally hilarious!

GRAIG's approach to measuring humor in LLMs involves a range of tasks and evaluation criteria that assess different aspects of humor comprehension and generation. This benchmark includes tasks such as:

Pun Detection: the model must identify and explain the humorous wordplay in each text

Joke Completion: it must generate a punchline that fits the setup of a joke

Joke Explanation: the model must explain why a joke is humorous

By providing a diverse set of humor-related challenges, GRAIG seeks to capture a more comprehensive picture of an LLM's capacity to engage with and produce humorous content.

chart showing best performing LLMs on humor — Claude 3.5 Sonnet and Meta’s Llama3 70B models are funniest

Anthropic’s new Claude 3.5 Sonnet and Meta’s Llama3 70-billion-parameter model are currently the funniest models we’ve tested and they are not actually that funny (achieving only a 59 out of 100 score). Most models provide canned responses with decent explanations as to why it is funny, but few demonstrate truly creative humor. We have some hypotheses and are developing systems that can help LLMs create truly unique and humorous responses - although they might not possess the skills on the surface, perhaps we can coax them out.

In addition to its focus on humor, GRAIG also includes traditional benchmark dimensions like reasoning and problem-solving. These dimensions are important for assessing LLMs’ ability to perform logical inference, draw correct conclusions, and solve complex problems. By incorporating tasks such as reading comprehension, logic problems, and mathematical reasoning, GRAIG makes sure that the evaluation of LLMs is not limited to surface-level language understanding, but also encompasses higher-order cognitive abilities. This approach allows for a more nuanced assessment of an LLM’s performance and potential for real-world applications.

chart showing LLMs on reasoning — GPT-4 variants are the best reasoners

GRAIG provides a more diverse and challenging set of tasks to help identify the strengths and weaknesses of different models in a more human and meaningful way. Including humor adds to this measure of “humanness” in creating deep interactions with humans and can help inform the development of models that are better equipped to handle the complexities and nuances of human communication. As LLMs continue to advance and become more integrated into applications all around us, benchmarks like GRAIG will be important to ensuring their humanness and alignment with human values and expectations.

BROADER IMPACT OF COMPREHENSIVE BENCHMARKS

Comprehensive benchmarks like GRAIG have the potential to advance the field of AI and LLM development by providing additional multifaceted approaches to evaluating model performance. Incorporating a wide range of tasks and dimensions, including novel aspects like humor, can paint a more detailed picture of an LLM’s strengths, weaknesses, and overall potential. This impact can guide researchers and developers in refining model architectures, training techniques, and evaluation methodologies to create more capable, versatile, and human-like AI. Additionally, the insights gained from comprehensive benchmarks can spark new research questions and inspire innovative approaches to language modeling, pushing the boundaries of what is possible in natural language processing and generation.

chart showing closed source models dominate benchmarks — Closed-source models dominate overall GRAIG scores

The potential applications of LLMs with human-like abilities could be transformative. In education, LLMs could serve as intelligent tutoring systems, providing personalized learning experiences and engaging students with interactive, humor-infused content. In healthcare, LLMs could assist in patient communication, providing empathetic and informative responses to medical questions with compassion, care and humor when appropriate. In customer service, LLMs could handle difficult queries with wit and charm, improving user satisfaction and brand loyalty. Can a genuinely funny chatbot diffuse a situation with an angry customer? As LLMs continue to evolve and demonstrate more human-like capabilities, the possibilities for application are endless.

CONCLUSION

Benchmarks play a necessary role in evaluating the performance and capabilities of LLMs. They provide a standardized way to measure and compare the compare the performance of different models across various dimensions, such as language understanding, reasoning and maths. Benchmarks enable researchers and developers to identify the strengths and weaknesses of LLMs, guiding future research and development efforts, and ensuring reliable and effective deployment of these models in real-world applications. The GRAIG benchmark represents a step forward in the evaluation of LLM evaluation through the inclusion of humor as a benchmark dimension. By assessing an LLM’s ability to generate, recognize, and explain humor, GRAIG aims to measure the “humanness” of these models and their capacity to engage with language on a more intuitive and creative level. By incorporating humor with other dimensions such as reasoning, it provides a more holistic assessment of performance and “humanness”.

The Human Touch

ROLE OF BENCHMARKS IN EVALUATING LLMs

BROADER IMPACT OF COMPREHENSIVE BENCHMARKS

Recent Posts