Artificial Intelligence

Evaluating Large Language Models: A Technical Guide

Published

6 months ago

January 29, 2024

Large language models (LLMs) like GPT-4, Claude, and LLaMA have exploded in popularity. Thanks to their ability to generate impressively human-like text, these AI systems are now being used for everything from content creation to customer service chatbots.

But how do we know if these models are actually any good? With new LLMs being announced constantly, all claiming to be bigger and better, how do we evaluate and compare their performance?

In this comprehensive guide, we'll explore the top techniques for evaluating large language models. We'll look at the pros and cons of each approach, when they are best applied, and how you can leverage them in your own LLM testing.

Task-Specific Metrics

One of the most straightforward ways to evaluate an LLM is to test it on established NLP tasks using standardized metrics. For example:

Summarization

For summarization tasks, metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are commonly used. ROUGE compares the model-generated summary to a human-written “reference” summary, counting the overlap of words or phrases.

There are several flavors of ROUGE, each with their own pros and cons:

ROUGE-N: Compares overlap of n-grams (sequences of N words). ROUGE-1 uses unigrams (single words), ROUGE-2 uses bigrams, etc. The advantage is it captures word order, but it can be too strict.
ROUGE-L: Based on longest common subsequence (LCS). More flexible on word order but focuses on main points.
ROUGE-W: Weights LCS matches by their significance. Attempts to improve on ROUGE-L.

In general, ROUGE metrics are fast, automatic, and work well for ranking system summaries. However, they don't measure coherence or meaning. A summary could get a high ROUGE score and still be nonsensical.

The formula for ROUGE-N is:

$ROUGE-N = \sum ^{s \in {Reference Summaries}} \sum ^{g r a m n \in s} C o u n t ( g r a m ^{n} ) \sum ^{s \in {Reference Summaries}} \sum ^{g r a m n \in s} C o u n t ^{ma t c h} ( g r a m ^{n} )$

Where:

Count_{match}(gram_n) is the count of n-grams in both the generated and reference summary.
Count(gram_n) is the count of n-grams in the reference summary.

For example, for ROUGE-1 (unigrams):

Generated summary: “The cat sat.”
Reference summary: “The cat sat on the mat.”
Overlapping unigrams: “The”, “cat”, “sat”
ROUGE-1 score = 3/5 = 0.6

ROUGE-L uses the longest common subsequence (LCS). It's more flexible with word order. The formula is:

$ROUGE-L = max(length(generated), length(reference)) L CS ( generated , reference )$

Where LCS is the length of the longest common subsequence.

ROUGE-W weights the LCS matches. It considers the significance of each match in the LCS.

Translation

For machine translation tasks, BLEU (Bilingual Evaluation Understudy) is a popular metric. BLEU measures the similarity between the model's output translation and professional human translations, using n-gram precision and a brevity penalty.

Key aspects of how BLEU works:

Compares overlaps of n-grams for n up to 4 (unigrams, bigrams, trigrams, 4-grams).
Calculates a geometric mean of the n-gram precisions.
Applies a brevity penalty if translation is much shorter than reference.
Generally ranges from 0 to 1, with 1 being perfect match to reference.

BLEU correlates reasonably well with human judgments of translation quality. But it still has limitations:

Only measures precision against references, not recall or F1.
Struggles with creative translations using different wording.
Susceptible to “gaming” with translation tricks.

Other translation metrics like METEOR and TER attempt to improve on BLEU's weaknesses. But in general, automatic metrics don't fully capture translation quality.

Other Tasks

In addition to summarization and translation, metrics like F1, accuracy, MSE, and more can be used to evaluate LLM performance on tasks like:

Text classification
Information extraction
Question answering
Sentiment analysis
Grammatical error detection

The advantage of task-specific metrics is that evaluation can be fully automated using standardized datasets like SQuAD for QA and GLUE benchmark for a range of tasks. Results can easily be tracked over time as models improve.

However, these metrics are narrowly focused and can't measure overall language quality. LLMs that perform well on metrics for a single task may fail at generating coherent, logical, helpful text in general.

Research Benchmarks

A popular way to evaluate LLMs is to test them against wide-ranging research benchmarks covering diverse topics and skills. These benchmarks allow models to be rapidly tested at scale.

Some well-known benchmarks include:

SuperGLUE – Challenging set of 11 diverse language tasks.
GLUE – Collection of 9 sentence understanding tasks. Simpler than SuperGLUE.
MMLU – 57 different STEM, social sciences, and humanities tasks. Tests knowledge and reasoning ability.
Winograd Schema Challenge – Pronoun resolution problems requiring common sense reasoning.
ARC – Challenging natural language reasoning tasks.
Hellaswag – Common sense reasoning about situations.
PIQA – Physics questions requiring diagrams.

By evaluating on benchmarks like these, researchers can quickly test models on their ability to perform math, logic, reasoning, coding, common sense, and much more. The percentage of questions correctly answered becomes a benchmark metric for comparing models.

However, a major issue with benchmarks is training data contamination. Many benchmarks contain examples that were already seen by models during pre-training. This enables models to “memorize” answers to specific questions and perform better than their true capabilities.

Attempts are made to “decontaminate” benchmarks by removing overlapping examples. But this is challenging to do comprehensively, especially when models may have seen paraphrased or translated versions of questions.

So while benchmarks can test a broad set of skills efficiently, they cannot reliably measure true reasoning abilities or avoid score inflation due to contamination. Complementary evaluation methods are needed.

LLM Self-Evaluation

An intriguing approach is to have an LLM evaluate another LLM's outputs. The idea is to leverage the “easier” task concept:

Producing a high-quality output may be difficult for an LLM.
But determining if a given output is high-quality can be an easier task.

For example, while an LLM may struggle to generate a factual, coherent paragraph from scratch, it can more easily judge if a given paragraph makes logical sense and fits the context.

So the process is:

Pass input prompt to first LLM to generate output.
Pass input prompt + generated output to second “evaluator” LLM.
Ask evaluator LLM a question to assess output quality. e.g. “Does the above response make logical sense?”

This approach is fast to implement and automates LLM evaluation. But there are some challenges:

Performance depends heavily on choice of evaluator LLM and prompt wording.
Constrainted by difficulty of original task. Evaluating complex reasoning is still hard for LLMs.
Can be computationally expensive if using API-based LLMs.

Self-evaluation is especially promising for assessing retrieved information in RAG (retrieval-augmented generation) systems. Additional LLM queries can validate if retrieved context is used appropriately.

Overall, self-evaluation shows potential but requires care in implementation. It complements, rather than replaces, human evaluation.

Human Evaluation

Given the limitations of automated metrics and benchmarks, human evaluation is still the gold standard for rigorously assessing LLM quality.

Experts can provide detailed qualitative assessments on:

Accuracy and factual correctness
Logic, reasoning, and common sense
Coherence, consistency and readability
Appropriateness of tone, style and voice
Grammaticality and fluency
Creativity and nuance

To evaluate a model, humans are given a set of input prompts and the LLM-generated responses. They assess the quality of responses, often using rating scales and rubrics.

The downside is that manual human evaluation is expensive, slow, and difficult to scale. It also requires developing standardized criteria and training raters to apply them consistently.

Some researchers have explored creative ways to crowdfund human LLM evaluations using tournament-style systems where people bet on and judge matchups between models. But coverage is still limited compared to full manual evaluations.

For business use cases where quality matters more than raw scale, expert human testing remains the gold standard despite its costs. This is especially true for riskier applications of LLMs.

Conclusion

Evaluating large language models thoroughly requires using a diverse toolkit of complementary methods, rather than relying on any single technique.

By combining automated approaches for speed with rigorous human oversight for accuracy, we can develop trustworthy testing methodologies for large language models. With robust evaluation, we can unlock the tremendous potential of LLMs while managing their risks responsibly.