Evaluating LLMs: Using LLM-as-a-Judge for better GenAI evaluation

A couple of weeks ago, I had the pleasure of joining fellow GenAI enthusiasts and practitioners on a panel at Microsoft to discuss reducing hallucinations and improving relevancy in LLM calls and RAG pipelines. A key takeaway from our discussion was the importance of using LLM-as-a-Judge for evaluation. In this blog post, I want to delve deeper into this concept and share insights on why it’s crucial for anyone working seriously with Large Language Models (LLMs).

Rather listen than read? Try this NotebookLM generated podcast for this article.

The challenge of evaluating LLMs

Evaluating LLMs is inherently challenging due to their creative and unpredictable nature. Unlike traditional software outputs, LLM responses can vary widely even when given the same input. It’s not just comparing apples to oranges—it’s like comparing apples to a freshly generated fruit each time. This variability makes it difficult to quantify improvements or regressions in model performance.

Why Evaluate LLMs?

Understanding and measuring the performance of LLMs is essential for several reasons:

  • Confirm and detect improvements: Assessing new models helps confirm if updates lead to better performance.
  • Signal regression: Identifying when newer models perform worse than previous versions prevents unintended downgrades.
  • Optimize prompts: Testing different prompt variants ensures that you’re getting the best possible responses from your models.

In essence, evaluation is crucial for automated LLM lifecycle management. For example, in Azure, new model versions are released around every two months. A new iteration can cause GenAI applications to behave unexpectedly or even break. We experienced this firsthand with an update to GPT-3.5, where stricter instruction adherence led to issues in our application due to a previously ignored but now problematic prompt instruction.

The need for objective scoring

Having a system that quantifies the quality of LLM responses over a broad set of questions is vital. Objective scores help you:

  • Measure improvements: Demonstrate that changes to prompts or models lead to better outcomes.
  • Monitor regressions: Ensure that enhancements in one area don’t negatively impact another.
  • Build measurable indices: Track performance over time to inform future development.

Limitations of traditional evaluation metrics

In the traditional machine learning world, metrics like BLEU, ROUGE, or any other method sounding like a French color were used for evaluation. These metrics rely on n-gram overlaps between generated text and reference text, which works well for tasks with static outputs. However, they fall short when applied to LLMs due to the models’ creativity and variability in wording.

Introducing LLM-as-a-Judge

To address these challenges, we’ve adopted the concept of LLM-as-a-Judge—using AI to evaluate AI. Interestingly, this approach was also advocated by my fellow panelists. LLMs can be remarkably effective at assessing the quality of generated content, often better at reflecting on an answer than producing it, especially in a single-shot setting.

How it works

  1. Prepare a broad set of questions: Develop a comprehensive list of questions along with their expected answers (ground truth).
  2. Generate model responses: Use your LLM with one or more prompts to produce answers to these questions.
  3. Evaluate with the most capable LLM: Utilize a powerful model like GPT-4 to compare the generated answers with the ground truth.
  4. Assign scores: Implement specific scoring conditions to rate the responses on a scale (e.g., 1 to 5).

Example prompt for evaluation

Here’s an example of a prompt used to evaluate the generated answers:

You are an AI language model tasked with evaluating the quality of another AI's response.

Evaluation Criteria:

Accuracy (1-5): How correct is the answer compared to the ground truth?

5: Completely accurate with no errors.

4: Mostly accurate with minor errors.

3: Some accurate elements but significant errors.

2: Mostly inaccurate with few correct elements.

1: Completely inaccurate.

Instructions:

Read the question and the generated answer carefully.

Compare the generated answer to the ground truth.

Assign an accuracy score based on the criteria above.

Provide a brief justification for the score.

Question: What is the capital of France?

Ground truth answer: Paris.

Generated answer: The capital of France is Paris.

Evaluation form:

Accuracy score: 5

Justification: The generated answer is completely accurate and matches the ground truth.

Sample questions with ground truth and evaluation

Sample question 1:

  • Question: Who wrote the play “Romeo and Juliet”?
  • Ground truth answer: William Shakespeare.

Generated answer (Good score): “Romeo and Juliet” was written by William Shakespeare.

  • Accuracy score: 5
  • Justification: The answer is correct and matches the ground truth precisely.

Sample question 2:

  • Question: What is the largest planet in our solar system?
  • Ground truth answer: Jupiter.

Generated answer (Bad score): The largest planet in our solar system is the Sun.

  • Accuracy score: 1
  • Justification: The answer is incorrect; the Sun is a star, not a planet.

Key Takeaways

  • LLM-as-a-Judge is effective: Using LLMs to evaluate other LLMs is becoming a standard practice due to its effectiveness.
  • Better at evaluation than generation: LLMs often excel at assessing responses over generating them, especially in single-shot scenarios.
  • Specific scoring conditions are crucial: Due to LLMs’ unpredictability, clearly defined scoring criteria are essential for consistent evaluations.
  • Not perfect but better: While this method isn’t foolproof, it’s a significant improvement over traditional metrics and offers a practical solution for LLM evaluation.

Conclusion

Evaluating LLMs doesn’t have to be an insurmountable challenge. By leveraging LLM-as-a-Judge, we can create a more reliable and objective evaluation system that adapts to the creative nature of these models. This approach not only helps in maintaining and improving current applications but also paves the way for more advanced and robust AI systems in the future.

In my next post I’m going to show how we can do this whole process in Azure Promptflow. Follow me on LinkedIn to stay posted.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *