ROUGE evaluation metric

The ROUGE metric measures how well generated summaries or translations compare to reference outputs.

Metric details

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a generative AI quality evaluation metric that measures how well generative AI assets perform tasks.

Scope

The ROUGE metric evaluates generative AI assets only.

  • Types of AI assets: Prompt templates
  • Generative AI tasks:
    • Text summarization
    • Content generation
    • Question answering
    • Entity extraction
    • Retrieval augmented generation (RAG)
  • Supported languages: Arabic (ar), Danish (da), English (en), French (fr), German (de), Italian (it), Japanese (ja), Korean (ko), Portuguese (pt), Spanish (es).

Scores and values

The ROUGE metric score indicates the similarity between the generated summary and reference outputs. Higher scores indicate higher similarity between the summary and the reference.

  • Range of values: 0.0-1.0
  • Best possible score: 1.0

Settings

  • Thresholds:
    • Lower limit: 0.8
    • Upper limit: 1
  • Parameters:
    • Use stemmer: If true, users Porter stemmer to strip word suffixes. Defaults to false.