Methodology

Overview

LiTERatE (Literary Translation Evaluation and Rating Ensemble) is a benchmark for evaluating machine translation systems on literary text. Unlike traditional machine translation benchmarks that focus on news articles, technical documentation, or general text, LiTERatE specifically targets literary translation, which presents unique challenges due to its creative and nuanced nature.

Dataset Composition

Our dataset consists of English human translations of novels from Chinese, Japanese, and Korean (CJK) languages. We include a diverse range of translations:

  • Published professional translations
  • Translations from online publishers
  • Amateur translations

While published professional translations make up the bulk of our samples to ensure high quality, we deliberately include lower-quality translations for two important reasons:

  • To ensure diversity, as they often encompass less-translated genres and story types
  • To test system robustness against varying human translation quality

Evaluation Units

Our evaluation is conducted on chunks of 200-500 CJK characters as the basic unit. To ensure a fair and consistent evaluation environment, we:

  • Extract terminology used in the original human translation
  • Provide these terms as additional input for all systems
  • Include gender information for each term (neuter, feminine, or masculine)
  • Provide approximately 60 CJK characters from previous and next chunks as context

This approach allows us to evaluate not only translation quality but also term adherence and contextual understanding.

System Input Format

All evaluated systems (except the Google NMT baseline) receive the following inputs:

  • The text chunk to be translated (200-500 CJK characters)
  • Previous and next chunks as context (approximately 80 CJK characters each)
  • A glossary of terms with their translations and gender information

The Google NMT baseline, which serves as a traditional machine translation reference point, receives only line-by-line input without additional context or terminology data.

Evaluation Process

Our evaluation process follows these key steps:

  1. Chunk Curation: We carefully select and prepare text chunks from our dataset, ensuring they represent diverse literary styles, genres, and translation challenges.
  2. Translation Generation: We ask different systems to produce translations based on the raw text, extracted terminology/glossary, and surrounding context.
  3. Human Reference: Each chunk has a corresponding human translation that serves as a reference point (though not necessarily the "gold standard").
  4. Head-to-Head Comparison: Our LLM ensemble judges compare each system's translation against the human translation in a direct comparison.
  5. Scoring: Based on these comparisons, we calculate win rates that represent how often each system's translations are judged to be equal to or better than human translations.

Example Chunks

Below are examples of chunks from our dataset. You can explore the source text, glossary terms, context, and human translations. Hover over highlighted terms to see their translations.

Chunk 1 of 2

Source Text
长庚 蓦地一转身:"备纸笔。"

侍卫连忙追上去:" 殿下 ,你的手……"

长庚 闻言一顿,抄起 顾昀 落下的酒壶,面无表情地将那一壶烈酒全冲到了双手的伤口上,本来已经结痂的伤口再次被冲出血水来,他从怀中取出一块帕子,浑不在意地一裹。

此时京城中,谁也没料到一个老太监的死竟然引发了这样一场轩然大波。

谭鸿飞 压抑二十年的冤屈爆发,大约已经失心疯了,先是派兵围了 王国舅 府邸,得知那老东西竟将老婆孩子抛下,进宫躲风头去了,便立刻掉头,悍然对上了赶来救场的 御林军

御林军 素日与 北大营 一主内、一主外,同为京畿重地的最后一道防线,是抬头不见低头见的交情, 御林军 主要由京城里走门路吃皇粮的少爷兵和从 北大营 抽调选拔的精英两部分组成,前者早就吓得尿了裤子,根本指望不上,后者虽然有本事,但骤然与"娘家"对上,一时间也是进退维谷,正如 长庚 预料,很快便溃不成军。
Previous Context
飞奔而去。

长庚一直盯着他的背影,直到目力无可及,他突然闭了闭眼,几不可闻地喃喃叫了一声:"子熹……"

一边的侯府侍卫没听清,疑惑道:"殿下说什么?"
Next Context
起鸢楼的笙歌还在绕梁不休,温热的花酒白雾未消,四九城中已经炸了锅。

谭鸿飞带人逼至宫禁之外,
Glossary
TermTranslationGender
长庚Chang Gengmasculine
殿下Your Highnessneuter
顾昀Gu Yunmasculine
谭鸿飞Tan Hongfeimasculine
王国舅Imperial Uncle Wangmasculine
御林军Imperial Guardneuter
北大营Northern Campneuter
Human Translation
Chang Geng spun around. "Prepare a brush and paper."

"Your Highness, your hands..." The guard chased after him.

Chang Geng paused, picked up Gu Yun's abandoned jar of wine, and, with no change in expression, poured the whole jar of strong liquor over the wounds on his hands. The cuts, which had already begun to scab over, bled again with the rush of liquid. Chang Geng carelessly retrieved a handkerchief from his lapels and wrapped them tight.

In the capital, no one expected that an old eunuch's death would raise such a storm of controversy.

The resentment Tan Hongfei had suppressed for twenty years erupted—he had very likely already lost his mind. He first sent soldiers to surround Imperial Uncle Wang's estate. Upon learning that the old bastard had abandoned his wife and children to cower within the palace, he did an about-face and brazenly turned his blade on the Imperial Guard who had rushed to the scene.

The Imperial Guard and the Northern Camp had always been the last lines of defense for the capital, one within and one without, and the two constantly crossed paths. The Imperial Guard was by and large made up of two groups: young-master soldiers benefitting from nepotism and living off the imperial coffers, and elite soldiers selected from the Northern Camp. The former had already pissed their pants in terror and could not be relied on. The latter were skilled, but, stuck in the impossible position of drawing blades against their maiden family, quickly crumpled. Just as Chang Geng had predicted, in no time at all, the Imperial Guard was defeated.

Evaluation Approach

Our benchmark uses an ensemble of Large Language Models (LLMs) as judges to evaluate translations. The evaluation is conducted as head-to-head comparisons between machine translations and human translations.

To ensure the highest possible accuracy in our evaluation system, we conducted an extensive calibration experiment:

  • Multiple human annotators evaluated several hundred translation pairs
  • We focused on decisive human verdicts—cases where multiple annotators agreed on a clear winner
  • This approach addresses the inherently subjective nature of literary translation evaluation, which typically has low inter-annotator agreement

Judge Ensemble

Our experiments revealed that using multiple frontier LLMs as judges, each evaluating different aspects of translation quality, and then ensembling their verdicts produces the most accurate results.

This ensemble approach achieves 82% accuracy when compared to decisive human judgments. For comparison, a single LLM judge would only achieve approximately 60% accuracy.

Scoring Methodology

For each evaluation unit, our judge ensemble determines whether the machine translation or the human translation is superior, or if the comparison is too close to call ("not-sure").

Points are assigned as follows:

  • Machine translation wins: 1 point
  • Tie or "not-sure": 0.5 points
  • Human translation wins: 0 points

The final score for each system is calculated as the average of these points multiplied by 100, representing the system's win rate against human translators. A score of 50 indicates parity with human translation quality.

Limitations

While our methodology represents a significant advancement in evaluating literary translation, we acknowledge several limitations:

  • Literary translation evaluation is inherently subjective with low inter-annotator agreement
  • Our current dataset is limited to Chinese, Japanese, and Korean source languages
  • The evaluation focuses on chunk-level translation rather than document-level coherence
  • Even with our ensemble approach, there remains an 18% gap between our automated evaluation and decisive human judgment