,
Reference Task Phenomena Language Size Construction Comments
NLI Antonyms, quantities, spelling, word overlap, negation, length English 7596 Automatic
NLI Compositionality English 44010 Automatic
NLI Antonyms, hyper/hyponyms English 6279 Semi-automatic
NLI Diverse semantics English 550 Manual
NLI Lexical inference English 8193 Semi-automatic
NLI Diverse English 570K Manual, semi-automatic, automatic
MT Word sense disambiguation German→English/French 13900 Semi-automatic
MT Morphology English→Czech/Latvian 18500 Automatic
MT Polarity, verb-particle constructions, agreement, transliteration English→German 97000 Automatic
MT Discourse English→French 400 Manual
MT Morpho-syntax, syntax, lexicon English↔French 108+506 Manual
MT Diverse English↔German 10000 Manual
MT Discourse English→German 4627 Automatic Test sets created using oracles, an alternative to challenge sets. The method can be applied to different language pairs and datasets.
MT Coreference, pronouns English→German 12000 Automatic
LM Subject-verb agreement English ∼1.35M Automatic
LM Number agreement English, Russian, Hebrew, Italian ∼10K Automatic
Coreference Gender bias English 720 Semi-automatic
Coreference Gender bias English 3160 Semi-automatic
Seq2Seq Compositionality English 20910 Automatic
POS tagging Noun-verb ambiguity English 32654 Semi-automatic
NLI Psychometric assessment English 180 Manual
Sentiment Psychometric assessment English 134 Manual